Best practice to update bulk data in table used for reporting in SQL - sql

I have created a table for reporting purpose where I am storing data for about 50 columns and at some time interval my scheduler executes a service which processes other tables and fill up data in my flat table.
Currently I am deleting and inserting data in that table But I want to know if this is the good practice or should I check every column in every row and update it if any change found and insert new record if data does not exists.
FYI, total number of rows which are being reinserted is 100k+.

This is a very broad question that can only really be answered with access to your environment and discussion on your personal requirements. Obviously this is not possible via Stack Overflow.
This means you will need to make this decision yourself.
The information you need to understand to be able to do this are the types of table updates available and how you can achieve them, normally referred to as Slowly Changing Dimensions. There are several different types, each with their own advantages, disadvantages and optimal use cases.
Once you understand the how of getting your data to incrementally update as required, you can then look at the why and whether the extra processing logic required to achieve this is actually worth it. Your dataset of a few hundred thousand rows of data is not large and probably may therefore not need this level of processing just yet, though that assessment will depend on how complex and time consuming your current process is and how long you have to run it.

It is probably faster to repopulate the table of 100k rows. To do an update, you still need to:
generate all the rows to insert
compare values in every row
update the values that have changed
The expense of updating rows is heavily on the logging and data movement operations at the data page level. In addition, you need to bring the data together.
If the update is updating a significant portion of rows, perhaps even just a few percent of them, then it is likely that all data pages will be modified. So the I/O is pretty similar.
When you simply replace the table, you will start by either dropping the table or truncating it. Those are relatively cheap operations because they are not logged at the row level. Then you are inserting into the table. Inserting 100,000 rows from one table to another should be pretty fast.
The above is general guidance. Of course, if you are only changing 3 rows in the table each day, then update is going to be faster. Or, if you are adding a new layer of data each day, then just an insert, with a handful of changed historical values might be a fine approach.

Related

Is it possible to implement point in time recovery (PITR) in PostgreSQL for a single table?

Let's say I have a database with lots of tables, but there's one big table that's being updated regularly. At any given point in time, this table contains billions of rows, and let's say that the table is updated so regularly that we can expect a 100% refresh of the table by the end of each quarter. So the volume of data being moved around is in the order tens of billions. Because this table is changing so constantly, I want to implement a PITR, but only for this one table. I have two options:
Hack PostgreSQL's in-house PITR to apply only for one table.
Build it myself by creating a base backup, set up continuous archiving, and using a python script to execute the log of SQL statements up to a point in time (or use PostgreSQL's EXECUTE statement to loop through the archive). The big con with this is that it won't have the timeline functionality.
My problem is, I don't know if option 1 is even possible, and I don't know if option 2 even makes sense (looping through billions of rows sounds like it defeats the purpose of PITR, which is speed and convenience.) What other options do I have?

Design question: Best approach to store and retrieve deltas in an SQL table

I have a historical table which contains many price columns and only few columns change at a time. Currently I am just inserting all the data as new records and this change could come 100+ times every second. So it is resulting in growing of table size pretty quick.
I am trying to find the better design for the table to keep the table size to minimum and the best query to retrieve the data when required. I am not much worried about the data retrieval performance, but it should be somewhere in the middle when used for reports. Priority is to keep the table size to its minimum.
Data from this historical table is not retrieved on a day to day basis. I have a transaction table like *1 Current Design for that purpose.
Here are the details of my implementation.
1) Current Design
2) Planned design - 1
Question:
1) If I use the above table structure what is the best query to get the results like shown in Current design #1
3) Planned design - 2
Question:
1) How much performance hit this would be compared to Planned design #1.
2) Also if I go in that route what is the best query to get the results like what shown in Current design #1?
End question:
I assume planned design #1 will take more table space VS planned design #2. But planned design 2 will take more time to retrieve the query. Is there any case I assumption can go wrong?
Edit: I have only inserts going to this table. No updates or deletion is ever made to this.
In fact, I think you have better plan. You can use Temporal Tables that come from SQL Server 2016.
This type managed by sql to track change of table in best way.
Visit this Link: https://learn.microsoft.com/en-us/sql/relational-databases/tables/temporal-tables?view=sql-server-2017
I have s similar situation where I'm loading a bunch of temperature sensors every 10 seconds. As I'm using the express version of MSSQL I'm looking a at max database size of 10GB so I went creative to make it last as long as possible.
My table-layout is pretty much identical to yours in that I've got 1 timestamp + 30 value columns + another 30 flag columns.
The value columns are numeric(9,2)
The value columns are marked SPARSE, if the value is identical (enough) to the value before it I store NULL instead of repeating the value.
The flag columns are bit and indicate whether the value is 'extrapolated' or from an actual source (later on that)
I've also got another table that holds the following information for each sensor:
last time the sensor was updated; that way if a new value comes in I can easily decide if this requires just a new insert at the end of the table or whether I need to go through all the logic of inserting/updating a value somewhere in-between existing numbers.
the value of that latest entry
the sensitivity for said sensor; this way I don't have to hardcode it and can set it on a per-sensor basis
Anyway, for now my stream of information is that I've got several programs each reading data from different sources (arduino, web, ...) and dumping this in .csv files and then a 'parser' program that reads these files into the database once in a while. As I'm loading the values 1 by 1 instead of row-based this isn't very efficient but I'm now doing about 3500 values / second so I'm not overly concerned. I'll agree that this is only true when loading the values in historical order and because I'm using the helper table.
Currently I've got almost 1 year of information in there which corresponds to
2.209.883 rows
5.799.511 values spread over 18 sensors (yes, I've still got room for 12 more without needing to change the table)
This means I've only got 15% of the fields filled in, or looking at it the other way around, when I'd fill in every record rather than putting NULL in case of repetition, I'd have almost 8 times that many numbers there.
As for space requirements: I decided to reload all numbers last night 'for fun' but noticed that even though most .csv files come in historically, they would do a range of columns from Jan to Dec, then another couple of columns from Jan to Dec etc.. This resulted in quite a bit of fragmentation: 70% in fact! At that time the table required 282Mb of disk-space.
I then rebuilt the table bringing the fragmentation down to 0% and the reserved space went down to 118Mb (!).
For me this is more than good enough
it's unlikely the table will outgrow the 10GB limit anytime soon, especially if I stick to (online) rebuilding it now and then.
loading data is sufficiently fast (although reloading the entire year took a couple of hours)
reporting is sufficiently fast (for now, haven't tried to connect any 'interactive' reporting tools to it yet; but for some simple graphs in excel it works just fine IMHO).
FYI: for reporting I've created a rater simple stored procedure that picks a from-to range for a given set of columns; dumps it in a temp-table and then fills in the blanks by figuring out the NULL-ranges and then filling these in with the value that preceded the range. This works quite well although fetching the 'first' value sometimes takes a while as I can't predict how far back in time the last value should be looked for (sometimes there is none).
To work around this I've added another process that extrapolates the values for every 'hour' timestamp. That way the report never needs to go back more than 1 hour. A flag-column in the readings table indicates whether the value on a record for a given field was extrapolated or not.
(note: this makes updating values in the past more problematic but not impossible)
Hope this helps you out a bit in your endeavors, good luck!

Is it a good idea to index every column if the users can filter by any column?

In my application, users can create custom tables with three column types, Text, Numeric and Date. They can have up to 20 columns. I create a SQL table based on their schema using nvarchar(430) for text, decimal(38,6) for numeric and datetime, along with an Identity Id column.
There is the potential for many of these tables to be created by different users, and the data might be updated frequently by users uploading new CSV files. To get the best performance during the upload of the user data, we truncate the table to get rid of existing data, and then do batches of BULK INSERT.
The user can make a selection based on a filter they build up, which can include any number of columns. My issue is that some tables with a lot of rows will have poor performance during this selection. To combat this I thought about adding indexes, but as we don't know what columns will be included in the WHERE condition we would have to index every column.
For example, on a local SQL server one table with just over a million rows and a WHERE condition on 6 of its columns will take around 8 seconds the first time it runs, then under one second for subsequent runs. With indexes on every column it will run in under one second the first time the query is ran. This performance issue is amplified when we test on an SQL Azure database, where the same query will take over a minute the first time its run, and does not improve on subsequent runs, but with the indexes it takes 1 second.
So, would it be a suitable solution to add a index on every column when a user creates a column, or is there a better solution?
Yes, it's a good idea given your model. There will, of course, be more overhead maintaining the indexes on the insert, but if there is no predictable standard set of columns in the queries, you don't have a lot of choices.
Suppose by 'updated frequently,' you mean data is added frequently via uploads rather than existing records being modified. In that case, you might consider one of the various non-SQL databases (like Apache Lucene or variants) which allow efficient querying on any combination of data. For reading massive 'flat' data sets, they are astonishingly fast.

Join or storing directly

I have a table A which contains entries I am regularly processing and storing the result in table B. Now I want to determine for each entry in A its latest processing date in B.
My current implementation is joining both tables and retrieving the latest date. However an alternative, maybe less flexible, approach would be to simply store the date in table A directly.
I can think of pros and cons for both cases (performance, scalability, ....), but didnt have such a case yet and would like to see whether someone here on stackoverflow had a similar situation and has a recommendation for either one for a specific reason.
Below a quick schema design.
Table A
id, some-data, [possibly-here-last-process-date]
Table B
fk-for-A, data, date
Thanks
Based on your description, it sounds like Table B is your historical (or archive) table and it's populated by batch.
I would leave Table A alone and just introduce an index on id and date. If the historical table is big, introduce an auto-increment PK for table B and have a separate table that maps the B-Pkid to A-pkid.
I'm not a fan of UPDATE on a warehouse table, that's why I didn't recommend a CURRENT_IND, but that's an alternative.
This is a fairly typical question; there are lots of reasonable answers, but there is only one correct approach (in my opinion).
You're basically asking "should I denormalize my schema?". I believe that you should denormalize your schema only if you really, really have to. The way you know you have to is because you can prove that - under current or anticipated circumstances - you have a performance problem with real-life queries.
On modern hardware, with a well-tuned database, finding the latest record in table B by doing a join is almost certainly not going to have a noticable performance impact unless you have HUGE amounts of data.
So, my recommendation: create a test system, populate the two tables with twice as much data as the system will ever need, and run the queries you have on the production environment. Check the query plans, and see if you can optimize the queries and/or indexing. If you really can't make it work, de-normalize the table.
Whilst this may seem like a lot of work, denormalization is a big deal - in my experience, on a moderately complex system, denormalized data schemas are at the heart of a lot of stupid bugs. It makes introducing new developers harder, it means additional complexity at the application level, and the extra code means more maintenance. In your case, if the code which updates table A fails, you will be producing bogus results without ever knowing about it; an undetected bug could affect lots of data.
We had a similar situation in our project tracking system where the latest state of the project is stored in the projects table (Cols: project_id, description etc.,) and the history of the project is stored in the project_history table (Cols: project_id, update_id, description etc.,). Whenever there is a new update to the project, we need find out the latest update number and add 1 to it to get the sequence number for the next update. We could have done this by grouping the project_history table on the project_id column and get the MAX(update_id), but the cost would be high considering the number of the project updates (in a couple of hundreds of thousands) and the frequency of update. So, we decided to store the value in the projects table itself in max_update_id column and keep updating it whenever there is a new update to a given project. HTH.
If I understand correctly, you have a table whose each row is a parameter and another table that logs each parameter value historically in a time series. If that is correct, I currently have the same situation in one of the products I am building. My parameter table hosts a listing of measures (29K recs) and the historical parameter value table has the value for that parameter every 1 hr - so that table currently has 4M rows. At any given point in time there will be a lot more requests FOR THE LATEST VALUE than for the history so I DO HAVE THE LATEST VALUE STORED IN THE PARAMETER TABLE in addition to it being in the last record in the parameter value table. While this may look like duplication of data, from the performance standpoint it makes perfect sense because
To get a listing of all parameters and their CURRENT VALUE, I do not have to make a join and more importantly
I do not have to get the latest value for each parameter from such a huge table
So yes, I would in your case most definitely store the latest value in the parent table and update it every time new data comes in. It will be a little slower for writing new data but a hell of a lot faster for reads.

How should I keep accurate records summarising multiple tables?

I have a normalized database and need to produce web based reports frequently that involve joins across multiple tables. These queries are taking too long, so I'd like to keep the results computed so that I can load pages quickly. There are frequent updates to the tables I am summarising, and I need the summary to reflect all update so far.
All tables have autoincrement primary integer keys, and I almost always add new rows and can arrange to clear the computed results in they change.
I approached a similar problem where I needed a summary of a single table by arranging to iterate over each row in the table, and keep track of the iterator state and the highest primary keen (i.e. "highwater") seen. That's fine for a single table, but for multiple tables I'd end up keeping one highwater value per table, and that feels complicated. Alternatively I could denormalise down to one table (with fairly extensive application changes), which feels a step backwards and would probably change my database size from about 5GB to about 20GB.
(I'm using sqlite3 at the moment, but MySQL is also an option).
I see two approaches:
You move the data in a separate database, denormalized, putting some precalculation, to optimize it for quick access and reporting (sounds like a small datawarehouse). This implies you have to think of some jobs (scripts, separate application, etc.) that copies and transforms the data from the source to the destination. Depending on the way you want the copying to be done (full/incremental), the frequency of copying and the complexity of data model (both source and destination), it might take a while to implement and then to optimizie the process. It has the advantage that leaves your source database untouched.
You keep the current database, but you denormalize it. As you said, this might imply changing in the logic of the application (but you might find a way to minimize the impact on the logic using the database, you know the situation better than me :) ).
Can the reports be refreshed incrementally, or is it a full recalculation to rework the report? If it has to be a full recalculation then you basically just want to cache the result set until the next refresh is required. You can create some tables to contain the report output (and metadata table to define what report output versions are available), but most of the time this is overkill and you are better off just saving the query results off to a file or other cache store.
If it is an incremental refresh then you need the PK ranges to work with anyhow, so you would want something like your high water mark data (except you may want to store min/max pairs).
You can create triggers.
As soon as one of the calculated values changes, you can do one of the following:
Update the calculated field (Preferred)
Recalculate your summary table
Store a flag that a recalculation is necessary. The next time you need the calculated values check this flag first and do the recalculation if necessary
Example:
CREATE TRIGGER update_summary_table UPDATE OF order_value ON orders
BEGIN
UPDATE summary
SET total_order_value = total_order_value
- old.order_value
+ new.order_value
// OR: Do a complete recalculation
// OR: Store a flag
END;
More Information on SQLite triggers: http://www.sqlite.org/lang_createtrigger.html
In the end I arranged for a single program instance to make all database updates, and maintain the summaries in its heap, i.e. not in the database at all. This works very nicely in this case but would be inappropriate if I had multiple programs doing database updates.
You haven't said anything about your indexing strategy. I would look at that first - making sure that your indexes are covering.
Then I think the trigger option discussed is also a very good strategy.
Another possibility is the regular population of a data warehouse with a model suitable for high performance reporting (for instance, the Kimball model).