Design question: Best approach to store and retrieve deltas in an SQL table

Design question: Best approach to store and retrieve deltas in an SQL table - sql

I have a historical table which contains many price columns and only few columns change at a time. Currently I am just inserting all the data as new records and this change could come 100+ times every second. So it is resulting in growing of table size pretty quick.
I am trying to find the better design for the table to keep the table size to minimum and the best query to retrieve the data when required. I am not much worried about the data retrieval performance, but it should be somewhere in the middle when used for reports. Priority is to keep the table size to its minimum.
Data from this historical table is not retrieved on a day to day basis. I have a transaction table like *1 Current Design for that purpose.
Here are the details of my implementation.
1) Current Design
2) Planned design - 1
Question:
1) If I use the above table structure what is the best query to get the results like shown in Current design #1
3) Planned design - 2
Question:
1) How much performance hit this would be compared to Planned design #1.
2) Also if I go in that route what is the best query to get the results like what shown in Current design #1?
End question:
I assume planned design #1 will take more table space VS planned design #2. But planned design 2 will take more time to retrieve the query. Is there any case I assumption can go wrong?
Edit: I have only inserts going to this table. No updates or deletion is ever made to this.

In fact, I think you have better plan. You can use Temporal Tables that come from SQL Server 2016.
This type managed by sql to track change of table in best way.
Visit this Link: https://learn.microsoft.com/en-us/sql/relational-databases/tables/temporal-tables?view=sql-server-2017

I have s similar situation where I'm loading a bunch of temperature sensors every 10 seconds. As I'm using the express version of MSSQL I'm looking a at max database size of 10GB so I went creative to make it last as long as possible.
My table-layout is pretty much identical to yours in that I've got 1 timestamp + 30 value columns + another 30 flag columns.
The value columns are numeric(9,2)
The value columns are marked SPARSE, if the value is identical (enough) to the value before it I store NULL instead of repeating the value.
The flag columns are bit and indicate whether the value is 'extrapolated' or from an actual source (later on that)
I've also got another table that holds the following information for each sensor:
last time the sensor was updated; that way if a new value comes in I can easily decide if this requires just a new insert at the end of the table or whether I need to go through all the logic of inserting/updating a value somewhere in-between existing numbers.
the value of that latest entry
the sensitivity for said sensor; this way I don't have to hardcode it and can set it on a per-sensor basis
Anyway, for now my stream of information is that I've got several programs each reading data from different sources (arduino, web, ...) and dumping this in .csv files and then a 'parser' program that reads these files into the database once in a while. As I'm loading the values 1 by 1 instead of row-based this isn't very efficient but I'm now doing about 3500 values / second so I'm not overly concerned. I'll agree that this is only true when loading the values in historical order and because I'm using the helper table.
Currently I've got almost 1 year of information in there which corresponds to
2.209.883 rows
5.799.511 values spread over 18 sensors (yes, I've still got room for 12 more without needing to change the table)
This means I've only got 15% of the fields filled in, or looking at it the other way around, when I'd fill in every record rather than putting NULL in case of repetition, I'd have almost 8 times that many numbers there.
As for space requirements: I decided to reload all numbers last night 'for fun' but noticed that even though most .csv files come in historically, they would do a range of columns from Jan to Dec, then another couple of columns from Jan to Dec etc.. This resulted in quite a bit of fragmentation: 70% in fact! At that time the table required 282Mb of disk-space.
I then rebuilt the table bringing the fragmentation down to 0% and the reserved space went down to 118Mb (!).
For me this is more than good enough
it's unlikely the table will outgrow the 10GB limit anytime soon, especially if I stick to (online) rebuilding it now and then.
loading data is sufficiently fast (although reloading the entire year took a couple of hours)
reporting is sufficiently fast (for now, haven't tried to connect any 'interactive' reporting tools to it yet; but for some simple graphs in excel it works just fine IMHO).
FYI: for reporting I've created a rater simple stored procedure that picks a from-to range for a given set of columns; dumps it in a temp-table and then fills in the blanks by figuring out the NULL-ranges and then filling these in with the value that preceded the range. This works quite well although fetching the 'first' value sometimes takes a while as I can't predict how far back in time the last value should be looked for (sometimes there is none).
To work around this I've added another process that extrapolates the values for every 'hour' timestamp. That way the report never needs to go back more than 1 hour. A flag-column in the readings table indicates whether the value on a record for a given field was extrapolated or not.
(note: this makes updating values in the past more problematic but not impossible)
Hope this helps you out a bit in your endeavors, good luck!

Related

Best practice to update bulk data in table used for reporting in SQL

I have created a table for reporting purpose where I am storing data for about 50 columns and at some time interval my scheduler executes a service which processes other tables and fill up data in my flat table.
Currently I am deleting and inserting data in that table But I want to know if this is the good practice or should I check every column in every row and update it if any change found and insert new record if data does not exists.
FYI, total number of rows which are being reinserted is 100k+.

This is a very broad question that can only really be answered with access to your environment and discussion on your personal requirements. Obviously this is not possible via Stack Overflow.
This means you will need to make this decision yourself.
The information you need to understand to be able to do this are the types of table updates available and how you can achieve them, normally referred to as Slowly Changing Dimensions. There are several different types, each with their own advantages, disadvantages and optimal use cases.
Once you understand the how of getting your data to incrementally update as required, you can then look at the why and whether the extra processing logic required to achieve this is actually worth it. Your dataset of a few hundred thousand rows of data is not large and probably may therefore not need this level of processing just yet, though that assessment will depend on how complex and time consuming your current process is and how long you have to run it.

It is probably faster to repopulate the table of 100k rows. To do an update, you still need to:
generate all the rows to insert
compare values in every row
update the values that have changed
The expense of updating rows is heavily on the logging and data movement operations at the data page level. In addition, you need to bring the data together.
If the update is updating a significant portion of rows, perhaps even just a few percent of them, then it is likely that all data pages will be modified. So the I/O is pretty similar.
When you simply replace the table, you will start by either dropping the table or truncating it. Those are relatively cheap operations because they are not logged at the row level. Then you are inserting into the table. Inserting 100,000 rows from one table to another should be pretty fast.
The above is general guidance. Of course, if you are only changing 3 rows in the table each day, then update is going to be faster. Or, if you are adding a new layer of data each day, then just an insert, with a handful of changed historical values might be a fine approach.

Correlation between amount of rows and amount columns in database performance

Is there a correlation between the amount of rows/number of columns used and it's impact within the (MS)SQL database?
A little more background:
We have to store lots of data from measurement devices. These devices ping a string with data back to us around 100 times a day. These strings contains +- 300 fields. Assume we have 100 devices in operation that means we get 10000 records back every day. At our back-end we split these data strings and have to put these into the database. When these data strings are fixed that means we add each days around 10000 new rows into the database. No big deal.
Whatsoever, the contents of these data strings may change during time. There are two options we are considering:
Using vertical tables to store the data dynamically
Using horizontal tables and add a new column now and then when it's needed.
From the perspective of ease we'd like to choose for the first approach. Whatsoever, that means we're adding 100*100*300=3000000 rows each day. Data has to be stored 1 year and a month (395 days) so then we're around 1.2 billion rows. Not calculated the expected growth.
Is it from a performance perspective smarter to use a 'vertical' or a 'horizontal' approach?
When choosing for the 'vertical' solution, how can we actual optimize performance by using PK's/FK's wisely?
When choosing for the 'horizontal' solution, are there recommendations for adding columns to the table?

I have a vertical DB with 275 million rows in the "values" table. We took this approach because we couldn't accurately define the schema at the outset either. Inserts are fantastic. Selects suck. Too be fair we throw in a couple of extra doohickies the typical vertical schema doesn't have to deal with.
Have a search for EAV aka Entity Attribute Value models. You'll find a lot of heat on both sides of the debate. Too good articles on making it work are
What is so bad about EAV, anyway?
dave’s guide to the eav
My guess is these sensors don't just start sending you extra fields. You have to release new sensors or sensor code for this to happen. That's your chance to do change control on your schema and add the extra columns. If external parties can connect sensors without notifying you this argument is null and void and you may be stuck with an EAV.
For the horizontal option you can split tables putting the frequently-used columns in one table and the less-used in a second; both tables have the same primary key values so you can link less-used to more-used columns. Also you can use RDBMS's built-in partitioning functionality to split each day's (or week's or month's) data for the others'.

Generally, you can tune a table more for inserts (or any DML) or for queries. Improving one side comes at the expense of the other. Usually, it's a balancing act.
First of all, 10K inserts a day is not really a large number. Sure, it's not insignificant, but it doesn't even come close to what would be considered "large" nowadays So, while we don't want to make inserts downright sluggish, this gives you some wiggle room.
Creating an index on the device id and/or entry timestamp will do some logical partitioning of the data for you. The exact makeup of your index(es) will depend on your queries. Are you looking for all entries for a given date or date range? Then index the timestamp column. Are you looking for all entries received from a particular device? Then index the device id column. Are you looking for entries from a particular device on a particular date or date range or sorted by the date? Then create an index on both columns.
So if you ask for the entries for device x on date y, then you are going out to the table and looking only at the rows you need. The fact that the table is much larger than the small subset you query is incidental. It's as if the rest of the table doesn't even exist. The total size of the table need not be intimidating.
Another option: As it looks like the data is written to the table and never altered after that, then you may want to create a data warehouse schema for the data. New entries can be moved to the warehouse every day or several times a day. The point is, the warehouse schema can have the data sliced, diced, quartered and cubed to make queries much more efficient. So you can have the existing table tuned for more efficient inserts and the warehouse tuned for more efficient queries. That is, after all, what data warehouses are for.
You also imply that some of each entry is (or can be) duplicated from one entry to the next. See if you can segment the data into three types:
Type 1: Data that never changes (the device id, for example)
Type 2: Data that rarely changes
Type 3: Data that changes often
Now all you have is a normalization problem, something a lot easier to solve. Let's say the row is equally split between the types. So you have one table with 100 rows of 33 columns. That's it. It never changes. Linked to that is a table with at least 100 rows of 33 columns but maybe several new rows are added each day. Finally, linked to the second table a table with rows of 33 columns that possibly grows by the full 10K every day.
This minimizes the grow-space required by the online database. The warehouse could then denormalize back to one huge table for ease of querying.

SQL server data backup/warehouse

I've been asked to do a snapshots of certain tables from the database, so in the future we can have a clear view of the situation for any given day in the past. lets say that one of such tables looks like this:
GKEY Time_in Time_out Category Commodity
1001 2014-05-01 10:50 NULL EXPORT Apples
1002 2014-05-02 11:23 2014-05-20 12:05 IMPORT Bananas
1003 2014-05-05 11:23 NULL STORAGE Null
The simples way to do a snapshot would be creating copy of the table with another column SNAPSHOT_TAKEN (Datetime) and populate it with an INSERT statement
INSERT INTO UNITS_snapshot (SNAPSHOT_TAKEN, GKEY,Time_in, Time_out, Category, Commodity)
SELECT getdate() as SNAPSHOT_TAKEN, * FROM UNITS
OK, it works fine, but it would make the destination table quite big pretty soon, especially if I'd like to run this query often. Better solution would be checking for changes between current live table and the latest snapshot and write them down, omitting everything that hasn't been changed.
Is there a simply way to write such query?
EDIT: Possible solution for the "Forward delta" (assuming no deletes from original table)
INSERT INTO UNITS_snapshot
SELECT getdate() as SNAP_DATE,
r.* -- Here goes all data from from the original table
CASE when b.gkey is null then 'I' else 'U' END AS change_type
FROM UNITS r left outer join UNITS_snapshot b
WHERE (r.time_in <>b.time_in or r.time_out<>b.time_out or r.category<>b.category or r.commodity<>b.commodity or b.gkey is null)
and (b.snap_date =
(SELECT max (b.snap_date) from UNITS_snapshot b right outer join UNITS r
on r.gkey=b.gkey) or b.snap_date is null)
Assumptions: no value from original table is deleted. Probably also every field in WHERE should be COALESCE (xxx,'') to avoid comparing null values with set ones.

Both Dan Bracuk and ITroubs have made very good comments.
Solution 1 - Daily snapshop
The first solution you proposed is very simple. You can build the snapshot with a simple query and you can also consult it and rebuild any day's snapshot with a very simple query, by just filtering on the SNAPSHOT_TAKEN column.
If you have just some thousands of records, I'd go with this one, without worrying too much about its growing size.
Solution 2 - Daily snapshop with rolling history
This is basically the same as solution 1, but you keep only some of the snapshots over time... to avoid having the snapshot DB growing indefinitely over time.
The simplest approach is just to save the snapshots of the last N days... maybe a month or two of data. A more sophisticated approach is to keep snapshot with a density that depends on age... so, for example, you could have every day of the last month, plus every sunday of the last 3 months, plus every end-of-month of the last year, etc...
This solution requires you develop a procedure to handle deletion of the snapshots that are not required any more. It's not as simple as using getdate() within a query. But you obtain a good balance between space and historic information. You just need to balance out a good snapshot retainment strategy to suit your needs.
Solution 3 - Forward row delta
Building any type of delta is a much more complex procedure.
A forward delta is built by storing the initial snapshot (as if all rows had been inserted on that day) and then, on the following snapshots, just storing information about the difference between snapshot(N) and snapshot(N-1). This is done by analyzing each row and just storing the data if the row is new or updated or deleted. If the main table does not change much over time, you can save quite a lot of space, as no info is stored for unchanged rows.
Obviously, to handle deltas, you now need 2 extra columns, not just one:
delta id (you snapshot_taken is good, if you only want 1 delta per
day)
row change type (could be D=deleted, I=inserted, U=updated... or
something similar)
The main complexity derives from the necessity to identify rows (usually by primary key) so as to calculate if between 2 snapshots any individual row has been inserted, updated, deleted... or none of the above.
The other complexity comes from reading the snapshot DB and building the latest (or any other) snapshot. This is necessary because, having only row differences in the table, you cannot simply select a day's snapshot by filtering on snapshot_taken.
This is not easy in SQL. For each row you must take into account just the final version... the one with MAX snapshot_taken that is <= the date of the snapshot you want to build. If it is an insert or update, then keep the data for that row, else (if it is a delete) then ignore it.
To build a delta of snapshot(N), you must first build the latest snapshot (N-1) from the snapshot DB. Then you must compare the two snapshots by primary key or row identity and calculate the change type (I/U/D) and insert the changes in the snapshot DB.
Beware that you cannot delete old snapshot data without consolidating it first. That is because all snapshots are calculated from the oldest initial one and all the subsequent difference data. If you want to remove a year's of old snapshots, you'll have to consolidate the old initial snapshot and all the year's variations in a new initial snapshot.
Solution 4 - Backward row delta
This is very similar to solution 3, but a bit more complex.
A backward delta is built by storing the final snapshot and then, on the following snapshots, just storing information about the difference between snapshot(N-1) and snapshot(N).
The advantage is that the latest snapshot is always readily available through a simple select on the snapshot DB. You only need to merge the difference data when you want to retrieve an older snapshot. Compare this to the forward delta, where you always need to rebuild the snapshot from the difference data unless you are actually interested in the very first snapshot.
Another advantage (compared to solution 3) is that you can remove older snapshots by just deleting the difference data older than a particular snapshot. You can do this easily because snapshots are calculated from the final snapshot and not from the initial one.
The disadvantage is in the obscure logic. Difference data is calculated backwards. Values must be stored on the (U)pdate and (D)elete variations, but are unnecessary on the I variations. Going backwards, rows must be ignored if the first variation you find is an (I)nsert. Doable, but a bit trickier.
Solution 5 - Forward and backward column delta
If the main table has many columns, or many long text or varchar columns, and only a bunch of these are updated, then it could make sense to store only column variations instead of row variations.
This is done by using a table with this structure:
delta id (you snapshot_taken is good, if you only want 1 delta per
day)
change type (could be D=deleted, I=inserted, U=updated... or
something similar)
column name
value
The difference can be calculated forward or backward, as per row deltas.
I've seen this done, but I really advise against it. There are just too many disadvantages and added complexity.
Value is a text or varchar, and there are typecasting issues to handle if you have numeric, boolean or date/time values... and, if you have a lot of these, it could very well be you won't be saving as much space as you think you are.
Rebuilding any snapshot is hell. Altogether... any operation on this type of table really requires a lot of knowledge of the main table's structure.

Join or storing directly

I have a table A which contains entries I am regularly processing and storing the result in table B. Now I want to determine for each entry in A its latest processing date in B.
My current implementation is joining both tables and retrieving the latest date. However an alternative, maybe less flexible, approach would be to simply store the date in table A directly.
I can think of pros and cons for both cases (performance, scalability, ....), but didnt have such a case yet and would like to see whether someone here on stackoverflow had a similar situation and has a recommendation for either one for a specific reason.
Below a quick schema design.
Table A
id, some-data, [possibly-here-last-process-date]
Table B
fk-for-A, data, date
Thanks

Based on your description, it sounds like Table B is your historical (or archive) table and it's populated by batch.
I would leave Table A alone and just introduce an index on id and date. If the historical table is big, introduce an auto-increment PK for table B and have a separate table that maps the B-Pkid to A-pkid.
I'm not a fan of UPDATE on a warehouse table, that's why I didn't recommend a CURRENT_IND, but that's an alternative.

This is a fairly typical question; there are lots of reasonable answers, but there is only one correct approach (in my opinion).
You're basically asking "should I denormalize my schema?". I believe that you should denormalize your schema only if you really, really have to. The way you know you have to is because you can prove that - under current or anticipated circumstances - you have a performance problem with real-life queries.
On modern hardware, with a well-tuned database, finding the latest record in table B by doing a join is almost certainly not going to have a noticable performance impact unless you have HUGE amounts of data.
So, my recommendation: create a test system, populate the two tables with twice as much data as the system will ever need, and run the queries you have on the production environment. Check the query plans, and see if you can optimize the queries and/or indexing. If you really can't make it work, de-normalize the table.
Whilst this may seem like a lot of work, denormalization is a big deal - in my experience, on a moderately complex system, denormalized data schemas are at the heart of a lot of stupid bugs. It makes introducing new developers harder, it means additional complexity at the application level, and the extra code means more maintenance. In your case, if the code which updates table A fails, you will be producing bogus results without ever knowing about it; an undetected bug could affect lots of data.

We had a similar situation in our project tracking system where the latest state of the project is stored in the projects table (Cols: project_id, description etc.,) and the history of the project is stored in the project_history table (Cols: project_id, update_id, description etc.,). Whenever there is a new update to the project, we need find out the latest update number and add 1 to it to get the sequence number for the next update. We could have done this by grouping the project_history table on the project_id column and get the MAX(update_id), but the cost would be high considering the number of the project updates (in a couple of hundreds of thousands) and the frequency of update. So, we decided to store the value in the projects table itself in max_update_id column and keep updating it whenever there is a new update to a given project. HTH.

If I understand correctly, you have a table whose each row is a parameter and another table that logs each parameter value historically in a time series. If that is correct, I currently have the same situation in one of the products I am building. My parameter table hosts a listing of measures (29K recs) and the historical parameter value table has the value for that parameter every 1 hr - so that table currently has 4M rows. At any given point in time there will be a lot more requests FOR THE LATEST VALUE than for the history so I DO HAVE THE LATEST VALUE STORED IN THE PARAMETER TABLE in addition to it being in the last record in the parameter value table. While this may look like duplication of data, from the performance standpoint it makes perfect sense because
To get a listing of all parameters and their CURRENT VALUE, I do not have to make a join and more importantly
I do not have to get the latest value for each parameter from such a huge table
So yes, I would in your case most definitely store the latest value in the parent table and update it every time new data comes in. It will be a little slower for writing new data but a hell of a lot faster for reads.

Handling 100's of 1,000,000's of rows in T-SQL2005

I have a couple of databases containing simple data which needs to be imported into a new format schema. I've come up with a flexible schema, but it relies on the critical data of the to older DBs to be stored in one table. This table has only a primary key, a foreign key (both int's), a datetime and a decimal field, but adding the count of rows from the two older DBs indicates that the total row count for this new table would be about 200,000,000 rows.
How do I go about dealing with this amount of data? It is data stretching back about 10 years and does need to be available. Fortunately, we don't need to pull out even 1% of it when making queries in the future, but it does all need to be accessible.
I've got ideas based around having multiple tables for year, supplier (of the source data) etc - or even having one database for each year, with the most recent 2 years in one DB (which would also contain the stored procs for managing all this.)
Any and all help, ideas, suggestions very, deeply, much appreciated,
Matt.

Most importantly. consider profiling your queries and measuring where your actual bottlenecks are (try identifying the missing indexes), you might see that you can store everything in a single table, or that buying a few extra hard disks will be enough to get sufficient performance.
Now, for suggestions, have you considered partitioning? You could create partitions per time range, or one partition with the 1% commonly accessed and another with the 99% of the data.
This is roughly equivalent to splitting the tables manually by year or supplier or whatnot, but internally handled by the server.
On the other hand, it might make more sense to actually splitting the tables in 'current' and 'historical'.
Another possible size improvement is using an int (like an epoch) instead of a datetime and provide functions to convert from datetime to int, thus having queries like
SELECT * FROM megaTable WHERE datetime > dateTimeToEpoch('2010-01-23')
This size savings will probably have a cost performance wise if you need to do complex datetime queries. Although on cubes there is the standard technique of storing, instead of an epoch, an int in YYYYMMDD format.

What's the problem with storing this data in a single table? An enterprise-level SQL server like Microsoft SQL 2005 can handle it without much pain.
By the way, do not do tables per year, tables per supplier or other things like this. If you have to store similar set of items, you need one and one only table. Setting multiple tables to store the same type of things will cause problems, like:
Queries would be extremely difficult to write, and performance will be decreased if you have to query from multiple tables.
The database design will be very difficult to understand (especially since it's not something natural to store the same type of items in different places).
You will not be able to easily modify your database (maybe it's not a problem in your case), because instead of changing one table, you would have to change every table.
It would require to automate a bunch of tasks. Let's see you have a table per year. If a new record is inserted on 2011-01-01 00:00:00.001, will a new table be created? Will you check at each insert if you must create a new table? How it would affect performance? Can you test it easily?
If there is a real, visible separation between "recent" and "old" data (for example you have to use daily the data saved the last month only, and you have to keep everything older, but you do not use it), you can build a system with two SQL servers (installed on different machines). The first, highly available server, will serve to handle recent data. The second, less available and optimized for writing, will store everything else. Then, on schedule, a program will move old data from the first one to the second.

With such a small tuple size (2 ints, 1 datetime, 1 decimal) I think you will be fine having a single table with all the results in it. SQL server 2005 does not limit the number of rows in a table.
If you go down this road and run in to performance problems, then it is time to look at alternatives. Until then, I would plow ahead.
EDIT: Assuming you are using DECIMAL(9) or smaller, your total tuple size is 21 bytes which means that you can store the entire table in less than 4 GB of memory. If you have a decent server(8+ GB of memory) and this is the primary memory user, then the table and a secondary index could be stored in memory. This should ensure super fast queries after a slower warm-up time before the cache is populated.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas