Why is the end date in a temporal table the maximum system time and not just NULL - sql-server-2016

We are looking at implementing a bitemporal solution for a few of our tables, because they must record both application time and system time.
I know SQL 2016 has native support for only the system-time component, so we will be converting tables to be temporal and modifying what we have now to create a fully functional bitemporal solution.
My question is about consistency. The system-time end date component of the temporal table is set to 9999-12-31 23:59:59.9999999, so I thought it would be a good idea to set our application/valid time end date to also be 9999-12-31 23:59:59.9999999.
However, I have been asked "Why can't we just set it to NULL to indicate that the period in the main table has no end?"
So why is that? Why did MS choose to use 9999-12-31 23:59:59.9999999 rather then NULL?
Is it as simple as making queries (potentially) easier to write? I guess BETWEEN works better with two actual date values, but I can't think of much else.

Its because those columns are Period columns. (In essence you are correct) Since a period is a definition of time, it logically shouldn't be compared against null.This way MSSQL can compare time values against time values and keep track of all updates ( as opposed to comparing against null and having to assume null doents mean lack of data but an end of a period.
Period start column: The system records the start time for the row in this column, typically denoted as the SysStartTime column.
Period end column: The system records the end time for the row in this column, typically denoted at the SysEndTime column.
MSDN

Related

How many records were counted in specific day

Is it possible to check in DB2 how many records were counted in specific table in specific day in past
I have a table with name 'XYZ' and I would like to check row count for specific day e.g. for 10.09.2020, for 05.09.2020 and for 01.09.2020
In ordinary SQL, without special provisions, no, you can´t!
Depending on your usage scenario, there are several ways to achieve this function. Here are three that I can think of:
If you table has a timestamp field or you can add one and you can guarantee there will be no rows deleted: You can just count the rows where the timestamp is smaller then your desired date. Cheap, performance wise, but deletes may make this impossible.
You could set up a procedure that runs daily and counts your rows to write them in a different table. This van also be rather cheap from a performance point of view, but you will be limited to the specific "snapshot" times you configured beforehand and you may have conditions where the count procedure did not run an therefore data is missing.
You could create an audit-table and a trigger on the table you are interested in to log every insert and delete operation on the table with a timestamp. This is the most performance heavy solution, but the only one that will give you always a full picture of the row count at any given time.

Keeping track of mutated rows in BigQuery?

I have a large table whose rows get updated/inserted/merged periodically from a few different queries. I need a scheduled process to run (via API) to periodically check for which rows in that table were updated since the last check. So here are my issues...
When I run the merge query, I don't see a way for it to return which records were updated... otherwise, I could be copying those updated rows to a special updated_records table.
There are no triggers so I can't keep track of mutations that way.
I could add a last_updated timestamp column to keep track that way, but then repeatedly querying the entire table all day for that would be a huge amount of data billed (expensive).
I'm wondering if I'm overlooking something obvious or if maybe there's some kind of special BQ metadata that could help?
The reason I'm attempting this is that I'm wanting to extract and synchronize a smaller subset of this table into my PostgreSQL instance because the latency for querying BQ is just too much for smaller queries.
Any ideas? Thanks!
One way is to periodically save intermediate state of the table using the time travel feature. Or store only the diffs. I just want to leave this option here:
FOR SYSTEM_TIME AS OF references the historical versions of the table definition and rows that were current at timestamp_expression.
The value of timestamp_expression has to be within last 7 days.
The following query returns a historical version of the table from one hour ago.
SELECT * FROM table
FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR);
The following query returns a historical version of the table at an absolute point in time.
SELECT * FROM table
FOR SYSTEM_TIME AS OF '2017-01-01 10:00:00-07:00';
An approach would be to have 3 tables:
one basetable in "append only" mode, only inserts are added, and updates as full row, in this table would be every record like a versioning system.
a table to hold deletes (or this can be incorporated as a soft delete if there is a special column kept in the first table)
a livetable where you hold the current data (in this table you would do your MERGE statements most probably from the first base table.
If you choose partitioning and clustering, you could end up leverage a lot for long time storage discounted price and scan less data by using partitioning and clustering.
If the table is large but the amount of data updated per day is modest then you can partition and/or cluster the table on the last_updated_date column. There are some edge cases, like the first today's check should filter for last_updated_date being either today or yesterday.
Depending of how modest this amount of data updated throughout a day is, even repeatedly querying the entire table all day could be affordable because BQ engine will scan one daily partition only.
P.S.
Detailed explanation
I could add a last_updated timestamp column to keep track that way
I inferred from that the last_updated column is not there yet (so the check-for-updates statement cannot currently distinguish between updated rows and non-updated ones) but you can modify the table UPDATE statements so that this column will be added to the newly modified rows.
Therefore I assumed you can modify the updates further to set the additional last_updated_date column which will contain the date portion of the timestamp stored in the last_updated column.
but then repeatedly querying the entire table all day
From here I inferred there are multiple checks throughout the day.
but the data being updated can be for any time frame
Sure, but as soon as a row is updated, no matter how old this row is, it will acquire two new columns last_updated and last_updated_date - unless both columns have already been added by the previous update in which cases the two columns will be updated rather than added. If there are several updates to the same row between the update checks, then the latest update will still make the row to be discoverable by the checks that use the logic described below.
The check-for-update statement will (conceptually, not literally):
filter rows to ensure last_updated_date=today AND last_updated>last_checked. The datetime of the previous update check will be stored in last_checked and where this piece of data is held (table, durable config) is implementation dependent.
discover if the current check is the first today's check. If so then additionally search for last_updated_date=yesterday AND last_updated>last_checked.
Note 1If the table is partitioned and/or clustered on the last_updated_date column, then the above update checks will not cause table scan. And subject to ‘modest’ assumption made at the very beginning of my answer, the checks will satisfy your 3rd bullet point.
Note 2The downside of this approach is that the checks for updates will not find rows that had been updated before the table UPDATE statements were modified to include the two extra columns. (Such rows will be in the__NULL__ partition with rows that never were updated.) But I assume until the changes to the UPDATE statements are made it will be impossible to distinguish between updated rows and non-updated ones anyway.
Note 3 This is an explanatory concept. In the real implementation you might need one extra column instead of two. And you will need to check which approach works better: partitioning or clustering (with partitioning on a fake column) or both.
The detailed explanation of the initial (e.g. above P.S.) answer ends here.
Note 4
clustering only helps performance
From the point of view of table scan avoidance and achieving a reduction in the data usage/costs, clustering alone (with fake partitioning) could be as potent as partitioning.
Note 5
In the comment you mentioned there is already some partitioning in place. I’d suggest to examine if the existing partitioning is indispensable, can it be replaced with clustering.
Some good ideas posted here. Thanks to those who responded. Essentially, there are multiple approaches to tackling this.
But anyway, here's how I solved my particular problem...
Suppose the data needs to ultimately end up in a table called MyData. I created two additional tables, MyDataStaging and MyDataUpdate. These two tables have an identical structure to MyData with the exception of MyDataStaging has an additional Timestamp field, "batch_timestamp". This timestamp allows me to determine which rows are the latest versions in case I end up with multiple versions before the table is processed.
DatFlow pushes data directly to MyDataStaging, along with a Timestamp ("batch_timestamp") value indicating when the process ran.
A scheduled process then upserts/merges MyDataStaging to MyDataUpdate (MyDataUpdate will now always contain only a unique list of rows/values that have been changed). Then the process upserts/merges from MyDataUpdate into MyData as well as being exported & downloaded to be loaded into PostgreSQL. Then staging/update tables are emptied appropriately.
Now I'm not constantly querying the massive table to check for changes.
NOTE: When merging to the main big table, I filter the update on unique dates from within the source table to limit the bytes processed.

Design question: Best approach to store and retrieve deltas in an SQL table

I have a historical table which contains many price columns and only few columns change at a time. Currently I am just inserting all the data as new records and this change could come 100+ times every second. So it is resulting in growing of table size pretty quick.
I am trying to find the better design for the table to keep the table size to minimum and the best query to retrieve the data when required. I am not much worried about the data retrieval performance, but it should be somewhere in the middle when used for reports. Priority is to keep the table size to its minimum.
Data from this historical table is not retrieved on a day to day basis. I have a transaction table like *1 Current Design for that purpose.
Here are the details of my implementation.
1) Current Design
2) Planned design - 1
Question:
1) If I use the above table structure what is the best query to get the results like shown in Current design #1
3) Planned design - 2
Question:
1) How much performance hit this would be compared to Planned design #1.
2) Also if I go in that route what is the best query to get the results like what shown in Current design #1?
End question:
I assume planned design #1 will take more table space VS planned design #2. But planned design 2 will take more time to retrieve the query. Is there any case I assumption can go wrong?
Edit: I have only inserts going to this table. No updates or deletion is ever made to this.
In fact, I think you have better plan. You can use Temporal Tables that come from SQL Server 2016.
This type managed by sql to track change of table in best way.
Visit this Link: https://learn.microsoft.com/en-us/sql/relational-databases/tables/temporal-tables?view=sql-server-2017
I have s similar situation where I'm loading a bunch of temperature sensors every 10 seconds. As I'm using the express version of MSSQL I'm looking a at max database size of 10GB so I went creative to make it last as long as possible.
My table-layout is pretty much identical to yours in that I've got 1 timestamp + 30 value columns + another 30 flag columns.
The value columns are numeric(9,2)
The value columns are marked SPARSE, if the value is identical (enough) to the value before it I store NULL instead of repeating the value.
The flag columns are bit and indicate whether the value is 'extrapolated' or from an actual source (later on that)
I've also got another table that holds the following information for each sensor:
last time the sensor was updated; that way if a new value comes in I can easily decide if this requires just a new insert at the end of the table or whether I need to go through all the logic of inserting/updating a value somewhere in-between existing numbers.
the value of that latest entry
the sensitivity for said sensor; this way I don't have to hardcode it and can set it on a per-sensor basis
Anyway, for now my stream of information is that I've got several programs each reading data from different sources (arduino, web, ...) and dumping this in .csv files and then a 'parser' program that reads these files into the database once in a while. As I'm loading the values 1 by 1 instead of row-based this isn't very efficient but I'm now doing about 3500 values / second so I'm not overly concerned. I'll agree that this is only true when loading the values in historical order and because I'm using the helper table.
Currently I've got almost 1 year of information in there which corresponds to
2.209.883 rows
5.799.511 values spread over 18 sensors (yes, I've still got room for 12 more without needing to change the table)
This means I've only got 15% of the fields filled in, or looking at it the other way around, when I'd fill in every record rather than putting NULL in case of repetition, I'd have almost 8 times that many numbers there.
As for space requirements: I decided to reload all numbers last night 'for fun' but noticed that even though most .csv files come in historically, they would do a range of columns from Jan to Dec, then another couple of columns from Jan to Dec etc.. This resulted in quite a bit of fragmentation: 70% in fact! At that time the table required 282Mb of disk-space.
I then rebuilt the table bringing the fragmentation down to 0% and the reserved space went down to 118Mb (!).
For me this is more than good enough
it's unlikely the table will outgrow the 10GB limit anytime soon, especially if I stick to (online) rebuilding it now and then.
loading data is sufficiently fast (although reloading the entire year took a couple of hours)
reporting is sufficiently fast (for now, haven't tried to connect any 'interactive' reporting tools to it yet; but for some simple graphs in excel it works just fine IMHO).
FYI: for reporting I've created a rater simple stored procedure that picks a from-to range for a given set of columns; dumps it in a temp-table and then fills in the blanks by figuring out the NULL-ranges and then filling these in with the value that preceded the range. This works quite well although fetching the 'first' value sometimes takes a while as I can't predict how far back in time the last value should be looked for (sometimes there is none).
To work around this I've added another process that extrapolates the values for every 'hour' timestamp. That way the report never needs to go back more than 1 hour. A flag-column in the readings table indicates whether the value on a record for a given field was extrapolated or not.
(note: this makes updating values in the past more problematic but not impossible)
Hope this helps you out a bit in your endeavors, good luck!

SQL server data backup/warehouse

I've been asked to do a snapshots of certain tables from the database, so in the future we can have a clear view of the situation for any given day in the past. lets say that one of such tables looks like this:
GKEY Time_in Time_out Category Commodity
1001 2014-05-01 10:50 NULL EXPORT Apples
1002 2014-05-02 11:23 2014-05-20 12:05 IMPORT Bananas
1003 2014-05-05 11:23 NULL STORAGE Null
The simples way to do a snapshot would be creating copy of the table with another column SNAPSHOT_TAKEN (Datetime) and populate it with an INSERT statement
INSERT INTO UNITS_snapshot (SNAPSHOT_TAKEN, GKEY,Time_in, Time_out, Category, Commodity)
SELECT getdate() as SNAPSHOT_TAKEN, * FROM UNITS
OK, it works fine, but it would make the destination table quite big pretty soon, especially if I'd like to run this query often. Better solution would be checking for changes between current live table and the latest snapshot and write them down, omitting everything that hasn't been changed.
Is there a simply way to write such query?
EDIT: Possible solution for the "Forward delta" (assuming no deletes from original table)
INSERT INTO UNITS_snapshot
SELECT getdate() as SNAP_DATE,
r.* -- Here goes all data from from the original table
CASE when b.gkey is null then 'I' else 'U' END AS change_type
FROM UNITS r left outer join UNITS_snapshot b
WHERE (r.time_in <>b.time_in or r.time_out<>b.time_out or r.category<>b.category or r.commodity<>b.commodity or b.gkey is null)
and (b.snap_date =
(SELECT max (b.snap_date) from UNITS_snapshot b right outer join UNITS r
on r.gkey=b.gkey) or b.snap_date is null)
Assumptions: no value from original table is deleted. Probably also every field in WHERE should be COALESCE (xxx,'') to avoid comparing null values with set ones.
Both Dan Bracuk and ITroubs have made very good comments.
Solution 1 - Daily snapshop
The first solution you proposed is very simple. You can build the snapshot with a simple query and you can also consult it and rebuild any day's snapshot with a very simple query, by just filtering on the SNAPSHOT_TAKEN column.
If you have just some thousands of records, I'd go with this one, without worrying too much about its growing size.
Solution 2 - Daily snapshop with rolling history
This is basically the same as solution 1, but you keep only some of the snapshots over time... to avoid having the snapshot DB growing indefinitely over time.
The simplest approach is just to save the snapshots of the last N days... maybe a month or two of data. A more sophisticated approach is to keep snapshot with a density that depends on age... so, for example, you could have every day of the last month, plus every sunday of the last 3 months, plus every end-of-month of the last year, etc...
This solution requires you develop a procedure to handle deletion of the snapshots that are not required any more. It's not as simple as using getdate() within a query. But you obtain a good balance between space and historic information. You just need to balance out a good snapshot retainment strategy to suit your needs.
Solution 3 - Forward row delta
Building any type of delta is a much more complex procedure.
A forward delta is built by storing the initial snapshot (as if all rows had been inserted on that day) and then, on the following snapshots, just storing information about the difference between snapshot(N) and snapshot(N-1). This is done by analyzing each row and just storing the data if the row is new or updated or deleted. If the main table does not change much over time, you can save quite a lot of space, as no info is stored for unchanged rows.
Obviously, to handle deltas, you now need 2 extra columns, not just one:
delta id (you snapshot_taken is good, if you only want 1 delta per
day)
row change type (could be D=deleted, I=inserted, U=updated... or
something similar)
The main complexity derives from the necessity to identify rows (usually by primary key) so as to calculate if between 2 snapshots any individual row has been inserted, updated, deleted... or none of the above.
The other complexity comes from reading the snapshot DB and building the latest (or any other) snapshot. This is necessary because, having only row differences in the table, you cannot simply select a day's snapshot by filtering on snapshot_taken.
This is not easy in SQL. For each row you must take into account just the final version... the one with MAX snapshot_taken that is <= the date of the snapshot you want to build. If it is an insert or update, then keep the data for that row, else (if it is a delete) then ignore it.
To build a delta of snapshot(N), you must first build the latest snapshot (N-1) from the snapshot DB. Then you must compare the two snapshots by primary key or row identity and calculate the change type (I/U/D) and insert the changes in the snapshot DB.
Beware that you cannot delete old snapshot data without consolidating it first. That is because all snapshots are calculated from the oldest initial one and all the subsequent difference data. If you want to remove a year's of old snapshots, you'll have to consolidate the old initial snapshot and all the year's variations in a new initial snapshot.
Solution 4 - Backward row delta
This is very similar to solution 3, but a bit more complex.
A backward delta is built by storing the final snapshot and then, on the following snapshots, just storing information about the difference between snapshot(N-1) and snapshot(N).
The advantage is that the latest snapshot is always readily available through a simple select on the snapshot DB. You only need to merge the difference data when you want to retrieve an older snapshot. Compare this to the forward delta, where you always need to rebuild the snapshot from the difference data unless you are actually interested in the very first snapshot.
Another advantage (compared to solution 3) is that you can remove older snapshots by just deleting the difference data older than a particular snapshot. You can do this easily because snapshots are calculated from the final snapshot and not from the initial one.
The disadvantage is in the obscure logic. Difference data is calculated backwards. Values must be stored on the (U)pdate and (D)elete variations, but are unnecessary on the I variations. Going backwards, rows must be ignored if the first variation you find is an (I)nsert. Doable, but a bit trickier.
Solution 5 - Forward and backward column delta
If the main table has many columns, or many long text or varchar columns, and only a bunch of these are updated, then it could make sense to store only column variations instead of row variations.
This is done by using a table with this structure:
delta id (you snapshot_taken is good, if you only want 1 delta per
day)
change type (could be D=deleted, I=inserted, U=updated... or
something similar)
column name
value
The difference can be calculated forward or backward, as per row deltas.
I've seen this done, but I really advise against it. There are just too many disadvantages and added complexity.
Value is a text or varchar, and there are typecasting issues to handle if you have numeric, boolean or date/time values... and, if you have a lot of these, it could very well be you won't be saving as much space as you think you are.
Rebuilding any snapshot is hell. Altogether... any operation on this type of table really requires a lot of knowledge of the main table's structure.

SQL Server - resume cursor from last updated row

I need to do a cursor update of a table (milions of rows). The script should resume from last updated row if it would be started again (e.g. in case of a server restart).
What is the best way to resolve this? Create a new table with the last saved id? Use the tables extended propertes to save this info?
I would add an "UpdateDate" or "LastProcessDate" or some similarly named datetime column to your table and use this. When running your update, simply process any number of records that aren't the max UpdateDate or are null:
where UpdateDate < (select max(UpdateDate) from MyTable) or UpdateDate is null
It's probably a good idea to grab the max UpdateDate (#maxUpdateDate?) at the beginning of your process/loop so it does not change during a batch, and similarly get a new UpdateDate (#newUpdateDate?) at the beginning of your process to update each row as you go. A UTC date will work best to avoid DST time changes.
This data would now be a real attribute of your entity, not metadata or a temporary placeholder, and this would seem to be the best way to be transactionally consistent and otherwise fully ACID. It would also be more self-documenting than other methods, and can be indexed should the need arise. A date can also hold important temporal information about your data, whereas IDs and flags do not.
Doing it in this way would make storing data in other tables or extended properties redundant.
Some other thoughts:
Don't use a temp table that can disappear in many of the scenarios where you haven't processed all rows (connection loss, server restart, etc.).
Don't use an identity or other ID that can have gaps filled, be reseeded, truncated back to 0, etc.
The idea of having a max value stored in another table (essentially rolling your own sequence object) has generally been frowned upon and shown to be a dubious practice in SQL Server from what I've read, though I'm oddly having trouble locating a good article right now.
If at all possible, avoid cursors in favor of batches, and generally avoid batches in favor of full set-based updates.
sp_updateextendedproperty does seem to behave correctly with a rollback, though I'm not sure how locking works with that -- just FYI if you ultimately decide to go down that path.