System Versioning with ETL - sql

I have just started to investigate System Versioning (temporal tables) as it is pretty cool feature with SQL. I was able to successfully set it up on one of my existing tables with the below query. However whenever my daily ETL runs, it is adding data to the History table for all items in Table1, whether there are changes or not. I am using an Insert into, Update, and Delete SQL Task in SSIS for my ETL. The ETL is updating all existing rows, usually with the same data, but I was hopeful System Versioning would only add a new row if the existing row actually had data change.
Is this a limitation of using an Update statement in a SQL Task in my ETL? Would using Slowly Changing Dimension in Data Flow make a difference or is there a better way to make this work?
Or is this a limitation of System Versioning with ETLs and I should use something else for tracking table changes?
CREATE SCHEMA History
GO
ALTER TABLE Table1
ADD
SysStartTime datetime2(0) GENERATED ALWAYS AS ROW START HIDDEN
CONSTRAINT DF_SysStart DEFAULT SYSUTCDATETIME()
, SysEndTime datetime2(0) GENERATED ALWAYS AS ROW END HIDDEN
CONSTRAINT DF_SysEnd DEFAULT CONVERT(datetime2 (0), '9999-12-31 23:59:59'),
PERIOD FOR SYSTEM_TIME (SysStartTime, SysEndTime);
GO
ALTER TABLE Table1
SET (SYSTEM_VERSIONING = ON (HISTORY_TABLE = History.Table1))

Comments to answer
If you issue an update statement that is the equivalent of setting every field to itself, SQL Server doesn't have any shortcut logic built into it that flags it as a noop and does nothing. It sounds like your current ETL pattern is updating every record per run and thus, your historical tables are growing
That makes sense. Would redesigning my ETL to only update changed rows would be the best way to address my historical table growing? Does a Slowly Changing Dimension Table accomplish this or is there a better way to make this happen? – Hslew 2 days ago
It's been ages since I used the Slowly Changing Dimension Wizard in SSIS. It was a bit crap back then and I can't see them having improved it at all. I find i have the most success with a source query to deduplicate within my data, a lookup to determine does the inbound row match an existing row. If it doesn't match, I load that in my table as it's new data. If it matches, then I need to do a second check to see if it's changed. Only if it changes, do I send it to a second table. SSIS doesn't scale updates well. After the data flow completes, would I update the destination (Execute SQL Task)

Related

How to make 2 remote tables in sync along with determination of Delta records

We have 10 tables on vendor system and same 10 tables on our DB side along with 10 _HISTORIC tables i.e. for each table in order to capture updated/new records.
We are reading the main tables from Vendor system using Informatica to truncate and load into our tables. How do we find Delta records without using Triggers and CDC as it comes with cost on vendor system.
4 tables are such that which have 200 columns and records around 31K in each with expectation that 100-500 records might update daily.
We are using Left Join in Informatica to load new Records in our Main and _HISTORIC tables.
But what's efficient approach to find the Updated records of Vendor table and load them in our _HISTORIC table ?
For new Records using query :
-- NEW RECORDS
INSERT INTO TABLEA_HISTORIC
SELECT FROM TABLEA
LEFT JOIN TABLEB
ON A.PK = B.PK
WHERE B.PK IS NULL
I believe a system versioned temporary table will be something you are looking for here. You can create a system versioned table for any table in SQL server 2016 or later.
for example, say I have a table Employee
CREATE TABLE Employee
(
EmployeeId VARCHAR(20) PRIMARY KEY,
EmployeeName VARCHAR(255) NOT NULL,
EmployeeDOJ DATE,
ValidFrom datetime2 GENERATED ALWAYS AS ROW START,--Automatically updated by the system when the row is updated
ValidTo datetime2 GENERATED ALWAYS AS ROW END,--auto-updated
PERIOD FOR SYSTEM_TIME (ValidFrom, ValidTo)--to set the row validity period
)
the column ValidFrom, ValidTo determines the time period on which that particular row was active.
For More Info refer the micorsoft article
https://learn.microsoft.com/en-us/sql/relational-databases/tables/temporal-tables?view=sql-server-ver15
Create staging tables, load wipe&load them. Next, use them for finding the differences that need to be load into your target tables.
The CDC logic needs to be performed this way, but it will not affect your source system.
Another way - not sure if possible in your case - is to load partial data based on some source system date or key. This way you stage only the new data. This improves performance a lot, but makes finding the deletes in source impossible.
A. To replicate a smaller subset of records in the source without making schema changes, there are a few options.
Transactional Replication, however this is not very flexible. For example would not allow any differences in the target database, and therefore is not a solution for you.
Identify a "date modified" field in the source. This obviously has to already exist, and will not allow you to identify deletes
Use a "windowing approach" where you simply delete and reload the last months transactions, again based on an existing date. Requires an existing date that isn't back dated and doesn't work for non transactional tables (which are usually small enough to just do full copies anyway)
Turn on change tracking. Your vendor may or may not argue that tihs is a costly change (it isn't) or impacts application performance (it probably doesn't)
https://learn.microsoft.com/en-us/sql/relational-databases/track-changes/about-change-tracking-sql-server?view=sql-server-ver15
Turning on change tracking will allow you to more easily identify changes to all tables.
You need to ask yourself: is it really an issue to copy the entire table? I have built solutions that simple copy entire large tables (far larger than 31k records) every hour and there is never an issue.
You need to consider what complications you introduce by building an incremental solution, and whether the associated maintenance and complexity is worth being able to reduce a record copy from 31K (full table) to 500 records (changed). Again a full copy of 31K records is actually pretty fast under normal circumstances (like 10 seconds or so)
B. Target table
As already recommended by many, you might want to consider a temporal table, although if you do decide to do full copies, a temporal table might not be the beast option.

Proper way of updating a whole table in Redshift, drop table + create table vs. truncate + insert into table

Currently I have many tables for which I have to update the information they hold, sometimes on a daily or weekly basis. So far, I've been doing this by a combination of DROP TABLE IF EXIST some_schema.some_table_name; followed by CREATE TABLE some_schema.some_table_name AS ( SELECT ... FROM ... WHERE ...); and I would like to know what is the "best-practice" or proper way of doing so.
I've read that INSERT operations in Redshift are quite expensive, so I've been avoiding its usage, but maybe the use of TRUNCATE with INSERT is better than dropping and creating.
How can I confirm which option is better?
I've seen this article from Redshift docs, but I'm not sure if it is the best option, since I could have not only to remove records, but keeping and inserting as well.
If your desire is to completely erase a table and replace the data then the general pattern you are following in fine. However, there are a few things you should be doing to make things safer / better.
There are 3 patterns to do this and one is clearly the lowest performance. These are Delete/Insert, Truncate/Insert, and Drop/Insert. Of these Delete/Create/Insert is NOT what you want to do from a performance point of view. This process invalidates all the rows in the table (not delete them) and adds new valid rows. This doubles the size of the table, wasting space, and needs to be vacuumed. The only upside of this approach is that it doesn't have the downsides of the other approaches but this only matters in certain situations. Go down this approach only if you have to.
Truncate/Insert is fast and maintains the same table id as the original table. Because truncate operates on the blocks of the table (unlinking them) it is fast but there is some small overhead in managing all the block links. Since the table definition is unchanged all DDL stays defined and dependent views can keep pointing to the table. The downside with truncate is that it forces a COMMIT to occur which means that until the table is repopulated with new data other users of the database can see an empty table. This can lead to incorrect results during these windows. Not good.
Lastly there is Drop/Create/Insert. This approach is marginally (very slightly and only for large tables) faster than truncate in the ideal case. It just throws away the old blocks. There is some additional cost to setting up the new table (of the same name) so truncate and drop are about the same speed unless the table is large. Since Drop can be inside of a transaction block the empty table won't be seen by third parties (if done correctly). The downside with this approach is that the old table and new table are entirely different tables (different oids) - they just happen to have the same name. This means that any dependent (regular) views will need to be dropped and recreated as well. Also since this table is "going away" the commit of the transaction cannot complete until all uses of the table are complete. This becomes a large problem when someone leaves a transaction open in their bench and goes home for the night. Since the tables needs to be recreated your process needs to know the complete and correct DDL for the table.
Hopefully this gives you some idea of when to use these different approaches. Two things I see that could be better in your current code - 1) You are not using a transaction block (as far as I can tell) so there is a window when others will see that the tables doesn't exists or is empty. This may or may not be important to you but be aware. 2) "Create table As" doesn't define the DDL of the table in performant structure (and possibly incorrectly). You should always specify your permanent tables fully. Sort and Dist keys matter as do varchar lengths, data types etc. This is a time bomb waiting to go off.
Per request for an example of drop/create/insert:
As I mentioned there are lock dependency issues that can arise with this method so I like to use a "swap & drop" approach to this path. This makes the new information visible to users at the "swap" so even if the "drop" gets blocks things get published on time. This doesn't remove the lock risk as a lock can still prevent the process (session) from completing, it just makes it so that the new data is visible (published) while you hunt down the offender.
(Please note that for transactions to execute properly you need to be sure that extra COMMITs are not being inserted into the process. This can happen with benches that are configured in "autocommit" mode.)
Create table new_table ( ... ) ...; -- make the new table but with a different name (and unique from other tables) than the existing table
Insert into new_table ... ; -- put the desired data into the new table
Analyze new_table; -- to ensure metadata is up to date
Begin; -- start transaction
Alter table perm_table rename to old_table; -- rename existing table
Alter table new_table rename to perm_table; -- complete the swap
Commit; -- publish the new data for all to see but transactions still using the original data can keep doing so
Drop table old_table; -- remove the old data to free up space
Commit;
This process is just one example. Sometimes you want to keep the old versions of the table around for a while (history / error recovery) so you will date stamp the old data and have a separate process to free up the space. This also helps with stray locks clogging up the works - only the clean up process gets stalled. You can also have the recreation of views in the process so that these are updated in the same transaction. And so on.
I think you will need to use the Update command. I understand that drooping a table is a risky move, as you might loose all of your data from your database.
Update some_table_name s set
s.Id="whatever you want to update",
s.Name="whatever you want to update",
s.LastName="whatever you want to update",
s.OtherTableColumn="whatever you want to update"
From
some_table_name s
In above code I assumed your table had for columns (1-Id, 2-Name, 3-LastName, 4-OtherTableColumn). If you have more or less columns then I would adjust accordingly.
I would also write a update procedure for this (and each table) so if you need to update somewhat frequently you just use the procedure; I think its quicker. Below would be my procedure:
Create Proc sp_UpdateSome_table_name
#Id int,
#name nvarchar(255),
#lastname nvarchar(255),
#OtherTableColumn int
AS
BEGIN
Update s some_table_name
s.Name="whatever you want to update",
s.LastName="whatever you want to update",
s.OtherTableColumn="whatever you want to update"
From
some_table_name s
Where
s.Id=#Id
END
You want to make sure that each column in your table is defined with correct data type in the procedure. For example I assumed above that #Id was int, Name was nvarchar(255) etc. If you want to allow yourself not to enter any data (allowing null) in certain table columns when updating then after the data type you can write Null; for example if you write #Id int Null, then you can update is as null; but if you are not sure what this is, simply ignore this sentence for now.
Once you assured above paragraph is good (data types are correct), then select the entire procedure and then execute (F5). This will store this procedure.
Then I will write the procedure every time you want to update your table shown as below:
Exec sp_UpdateSome_table_name 1,John,Smith,77
If you highlight the above command and execute (f5) it then it will update the table which has Id=1 and it will make the name John, last name Smith and the other column 77 from whatever it was before. If there is no data in the table with Id=1 then you can execute.
Keep in mind the last rows of the codes might not have a comma. The above codes are written correctly, just pointing it out as you might put a comma out of habit.

DATA_CONSISTENCY_CHECK is On in my table.But still temporal table inserts another row for same data update .How can I restrict in T-SQL?

DATA_CONSISTENCY_CHECK is On in my table. I'm trying to check data consistency for audit purpose.When I update same value in the main table, the temporal table keeps history of the same row, which causes difficult to track the version changes.I'm using MSSQL server.
You misunderstood the function of DATA_CONSISTENCY_CHECK option. It's used to check if time ranges definded by system_start_time_column_name and system_end_time_column_name columns in PERIOD FOR SYSTEM_TIME do not overlap in base and historical table when you enable the link between the base and historical table ( this done when you execute CREATE/ALTER TABLE command).
If you need data deduplication in historical table you have to implement it yourself. It can be a maintenance task which disable the link, remove duplicates, update the time range columns correctely and enable link between base and historical table back.

Keeping BigQuery table data up-to-date

This is probably incorrect use case for BigQuery but I have following problem: I need to periodically update Big Query table. Update should be "atomic" in a sense that clients which read data should either use only old version of data or completely new version of data. The only solution I have now is to use date partitions. The problem with this solution is that clients which just need to read up to date data should know about partitions and get data only from certain partitions. Every time I want to make a query I would have first to figure out which partition to use and only then select from the table. Is there any way to improve this? Ideally I would like solution to be easy and transparent for clients who read data.
You didn't mention the size of your update, I can only give some general guideline.
Most BigQuery updates, including single DML (INSERT/UPDATE/DELETE/MERGE) and single load job, are atomic. Your reader reads either old data or new data.
Lacking multi-statement transaction right now, if you do have updates which doesn't fit into single load job, the solution is:
Load update into a staging table, after all loads finished
Use single INSERT or MERGE to merge updates from staging table to primary data table
The drawback: scanning staging table is not for free
Update: since you have multiple tables to update atomically, there is a tiny trick which may be helpful.
Assuming for each table that you need an update, there is a ActivePartition column as partition key, you may have a table with only one row.
CREATE TABLE ActivePartition (active DATE);
Each time after loading, you set ActivePartition.active to a new active date, then your user use a script:
DECLARE active DATE DEFAULT (SELECT active FROM ActivePartition);
-- Actual query
SELECT ... FROM dataTable WHERE ActivePartition = active

The best way to Update the database table through a pyspark job

I have a spark job that gets data from multiple sources and aggregates into one table. The job should update the table only if there is new data.
One approach I could think of is to fetch the data from the existing table, and compare with the new data that comes in. The comparison happens in the spark layer.
I was wondering if there is any better way to compare, that can improve the comparison performance.
Please let me know if anyone has a suggestion on this.
Thanks much in advance.
One approach I could think of is to fetch the data from the existing
table, and compare with the new data that comes in
IMHO entire data compare to load new data is not performant.
Option 1:
Instead you can create google-bigquery partition table and create a partition column to load the data and also while loading new data you can check whether the new data has same partition column.
Hitting partition level data in hive or bigquery is more useful/efficient than selecting entire data and comparing in spark.
Same is applicable for hive as well.
see this Creating partitioned tables
or
Creating and using integer range partitioned tables
Option 2:
Another alternative is with GOOGLE bigquery we have merge statement, if your requirement is to merge the data with out comparision, then you can go ahead with MERGE statement .. see doc link below
A MERGE statement is a DML statement that can combine INSERT, UPDATE, and DELETE operations into a single statement and perform the operations atomically.
Using this, We can get performance improvement because all three operations (INSERT, UPDATE, and DELETE) are performed in one pass. We do not need to write an individual statement to update changes in the target table.
There are many ways this problem can be solved, one of the less expensive, performant and scalable way is to use a datastore on the file system to determine true new data.
As data comes in for the 1st time write it to 2 places - database and to a file (say in s3). If data is already on the database then you need to initialize the local/s3 file with table data.
As data comes in 2nd time onwards, check if it is new based its presence on local/s3 file.
Mark delta data as new or updated. Export this to database as insert or update.
As time goes by this file will get bigger and bigger. Define a date range beyond which updated data won’t be coming. Regularly truncate this file to keep data within that time range.
You can also bucket and partition this data. You can use deltalake to maintain it too.
One downside is that whenever database is updated this file may need to be updated based on relevant data is being Changed or not. You can maintain a marker on the database table to signify sync date. Index that column too. Read changed records based on this column and update the file/deltalake.
This way your sparl app will be less dependent on a database. The database operations are not very scalable so keeping them away from critical path is better
Shouldnt you have a last update time in you DB? The approach you are using doesnt sound scalable so if you had a way to set update time to each row in the table it will solve the problem.