DBT: set valid_from and valid_to date when retrieving historical data with dbt snapshots

DBT: set valid_from and valid_to date when retrieving historical data with dbt snapshots - dbt

We set up dbt snapshots to build a redshift schema to track slow changing dimension.
Our dbt snapshots crawl spectrum external schema partitionned by year | month | day in s3 bucket.
Everyhing works fine if we start to use day by day.
But we cannot succeed to get historical data with its right changing date.
To retrieve historical data, in our dbt snapshot, we select where the partition = date_to_snapshot BUT the dbt_valid_from and the dbt_valid_to are set to the current day of execution and not the date of the partition we snapshot.
Is it possible to set the dbt_valid_from and the dbt_valid_to in a way that reflects the date we snapshot.
Thanks for your help

There are two "strategies" for a snapshot, timestamp and check.
The timestamp strategy relies on having an updated_at field in the source, and is the recommended way to snapshot a table. If you use the timestamp strategy, the dbt_valid_from and dbt_valid_to fields will be populated as you want. See the docs.
If you don't have an updated_at field in your source, or if that field isn't reliable and you need to use check, then you should build a model on top of your snapshot that calculates the timestamp you want (although if you don't have an updated_at field, I don't know how you would know when that record changed). You should also consider snapshotting more frequently (I typically configure a separate "job" to run dbt snapshot that runs much more frequently than the job that runs dbt build)

Related

Hive Date Partitioned table - Streaming Data in S3 with mixed dates

I have extensive experience working with Hive Partitioned tables. I use Hive 2.X. I was interviewing for a Big Data Solution Architect role and I was asked the below question.
Question: How would you ingest a streaming data in a Hive table partitioned on Date? The streaming data is first stored in S3 bucket and then loaded to Hive. Although the S3 bucket names have a date identifier such as S3_ingest_YYYYMMDD, the content could have data for more than 1 date.
My Answer: Since the content could have more than 1 Date, creating external table might not be possible since we want to read the file and distribute the file based on the date. I suggested we first load the S3 bucket in an external staging table with no partitions and then Load/Insert the final date partition table using Dynamic Partition settings which will dynamically distribute the data to the correct partition directory.
The interviewer said my answer was not correct and I was curious to know what the correct answer was, but ran out of time.
The only caveat in my answer is that, over time the partitioned date directories will have multiple small files that can lead to small file issue, which can always be handled via batch maintenance process.
What are the other/correct options to handle this scenario?
Thanks.

It depends on the requirements.
As per my understanding if one file or folder with S3_ingest_YYYYMMDD files can contain more than one date, then some events are loaded the next day or even later. This is rather common scenario.
Ingestion date and event date are two different dates. Put ingested files into table partitioned by ingestion date (LZ). You can track the initial data. If reprocessing is possible, then use ingestion_date as a bookmark for reprocessing of LZ table.
Then schedule a process which will take two or more last days of ingestion date and load into table partitioned by event_date. Last day will be always incomplete, and may be you need to increase look-back period to 3 or even more ingestion days (using ingestion_date >= current_date - 2 days filter), it depends how many dates back ingestion may load event dates. And in this process you are using dynamic partitioning by event_date and applying some logic - cleaning, etc and loading into ODS or DM.
This approach is very similar to what you proposed. The difference is in first table, it should be partitioned to allow you process data incrementally and to do easy restatement if you need to change the logic or upstream data was also restated and reloaded in the LZ.

Comparing yesterday's data with today's data

I have 2 parquet tables, one for today and one for yesterday. What I want to do is compare what has changed in today's table, e.g.:
which new rows have been added
which rows have been deleted and when they have been deleted
which rows have been changed
The tables itself have columns "createdAt" and "updatedAt" which I can use for this purpose.
I'm working with Databricks/Apache Spark so I can either use their built-in functions or an SQL query. I'm not sure how to go about this, any general ideas are appreciated!

Maintain one audit table behind your main table. data must be inserted in Audit table when you perform Insert, update or delete on your main table. Audit table should include createdAt of main table and current date-stamp.
If you manage transaction-type Insert, update or delete with 1,2,3 then it will be good for Query performance.

As I don't know the LoadType (full or delta) for your table, I will try to cover both the scenarios:-
Full Load -
For this, you only need today's table as it will contain all the previous days record as well.
Hence you only need to put condition to check all the records that are modified after yesterday's load using updatedAt column i.e
updatedAt > yesterday's load date
Delta Load -
For delta, each day you will get modified records(new, updated or deleted) only, hence just query today's table without any condition will serve the purpose.
Now, on spark side, as you have large number of records, you can manipulate number of dataframe partitions at runtime using something like below:-
spark.sql("set spark.sql.shuffle.partitions = 1500");
please find other optimization techniques here
https://deepsense.ai/optimize-spark-with-distribute-by-and-cluster-by/

SQL Updating only from previous day

I have loaded all the content from TableA where TableA has a field called
CreatedDate(DateTime)
I have copied only the content of this table which are of actually usefulness.
Rather than using SQL Agent to schedule a job to copy this table everynight, I would like to append the data and only insert any new data added and perhaps do some updates that may have been affected in last day.
Just wondering whats the best way to do this taking into account we already have the initial data loaded but just want to append new data each day.
I was thinking about doing the following SQL in the where clause:
CreatedDate between getdate()-1 and getdate()
Is this the best way to do it?
Thank you

This is exactly the kind of situation that SSIS is built for. Essentially you want to load only new data every day and in the ETL process this is known as an incremental load.
The best way of accomplishing your desired outcome is to create an ETL, have it only grab data where MaxDateTime > CreatedDate (or ModifiedDate assuming you have that column and you want to include updates to previous data). If your CreatedDate doesn't change if previous data is modified, then you would need to add a lookup and conditional split to go through the data.
Then you would create a SQL Agent job to run the incremental ETL daily.

SQL server data backup/warehouse

I've been asked to do a snapshots of certain tables from the database, so in the future we can have a clear view of the situation for any given day in the past. lets say that one of such tables looks like this:
GKEY Time_in Time_out Category Commodity
1001 2014-05-01 10:50 NULL EXPORT Apples
1002 2014-05-02 11:23 2014-05-20 12:05 IMPORT Bananas
1003 2014-05-05 11:23 NULL STORAGE Null
The simples way to do a snapshot would be creating copy of the table with another column SNAPSHOT_TAKEN (Datetime) and populate it with an INSERT statement
INSERT INTO UNITS_snapshot (SNAPSHOT_TAKEN, GKEY,Time_in, Time_out, Category, Commodity)
SELECT getdate() as SNAPSHOT_TAKEN, * FROM UNITS
OK, it works fine, but it would make the destination table quite big pretty soon, especially if I'd like to run this query often. Better solution would be checking for changes between current live table and the latest snapshot and write them down, omitting everything that hasn't been changed.
Is there a simply way to write such query?
EDIT: Possible solution for the "Forward delta" (assuming no deletes from original table)
INSERT INTO UNITS_snapshot
SELECT getdate() as SNAP_DATE,
r.* -- Here goes all data from from the original table
CASE when b.gkey is null then 'I' else 'U' END AS change_type
FROM UNITS r left outer join UNITS_snapshot b
WHERE (r.time_in <>b.time_in or r.time_out<>b.time_out or r.category<>b.category or r.commodity<>b.commodity or b.gkey is null)
and (b.snap_date =
(SELECT max (b.snap_date) from UNITS_snapshot b right outer join UNITS r
on r.gkey=b.gkey) or b.snap_date is null)
Assumptions: no value from original table is deleted. Probably also every field in WHERE should be COALESCE (xxx,'') to avoid comparing null values with set ones.

Both Dan Bracuk and ITroubs have made very good comments.
Solution 1 - Daily snapshop
The first solution you proposed is very simple. You can build the snapshot with a simple query and you can also consult it and rebuild any day's snapshot with a very simple query, by just filtering on the SNAPSHOT_TAKEN column.
If you have just some thousands of records, I'd go with this one, without worrying too much about its growing size.
Solution 2 - Daily snapshop with rolling history
This is basically the same as solution 1, but you keep only some of the snapshots over time... to avoid having the snapshot DB growing indefinitely over time.
The simplest approach is just to save the snapshots of the last N days... maybe a month or two of data. A more sophisticated approach is to keep snapshot with a density that depends on age... so, for example, you could have every day of the last month, plus every sunday of the last 3 months, plus every end-of-month of the last year, etc...
This solution requires you develop a procedure to handle deletion of the snapshots that are not required any more. It's not as simple as using getdate() within a query. But you obtain a good balance between space and historic information. You just need to balance out a good snapshot retainment strategy to suit your needs.
Solution 3 - Forward row delta
Building any type of delta is a much more complex procedure.
A forward delta is built by storing the initial snapshot (as if all rows had been inserted on that day) and then, on the following snapshots, just storing information about the difference between snapshot(N) and snapshot(N-1). This is done by analyzing each row and just storing the data if the row is new or updated or deleted. If the main table does not change much over time, you can save quite a lot of space, as no info is stored for unchanged rows.
Obviously, to handle deltas, you now need 2 extra columns, not just one:
delta id (you snapshot_taken is good, if you only want 1 delta per
day)
row change type (could be D=deleted, I=inserted, U=updated... or
something similar)
The main complexity derives from the necessity to identify rows (usually by primary key) so as to calculate if between 2 snapshots any individual row has been inserted, updated, deleted... or none of the above.
The other complexity comes from reading the snapshot DB and building the latest (or any other) snapshot. This is necessary because, having only row differences in the table, you cannot simply select a day's snapshot by filtering on snapshot_taken.
This is not easy in SQL. For each row you must take into account just the final version... the one with MAX snapshot_taken that is <= the date of the snapshot you want to build. If it is an insert or update, then keep the data for that row, else (if it is a delete) then ignore it.
To build a delta of snapshot(N), you must first build the latest snapshot (N-1) from the snapshot DB. Then you must compare the two snapshots by primary key or row identity and calculate the change type (I/U/D) and insert the changes in the snapshot DB.
Beware that you cannot delete old snapshot data without consolidating it first. That is because all snapshots are calculated from the oldest initial one and all the subsequent difference data. If you want to remove a year's of old snapshots, you'll have to consolidate the old initial snapshot and all the year's variations in a new initial snapshot.
Solution 4 - Backward row delta
This is very similar to solution 3, but a bit more complex.
A backward delta is built by storing the final snapshot and then, on the following snapshots, just storing information about the difference between snapshot(N-1) and snapshot(N).
The advantage is that the latest snapshot is always readily available through a simple select on the snapshot DB. You only need to merge the difference data when you want to retrieve an older snapshot. Compare this to the forward delta, where you always need to rebuild the snapshot from the difference data unless you are actually interested in the very first snapshot.
Another advantage (compared to solution 3) is that you can remove older snapshots by just deleting the difference data older than a particular snapshot. You can do this easily because snapshots are calculated from the final snapshot and not from the initial one.
The disadvantage is in the obscure logic. Difference data is calculated backwards. Values must be stored on the (U)pdate and (D)elete variations, but are unnecessary on the I variations. Going backwards, rows must be ignored if the first variation you find is an (I)nsert. Doable, but a bit trickier.
Solution 5 - Forward and backward column delta
If the main table has many columns, or many long text or varchar columns, and only a bunch of these are updated, then it could make sense to store only column variations instead of row variations.
This is done by using a table with this structure:
delta id (you snapshot_taken is good, if you only want 1 delta per
day)
change type (could be D=deleted, I=inserted, U=updated... or
something similar)
column name
value
The difference can be calculated forward or backward, as per row deltas.
I've seen this done, but I really advise against it. There are just too many disadvantages and added complexity.
Value is a text or varchar, and there are typecasting issues to handle if you have numeric, boolean or date/time values... and, if you have a lot of these, it could very well be you won't be saving as much space as you think you are.
Rebuilding any snapshot is hell. Altogether... any operation on this type of table really requires a lot of knowledge of the main table's structure.

How would you maintain a history in SQL tables?

I am designing a database to store product informations, and I want to store several months of historical (price) data for future reference. However, I would like to, after a set period, start overwriting initial entries with minimal effort to find the initial entries. Does anyone have a good idea of how to approach this problem? My initial design is to have a table named historical data, and everyday, it pulls the active data and stores it into the historical database with a time stamp. Does anyone have a better idea? Or can see what is wrong with mine?

First, I'd like to comment on your proposed solution. The weak part of course is that, there can, actually, be more than one change between your intervals. That means, the record was changed three times during the day, but you only archive the last change.
It's possible to have the better solution, but it must be event-driven. If you have the database server that supports events or triggers (like MS SQL), you should write a trigger code that creates entry in history table. If your server does not support triggers, you can add the archiving code to your application (during Save operation).

You could place a trigger on your price table. That way you can archive the old price in an other table at each update or delete event.

It's a much broader topic than it initially seems. Martin Fowler has a nice narrative about "things that change with time".

IMO your approach seems sound if your required history data is a snapshot of the end of the day's data - in the past I have used a similar approach with overnight jobs (SP's) that pick up the day's new data, timestamp it and then use a "delete all data that has a timestamp < today - x" where x is the time period of data I want to keep.
If you need to track all history changes, then you need to look at triggers.

I would like to, after a set period, start overwriting initial entries with minimal effort to find the initial entries
We store data in Archive tables, using a Trigger, as others have suggested. Our archive table has additional column for AuditDate, and stores the "Deleted" data - i.e. the previous version of the data. The current data is only stored in the actual table.
We prune the Archive table with a business rule along the lines of "Delete all Archive data more than 3 months old where there exists at least one archive record younger than 3 months old; delete all archive data more than 6 months old"
So if there has been no price change in the last 3 months you would still have a price change record from the 3-6 months ago period.
(Ask if you need an example of the self-referencing-join to do the delete, or the Trigger to store changes in the Archive table)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas