SCD Type-2 Using INSERT strategy - sql

I am using Data vault 2.0 model. Here we are not supposed to use UPDATE in a table.
Normally in RDBMS, we implement SCD-2 using UPDATE & INSERT strategy.
But in my case, I can use only INSER strategy.
My Source is Kafka input and its loaded into Snowflake (as avro format) and I'm flattening and loading the same into another table in RDBMS format.
My question is: Can anyone please help me to implement SCD type-2 logic without using UPDATE. Only with INSERT strategy I have to do this.
I should keep the history also. For example, if I get rec-1 in interval-a and in interval-b, I got 2 updates for rec-1, then I need to load all the three records in my target pointing the last record came in will be the latest one.

Your satellite key needs to consist out of the Hash Key + the Load Date Timestamp. Each individual version will then have a new primary key and thus can be inserted into the satellite. By querying the latest timestamp, you will get the latest version.
You can find a sample script here: https://simplesqlbi.home.blog/2019/06/29/part-3-data-vault-for-beginners/

Related

How do I modify Upsert key for a snowflake output in Upsolver

Is it possible to add an Upsert key to an existing output or change the Upsert key for an existing output without creating a new output?
This is all very hypothetical right now! Since this is new data and a lot of the schemas have been changing during design phases, I just want to get ahead of any problems we might have.
You won't be able to update the Upsert keys for an existing output in Upsolver. Best way to do handle this is to create a new snowflake table (or delete existing data and keep same table), duplicate your output, change your Upsert key and replay

Should I apply type 2 history to tables with duplicate keys?

I'm working on a data warehouse project using BigQuery. We're loading daily files exported from various mainframe systems. Most tables have unique keys which we can use to create the type 2 history, but some tables, e.g. a ledger/positions table, can have duplicate rows. These files contain the full data extract from the source system every day.
We're currently able to maintain a type 2 history for most tables without knowing the primary keys, as long as all rows in a load are unique, but we have a challenge with tables where this is not the case.
One person on the project has suggested that the way to handle it is to "compare duplicates", meaning that if the DWH table has 5 identical rows and the staging tables has 6 identical rows, then we just insert one more, and if it is the other way around, we just close one of the records in the DWH table (by setting the end date to now). This could be implemented by adding and extra "sub row" key to the dataset like this:
Row_number() over(partition by “all data columns” order by SystemTime) as data_row_nr
I've tried to find out if this is good practice or not, but without any luck. Something about it just seems wrong to me, and I can't see what unforeseen consequences can arise from doing it like this.
Can anybody tell me what the best way to go is when dealing with full loads of ledger data on a daily basis, for which we want to maintain some kind of history in the DWH?
No, I do not think this would be a good idea to introduce an artificial primary key based on all columns plus the index of the duplicated row.
You will solve the technical problem, but I doubt there will be some business value.
First of all you should distinct – the tables you get with primary key are dimensions and you can recognise changes and build history.
But the table without PK are most probably fact tables (i.e. transaction records) that are typically not full loaded but loaded based on some DELTA criterion.
Anyway you will never be able to recognise an update in those records, only possible change is insert (deletes are typically not relevant as data warehouse keeps longer history that the source system).
So my todo list
Check if the dup are intended or illegal
Try to find a delta criterion to load the fact tables
If everything fails, make the primary key of all columns with a single attribute of the number of duplicates and build the history.

Sql table Default Value

Hi everyone i have a project i am working on that consists on keeping tables the same at 3 different locations
i pull data that doesnt exist from each of these locations into a corporate table, i then need to send back down to the locations the new data so they are all the same
The table i am pulling from is a identity
My question is in Sql is there any way to make a table a identity without making it a identity as in setting the default value to be the max(id)+1, this is the only way i can figure i can keep the data structure the same without going to replication
The problem is that you're generating records in an IDENTITY field in multiple sources, then unable to combine them without those records being assigned new IDENTITY values.
By using a GUID as your key field, each of the 3 databases can create records which will have a unique ID, and you'll be able to then combine them without issue. You can still have a UNIQUE constraint on the field, but the likelihood of generating the same GUID is astronomically small.
Most replication processes utilize this GUID approach at some level already, so it's a common solution to this problem.

Versioning in relational database

I have a problem to introduce a good versioning in my database design.
Let's make a easy example. A little rental service.
You have a table Person (P_ID, Name), table Computer (C_ID, Type) and a table Rent (R_ID, P_ID, C_ID, FromData, ToData).
I want to be able to change say the user name, create a new version and still have the old stuff at hand if I need it.
My goal is to have some kind of system on my websites witch makes it easy to make a versioning of some records in a table.
More Information:
I have business logic that demands that I can release a record for a version. I also have to be able to rollback to the old ones. The reason is that I want exports for diffrente versions of the data.
Before jumping into the solution it might be a good idea to ask what behaviour are you wanting to achieve? Do you need versioning for some auditing purpose, do you need versioning so that users can rollback changes, do you need versioning for some business rule, or is there another reason?
Once you know this the answer should pretty much jump out at you. I.E., If auditing is your purpose you could add database triggers and store the old and new values in a seperate [Audit] table.
You have made a statement (that you want versioning), but not asked a question (exactly what your problem is). Without a question, it's hard to provide an answer.
In general, you could provide versioning by:
Identifying what entity needs to be versioned. In this case it sounds like you may want to be versioning a "deal" or "rental agreement".
Add a PK column, version number column, and "originalID" column to the table at the top of the model for that entity.
To do versioning, copy top level record to a new PK, placing the original PK in the "originalID" column and incrementing the version number column. Copy the related tables, changing the FK in those tables to match the PK of the new record. Then allow the user to modify the records pertaining to the new-PK version of the record.
You could use triggers:
http://weblogs.asp.net/jgalloway/archive/2008/01/27/adding-simple-trigger-based-auditing-to-your-sql-server-database.aspx
You could create an Archive table that you update via stored procedure or trigger that is populated with a copy of all the fields in a data row in the primary table after every update or insert. The archive table would have its own PK and time stamps for when changes were made.

FNhibernate, GeneratedBy.HiLo, hibernate_unique_key etc

I have started using the s#arp architecture which uses FNhibernate and GeneratedBy.HiLo to generate primary keys (there is also table hibernate_unique_key). Apparently, this is recommended practise and I would like to stick with this. Now to my problem. I have used NHibernate and hbm mapping quite a bit and usually used identity columns for my primary keys. This allowed me to seed the database using SQL. Can I do this with the aforementioned setup (hibernate_unique_key table etc.). I need to do this as SQL insert is much more efficient than using NHibernate + C# to seed the db with a million entities. Any feedback would be very much appreciated. Thanks.
Christian
Maybe it's a bit late but the Identity generator will break the UnitOfWork-pattern.
If you perform a Save on your currentSession it will already try to insert the entity in the DB and thus break the whole meaning of the UoW.
After many hours I found the reason why it was broken and the reason was of this Identity Generator. I use now the HiLo generator.
Following links helped me through this:
Nice article about the behaviour of these generators
You should be able to seed the database using plain SQL and still use HiLo to generate the primary keys in NHibernate. What you have to do is to set the NextHi value(s) in the HiLo table to values that are high enough that the next entity you save will get an id that is higher than the highest id set when you seed the database.
So, you should be able to do something like this:
run the schema export
seed the database using a custom sql script (you would have to supply your own id's in the script, since they are not generated by the database)
manually insert a big enough value into the hibernate_unique_key table, so that the next id generated by NHibernate is larger than the largest inserted in the seeding
use NHibernate as usual
There are a few different approaches to using HiLo with NHibernate (one shared next-hi for all entities, a next hi per entity, etc.) so you might have to do a little experimenting to find out what value(s) would be appropriate to write to the hibernate_unique_key table after the seeding, depending on your hilo strategy and what max_lo you are using etc.
As a side note, schema export does not seem to support multiple rows in the hibernate_unique_key table that well, so you might have to do some manual stuff to create all the rows in the table if you use a hilo row per entity.
You could also use Identity to generate the ids, but at the cost of worse performance with NHibernate. The reason for the performance loss is that NHibernate has to do an extra read for each insert to get the id that was generated by the database. With hilo NHibernate already knows the id that the entity will get, so there is no need for that extra read.
Another option could be to use GuidComb, which also allows NHibernate to generate the ids, and therefore removes the need to query the database to get the id after an insert. However, you then have to look at ugly guids instead of nice integers when developing. :)
I guess the problem is that the pk generation is controlled by nhibernate and not the db. so an option would be to use instance.GeneratedBy.Identity(). do you reckon that would be sensible?
I would really appreciate any comments.
Christian