I am developing a new data warehouse and my source tables for the employee dimension gets truncated every day and reloaded with all history and updates,deletes and new inserts.
The columns which tracks these changes are effective date & effective sequence.We also have a audit table which helps us determine which records are updated,inserted and deleted every day by comparing table from today & previous day.
My question is to how can I do a incremental load on the table in my staging layer so the surrogate key which is a identity columns remains same.If I do a truncate on my final dimension then I get new surrogate key each time I truncate and hence it mess up my fact table.
Truncating a dimension is never a good idea. You will lost the ability to keep track of the primary keys, which will be referenced by the fact table.
If you must truncate the dimension everyday, then you shouldn't have auto-increment keys. Instead, you should compare the previous state of the dimension with the new state, and lookup the key values so that they can be kept.
Example: your dim has 2 entries, employee A and employee B with keys 1 and 2 resp. Next day, employee A is updated to AA and employee C is added. You should be able to compare this new data set with the old one, so that AA still has key 1, B is kept with key 2 and C is added with key 3. Of course you can't rely on auto-increment keys, and must set them from what was there previously
Also, beware of deletes: just because an employee is deleted that doesn't mean the facts pertaining to that employee also disappear. Don't delete the record from the fact table, instead add a "deleted" flag and set it to Y for deleted records. In your reporting, filter out those deleted employees, so you report only on non deleted ones.
But, the best scenario is always to not truncate the table, and instead perform the necessary updates in the dimension, keeping the primary keys (which should be synthetic and not coming from the source system anyway) and any attributes that didn't change, marking as deleted those that were deleted from the source system, and updating the version numbers, validity dates, etc. accordingly.
Your problem seems to be very close to what Kimball describes as a Type II Slowly Changing Dimension and your ETL should be able to handle that.
Table truncation on the source wouldn't represent a real issue as long as you have a business key to uniquely identify one employee. If so, the best way to address your requirement, is that to handle your employee dimension as a type 2 SCD.
Typically ETL software provide components to manage SCD. Nevertheless, a way to handle SCD may consist in defining a hash based on the attributes you want to track. Then if for a given business key you notice that the new hash calculated on the source differs from the hash you stored in your dimension, you will update all the attributes for that record.
Hope this helps.
Related
I'm working on a data warehouse project using BigQuery. We're loading daily files exported from various mainframe systems. Most tables have unique keys which we can use to create the type 2 history, but some tables, e.g. a ledger/positions table, can have duplicate rows. These files contain the full data extract from the source system every day.
We're currently able to maintain a type 2 history for most tables without knowing the primary keys, as long as all rows in a load are unique, but we have a challenge with tables where this is not the case.
One person on the project has suggested that the way to handle it is to "compare duplicates", meaning that if the DWH table has 5 identical rows and the staging tables has 6 identical rows, then we just insert one more, and if it is the other way around, we just close one of the records in the DWH table (by setting the end date to now). This could be implemented by adding and extra "sub row" key to the dataset like this:
Row_number() over(partition by “all data columns” order by SystemTime) as data_row_nr
I've tried to find out if this is good practice or not, but without any luck. Something about it just seems wrong to me, and I can't see what unforeseen consequences can arise from doing it like this.
Can anybody tell me what the best way to go is when dealing with full loads of ledger data on a daily basis, for which we want to maintain some kind of history in the DWH?
No, I do not think this would be a good idea to introduce an artificial primary key based on all columns plus the index of the duplicated row.
You will solve the technical problem, but I doubt there will be some business value.
First of all you should distinct – the tables you get with primary key are dimensions and you can recognise changes and build history.
But the table without PK are most probably fact tables (i.e. transaction records) that are typically not full loaded but loaded based on some DELTA criterion.
Anyway you will never be able to recognise an update in those records, only possible change is insert (deletes are typically not relevant as data warehouse keeps longer history that the source system).
So my todo list
Check if the dup are intended or illegal
Try to find a delta criterion to load the fact tables
If everything fails, make the primary key of all columns with a single attribute of the number of duplicates and build the history.
In my design, I have many tables which use FKs. The issue is because certain records will be deleted and re-added at various points of time as they are linked to specific project files, the references will be always be inaccurate if I rely on the traditional auto-incrementing ID (because each time they are re-added they will be given a new ID).
I previously asked a question (Sqlite - composite PK with two auto-incrementing values) as to whether I can create a composite auto-incrementing ID however it appears to not be possible as answered by the question I was linked.
The only automatic value I can think of that'll always be unique and never repeated is a full date value, down to the second - however the idea of using a date for the tables' IDs feels like bad design. So, if I instead place a full date field in every table and use these as the FK reference, am I looking at any potential issues down the line? And am I correct in thinking it would be more efficient to store it as integer rather than a text value?
Thanks for the help
Update
To clarify, I am not looking asking in regards to Primary Keys. The PK will be standard auto-incrementing ID. I am asking in regards to basing hundreds of FKs on dates.
Thank you for the replies below, the difficulty i'm having is I can't find a similar model to learn from. The end result is i'd like the application to use project files (like Word has their docx files) to import data into the database. Once a new project is loaded, the previous project's records are cleared but their data is preserved in the project file (the application's custom file format / a txt file) so they can be added once again. The FKs will all be project-based so they will only be referencing records that exist at the time in the database. For example, as it's a world-building application, let's say a user adds a subject type that would be relevant to any project (e.g. mathematics), due to the form it's entered on in the application, the record is given a_type number of 1, meaning it’s something that persists regardless of the project loaded. Another subject type however may be Demonology which only applies to the specific project loaded (e.g. a fantasy world). A school_subject junction table needs both of these in the same table to reference as the FK. So let’s say Demonology is the second record in the subjects type table, it has an auto-increment value of 2 - thus the junction table records 2 as it’s FK value. The issue is, before this project is re-opened again, the user may have added 10 more subject types that are universal and persist, thus next time the project’s subject type records and school_subject records are added back, Demonology is now given the ID of 11. However, the school_subject junction table is re-recreated with the same record having 2 as its value. This is why I’d like a FK which will always remain the same. I don’t want all projects to be present in the database, because I want users to be able to backup and duplicate individual projects as well know that even if the application is deleted, they can re-download and re-open their project files.
This is a bit long for a comment.
Something seems wrong with your design. When you delete a row in a table, there should be no foreign key references to that key. The entity is gone. Does not exist (as far as the database is concerned). Under most circumstances, you will get an error if you try to delete a row in one table where another row refers to that row using a foreign key reference.
When you insert a row into a table, the database becomes aware of that entity. There should not be references to it.
Hence, you have an unusual situation. It sounds like you have primary keys that represent something in the real world -- such as a social security number or vehicle identification number. If that is the case, you might want this id to be the primary key of the table.
Another option is soft deletion. Once one of these rows is inserted in the table, it cannot be deleted. However, you can set a flag that says that it is deleted. Then, foreign key references can stay to the "soft" deleted row.
I have a fact table with five dimension tables associated to it.Typically, the fact table contains the surrogate keys of each dimension and has no business/surrogate key. I am trying to load the fact table with data resulted of the staging fact table i.e.Insert new records. However, I notice the fact table can also handle other operations such as Update or Delete on data. A conditional split was used in the SSIS Package for this purpose to check if all surrogate keys are 0 then make the new insert. My question is, Can I use the surrogate keys in terms of Update or Delete?
I made an insert on the fact table just to give an idea of how the data will look like.
The answer is yes, you can. BUT, will there be a situation where one employee sold the same product, from the same supplier, to the same customer, on the same day? Perhaps a different order on the same day? (this is based on the data you present in the question)
If all the surrogate keys together can uniquely identify a record, update fact records to your hearts content. But, if that is not the case, you could end up updating records when you do not intend to update.
I tend to include an order number in the fact tables I design to help avoid that situation, but you may not have that in your actual fact tables. Including the order number is a pattern referred to a degenerate dimension in the fact table. I have found it to be pretty handy.
Anyway, the answer is the same. You can update fact records based on surrogate keys, as long as all of them together can uniquely identify the row(s) you want to update.
Don't throw caution to the wind, be sure your data warehouse is designed such that you can do this if you need to. Being able to do in place updates of facts can be nice, versus delete and replace, in that there could be fewer steps in the ETL process.
i have designed places related warehouse tables - DimPlaces, FactPlaces, DimGeography. It is straightforward design if you see. All the locations is in DimPlaces (Addrline1, Addrline2,placename,etc) and geography hierarchy is in DimGeography (City, State, Country, PostCode). FactPlaces is table which has got foriegn keys to DimPlaces and DimGeography.
I would like to maintain historical data as there are chances that places names or their properties might change and at the same time if the location of a place changes then geographic hierarchy key changes.
I have found design pattern -
Another useful design pattern is to add the durable account key to the fact table in addition to the dimension’s surrogate key. This joins back to the current rows in the dimension to make it easier to report all of history by the current dimension attributes.
Could you please suggest is this OK to follow this solution? If yes, do i need to use KEY of type UNIQUEIDENTIFIER for a unique value?
Another question on this - I have employees data (DimEmployee and FactEmployee). Each employee is associated with the places where he works. How to connect These EMPLOYEE TABLES with the PLACES TABLES. Do I need to connect FACTEMPLOYEE WITH FACTPLACES?
I think in the first instance, they're referring to business keys? So if your dimension table has two rows, surrogate key 1 & 2, but they both refer to the same thing, so both have AccountId/ProductId/WhateverId of 1, then you will have some fact table rows with surrogate key 1 and business key 1, and later ones with surrogate key 2 and business key 1.
Uniqueidentifiers are very wide, try and avoid using them on fact tables and for joins if possible.
For your last question - That's really more a reporting thing. Do you need to do that? Is that what people need to see, do they need to slice by that? You could consider a referenced dimension - Where the places table links to the fact tables via a placeId on the employees dimension. Or, you could have a factemployees table with start and stop dates. It depends on what you need to achieve.
I've inherited a (Microsoft?) SQL database that wasn't very pristine in its original state. There are still some very strange things in it that I'm trying to fix - one of them is inconsistent ID entries.
In the accounts table, each entry has a number called accountID, which is referenced in several other tables (notes, equipment, etc. ). The problem is that the numbers (for some random reason) - range from about -100000 to +2000000 when there are about only 7000 entries.
Is there any good way to re-number them while changing corresponding numbers in the other tables? At my disposal I also have ColdFusion, so any thing that works with SQL and/or that I'll accept.
For surrogate keys, they are meant to be meaningless, so unless you actually had a database integrity issue (like there were no foreign key contraints properly defined) or your identity was approaching the maximum for its datatype, I would leave them alone and go after some other low hanging fruit that would have more impact.
In this instance, it sounds like "why" is a better question than "how". The OP notes that there is a strange problem that needs to be fixed but doesn't say why it is a problem. Is it causing problems? What positive impact would changing these numbers have? Unless you originally programmed the system and understand precisely why the number is in its current state, you are taking quite a risky making changes like this.
I would talk to an accountant (or at least your financial people) before messing in anyway with the numbers in the accounts tables if this is a financial app. The Table of accounts is very critical to how finances are reported. These IDs may have meaning you don't understand. No one puts in a negative id unless they had a reason. I would under no circumstances change that unless I understood why it was negative to begin with. You could truly screw up your tax reporting or some other thing by making an uneeded change.
You could probably disable the foreign key relationships (if you're able to take it offline temporarily) and then update the primary keys using a script. I've used this update script before to change values, and you could pretty easily wrap this code in a cursor to go through the key values in question, one by one, and update the arbitrary value to an incrementing value you're keeping track of.
Check out the script here: http://vyaskn.tripod.com/sql_server_search_and_replace.htm
If you just have a list of tables that use the primary key, you could set up a series of UPDATE statements that run inside your cursor, and then you wouldn't need to use this script (which can be a little slow).
It's worth asking, though, why these values appear out of wack. Does this database have values added and deleted constantly? Are the primary key values really arbitrary, or do they just appear to be, but they really have meaning? Though I'm all for consolidating, you'd have to ensure that there's no purpose to those values.
With ColdFusion this shouldn't be a herculean task, but it will be messy and you'll have to be careful. One method you could use would be to script the database and then generate a brand new, blank table schema. Set the accountID as an identity field in the new database.
Then, using ColdFusion, write a query that will pull all of the old account data and insert them into the new database one by one. For each row, let the new database assign a new ID. After each insert, pull the new ID (using either ##IDENTITY or MAX(accountID)) and store the new ID and the old ID together in a temporary table so you know which old IDs belong to which new IDs.
Next, repeat the process with each of the child tables. For each old ID, pull its child entries and re-insert them into the new database using the new IDs. If the primary keys on the child tables are fine, you can insert them as-is or let the server assign new ones if they don't matter.
Assigning new IDs in place by disabling relationships temporarily may work, but you might also run into conflicts if one of the entries is assigned an ID that is already being used by the old data which could cause conflicts.
Create a new column in the accounts table for your new ID, and new column in each of your related tables to reference the new ID column.
ALTER TABLE accounts
ADD new_accountID int IDENTITY
ALTER TABLE notes
ADD new_accountID int
ALTER TABLE equipment
ADD new_accountID int
Then you can map the new_accountID column on each of your referencing tables to the accounts table.
UPDATE notes
SET new_accountID = accounts.new_accountID
FROM accounts
INNER JOIN notes ON (notes.accountID = accounts.accountID)
UPDATE equipment
SET new_accountID = accounts.new_accountID
FROM accounts
INNER JOIN equipment ON (equipment.accountID = accounts.accountID)
At this point, each table has both accountID with the old keys, and new_accountID with the new keys. From here it should be pretty straightforward.
Break all of the foreign keys on accountID.
On each table, UPDATE [table] SET accountID = new_accountID.
Re-add the foreign keys for accountID.
Drop new_accountID from all of the tables, as it's no longer needed.