I want to load fact table and one of my dimension is not in the stream and I want to store that dimension key to my fact table when my transformation runs.
Issue is that dimension key is not present in my stream then how can I compare my key to stream
Related
I am developing a new data warehouse and my source tables for the employee dimension gets truncated every day and reloaded with all history and updates,deletes and new inserts.
The columns which tracks these changes are effective date & effective sequence.We also have a audit table which helps us determine which records are updated,inserted and deleted every day by comparing table from today & previous day.
My question is to how can I do a incremental load on the table in my staging layer so the surrogate key which is a identity columns remains same.If I do a truncate on my final dimension then I get new surrogate key each time I truncate and hence it mess up my fact table.
Truncating a dimension is never a good idea. You will lost the ability to keep track of the primary keys, which will be referenced by the fact table.
If you must truncate the dimension everyday, then you shouldn't have auto-increment keys. Instead, you should compare the previous state of the dimension with the new state, and lookup the key values so that they can be kept.
Example: your dim has 2 entries, employee A and employee B with keys 1 and 2 resp. Next day, employee A is updated to AA and employee C is added. You should be able to compare this new data set with the old one, so that AA still has key 1, B is kept with key 2 and C is added with key 3. Of course you can't rely on auto-increment keys, and must set them from what was there previously
Also, beware of deletes: just because an employee is deleted that doesn't mean the facts pertaining to that employee also disappear. Don't delete the record from the fact table, instead add a "deleted" flag and set it to Y for deleted records. In your reporting, filter out those deleted employees, so you report only on non deleted ones.
But, the best scenario is always to not truncate the table, and instead perform the necessary updates in the dimension, keeping the primary keys (which should be synthetic and not coming from the source system anyway) and any attributes that didn't change, marking as deleted those that were deleted from the source system, and updating the version numbers, validity dates, etc. accordingly.
Your problem seems to be very close to what Kimball describes as a Type II Slowly Changing Dimension and your ETL should be able to handle that.
Table truncation on the source wouldn't represent a real issue as long as you have a business key to uniquely identify one employee. If so, the best way to address your requirement, is that to handle your employee dimension as a type 2 SCD.
Typically ETL software provide components to manage SCD. Nevertheless, a way to handle SCD may consist in defining a hash based on the attributes you want to track. Then if for a given business key you notice that the new hash calculated on the source differs from the hash you stored in your dimension, you will update all the attributes for that record.
Hope this helps.
I have a fact table containing lines of different types. Based upon each type I have a different dimension table that I need to connect. Is there a solution to this problem without creating different fact tables for each type of line?
When connecting the fact to one dimension, I get the error that one key cannot be found in the dimension table. That's right, because that value exists in another dimension table
Here a picture of what I intend to achieve:
I have the need to create a simple cube from a single table (view), no dimensions and facts star schema type stuff..
I have a large flat table (100+ columns). This table is a straight import from a CSV file, so I then create a view that includes a ID column...
As an example...
CREATE VIEW [dbo].[v_dw]
AS
SELECT
newId() Id,
x.[customer]
FROM dwdump as x;
GO
In SSAS designer I create my DSV from the view and all the int columns end up as fact data and all the varchar columns end up in a single dimension.
I try to process this cube and it throws duplicate record exist, so I set it to ignore this error, then it throws
The attribute key cannot be found when processing
The full error is...
Errors in the OLAP storage engine: The attribute key cannot be found when processing: Table: '[dbo].[v_dw]', Column: 'Id', Value: '{D0B94A2D-7024-4634-844F-64768ED4B203}'. The attribute is 'Id'. Errors in the OLAP storage engine: The record was skipped because the attribute key was not found.
I know that building a cube without proper fact/dimensions defined in the table is against best practices, but I need something simple and quick.
Can we not create a cube from a single table and use a arbitrary [Id] key column.
This can be the result of measures being processed before dimensions, leading the corresponding key not found in the dimension. As you indicated in your comment, processing the dimensions doesn't pose any problems. Since this post is tagged with SSIS I'm assuming that you're either using an Analysis Services Processing task or processing via commands such as XMLA. When you define how the cube is processed set the dimensions to process before the fact table containing the measures is processed.
First Process Update the concerned dimension. After this is done, PROCESS FULL the concerned measure group individually. Faced this issue several times, and this fix always works.
In data modelling is it acceptable for a dimension to have a surrogate key to another dimension as an attribute or should this always be a business key?
I have an Item dimension, which has a Department Number as an attribute. I also have a Department dimension. Is it acceptable for the Item Dimension to hold the SK to the Department Dimension or just the business key?
Usually you would avoid to have both the natural and the surrogate key as foreign keys in the table, because that is redundant and can lead to data inconsistency. Example: Someone updates the natural keys and forgets to also update the surrogate key.
In a data warehouse, which you tagged your request with, however, redundancy is not so much considered a problem. There is usually an transaction processing system with loads of inserts, updates and deletes, and then there is the data warehouse. The data warehouse gets all its data beautifully arranged from the processing system and there simply are no updates as the one mentioned above. If data is redundant, who cares? It simplifies data access. You can even store the employee-department join as a table with all the department data redundant. A data warehouse is all about easy and quick access to data, so as to make reporting easier. Redundant foreign keys are no problem in a data warehouse.
I currently have a source fact table, that references all of its source dimensions. I have already used SSIS to take the source dimensions and load them into our destination dimensions. While doing this, I had a PK created in each dimension, and moved the original source PK into another column within the table.
The trouble I am encountering now, is how to perform the look from when I load the source fact table into the destination fact table, and have each source dimension primary key (now in a new column in the destination dimensions) reference the correct destination dimension primary key. Of which, the destination primary key will be in the destination fact table.
Would I need to use SK lookups, or just a transformation lookup? Furthermore, for a novice user, what would be the easiest / quickest to learn?
Hopefully some of this makes sense!
Thanks in advance for any help or advice!
I didn't completely understand your scenario, its quite confusing. Maybe if you give more specific examples it would be easier to help.
Nevertheless, the logical behavior in these type of scenarios is always load the dimensions first and when loading the fact, you use lookup transformation components to get the correct value of the foreign key from the dimension tables
here is an official video from youtube teaching how to use this component
When loading the data from the source "fact" you would be looking up the Source PKs, which in your case would be the Business Keys, or what you would call it. If you are using SCD type 2 dimensions, you would perhaps also want the Start/End dates in your lookup as well.
For Non-type2 dimensions, the easiest (and fastest) would be to just do a regular lookup. Your Source Fact table has a DimA_id (which is the Business Key). Use SELECT PK, BK FROM DimA in your lookup task, join the dima_id to the dimensions BK, and put the PK into the downstream. Use the PK when inserting to the destination Fact table.