SSIS Error on insertion - Error when new primary keys in Dimension table - sql

I am using SSIS to bring in a bunch of large files each month into one single table, which I insert records to SQL Server tables. My fact table is actual financial transactions that occur during the month. It looks something like:
FactTransactions
'Acct Number' 'Product Number' 'Total Value'
000001 1A 1000
000002 1A 2000
000001 3B 3000
I'd like to track this information against some manually generated information in a table about accounts where 'Acct Number' is the primary key in the Dim Table
DimAcct
'Acct Number' 'Acct Name' 'Acct Type'
000001 Sales Revenue
000002 Returns Revenue (Contra)
My process is:
1) Clear the transaction table
2) Reload all the transactions included anything new or corrected
3) Do Analysis through Joins, etc
When I went to run the tables in a new month, I received the following error in SSIS:
"The INSERT statement conflicted with the FOREIGN KEY constraint
"FK_GLTransaction_List_Master_DimAccount". The conflict occurred in
database "GLTransactions", table "dbo.DimAccount", column 'Acct_Number'.".
I am guessing this is because new accts have been made and used in Transactions, but I haven't manually updated my Dim file and had no warning about them. This is going to happen every month, because new Transactions accounts get added when they find new Accounting items to track separately in their own accounts. I also manually update the table to add the few accounts. Is there a way to avoid this, or better, what should I do before/during the SSIS run to handle these new accounts and avoid the error?

The message you receive tells you that in the data you are trying to load in the fact table, there is at least one record referencing a record that doesn't exists in the dimension table.
How do you load the dimension table? Do you load it only before running the fact load ? If this is the case you should consider to manage the so called "inferred dimension", that is dimension that you do not know before loading the fact table. This situation is also referred to as "early arriving facts".
Therefore, you should scan the facts you are trying to load, looking for dimension records that are not in your dimension table. Then you will insert this records in the dimension table and flag it as inferred. At this stage you will load the fact.
Note that flagging these record as "Inferred" will enable you to refine the dimension record at a later time.
Let's say you will insert an inferred account for which you know only the account number (the business key) but not the others information such as account description etc ... You can update these information at a later time.
Notice that in SSIS SCD component you can define a proper Inferred Dimension management.
Hope this helps.

You mention that you are only manually inserting dimension tables, that needs to be changed.
You should insert all new accounts as "New/Unknown" in the account dimension as a 1st step of the SSIS process.
Then you will have a report that prints these new accounts and you will manually or in some other way update these accounts to contain correct data.
You can read more about possible solutions at: https://www.matillion.com/blog/late-arriving-dimension/

Related

Should I apply type 2 history to tables with duplicate keys?

I'm working on a data warehouse project using BigQuery. We're loading daily files exported from various mainframe systems. Most tables have unique keys which we can use to create the type 2 history, but some tables, e.g. a ledger/positions table, can have duplicate rows. These files contain the full data extract from the source system every day.
We're currently able to maintain a type 2 history for most tables without knowing the primary keys, as long as all rows in a load are unique, but we have a challenge with tables where this is not the case.
One person on the project has suggested that the way to handle it is to "compare duplicates", meaning that if the DWH table has 5 identical rows and the staging tables has 6 identical rows, then we just insert one more, and if it is the other way around, we just close one of the records in the DWH table (by setting the end date to now). This could be implemented by adding and extra "sub row" key to the dataset like this:
Row_number() over(partition by “all data columns” order by SystemTime) as data_row_nr
I've tried to find out if this is good practice or not, but without any luck. Something about it just seems wrong to me, and I can't see what unforeseen consequences can arise from doing it like this.
Can anybody tell me what the best way to go is when dealing with full loads of ledger data on a daily basis, for which we want to maintain some kind of history in the DWH?
No, I do not think this would be a good idea to introduce an artificial primary key based on all columns plus the index of the duplicated row.
You will solve the technical problem, but I doubt there will be some business value.
First of all you should distinct – the tables you get with primary key are dimensions and you can recognise changes and build history.
But the table without PK are most probably fact tables (i.e. transaction records) that are typically not full loaded but loaded based on some DELTA criterion.
Anyway you will never be able to recognise an update in those records, only possible change is insert (deletes are typically not relevant as data warehouse keeps longer history that the source system).
So my todo list
Check if the dup are intended or illegal
Try to find a delta criterion to load the fact tables
If everything fails, make the primary key of all columns with a single attribute of the number of duplicates and build the history.

Employee Dimension Truncated everyday in Datawarehouse

I am developing a new data warehouse and my source tables for the employee dimension gets truncated every day and reloaded with all history and updates,deletes and new inserts.
The columns which tracks these changes are effective date & effective sequence.We also have a audit table which helps us determine which records are updated,inserted and deleted every day by comparing table from today & previous day.
My question is to how can I do a incremental load on the table in my staging layer so the surrogate key which is a identity columns remains same.If I do a truncate on my final dimension then I get new surrogate key each time I truncate and hence it mess up my fact table.
Truncating a dimension is never a good idea. You will lost the ability to keep track of the primary keys, which will be referenced by the fact table.
If you must truncate the dimension everyday, then you shouldn't have auto-increment keys. Instead, you should compare the previous state of the dimension with the new state, and lookup the key values so that they can be kept.
Example: your dim has 2 entries, employee A and employee B with keys 1 and 2 resp. Next day, employee A is updated to AA and employee C is added. You should be able to compare this new data set with the old one, so that AA still has key 1, B is kept with key 2 and C is added with key 3. Of course you can't rely on auto-increment keys, and must set them from what was there previously
Also, beware of deletes: just because an employee is deleted that doesn't mean the facts pertaining to that employee also disappear. Don't delete the record from the fact table, instead add a "deleted" flag and set it to Y for deleted records. In your reporting, filter out those deleted employees, so you report only on non deleted ones.
But, the best scenario is always to not truncate the table, and instead perform the necessary updates in the dimension, keeping the primary keys (which should be synthetic and not coming from the source system anyway) and any attributes that didn't change, marking as deleted those that were deleted from the source system, and updating the version numbers, validity dates, etc. accordingly.
Your problem seems to be very close to what Kimball describes as a Type II Slowly Changing Dimension and your ETL should be able to handle that.
Table truncation on the source wouldn't represent a real issue as long as you have a business key to uniquely identify one employee. If so, the best way to address your requirement, is that to handle your employee dimension as a type 2 SCD.
Typically ETL software provide components to manage SCD. Nevertheless, a way to handle SCD may consist in defining a hash based on the attributes you want to track. Then if for a given business key you notice that the new hash calculated on the source differs from the hash you stored in your dimension, you will update all the attributes for that record.
Hope this helps.

3 Level authorization structure

I am working on banking application, I want to add a feature of maker,checker and authorize for every record in a table. I am explaining in below details
Suppose I have one table called invmast table. There are 3 users one is maker, 2nd one is checker and last one is authorize. So when maker user creates a transaction in database then this record is not live (means this record can not be available in invmast table). Once checker checked the record and authorizer authorized the record the record will go live ( means this record will insert in invmast table ). Same thing is applicable for update and delete also. So I want a table structure how to achieve this in real time. Please advice if any.
I am using vb.net and sql server 2008
Reads like a homework assignment.....
Lots of ways to solve this, here's a common design pattern:
Have an invmast_draft table that is identical to invmast but has an additional status column in the table. Apps need to be aware of this table, status column and what its values mean. In your case, it can have at least 3 values - draft, checked, authorized. Makers first create a transaction in this table. Once maker is done, the row is committed with the value "draft" in the status column. Checker then knows there's a new row to check and does his job. When done, row is updated with status set to checked. Authorizer does her thing. When authorizer updates the status as "authorized" you can then copy or move the row to the final invmast table rightaway. Alternatively, you can have a process that wakes up periodically to copy/move batches of rows. All depends on your business requirements. All kinds of optimizations can be performed here but you get the general idea.

How to Troubleshoot composite key violation in MS Access

My Access database (2007) has Four tables; Customer, Supplier, Account, and AccountAgeing
AccountAgeing has a composite key made up of the foreign keys of two of the other tables, plus a date. i.e.;
AsAtDate, SupplierID, AccountNumber
I am importing data from excel via a temporary table, and my parent tables (Customers, Suppliers, Accounts) are importing well.
However importing AccountAgeing from my tempTable continually has a key violation. Of 749 possible imports, 746 violate the key. A query to test was:
SELECT DISTINCT tempTable.[SupplierID], #31/7/14#, tempTable.[AccountNumber]
FROM tempTable;
This returned 749 records (all of them). If this is the case, how do I have a key violation??
The composite key fields are all indexed, with duplicates OK. There is no data in the destination table
I have date and [Account Number] indexed as these are the fields searches will be on.
Here is a sequence of some troubleshooting steps you can try.
Remove the primary key from your target table and populate it. If you can't populate the target table, your problem may not be the key itself, and may become apparent based on error messages you receive.
If the target table does populate, try adding your desired composite key to the already populated target table.
If you are unable to add the key, re-run your "select distinct" query on the populated target table.
If you don't select 749 distinct rows, visually inspect the table contents to see what's going on.
These steps should lead you to some insight. Just a guess - but it sounds possible that you may have a data type mismatch somewhere. In cases like this, Access will sometimes convert data on the fly and insert it without giving an error. But in the process the nature of the data are changed, resulting in a key violation in the target table.
I'm curious to hear what you find. Please post a comment when you figure out what the problem is.
Hope it helps. Good luck with the troubleshooting.
Thank you Marty!! I attempted to populate manually, which errored because there was no matching record in the Customers table.
I discovered that I had incorrectly assigned AccountAgeing to be the parent of Customers, rather than of Accounts.
The business logic is that an AccountAgeing record will always have an Account, but an AccountAgeing record does not always mention Company Number (the primary key of the Customer table).
The fix was binding part of the Account Ageing composite key to the Accounts composite key.
I am unsure what will happen when I add an ATBRecord which has an account number but no Company number, but that is another question
Check the Indexed Property in table properties - make sure it is not set at Duplicates OK on any of the composite key fields

identity to be incremented only if record is inserted

sql server 2005 : i have a column empid in employee table with identity on.if there is some error while inserting data into table .identity is incremented .i want identity to be incremented only if record is inserted .like if i have generated emp id from 1 to 5 and then on 6th record insertion error ocurrs.and on next record insertion identity value will be 7 .i want it to be 6.
Why do you want to do that ?
The identity column should only be used as an 'internal administrative value' for the database, and it should have no 'business value', so why does it matter that there are gaps in that sequence ?
If identity is used correctly, then users of your software will never be faced with the column that has an identity value; you just use it to uniquely identify a record.
I don't think this can be done. If you want your identity numbers to be exactly sequential then you may have to generate them yourself, rather than using the SQL Identity feature.
edit: Even rolling back the failed transactions will not make the Identity count go back down, this is by design, see this other question.
What valid business reason do you have for caring if there are gaps? There is no reason for the database to care and every reason to want to make sure that identity values are never reused for something else as they can cause major problems with data integrity with looking up information based on old reports, etc. Suppose you have a report that shows the orders last month and then you delete one of the records becasue the customer was duplicated and thus dedupped. Then you reuse the identity field for the dupped customer that was removed. NOw someone looknig at last month's report goes to look up customer 12345 and the data associated with that cuisotmer belongs to John Smith rather than Sally Jones. BUt the person doesn;t know that because she is using an aggreagate, so now she has incorrect information that was totally avoidable. If she was looking up the delted customer, the process instead could have redirected her to the correct customer left after the dedupping.
When you need to have this specific behaviour you should use stored procedures to generate the ID. This way you can really to a rollback. But keep in mind that the current behaviour is by purpose.
Transaction isolation and different read levels (dirty reads) will most likely get you into trouble when you don't use locking on that id field in your masterdata table that holds the current or next ID value.