How can I create a Primary Key in Oracle when composite key and a "generated always as identity" option won't work? - sql

I'm working on an SSIS project that pulls data form Excel and loads to Oracle Database every month. I plan to pull data from Excel file and load to Oracle stage table. I will be using a merge statement because the data that gets loaded each month is a rolling 12 month list and the data can change, so need to be able to INSERT when records don't match or UPDATE when they do. My control flow looks like this: Truncate Stage Table (to clear out table from last package run)---> DATA FLOW from Excel to Stage Table---> Merge to Target Table in Oracle.
My problem is that the data in the source Excel file doesn't have any unique columns to select a primary key or a composite key, as it is a possibility (although very unlikely) that a new record could have the exact same information. I am unable to utilize the "generated always as identity" because my SSIS package needs to truncate at the beginning of each job to clear out the Stage Table. This would generate the same ID numbers in the new load and create problems in the Target Table.
Any suggestions as to how I can get around this problem?

Welcome to SO and ETL. Instead of using a staging table, in SSIS use two sources: Excel file and existing production table. Sort both inputs and then perform a merge join on the unique identifier. From there, use a derived column transformation to add a new column called 'Action' which will mark a row as either an INSERT/UPDATE/DELETE based on whether the join key is NULL. So:
NULL from file means DELETE (not in file, in database)
NULL from database means INSERT (in file, not in database)
Not NULL for both means UPDATE (in file, in database)
From there, use a conditional split to split rows to either a OLE DB Destination (INSERT), or SQL Command (UPDATE or DELETE). You can now remove the stage environment and MERGE command from your process. This has the added benefit of removing the ETL load from the SQL Server, assuming SSIS is running on a separate server.
Note: The sort transformation has the option to remove duplicates.

Related

Using SSIS Package, How to validate the source records for duplicate before inserting?

SQL Server 2012: using a SSIS package, how to validate the source records for duplicate before inserting?
Our source file is a .csv. We are facing duplicate records loaded in the staging table.
At present , we are following manual process of loading data.
How to validate the source file data against the destination table before loading and load only the valid records? Possibility of loading duplicate records not only because of the source file having duplicate records in it but also reloading the same file to the staging table.
We are not Truncate the staging table. We are keeping records as is.
Second question : How to pick the name of the source file and pass it in the loading ? Possibly having a derived column as "FileName" which will get loaded along with raw data to the staging table.
The typical load pattern I use in this case is:
Prepare a staging table that matches the source file
In SSIS run a SQL Task with TRUNCATE StagingTable; (which clears it out)
Then, run a data flow task that loads the entire data file into the staging table
Lastly, merge the staging table into the final table.
I prefer to do this last step in a SQL Task also:
INSERT INTO FinalTable
(PrimaryKey,Column1,Column2,Column3)
SELECT
PrimaryKey,Column1,Column2,Column3
FROM StagingTable SRC
WHERE NOT EXISTS (
SELECT * FROM FinalTable TGT WHERE TGT.PrimaryKey=SRC.PrimaryKey
);
If you prefer a graphical UI, and you don't mind the extra network traffic, and slower processing time, you can do the same type of merge operation using lookups. You can even use the SCD component but I strongly discourage it's use.
Whether you do it in T-SQL or the UI, you need a key that can be used to uniquely identify the records (referred to as PrimaryKey in my example). If you don't have this key, there is no way to 'deduplicate'
Note in this example you have a 'real' staging table whose only purpose is to get the data file into the database. Then you have a final table that contains the final consistent result
Also note that this pattern only adds new rows - it will not update existing rows if they change in the data file.
Given your exact scenario (of loading the same file again), I would first check if the data is even loaded to the staging table. If you do that, you don't have to worry about checking the duplicates at record level.
How are you setting the connection to the file? Most of the data loads I have dealt with, I designed for-each-loop-container where the file name/path would be populated in a user variable. As you said, you could just use a derived column transform to add a new column which gets the value from a variable. If you don't have the file name in a user variable, you could use expression task in the control flow to populate it.
To cover your exact requirement, I would use the above step to populate the file name in the table. You could even normalize to a different table instead of storing long file name for every data record. Once you have all the file names in the database, you could just have an "Execute SQL" at the beginning to see if that file name is already in the database.
Two years back I have faced the same problem with importing TSV files.
I tried many other solutions but best I could design is C# code script for such validation at its best.
What I did as a solution
Create one C# DataTable object in memory with Primary Key constraints,
like:-
DataColumn[] keyColumn = new DataColumn[30];
keyColumn[intJ] = dtFilterdPK.Columns["Column name"];
Then try to add one by one row from your CSV to this DataTables.
Whenever your data will get Duplication based on Primary Key will have an error
Handle this error code in (TRY)..CATCH block and make this duplication error as per your logging requirement.
Avoid those error records importing in DataTable object.
Atlast import your CSV file into your table as BulkImport
Like:
using (SqlBulkCopy bulkCopy = new SqlBulkCopy(myConnection))
{
bulkCopy.DestinationTableName = "Your DB Table Name"; //Assign table name
bulkCopy.WriteToServer(dtToBeImport); //Write into Actual table.
}
Hope this will help you.

How can I copy and overwrite data of tables from database1 to database2 in SQL Server

I have a database1 which has more than 500 tables and I have database2 which also has the same number of tables and in both the databases the name of tables are same.. some of the tables have different table definitions, for example a table reports in database1 has 9 columns and the table reports in database2 has 10.
I want to copy all the data from database1 to database2 and it should overwrite the same data and append the columns if structure does not match. I have tried the import export wizard in SQL Server 2008 but it gives an error when it comes to the last step of copying rows. I don't have the screen shot of that error right now, it is my office PC. It says that error inserting into the readonly column xyz, some times it says that vs_isbroken, for the read only column error as I mentioned a enabled the identity insert but it did not help..
Please help me. It is an opportunity in my office for me.
SSIS and SQL Server 2008 Wizards can be finicky tools.
If you get a "can't insert into column ABC", then it could be one of the following:
Inserting into a PK column -> when setting up the mappings, you need to indicate to overwrite the value
Inserting into a column with a smaller range -> for example from nvarchar(256) into nvarchar(50)
Inserting into a calculated column (pointed out by #Nick.McDermaid)
You could also get issues with referential integrity if your database uses this (most do).
If you're going to do this more often, then I suggest you build an SSIS package instead of using the wizard tooling. This way you will see warnings on all sorts of issues like the ones I've described above. You can then run your package on demand.
Another suggestion I would make, is that you insert DB1 into "stage" tables in DB2. These tables should have no relational integrity and will allow you to break the process into several steps as follows.
Stage the data from DB1 into DB2
Produce reports/queries on issues pertinent to your database/rules
Merge the data from stage tables into target tables using SQL
That last step is where you can use merge statements, or simple insert/updates depending on a key match. Using SQL here in the local database is then able to use set theory to manage the overlap of the two sets and figure out what is new or to be updated.
SSIS "can" do this, but you will not be able to do a bulk update using SSIS, whereas with SQL you can. SSIS would do what is known as RBAR (row by agonizing row), something slow and to be avoided.
I suggest you inform your seniors that this will take a little longer to ensure it is reliable and the results reportable. Then work step by step, reporting on each stages completion.
Another two small suggestions:
Create _Archive tables of each of the stage tables and add a Tstamp column to each. Merge into these after the stage step which will allow you to quickly see when which rows were introduced into DB2
After stage and before the SQL merge step, create indexes on your stage tables. This will improve the merge performance
Drop those Indexes after each merge, this will increase the bulk insert Performance
Basic on Staging (response to question clarification):
Links:
http://www.codeproject.com/Articles/173918/How-to-Create-your-First-SQL-Server-Integration-Se
http://www.jasonstrate.com/tag/31daysssis/
http://blogs.msdn.com/b/andreasderuiter/archive/2012/12/05/designing-an-etl-process-with-ssis-two-approaches-to-extracting-and-transforming-data.aspx
Staging is the act of moving data from one place to another without any checks.
First you need to create the target tables, the schema should match the source tables.
Open up BIDS and create a new Project and in it a new SSIS package.
In the package, create a connection for the source server and another for the destination.
Then create a data flow step, in the step create a data source for each table you want to copy from.
Connect each source to a new data destination and set the appropriate connection and table.
When done, save and do a test run.
Before the data flow step, you might like to add a SQL step that will truncate all the target tables.
If you're open to using tools then what about using something like Red Gate Sql Compare and Red Gate SQL Data Compare?
First I would use data compare to manage the schema differences, add the new columns you want to your destination database (database2) from the source (database1). Then with data compare you match the contents of the tables any columns it can't match based on names you specify how to handle. Then you can pick and choose what data you want to copy from your destination. So you'll see what data is new and what's different (you can delete data in the destination that's not in the source or ignore it). You can either have the tool do the work or create you a script to run when you want.
There's a 15 day trial if you want to experiment.
Seems like maybe you are looking for Replication technology as is offered by SQL Server Replication.
Well, if i understood your requirement correctly, you need to make database2 a replica of database1. Why not take a full backup of database1 and restore it as database2? Your database2 will be exactly what database1 is at the time of backup.

Using SSIS to create new Database from two separate databases

I am new to SSIS.I got the task have according to the scenario as explained.
Scenario:
I have two databases A and B on different machines and have around 25 tables and 20 columns with relationships and dependencies. My task is to create a database C with selected no of tables and in each table I don't require all the columns but selected some. Conditions to be met are that the relationships should be intact and created automatically in new database.
What I have done:
I have created a package using the transfer SQL Server object task to transfer the tables and relationships.
then I have manually edited the columns that are not required
and then I transferred the data using the data source and destination
My question is: can I achieve all these things in one package? Also after I have transferred the data how can I schedule the package to just transfer the recently inserted rows in the database to the new database?
Please help me
thanks in advance
You can schedule the package by using a SQL Server Agent Job - one of the options for a job step is run SSIS package.
With regard to transferring new rows, I would either:
Track your current "position" in another table, assumes you have either an ascending key or a time stamp column - load the current position into an SSIS variable, use this variable in the WHERE statement of your data source queries.
Transfer all data across into "dump" copies of each table (no relationships/keys etc required just the same schema) & use a T-SQL MERGE statement to load new rows in, then truncate "dump" tables.
Hope this makes sense - its a bit difficult to get across in writing.

Importing Excel data using SSIS using unique ID's

I have one excel file that I want to import into two different tables, tblUni and tblUser.
I have a third table which contains the id's from the other two tables:
tblUni_Students
Id
UniId
StudentId
What I need is when I import the excel data into the first two tables, for each record, the newly created ids to be inserted into the Uni_Students table also.
Using SSIS, I have managed to import the data into two sql destinations but cannot seem to then take the new ids from these destinations to then insert into the lookup table.
Can anyone advise please. Thanks.
It's a bit difficult to answer without knowing the target database or the structure of the data but speaking generally this would be much better done by adding the data into a "load" table. i.e. one who's sole reason is to temporarily hold data while you process it, you would then update the tblStudent, tblUni and tblUni_Student tables from the load area using SQL statements either via Procedure or via an Execute SQL Task component.
You'd it as an oledbcommand component, where the command is to insert values into the table. Then in the same component you'd output the generated identity. Assign the generated identity to a new column in the output, and now you have all your data plus the generated identity in the dataflow.
This will be processed one row at a time, so it will be slow. Personally I'd put it in a staging table and do it as CiarĂ¡n described.

using ssis to move tabels and parts of tables from one database to another

I am currently using a database that is poorly designed and a slow pipeline so i decided to copy a small portion of the database (15 tables) and only bring over some of those tables for example i want to bring only the rows that have a certain id.
But this is not a one time move i need to add all the stuff that is added to the old database added to the new one on an hourly basis. My research has led me to SSIS and that it may have a way of accomplishing this but i have found no clear examples on how it is done if in fact it is possible. Thanks in advance.
Yes it is possible . You can schedule your ssis package through sql agent to run on hourly basis .
For a table ,you can drag a Data Flow Task on to the control flow .Inside DFT ,you need to place an oledb Source component ,Lookup ,Data conversion (if the types are different in source and target table) and Oledb destination .
oledb Source component : Create a variable of type string and in the expression write your sql query to fetch the data based on ID.Now use this variable in source component.
Lookup: You need to select your source table and combine the primary key column from the source and destination table.It acts similar to inner join query .After combining the primary key from the both the tables ,select the columns which you need from the source .
Oledb destination : Simply select your target table and map the columns from Lookup no matched output .If you need to update the values from the source then use Lookup matched output and connect it to an execute SQL task and write the update query .
Please go thru the link and SO
Scheduling of SSIS package