Using SSIS Package, How to validate the source records for duplicate before inserting? - sql

SQL Server 2012: using a SSIS package, how to validate the source records for duplicate before inserting?
Our source file is a .csv. We are facing duplicate records loaded in the staging table.
At present , we are following manual process of loading data.
How to validate the source file data against the destination table before loading and load only the valid records? Possibility of loading duplicate records not only because of the source file having duplicate records in it but also reloading the same file to the staging table.
We are not Truncate the staging table. We are keeping records as is.
Second question : How to pick the name of the source file and pass it in the loading ? Possibly having a derived column as "FileName" which will get loaded along with raw data to the staging table.

The typical load pattern I use in this case is:
Prepare a staging table that matches the source file
In SSIS run a SQL Task with TRUNCATE StagingTable; (which clears it out)
Then, run a data flow task that loads the entire data file into the staging table
Lastly, merge the staging table into the final table.
I prefer to do this last step in a SQL Task also:
INSERT INTO FinalTable
(PrimaryKey,Column1,Column2,Column3)
SELECT
PrimaryKey,Column1,Column2,Column3
FROM StagingTable SRC
WHERE NOT EXISTS (
SELECT * FROM FinalTable TGT WHERE TGT.PrimaryKey=SRC.PrimaryKey
);
If you prefer a graphical UI, and you don't mind the extra network traffic, and slower processing time, you can do the same type of merge operation using lookups. You can even use the SCD component but I strongly discourage it's use.
Whether you do it in T-SQL or the UI, you need a key that can be used to uniquely identify the records (referred to as PrimaryKey in my example). If you don't have this key, there is no way to 'deduplicate'
Note in this example you have a 'real' staging table whose only purpose is to get the data file into the database. Then you have a final table that contains the final consistent result
Also note that this pattern only adds new rows - it will not update existing rows if they change in the data file.

Given your exact scenario (of loading the same file again), I would first check if the data is even loaded to the staging table. If you do that, you don't have to worry about checking the duplicates at record level.
How are you setting the connection to the file? Most of the data loads I have dealt with, I designed for-each-loop-container where the file name/path would be populated in a user variable. As you said, you could just use a derived column transform to add a new column which gets the value from a variable. If you don't have the file name in a user variable, you could use expression task in the control flow to populate it.
To cover your exact requirement, I would use the above step to populate the file name in the table. You could even normalize to a different table instead of storing long file name for every data record. Once you have all the file names in the database, you could just have an "Execute SQL" at the beginning to see if that file name is already in the database.

Two years back I have faced the same problem with importing TSV files.
I tried many other solutions but best I could design is C# code script for such validation at its best.
What I did as a solution
Create one C# DataTable object in memory with Primary Key constraints,
like:-
DataColumn[] keyColumn = new DataColumn[30];
keyColumn[intJ] = dtFilterdPK.Columns["Column name"];
Then try to add one by one row from your CSV to this DataTables.
Whenever your data will get Duplication based on Primary Key will have an error
Handle this error code in (TRY)..CATCH block and make this duplication error as per your logging requirement.
Avoid those error records importing in DataTable object.
Atlast import your CSV file into your table as BulkImport
Like:
using (SqlBulkCopy bulkCopy = new SqlBulkCopy(myConnection))
{
bulkCopy.DestinationTableName = "Your DB Table Name"; //Assign table name
bulkCopy.WriteToServer(dtToBeImport); //Write into Actual table.
}
Hope this will help you.

Related

Azure Data Factory Incremental Load data by using Copy Activity

I would like to load incremental data from data lake into on premise SQL, so that i created data flow do the necessary data transformation and cleaning the data.
after that i copied all the final data sink to staging data lake to stored CSV format.
I am facing two kind of issues here.
when ever i am trigger / debug to loading my dataset(data flow full activity ), the first time data loaded in CSV, if I load second time similar pipeline, the target data lake, the CSV file loaded empty data, which means, the column header loaded but i could not see the any value inside file.
coming to copy activity, which is connected to on premise SQL server, i am trying to load the data but if we trigger this pipeline again and again, the duplicate data loaded, i want to load only incremental or if updated data comes from data lake CSV file. how do we handle this.
Kindly suggest.
When we want to incrementally load our data to a database table, we need to use the Upsert option in copy data tool.
Upsert helps you to incrementally load the source data based on a key column (or columns). If the key column is already present in target table, it will update the rest of the column values, else it will insert the new key column with other values.
Look at following demonstration to understand how upsert works. I used azure SQL database as an example.
My initial table data:
create table player(id int, gname varchar(20), team varchar(10))
My source csv data (data I want to incrementally load):
I have taken an id which already exists in target table (id=1) and another which is new (id=4).
My copy data sink configuration:
Create/select dataset for the target table. Check the Upsert option as your write behavior and select a key column based on which upsert should happen.
Table after upsert using Copy data:
Now, after the upsert using copy data, the id=1 row should be updated and id=4 row should be inserted. The following is the final output achieved which is inline with expected output.
You can use the primary key in your target table (which is also present in your source csv) as the key column in Copy data sink configuration. Any other configuration (like source filter by last modified configuration) should not effect the process.

loading data from external stage - only truncate + load when theres new file

I'm loading data from a named external stage (S3) by using COPY INTO, and this S3 bucket keep all old files.
Here's what I want:
When a new file comes in, truncate the table and load the new file only, if there's no new file coming in, just keep the old data without truncation.
I understand that I can put option like FORCE = False to avoid loading old files again, but how do I only truncate the table when there's new file coming in?
I would likely do this a bit differently, since there isn't a way to truncate/delete records in the target table from the COPY command. This will be a multi-step process, but can be automated via Snowflake:
Create a transient table. For sake of description, I'll just call this STG_TABLE. You will also maintain your existing target table called TABLE.
Modify your COPY command to load to STG_TABLE.
Create a STREAM called STR_STG_TABLE over STG_TABLE.
Create a TASK called TSK_TABLE with the following statement
This statement will execute only if your COPY command actually loaded any new data.
CREATE OR REPLACE TASK TSG_TABLE
WAREHOUSE = warehouse_name
WHEN SYSTEM$STREAM_HAS_DATA('STR_STG_TABLE')
AS
INSERT OVERWRITE INTO TABLE (fields)
SELECT fields FROM STR_STG_TABLE;
The other benefit of using this method is that your transient table will have the full history of your files, which can be nice for debugging issues.

How can I create a Primary Key in Oracle when composite key and a "generated always as identity" option won't work?

I'm working on an SSIS project that pulls data form Excel and loads to Oracle Database every month. I plan to pull data from Excel file and load to Oracle stage table. I will be using a merge statement because the data that gets loaded each month is a rolling 12 month list and the data can change, so need to be able to INSERT when records don't match or UPDATE when they do. My control flow looks like this: Truncate Stage Table (to clear out table from last package run)---> DATA FLOW from Excel to Stage Table---> Merge to Target Table in Oracle.
My problem is that the data in the source Excel file doesn't have any unique columns to select a primary key or a composite key, as it is a possibility (although very unlikely) that a new record could have the exact same information. I am unable to utilize the "generated always as identity" because my SSIS package needs to truncate at the beginning of each job to clear out the Stage Table. This would generate the same ID numbers in the new load and create problems in the Target Table.
Any suggestions as to how I can get around this problem?
Welcome to SO and ETL. Instead of using a staging table, in SSIS use two sources: Excel file and existing production table. Sort both inputs and then perform a merge join on the unique identifier. From there, use a derived column transformation to add a new column called 'Action' which will mark a row as either an INSERT/UPDATE/DELETE based on whether the join key is NULL. So:
NULL from file means DELETE (not in file, in database)
NULL from database means INSERT (in file, not in database)
Not NULL for both means UPDATE (in file, in database)
From there, use a conditional split to split rows to either a OLE DB Destination (INSERT), or SQL Command (UPDATE or DELETE). You can now remove the stage environment and MERGE command from your process. This has the added benefit of removing the ETL load from the SQL Server, assuming SSIS is running on a separate server.
Note: The sort transformation has the option to remove duplicates.

Importing Excel data using SSIS using unique ID's

I have one excel file that I want to import into two different tables, tblUni and tblUser.
I have a third table which contains the id's from the other two tables:
tblUni_Students
Id
UniId
StudentId
What I need is when I import the excel data into the first two tables, for each record, the newly created ids to be inserted into the Uni_Students table also.
Using SSIS, I have managed to import the data into two sql destinations but cannot seem to then take the new ids from these destinations to then insert into the lookup table.
Can anyone advise please. Thanks.
It's a bit difficult to answer without knowing the target database or the structure of the data but speaking generally this would be much better done by adding the data into a "load" table. i.e. one who's sole reason is to temporarily hold data while you process it, you would then update the tblStudent, tblUni and tblUni_Student tables from the load area using SQL statements either via Procedure or via an Execute SQL Task component.
You'd it as an oledbcommand component, where the command is to insert values into the table. Then in the same component you'd output the generated identity. Assign the generated identity to a new column in the output, and now you have all your data plus the generated identity in the dataflow.
This will be processed one row at a time, so it will be slow. Personally I'd put it in a staging table and do it as CiarĂ¡n described.

Import SQL to SQL DB: How can I populate columns that exist in destination, not source?

I'm using SSIS to import data from one DB to another existing DB. Some columns in the destination tables do not exist in the source tables. Seems the Import & Export Wizard only allows me to select unmapped columns from the source and match them with these new columns in the destination. I'd like to be able to just provide one piece of data to import into all rows of these new columns.
Would like to use the GUI if possible because I'm not skilled at writing scripts. Thanks!
In SSIS, you can add a "derived column" component that will add columns to the buffer rows with the value you want (either a string or an expression).
I don't believe this is possible in the GUI. However, it would be a simple script after the data is loaded with SSIS:
UPDATE table SET newcolumn = new value
If you need to filter the rows, just add
WHERE column = value ...
You could change your source to a select query and list out the columns along with the static value you want to map.
SELECT SOURCECOLUMN_1,SOURCECOLUMN_2,....,SOURCECOLUMN_N,'VALUE' AS DESTINATIONCOLUMN FROM Source_Table
My original thought was that you could use the query right in the Import & Export wizard. you can obviously do alot more if you go in and edit the package, but it sounded like you didn't have much expereince with that. Here is how you would do this in the wizard.
After you have selected your source and destination databases you can Specify Table Copy or Query. Select the Write a query to specify the data to transfer option
On the next screen enter the query listing out all of the columns and add in your static columns.
On the Next screen You will need to select the Destination table or it will default to creating a new table named Query. You should be able to choose from the drop down. As long as you aliased your extra columns with the same names it should map correctly. You can go in and edit mappings here if needed.
You can then save off the SSIS package and it will source form the query.
Alternatively if you already have the SSIS pacakge created without the extra columns you can go in to the Data Flow and change the Data access mode in the OLE DB Source to be a SQL Command instead of a table or view. Add your query here.
You can then go into the properties of the OLE DB Desitination in the Dataflow and map the new column. You could also add in a derived column as #DominicGoulet by adding in a Dervied Column task and putting your static information here and then mapping. If you want to see that solution too let me know.