SSIS delete duplicate data

SSIS delete duplicate data - sql

I have a problem for making a SSIS package.Can anyone help?
Here is the case:
I have two tables: A & B, and the structure is the same. But they are stored on different servers.
I have already made an SSIS package to transfer data from A to B (about one million rows at a time, which takes between one and two minutes).
After that I want to delete table A's data, after having been transfered to B. The SSIS package I wrote would follow that. I use a merge join and Conditional Split command to select the same data.
After that I use the OLE DB Command to delete Table A's data (just use "Delete RE_FormTo Where ID=?" SQLCommand to delete). It can work, but it is too slow! It took about one hour to delete the duplicate data! Does anyone know of a more efficient way of doing this?
SSIS Package Link

The execution is bound to be slow because of the poor SSIS package design .
Kindly refer the document Best Practices of SSIS Design
Let me explain you the mistakes which are there in your package .
1.You are using a Blocking transformation (Sort Component) .These transformations doesn't reuse the input buffer but create a new buffer for output and mostly they are slower than Synchronous components such as Lookup ,Derived Column etc which try to re use the input buffer .
As per MSDN
Do not sort within Integration Services unless it is absolutely necessary. In
order to perform a sort, Integration Services allocates the memory space of the
entire data set that needs to be transformed. If possible, presort the data before
it goes into the pipeline. If you must sort data, try your best to sort only small
data sets in the pipeline. Instead of using Integration Services for sorting, use
an SQL statement with ORDER BY to sort large data sets in the database – mark
the output as sorted by changing the Integration Services pipeline metadata
on the data
source.
2.Merge Join is a semi-blocking transformation which does hamper the performance but much less than Blocking transformation
There are 2 ways in which you can solve the issue
Use Lookup
Use Execute SQL Task and write the Merge SQL
DECLARE #T TABLE(ID INT);
Merge #TableA as target
using #TableB as source
on target.ID=source.ID
when matched then
Delete OUTPUT source.ID INTO #T;
DELETE #TableA
WHERE ID in (SELECT ID
FROM #T);

After you Joining both tables just insert Sort element, he gonna remove duplicates....
http://sqlblog.com/blogs/jamie_thomson/archive/2009/11/12/sort-transform-arbitration-ssis.aspx

Related

SSIS Incremental Load-15 mins

I have 2 tables. The source table being from a linked server and destination table being from the other server.
I want my data load to happen in the following manner:
Everyday at night I have scheduled a job to do a full dump i.e. truncate the table and load all the data from the source to the destination.
Every 15 minutes to do incremental load as data gets ingested into the source on second basis. I need to replicate the same on the destination too.
For incremental load as of now I have created scripts which are stored in a stored procedure but for future purposes we would like to implement SSIS for this case.
The scripts run in the below manner:
I have an Inserted_Date column, on the basis of this column I take the max of that column and delete all the rows that are greater than or equal to the Max(Inserted_Date) and insert all the similar values from the source to the destination. This job runs evert 15 minutes.
How to implement similar scenario in SSIS?
I have worked on SSIS using the lookup and conditional split using ID columns, but these tables I am working with have a lot of rows so lookup takes up a lot of the time and this is not the right solution to be implemented for my scenario.
Is there any way I can get Max(Inserted_Date) logic into SSIS solution too. My end goal is to remove the approach using scripts and replicate the same approach using SSIS.

Here is the general Control Flow:
There's plenty to go on here, but you may need to learn how to set variables from an Execute SQL and so on.

SSIS Alternatives to one-by-one update from RecordSet

I'm looking for a way to speed up the following process: I have a SSIS package that loads data from Excel files on a weekly basis to SQL Server. There are 3 fields: Brand, Date, Value.
In the dataflow, I check for existing combinations of Brand+Date, and new combinations go to the table directly, the existing ones go to a RecordSet destination for updates:
The next step is to update the Value of the existing combinations:
As you can see, there are thousands of records to update, and it takes too long. The number of records tend to grow week by week. Please suggest.

The fastest way will be do this inside a Stored procedure using ELT (Extract Load Transform) approach.
Push all data from excel as is into a table(called load to a staging table in theory). Since you do not seem to be concerned with data validation steps, this table can be a replica of final destination table columns.
Next step is to call a stored procedure using Execute SQL task. Inside this procedure you can put all your business logic. Since this steps with native data manipulation on SQL server entities, it is the fastest alternative.
As a last part, please delete all entries from the staging table.
You can use indexes on staging table to make the SP part even faster.

Bulk Insert from table to table

I am implementing an A/B/View scenario, meaning that the View points to table A, while table B is updated, then a switch occurs and the view points to table B while table A is loaded.
The switch occurs daily. There are millions of rows to update and thousands of users looking at the view. I am on SQL Server 2012.
My questions are:
how do I insert data into a table from another table in the fastest possible way? (within a stored proc)
Is there any way to use BULK INSERT? Or, is using regular insert/select the fastest way to go?

You could to a Select ColA, ColB into DestTable_New From SrcTable. Once DestTable_New is loaded, recreate indexes and constraints.
Then rename DestTable to DestTable_Old and rename DestTable_New to DestTable. Renaming is extremly quick. If something turns out to have gone wrong, you also have a backup of the previous table close by (DestTable_Old).
I did this scenario once where we had to have the system running 24/7 and needed to load tens of millions of rows each day.

I'd be inclined to use SSIS.
Make table A an OLEDB source and table B an OLEDB destination. You will bypass the transaction log so reduce the load on the DB. The only way (I can think of) to do this using T-SQL is to change the recovery model for your entire database, which is far from ideal because it means no transactions are stored, not just the ones for your transfer.
Setting up SSIS Transfer
Create a new project and drag a dataflow task to your design surface
Double click on your dataflow task which will take you through to the Data Flow tab. Then drag and drop an OLE DB source from the "Data flow Sources" menu, and an OLE DB destination from the "Data flow Destinations" menu
Double click on the OLE DB source, set up the connection to your server, choose the table you want to load from and click OK. Drag the green arrow from the OLE DB source to the destination then double click on the destination. Set up your connection manager, destination table name and column mappings and you should be good to go.
OLE DB Source docs on MSDN
OLE DB Destination docs on MSDN

You could do the
SELECT fieldnames
INTO DestinationTable
FROM SourceTable
as a couple answers suggest, that should be as fast as it can get (depending on how many indexes you'd need to recreate, etc).
But I would suggest using synonyms in order to change the pointer from one table to another. They're very transparent and in my opinion, cleaner than updating the view, or renaming tables.

I know the question is old, but I was hunting for an answer to the same question and didn't find anything really helpful. Yes the SSIS approach is a possibility, but the question wanted a stored proc.
To my delight I have discovered (almost) the solution that the original question wanted; you can do it with a CLR SP.
Select the data from TableA into a DataTable and then use the WriteToServer(DataTable dt) method of the SqlBulkCopy class with TableB as the DestinationTableName.
The only slight drawback is that the CLR procedure must use external access in order to use SqlBulkCopy, and does not work with context connection, so you need to fiddle a little bit with permissions and connection strings. But hey! nothing is ever perfect.

INSERT... SELECT... functions fairly similarly to BULK INSERT. You could use SSIS, like #GarethD says, but that might be overly complex if you're just copying rows from table1 to table2.
If you are copying serious quantities of data, keep an eye on the transaction log -- it can bloat up pretty fast when doing huge inserts. One work-around is to "chunkify" the data you are inserting, by looping over an insert statment that processes, say, only 100,000 or 10,000 rows a time (depends on how wide your rows are, i.e. how many MB per pass).
(Just curious, are you doing ALTER VIEW to reset the view? I did something similar once, though we had to have four tables and four views to support past/present/next/swap sets.)

You can simply do like this
select * into A from B Where [criteria]
This shall select the data from B, based on the criteria and shall insert it into A, provided the columns are same or you can specify column names instead of *.

SSIS storing logging variables in a derived column

I am developing SSIS packages that consist of 2 main steps:
Step 1: Grab all sorts of data from existing legacy systems and dump them into a series of staging tables in my database.
Step 2: Move the data from my staging tables into a more relational set of tables that I'm using specifically for my project.
In step 1 I'm just doing a bulk SELECT and a bulk INSERT; however, in step 2 I'm doing row-by-row inserts into my tables using OLEDB Command tasks so that I can log very specific row-level activity of everything that's happening. Here is my general layout for step 2 processes.
alt text http://dl.dropbox.com/u/2468578/screenshots/step_1.png
You'll notice 3 OLEDB tasks: 1 for the actual INSERT, and 2 for success/fail INSERTs into our logging table.
The main thing I'm logging is source table/id and destination table/id for each row that passes through this flow. I'm storing this stuff in variables and adding them to the data flow using a Derived Column so that I can easily map them to the query parameters of the stored procedures.
alt text http://dl.dropbox.com/u/2468578/screenshots/step_3.png
I've decided to store these logging values in variables instead of hard-coding the values in the SqlCommand field on the task, because I'm pretty sure you CAN'T put variable expressions in that field (i.e. exec storedproc #[User::VariableName],... ,... ,...). So, this is the best solution I've found.
alt text http://dl.dropbox.com/u/2468578/screenshots/step_2.png
Is this the best solution? Probably not.
Is it good performance wise to add 4 logging columns to a data flow that consists of 500,000 records? Probably not.
Can you think of a better way?

I really don't think calling an OLEDBCommand 500,000 times is going to be performant.
If you are already going to staging tables - load it all to a staging table and take it from there in T-SQL or even another dataflow (or to a raw file and then something else depending on your complete operation). A Bulk insert is going to be hugely more efficient.

to add to Cade's answer if you truly need the logging info on a row by row basis, your best best is to leverage the oledb destination and use one or both of the following transformations to add columns to the dataflow:
Derived Column Transformation
Audit Transformation
This should be your best bet and should't add much overhead

Table Variables in SSIS

In one SQL Task can I create a table variable
DELCARE #TableVar TABLE (...)
Then in another SQL Task or DataSource destination and select or insert into the table variable?
The other option I have considered is using a Temp Table.
CREATE TABLE #TempTable (...)
I would prefer to use Table Variable so that it remains in memory. But can use temp table if it is not possible to use table variable. Also I cannot use the record set destination as I need to preform straight SQL tasks on it later on.
The use case that this is trying to solve is essentially performing a transformation in the stead of BizTalk. There is a very large flat file to flat file transformation that BizTalk has to transform unfortunately the data volume would produce unacceptable load on the BizTalk server so the idea is to off load it to SSIS. However, it is not a simple row to row transformation, there are different types of rows which have relations to each other. The first task in SSIS is to load the row into appropriate (temp) tables, then in the second data task a select is preformed with the correct format for output.

You could use some of the techniques in this post: http://consultingblogs.emc.com/jamiethomson/archive/2006/11/19/SSIS_3A00_-Using-temporary-tables.aspx
especially the ones about using RetainSameConnection=TRUE on the connection manager.
I would be interested to see more information about what use case you have that requires you to write out data to a temp table or table variable before further SSIS processing. Couldn't you take care of all of the SQL required steps in your source query before you start processing the dataflow with SSIS?

Table variables are not kept solely in memory and can be written to disk under memory pressure. I tend to use table variables for very small lookups. If you Need to push a table into SQL Server due to necessary and complex transformations, then use a 'permanent' temp table that is truncated within the SSIS package prior to insert. Simple and will get what you need done.

The SSIS package would be run in a job. I assume it runs inside a SQL job. In that case, using a temp table won't harm. SQL Jobs are generally run after office hours so it does not matter.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas