I'm relatively new to SSIS and know that handling duplicates is an oft repeated question, so thank you in advance for reading through my wall of text and for any help with my complicated situation.
I have a small 18179 row table (we'll call it Destination) that needs to be updated with SSIS using a flat file. The 18179 row flat file I am testing contains only records that exist in Destination and have changed. Currently, I have a package that loads a staging table (we'll call it Stage) from the flat file, then moves to the Data Flow and Look Up
This Data Flow takes Stage and does Look Up LKP_OrderID from Stage on Destination using primary key OrderID to see if the record exists.
If the OrderID does not exist in Destination, then it follows the New OrderID path and the record is inserted into Destination at DST_OLE_Dest.
Here is where I am having trouble: If the OrderID does exist in Destination, then it follows the Existing OrderID path. The CMD_Delete_Duplicates OLE DB Command executes:
DELETE d
FROM dbo.Destination d
INNER JOIN dbo.Stage s ON d.OrderID = s.OrderID
This should delete any records from Destination that exist in Stage. Then it should insert the updated version of those records from Stage at DST_OLE_Desti.
However, it seems to process the 18179 rows in 2 batches: in the first batch it processes 9972 rows.
Then, in the 2nd batch it processes the remaining 8207 rows. It displays that it inserted all 18179 rows to Destination, but I only end up with the last batch of 8207 rows in Destination.
I believe it deletes and inserts the 1st batch of 9972 rows, then runs the above delete from inner join SQL again for the 2nd batch of 8207 rows, inadvertently deleting the just-inserted 9972 rows and leaving me with the 8207.
I've found that maximizing DefaultBufferSize to 104857600 bytes and increasing the DefaultBufferMaxRows in the Data Flow such that the package processes all 18179 rows at once correctly deletes and inserts all 18179, but once my data exceeds the 104857600 file size, this will again be an issue. I can also use the OLE DB Command transformation to run
DELETE FROM dbo.Destination WHERE OrderID = ?
This should pass OrderID from Stage and delete from Destination where there is a match, but this is time intensive and takes ~10 minutes for this small table. Are there any other solutions out there for this problem? How would I go about doing an Update rather than an Insert and Delete if that is a better option?
Yeah, you've got logic issues in there. Your OLE DB Command is firing that delete statement for EVERY row that flows through it.
Instead, you'd want to have that step be a precedent (Execute SQL Task) to the Data Flow. That would clear out the existing data in the target table before you began loading it. Otherwise, you're going to back out the freshly loaded data, much as you've observed.
There are different approaches for handling this. If deletes work, then keep at it. Otherwise, people generally stage updates to a secondary table and then use an Execute SQL Task as a successor to the data flow task and perform set based update.
You could use the Slowly Changing Dimension tool from the SSIS toolbox to update the rows (as opposed to a delete and re-insert). You only have 'Type 1' changes by the sounds of it, so you won't need to use the Historical Attributes Inserts Output.
It would automatically take care of both streams in your illustration - inserts and updates
Related
I am using MonetDB (MDB) for OLAP queries. I am storing source data in PostgreSQL (PGSQL) and syncing it with MonetDB in batches written in Python.
In PGSQL there is a wide table with ID (non-unique) and few columns. Every few seconds Python script takes a batch of 10k records changed in the PGSQL and uploads them to MDB.
The process of upload to MDB is as follows:
Create staging table in MDB
Use COPY command to upload 10k records into the staging table.
DELETE from destination table all IDs that are in staging table.
INSERT to the destination table all rows from staging table.
So, it is basically a DELETE & INSERT. I cannot use MERGE statement, because I do not have a PK - one ID can have multiple values in the destination. So I need to do a delete and full insert for all IDs currently synced.
Now to the problem: the DELETE is slow.
When I do a DELETE on a destination table, deleting 10k records in table of 25M rows, it will take 500ms.
However! If I run simple SELECT * FROM destination WHERE id = 1 and THEN do a DELETE, it takes 2ms.
I think that it has something to do with automatic creation of auxiliary indices. But this is where my knowledge ends.
I tried to solve this problem of "pre-heating" by doing the lookup myself and it works - but only for the first DELETE after pre-heat.
Once I do DELETE and INSERT, the next DELETE gets again slow. And doing the pre-heating before each DELETE does not make sense, because the pre-heat itself takes 500ms.
Is there any way on how to sync data to MDB without breaking auxiliary indices already built? Or make the DELETE faster without pre-heat? Or should I use some different technique to sync data into MDB without PK (does MERGE has the same problem?).
Thanks!
I have a table set up in my sql server that keeps track of inventory items (in another database) that have changed. This table is fed by several different triggers. Every 15 minutes a scheduled task runs a batch file that executes a number of different queries that send updates on the items flagged in this table to update several ecommerce websites. The last query in the batch file resets the flags.
As you can imagine there is potential to lose changes if an item is flagged while this batch file is running. I have worked around this by replaying the last 25 hours of updates every 24 hours, just in case this scenario happened. It works, but IMO is kind of clumsy.
What I would like to do is delay any writes to this table until my script finishes, and resets the flags on all the rows that were flagged when the script started. Then allow all of these delayed writes to happen.
I've looked into doing this with table hints (TABLOCK) but this seems to be limited to one query--unless I'm misunderstanding what I have read, which is certainly possible. I have several that run in succession. TIA.
Alex
Could you modify your script into a stored procedure that extracts all the data into a temporary table using a select statement that applies a lock to the production table. You could then drop your lock on the main table and do all your processing in the temporary table (or permanent table built for the purpose) away from the live system. It will be a lot slower and put more load on your SQL box but speed shouldn't be an issue if you have a point in time snapshot of it.
If that option is not applicable then maybe you could play with wrapping the whole thing in a transaction and putting a table lock on your production table with the first select statement.
Good luck mate
I have an issue with my data flow task locking, this task compares a couple of tables, from the same server and the result is inserted into one of the tables being compared. The table being inserted into is being compared by a NOT EXISTS clause.
When performing fast load the task freezes with out errors when doing a regular insert the task gives a dead lock error.
I have 2 other tasks that perform the same action to the same table and they work fine but the amount of information being inserted is alot smaller. I am not running these tasks in parallel.
I am considering using no locks hint to get around this because this is the only task that writes to a cerain table partition, however I am only coming to this conclusion because I can not figure out anything else, aside from using a temp table, or a hashed anti join.
Probably you have so called deadlock situation. You have in your DataFlow Task (DFT) two separate connection instances to the same table. The first conn instance runs SELECT and places Shared lock on the table, the second runs INSERT and places a page or table lock.
A few words on possible cause. SSIS DFT reads table rows and processes it in batches. When number of rows is small, read is completed within a single batch, and Shared lock is eliminated when Insert takes place. When number of rows is substantial, SSIS splits rows into several batches, and processes it consequentially. This allows to perform steps following DFT Data Source before the Data Source completes reading.
The design - reading and writing the same table in the same Data Flow is not good because of possible locking issue. Ways to work it out:
Move all DFT logic inside single INSERT statement and get rid of DFT. Might not be possible.
Split DFT, move data into intermediate table, and then - move to the target table with following DFT or SQL Command. Additional table needed.
Set a Read Committed Snapshot Isolation (RCSI) on the DB and use Read Committed on SELECT. Applicable to MS SQL DB only.
The most universal way is the second with an additional table. The third is for MS SQL only.
I need to alter the size of a column on a large table (millions of rows). It will be set to a nvarchar(n) rather than nvarchar(max), so from what I understand, it will not be a long change. But since I will be doing this on production I wanted to understand the ramifications in case it does take long.
Should I just hit F5 from SSMS like I execute normal queries? What happens if my machine crashes? Or goes to sleep? What's the general best practice for doing long running updates? Should it be scheduled as a job on the server maybe?
Thanks
Please DO NOT just hit F5. I did this once and lost all the data in the table. Depending on the change, the update statement that is created for you actually stores the data in memory, drops the table, creates the new one that has the change you want, and populates the data from memory. However in my case one of the changes I made was adding a unique constraint so the population failed, and as the statement was over the data in memory was dropped. This left me with the new empty table.
I would create the table you are changing, with the change(s) you want, as a new table. Then select * into the new table, then re-name the tables in a single statement. If there is potential for data to be entered into the table while this is running and that is an issue, you may want to lock the table.
Depending on the size of the table and duration of the statement, you may want to save the locking and re-naming for later, and after the initial population of the new table do a differential population of new data and re-name the tables.
Sorry for the long post.
Edit:
Also, if the connection times out due to duration, then run the insert statement locally on the DB server. You could also create a job and run that, however it is essentially the same thing.
I'm using SSIS and BIDS to process a text file who contains lots (millions) of records. I decided to use the Bulk Insert Task and it worked great but then the destination table needed an additional column with a default value on the insert operation and the Bulk Insert Task stopped working. After that, I decided to use a Derived Column with the defaul value and an OleDB Destination to insert the bulk data. It solved my last problem but generated a new one: If there is an error when inserting the data in the OleDB Destination, then it executes a full rollback and no row was added on my table, but when I used the Bulk Insert Task, there were rows based in the BatchSize configuration. Let me explain it with a sample:
I use a text file with 5000 lines. The file contained a duplicate id (intentionally) between the rows 3000 and 4000.
Before starting the DTS, the destination table was totally empty.
Using Bulk Insert Task, after the error raised (and the DTS stopped), the destination table had 3000 rows. I set the BatchSize attribute to 1000.
Using OleDB Destination, after the error raised, the destination table had 0 rows! I set the Rows per batch attribute to 1000 and the Maximum insert commit size to its max value: 2147483647. I tried changing last one to 0, no effect.
Is this the normal behavior of OleDB Destination? Can someone provide me a guide about working with these tasks? Should I forget to use these tasks and use the Bulk Insert from T-SQL?
As a side note, I also tried following the instructions for KEEPNULLS in Keep Nulls or UseDefault Values During Bulk Import (SQL Server) to not use the OleDB Destination task, but it didn't work (maybe is just me).
EDIT: Additional info about the problem.
Table structure (sample)
Table T
id int, name varchar(50), processed int default 0
CSV File (sample)
1, hello
2, world
There is no rolling back on Bulk Inserts, that's why they are fast.
Take a look at using format files:
http://msdn.microsoft.com/en-us/library/ms179250.aspx
You could potentially place this in a transaction in SSIS (you'll need MSDTC running), or you could create T-SQL script with a try-catch to handle any exceptions of the bulk insert (probably just rollback or commit).