Data flow insert lock

Data flow insert lock - sql

I have an issue with my data flow task locking, this task compares a couple of tables, from the same server and the result is inserted into one of the tables being compared. The table being inserted into is being compared by a NOT EXISTS clause.
When performing fast load the task freezes with out errors when doing a regular insert the task gives a dead lock error.
I have 2 other tasks that perform the same action to the same table and they work fine but the amount of information being inserted is alot smaller. I am not running these tasks in parallel.
I am considering using no locks hint to get around this because this is the only task that writes to a cerain table partition, however I am only coming to this conclusion because I can not figure out anything else, aside from using a temp table, or a hashed anti join.

Probably you have so called deadlock situation. You have in your DataFlow Task (DFT) two separate connection instances to the same table. The first conn instance runs SELECT and places Shared lock on the table, the second runs INSERT and places a page or table lock.
A few words on possible cause. SSIS DFT reads table rows and processes it in batches. When number of rows is small, read is completed within a single batch, and Shared lock is eliminated when Insert takes place. When number of rows is substantial, SSIS splits rows into several batches, and processes it consequentially. This allows to perform steps following DFT Data Source before the Data Source completes reading.
The design - reading and writing the same table in the same Data Flow is not good because of possible locking issue. Ways to work it out:
Move all DFT logic inside single INSERT statement and get rid of DFT. Might not be possible.
Split DFT, move data into intermediate table, and then - move to the target table with following DFT or SQL Command. Additional table needed.
Set a Read Committed Snapshot Isolation (RCSI) on the DB and use Read Committed on SELECT. Applicable to MS SQL DB only.
The most universal way is the second with an additional table. The third is for MS SQL only.

Related

DB2: Working with concurrent DDL operations

We are working on a data warehouse using IBM DB2 and we wanted to load data by partition exchange. That means we prepare a temporary table with the data we want to load into the target table and then use that entire table as a data partition in the target table. If there was previous data we just discard the old partition.
Basically you just do "ALTER TABLE target_table ATTACH PARTITION pname [starting and ending clauses] FROM temp_table".
It works wonderfully, but only for one operation at a time. If we do multiple loads in parallel or try to attach multiple partitions to the same table it's raining deadlock errors from the database.
From what I understand, the problem isn't necessarily with parallel access to the target table itself (locking it changes nothing), but accesses to system catalog tables in the background.
I have combed through the DB2 documentation but the only reference to the topic of concurrent DDL statements I found at all was to avoid doing them. The answer to this question, can't be to simply not attempt it?
Does anyone know a way to deal with this problem?
I tried to have a global, single synchronization table to lock if you want to attach any partitions, but it didn't help either. Either I'm missing something (implicit commits somewhere?) or some of the data catalog updates even happen asynchronously, which makes the whole problem much worse. If that is the case, is there are any chance at all to query if the attach is safe to perform at any given moment?

What are the performance consequences of different isolation levels in PostgreSQL?

I am writing an archival script (in Python using psycopg2) that needs to pull a very large amount of data out of a PostgreSQL database (9.4), process, upload and then delete it from the database.
I start a transaction, execute a select statement to create a named cursor, fetch N rows at a time from the cursor and do processing and uploading of parts (using S3 multipart upload). Once the cursor is depleted and no errors occurred, I finalize the upload and execute a delete statement using the same conditions as I did in select. If delete succeeds, I commit the transaction.
The database is being actively written to and it is important that both the same rows get archived and deleted and that reads and writes to the database (including the table being archived) continue uninterrupted. That said, the tables being archived contain logs, so existing records are never modified, only new records are added.
So the questions I have are:
What level of isolation should I use to ensure same rows get archived and deleted?
What impact will these operations have on database read/write ability? Does anything get write or read locked in the process I described above?

You have two good options:
Get the data with
SELECT ... FOR UPDATE
so that the rows get locked. Then the are guaranteed to be there when you delete them.
Use
DELETE FROM ... RETURNING *
Then insert the returned rows into your archive.
The second solution is better, because you need only one statement.
Nothing bad can happen. If the transaction fails for whatever reason, no row will be deleted.
You can use the default READ COMMITTED isolation level for both solutions.

How do I lock out writes to a specific table while several queries execute?

I have a table set up in my sql server that keeps track of inventory items (in another database) that have changed. This table is fed by several different triggers. Every 15 minutes a scheduled task runs a batch file that executes a number of different queries that send updates on the items flagged in this table to update several ecommerce websites. The last query in the batch file resets the flags.
As you can imagine there is potential to lose changes if an item is flagged while this batch file is running. I have worked around this by replaying the last 25 hours of updates every 24 hours, just in case this scenario happened. It works, but IMO is kind of clumsy.
What I would like to do is delay any writes to this table until my script finishes, and resets the flags on all the rows that were flagged when the script started. Then allow all of these delayed writes to happen.
I've looked into doing this with table hints (TABLOCK) but this seems to be limited to one query--unless I'm misunderstanding what I have read, which is certainly possible. I have several that run in succession. TIA.

Alex
Could you modify your script into a stored procedure that extracts all the data into a temporary table using a select statement that applies a lock to the production table. You could then drop your lock on the main table and do all your processing in the temporary table (or permanent table built for the purpose) away from the live system. It will be a lot slower and put more load on your SQL box but speed shouldn't be an issue if you have a point in time snapshot of it.
If that option is not applicable then maybe you could play with wrapping the whole thing in a transaction and putting a table lock on your production table with the first select statement.
Good luck mate

Best way to do a long running schema change (or data update) in MS Sql Server?

I need to alter the size of a column on a large table (millions of rows). It will be set to a nvarchar(n) rather than nvarchar(max), so from what I understand, it will not be a long change. But since I will be doing this on production I wanted to understand the ramifications in case it does take long.
Should I just hit F5 from SSMS like I execute normal queries? What happens if my machine crashes? Or goes to sleep? What's the general best practice for doing long running updates? Should it be scheduled as a job on the server maybe?
Thanks

Please DO NOT just hit F5. I did this once and lost all the data in the table. Depending on the change, the update statement that is created for you actually stores the data in memory, drops the table, creates the new one that has the change you want, and populates the data from memory. However in my case one of the changes I made was adding a unique constraint so the population failed, and as the statement was over the data in memory was dropped. This left me with the new empty table.
I would create the table you are changing, with the change(s) you want, as a new table. Then select * into the new table, then re-name the tables in a single statement. If there is potential for data to be entered into the table while this is running and that is an issue, you may want to lock the table.
Depending on the size of the table and duration of the statement, you may want to save the locking and re-naming for later, and after the initial population of the new table do a differential population of new data and re-name the tables.
Sorry for the long post.
Edit:
Also, if the connection times out due to duration, then run the insert statement locally on the DB server. You could also create a job and run that, however it is essentially the same thing.

SSIS - Delete Existing Rows then Insert, Incomplete Result

I'm relatively new to SSIS and know that handling duplicates is an oft repeated question, so thank you in advance for reading through my wall of text and for any help with my complicated situation.
I have a small 18179 row table (we'll call it Destination) that needs to be updated with SSIS using a flat file. The 18179 row flat file I am testing contains only records that exist in Destination and have changed. Currently, I have a package that loads a staging table (we'll call it Stage) from the flat file, then moves to the Data Flow and Look Up
This Data Flow takes Stage and does Look Up LKP_OrderID from Stage on Destination using primary key OrderID to see if the record exists.
If the OrderID does not exist in Destination, then it follows the New OrderID path and the record is inserted into Destination at DST_OLE_Dest.
Here is where I am having trouble: If the OrderID does exist in Destination, then it follows the Existing OrderID path. The CMD_Delete_Duplicates OLE DB Command executes:
DELETE d
FROM dbo.Destination d
INNER JOIN dbo.Stage s ON d.OrderID = s.OrderID
This should delete any records from Destination that exist in Stage. Then it should insert the updated version of those records from Stage at DST_OLE_Desti.
However, it seems to process the 18179 rows in 2 batches: in the first batch it processes 9972 rows.
Then, in the 2nd batch it processes the remaining 8207 rows. It displays that it inserted all 18179 rows to Destination, but I only end up with the last batch of 8207 rows in Destination.
I believe it deletes and inserts the 1st batch of 9972 rows, then runs the above delete from inner join SQL again for the 2nd batch of 8207 rows, inadvertently deleting the just-inserted 9972 rows and leaving me with the 8207.
I've found that maximizing DefaultBufferSize to 104857600 bytes and increasing the DefaultBufferMaxRows in the Data Flow such that the package processes all 18179 rows at once correctly deletes and inserts all 18179, but once my data exceeds the 104857600 file size, this will again be an issue. I can also use the OLE DB Command transformation to run
DELETE FROM dbo.Destination WHERE OrderID = ?
This should pass OrderID from Stage and delete from Destination where there is a match, but this is time intensive and takes ~10 minutes for this small table. Are there any other solutions out there for this problem? How would I go about doing an Update rather than an Insert and Delete if that is a better option?

Yeah, you've got logic issues in there. Your OLE DB Command is firing that delete statement for EVERY row that flows through it.
Instead, you'd want to have that step be a precedent (Execute SQL Task) to the Data Flow. That would clear out the existing data in the target table before you began loading it. Otherwise, you're going to back out the freshly loaded data, much as you've observed.
There are different approaches for handling this. If deletes work, then keep at it. Otherwise, people generally stage updates to a secondary table and then use an Execute SQL Task as a successor to the data flow task and perform set based update.

You could use the Slowly Changing Dimension tool from the SSIS toolbox to update the rows (as opposed to a delete and re-insert). You only have 'Type 1' changes by the sounds of it, so you won't need to use the Historical Attributes Inserts Output.
It would automatically take care of both streams in your illustration - inserts and updates

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas