Update Million Rows with Foreach using WHERE with PK - sql

I have the following use case:
I have a batch file with approximately 10 million lines. Each line represents a row in my target database table but I don't if this row is to update or insert, so I have the following logic:
Take a line and try to make an UPDATE (always using PK), if the result is 0 (no rows affected) go to step 2.
Take same line, as step 1, and make an INSERT.
Some considerations about this logic:
I'm using SQL Server
I execute steps 1 and 2 in chunks, every 1000 lines. So, I get 1000 lines to execute step 1, and with this return, I execute step 2.
My table has index only in PK and UPDATE only executes using WHERE in PK.
UPDATE is slow 100x times than INSERT. UPDATE execute in 1hour and INSERT in 1min.
INSERT is executed inside a transaction, with batch insert.
UPDATE is not executed inside a transaction, I tried to use 1000 UPDATEs inside a transaction but I got deadlock because my application is multithreading and need to execute step 1 and 2 in parallel.
Some things that I thought to solve this problem:
Try to understand why UPDATE is so slow. But really, I don't know where to start in this case, my update seems to be correct. This is an example:
UPDATE TABLEA SET ATTR1 = LINE1_ATTR1, ATTR2 = LINE2_ATTR2
WHERE ATTR_PK1 = LINE1_ATTR_PK1 AND ATTR_PK2 = LINE1_ATTR_PK2;
Change the LOGIC: Insert all 10 million rows in a TEMP TABLE (because INSERT is faster), after doing an UPDATE FROM with ORIGINAL table and TEMP TABLE.

Related

Is there any way to turn off trigger on MariaDB?

I am using to trigger on a table, after insert and delete, for counting items.
but i think it make query some inefficient.
a transaction inserting 250,000 rows, If trigger is on then it takes 75 seconds. but If trigger isn't then it takes 60 seconds.
when ever i saw some session variable turning off unique check... like that, Is there any way to turn off trigger?
I think MariaDB doesn't optimize trigger by anyway. ( I means it just repeat +1 operation 250,000.. not just +250,000)
below is my trigger.
CREATE TRIGGER incrementTableA
AFTER INSERT ON TableA
FOR EACH ROW
UPDATE Counts
SET Counts.value = Counts.value + 1
WHERE Counts.var='totalTableA';
CREATE TRIGGER decrementTotalTableA
AFTER DELETE ON TableA
FOR EACH ROW
UPDATE Counts
SET Counts.value = Counts.value - 1
WHERE Counts.var='totalTableA';

PostgreSQL - FOR UPDATE SKIP LOCKED deadlock

I have a parallel process that is using queue table in PostgreSQL. Logic is:
Begin transaction.
Mark 100 unprocessed records with some random generated ID.
Commit.
Run some heavy app logic that takes some time and is processing queue records with generated ID in step 2.
Update 100 processed records with success/bad status.
Up to 20 threads are doing those steps.
However, sometimes when I'm trying to do 2 step with query:
UPDATE QUEUE_TABLE
SET QUEUE_TXN_GUID=$RANDOM_GUID,
QUEUE_STATUS=1
WHERE QUEUE_ROW_GUID IN
(SELECT QUEUE_ROW_GUID from QUEUE_TABLE
WHERE QUEUE_STATUS IS NULL OR QUEUE_STATUS = -1
LIMIT 100 FOR UPDATE SKIP LOCKED) RETURNING QUEUE_ROW_GUID
I got error deadlock detected.
Query that I'm using in step 5 is
UPDATE QUEUE_TABLE SET CDC_QUEUE_REZ_STATUS=$STATUS WHERE CDC_QUEUE_REZ_TXN_GUID=$RANDOM_GUID;
I don't know why I'm getting this strange deadlock, with FOR UPDATE SKIP LOCKED in first update subquery.
The reason of the issue is the fact that there are duplicates in QUEUE_ROW_GUID. Select locks some rows but then query updates not those rows that were locked. That's why concurrently running query may try to update the same rows as this one. So the SKIP LOCKED does not work in this case.
Given that update of rows may happen in different order the first query (that tries to update say row 1 and row 2) may first update row 1 and then try to update row 2 but waits on lock. Concurrently running query (that tries to update 1 and 2 as well) already updated row 2 and waits for lock for row 1. Hence the deadlock.
You need to use unique identifiers to update rows after they are locked.

Splitting a merge into batches in SQL Server 2008

I am trying to find the best way to perform a SQL Server merge in batches.
TEMPTABLENAME is a temp table with an id column (and all the other columns).
I am thinking some kind of while loop using the id as a counter.
The real problem with the merge statement is that it locks the entire table for an extended amount of time when processing 200K+ records. I want to loop every 100 rows so I can release the lock and let the other applications access the table. This table has millions of rows and also fires off an audit every time data is updated. This causes 160K records to take around 20 to 30 mins in the merge below.
The merge code below is a sample of the code. There is probably about 25 columns that get updated/inserted.
I would be open to finding another way other than the merge to insert the data. I just can not change the audit system or amount of records in the table.
merge Employee as Target
using TEMPTABLENAME as Source on (Target.ClientId = Source.ClientId and
Target.EmployeeReferenceId = Source.EmployeeReferenceId)
when matched then
update
set
Target.FirstName = Source.FirstName,
Target.MiddleName = Source.MiddleName,
Target.LastName = Source.LastName,
Target.BirthDate = Source.BirthDate
when not matched then
INSERT ([FirstName], [MiddleName], [LastName], [BirthDate])
VALUES (Source.FirstName, Source.MiddleName, Source.LastName, Source.BirthDate)
OUTPUT $action INTO #SummaryOfChanges;
SELECT Change, COUNT(*) AS CountPerChange
FROM #SummaryOfChanges
GROUP BY Change;

Non Repeatable Read from database table in SQL Server

Suppose I have a table 100 rows, I just want to select top 10 rows of table, but my situation is that i want to select only those rows which was not previously processed.
For this i have added a Flag column so that i will update whenever i process rows.
But here the problem arises when concurrent request comes for top 10 rows. Both may get same rows and trying to update the same rows (which I dont want to do).
Here I can't use Begin Transaction because It will lock the table and concurrent request will not get handled.
Requirement : My actual requirement is When i am selecting top 10 rows
using flag condition and updating then, then if other request for the
same it will also select other top 10 rows which is not handling by
Request 1.
Example : My table contains 100 rows.
{
Select top 10 * from table_name where flag=0
update table_name set top 10 flag = 1
}
(Will select top 10 out of 100 rows n update)
if at the same time during above request, another request come,
{
Select top 10 * from table_name where flag=0 (Should skip previous request rows)
update table_name set top 10 flag = 1
}
Need: (Will select top 10 out of rest 90 rows n update)
I Need a lock on top 10 rows of first request, but lock should like skip rows of first request even during simultaneous select statement of both requests
Please help me out for this to solve.
You can use an OUTPUT clause to do both the selecting and the updating the flag in one statement, e.g.
UPDATE TOP 10 table
SET flag = 1
WHERE flag = 0
OUTPUT inserted.*
If I understand you correctly you don't want to use a Transaction because it will lock the table for the duration of the update.
Maybe you could split the process into one part which selects the rows and updates the flag and a second part where you actually do your update with the selected rows.
Use a Transaction only for the first part of the task. This will ensure the table is only locked for the absolute Minimum of time.
As for your non-repeatable reads:
If you really want enforce this policy you should delete the selected row from the table and optionally save them to another table where the read-history stays. The lowest-level way to accomplish this guaranteed is with an update of another flag (updated?) and a trigger after the update.
Transaction with ISOLATION LEVEL REPEATABLE READ
{
select top 10 rows
update select-flag
return the 10 rows
}
normal query
{
take the returned 10 rows and do something
change updated-flag
}
Trigger after update if updated-flag changed
{
copy updated to read-history-table
delete updated-rows
}
ISOLATION LEVELS on MSDN
REPEATABLE READ "Specifies that statements cannot read data that has
been modified but not yet committed by other transactions and that
no other transactions can modify data that has been read by the
current transaction until the current transaction completes."

Running large queries in the background MS SQL

I am using MS SQL Server 2008
i have a table which is constantly in use (data is always changing and inserted to it)
it contains now ~70 Mill rows,
I am trying to run a simple query over the table with a stored procedure that should properly take a few days,
I need the table to keep being usable, now I executed the stored procedure and after a while every simple select by identity query that I try to execute on the table is not responding/running too much time that I break it
what should I do?
here is how my stored procedure looks like:
SET NOCOUNT ON;
update SOMETABLE
set
[some_col] = dbo.ufn_SomeFunction(CONVERT(NVARCHAR(500), another_column))
WHERE
[some_col] = 243
even if i try it with this on the where clause (with an 'and' logic..) :
ID_COL > 57000000 and ID_COL < 60000000 and
it still doesn't work
BTW- SomeFunction does some simple mathematics actions and looks up rows in another table that contains about 300k items, but is never changed
From my perspective your server has a serious performance problem. Even if we assume that none of the records in the query
select some_col with (nolock) where id_col between 57000000 and 57001000
was in memory, it shouldn't take 21 seconds to read the few pages sequentially from disk (your clustered index on the id_col should not be fragmented if it's an auto-identity and you didn't do something stupid like adding a "desc" to the index definition).
But if you can't/won't fix that, my advice would be to make the update in small packages like 100-1000 records at a time (depending on how much time the lookup function consumes). One update/transaction should take no more than 30 seconds.
You see each update keeps an exclusive lock on all the records it modified until the transaction is complete. If you don't use an explicit transaction, each statement is executed in a single, automatic transaction context, so the locks get released when the update statement is done.
But you can still run into deadlocks that way, depending on what the other processes do. If they modify more than one record at a time, too, or even if they gather and hold read locks on several rows, you can get deadlocks.
To avoid the deadlocks, your update statement needs to take a lock on all the records it will modify at once. The way to do this is to place the single update statement (with only the few rows limited by the id_col) in a serializable transaction like
IF ##TRANCOUNT > 0
-- Error: You are in a transaction context already
SET NOCOUNT ON
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
-- Insert Loop here to work "x" through the id range
BEGIN TRANSACTION
UPDATE SOMETABLE
SET [some_col] = dbo.ufn_SomeFunction(CONVERT(NVARCHAR(500), another_column))
WHERE [some_col] = 243 AND id_col BETWEEN x AND x+500 -- or whatever keeps the update in the small timerange
COMMIT
-- Next loop
-- Get all new records while you where running the loop. If these are too many you may have to paginate this also:
BEGIN TRANSACTION
UPDATE SOMETABLE
SET [some_col] = dbo.ufn_SomeFunction(CONVERT(NVARCHAR(500), another_column))
WHERE [some_col] = 243 AND id_col >= x
COMMIT
For each update this will take an update/exclusive key-range lock on the given records (but only them, because you limit the update through the clustered index key). It will wait for any other updates on the same records to finish, then get it's lock (causing blocking for all other transactions, but still only for the given records), then update the records and release the lock.
The last extra statement is important, because it will take a key range lock up to "infinity" and thus prevent even inserts on the end of the range while the update statement runs.