Simultaneous updates to SQL table - sql

I have a table will needs to be updated based on values within the table. The whole table will be updated and I wanted to ensure that SQL-Server will handle the locking for me. I know I can use WITH(LOCKTYPE). But as my update will be something like:
SELECT MyTable.ID, MyTable.Cost
INTO #tempTable from MyTable
-- More complex calculations here
UPDATE MyTable
SET MyTable.Cost = #tempTable.Cost+1
FROM #tempTable
WHERE MyTable.ID = #TempTable.ID
I wanted to be sure that I could lock the entire table so that if another request comes in to update the table it has to wait, but ideally, if a read request comes in it is processed. Is that possible?
Edit
Process:
Read from Table X
Perform calculations
Write to temp table Y
Average values between X and Y
Write back to X
Within the SP, once step 1 has happened, I don't want any other calls to the SP to be processed, I want them to sit in a queue. Because each update is dependent on the values already in the table, I cannot allow another call to the SP to read from the table.
In general, the SP needs to be run serially, the SP should not be allowed to be running twice at the same time. I don't mind if the call goes through and the SELECT is queued, but once a select happens, another select (within the SP) cannot be allowed to happen in another thread.

You can use a SQL transaction : http://msdn.microsoft.com/en-us/library/ms188929.aspx#fbid=-CQjt8-slkk

Related

SQL Server 2 Table Sync Recursion

I'm using triggers to keep two identical tables in a single database in sync. One is for an internal proprietary system, the other is used to expose a subset of data to the outside world. I'm not able to use the same table for both.
I need updates, inserts, deletes in either of the tables to be applied to the other.
So far I'm using triggers on both tables instead of a scheduled stored procedure because I wanted immediate updates. The problem is, an update to table A fires the trigger to update table B, which fires a trigger to update table A, which fires the trigger to table B.... and so on.
What's the best way to stop the recursion?
One way is to check the data first to see if it is different, something like this:
SELECT #cempno = inserted.cempno FROM inserted
SELECT #count = COUNT(*)
FROM jcempy j INNER JOIN zhhjcempy z ON j.cempno = z.cempno AND j.cempno = #cempno
WHERE (j.ccostcode <> z.ccostcode)
OR (j.cimearnreg <> z.cimearnreg)
OR (j.cimearnot <> z.cimearnot)
OR (j.cimearndt <> z.cimearndt)
OR (j.cimearnl1 <> z.cimearnl1)
IF #count = 1
BEGIN
-- Update the record
END
Another way could be to use a third table that holds status flags to show which table first initiated the update, but I've a feeling managing that will be a problem once I have 100 users hammering away at the system.
Any ideas, or comments on what sort of performance penalties the data check will incur?
Thanks!
If you can't avoid triggers look at ##NESTLEVEL within your trigger.
It should be always the same number. If number is bigger, just do nothing.

Best incremental load method using SSIS with over 20 million records

What is needed: I'm needing 25 million records from oracle incrementally loaded to SQL Server 2012. It will need to have an UPDATE, DELETE, NEW RECORDS feature in the package. The oracle data source is always changing.
What I have: I've done this many times before but not anything past 10 million records.First I have an [Execute SQL Task] that is set to grab the result set of the [Max Modified Date]. I then have a query that only pulls data from the [ORACLE SOURCE] > [Max Modified Date] and have that lookup against my destination table.
I have the the [ORACLE Source] connecting to the [Lookup-Destination table], the lookup is set to NO CACHE mode, I get errors if I use partial or full cache mode because I assume the [ORACLE Source] is always changing. The [Lookup] then connects to a [Conditional Split] where I would input an expression like the one below.
(REPLACENULL(ORACLE.ID,"") != REPLACENULL(Lookup.ID,""))
|| (REPLACENULL(ORACLE.CASE_NUMBER,"")
!= REPLACENULL(ORACLE.CASE_NUMBER,""))
I would then have the rows that the [Conditional Split] outputs into a staging table. I then add a [Execute SQL Task] and perform an UPDATE to the DESTINATION-TABLE with the query below:
UPDATE Destination
SET SD.CASE_NUMBER =UP.CASE_NUMBER,
SD.ID = UP.ID,
From Destination SD
JOIN STAGING.TABLE UP
ON UP.ID = SD.ID
Problem: This becomes very slow and takes a very long time and it just keeps running. How can I improve the time and get it to work? Should I use a cache transformation? Should I use a merge statement instead?
How would I use the expression REPLACENULL in the conditional split when it is a data column? would I use something like :
(REPLACENULL(ORACLE.LAST_MODIFIED_DATE,"01-01-1900 00:00:00.000")
!= REPLACENULL(Lookup.LAST_MODIFIED_DATE," 01-01-1900 00:00:00.000"))
PICTURES BELOW:
A pattern that is usually faster for larger datasets is to load the source data into a local staging table then use a query like below to identify the new records:
SELECT column1,column 2
FROM StagingTable SRC
WHERE NOT EXISTS (
SELECT * FROM TargetTable TGT
WHERE TGT.MatchKey = SRC.MatchKey
)
Then you just feed that dataset into an insert:
INSERT INTO TargetTable (column1,column 2)
SELECT column1,column 2
FROM StagingTable SRC
WHERE NOT EXISTS (
SELECT * FROM TargetTable TGT
WHERE TGT.MatchKey = SRC.MatchKey
)
Updates look like this:
UPDATE TGT
SET
column1 = SRC.column1,
column2 = SRC.column2,
DTUpdated=GETDATE()
FROM TargetTable TGT
WHERE EXISTS (
SELECT * FROM TargetTable SRC
WHERE TGT.MatchKey = SRC.MatchKey
)
Note the additional column DTUpdated. You should always have a 'last updated' column in your table to help with auditing and debugging.
This is an INSERT/UPDATE approach. There are other data load approaches such as windowing (pick a trailing window of data to be fully deleted and reloaded) but the approach depends on how your system works and whether you can make assumptions about data (i.e. posted data in the source will never be changed)
You can squash the seperate INSERT and UPDATE statements into a single MERGE statement, although it gets pretty huge, and I've had performance issues with it and there are other documented issues with MERGE
Unfortunately, there's not a good way to do what you're trying to do. SSIS has some controls and documented ways to do this, but as you have found they don't work as well when you start dealing with large amounts of data.
At a previous job, we had something similar that we needed to do. We needed to update medical claims from a source system to another system, similar to your setup. For a very long time, we just truncated everything in the destination and rebuilt every night. I think we were doing this daily with more than 25M rows. If you're able to transfer all the rows from Oracle to SQL in a decent amount of time, then truncating and reloading may be an option.
We eventually had to get away from this as our volumes grew, however. We tried to do something along the lines of what you're attempting, but never got anything we were satisfied with. We ended up with a sort of non-conventional process. First, each medical claim had a unique numeric identifier. Second, whenever the medical claim was updated in the source system, there was an incremental ID on the individual claim that was also incremented.
Step one of our process was to bring over any new medical claims, or claims that had changed. We could determine this quite easily, since the unique ID and the "change ID" column were both indexed in source and destination. These records would be inserted directly into the destination table. The second step was our "deletes", which we handled with a logical flag on the records. For actual deletes, where records existed in destination but were no longer in source, I believe it was actually fastest to do this by selecting the DISTINCT claim numbers from the source system and placing them in a temporary table on the SQL side. Then, we simply did a LEFT JOIN update to set the missing claims to logically deleted. We did something similar with our updates: if a newer version of the claim was brought over by our original Lookup, we would logically delete the old one. Every so often we would clean up the logical deletes and actually delete them, but since the logical delete indicator was indexed, this didn't need to be done too frequently. We never saw much of a performance hit, even when the logically deleted records numbered in the tens of millions.
This process was always evolving as our server loads and data source volumes changed, and I suspect the same may be true for your process. Because every system and setup is different, some of the things that worked well for us may not work for you, and vice versa. I know our data center was relatively good and we were on some stupid fast flash storage, so truncating and reloading worked for us for a very, very long time. This may not be true on conventional storage, where your data interconnects are not as fast, or where your servers are not colocated.
When designing your process, keep in mind that deletes are one of the more expensive operations you can perform, followed by updates and by non-bulk inserts, respectively.
Incremental Approach using SSIS
Get Max(ID) and Max(ModifiedDate) from Destination Table and Store them in a Variables
Create a temporary staging table using EXECUTE SQL TASK and store that temporary staging table name into the variable
Take a Data Flow Task and Use OLEDB Source and OLEDB Destination to pull the data from the Source System and load the
data into the variable of temporary tables
Take Two Execute SQL Task one for Insert Process and other for Update
Drop the Temporary Table
INSERT INTO sales.salesorderdetails
(
salesorderid,
salesorderdetailid,
carriertrackingnumber ,
orderqty,
productid,
specialofferid,
unitprice,
unitpricediscount,
linetotal ,
rowguid,
modifieddate
)
SELECT sd.salesorderid,
sd.salesorderdetailid,
sd.carriertrackingnumber,
sd.orderqty,
sd.productid ,
sd.specialofferid ,
sd.unitprice,
sd.unitpricediscount,
sd.linetotal,
sd.rowguid,
sd.modifieddate
FROM ##salesdetails AS sd WITH (nolock)
LEFT JOIN sales.salesorderdetails AS sa WITH (nolock)
ON sa.salesorderdetailid = sd.salesorderdetailid
WHERE NOT EXISTS
(
SELECT *
FROM sales.salesorderdetails sa
WHERE sa.salesorderdetailid = sd.salesorderdetailid)
AND sa.salesorderdetailid > ?
UPDATE sa
SET SalesOrderID = sd.salesorderid,
CarrierTrackingNumber = sd.carriertrackingnumber,
OrderQty = sd.orderqty,
ProductID = sd.productid,
SpecialOfferID = sd.specialofferid,
UnitPrice = sd.unitprice,
UnitPriceDiscount = sd.unitpricediscount,
LineTotal = sd.linetotal,
rowguid = sd.rowguid,
ModifiedDate = sd.modifieddate
FROM sales.salesorderdetails sa
LEFT JOIN ##salesdetails sd
ON sd.salesorderdetailid = sa.salesorderdetailid
WHERE sa.modifieddate > sd.modifieddate
AND sa.salesorderdetailid < ?
Entire Process took 2 Minutes to Complete
Incremental Process Screenshot
I am assuming you have some identity like (pk)column in your oracle table.
1 Get max identity (Business key) from Destination database (SQL server one)
2 Create two data flow
a) Pull only data >max identity from oracle and put them Destination directly .( As these are new record).
b) Get all record < max identity and update date > last load put them into temp (staging ) table (as this is updated data)
3 Update Destination table with record from temp table record (created at step b)

SQL locking issue

I have much information in the table named "A". The table named "B" is empty.
I get 500 rows from table A, which are not in the table "B":
SELECT TOP 500 * FROM A WHERE
NOT EXISTS (
SELECT * FROM B WHERE
(
A.id1=B.id1 AND A.id2=B.id2
)
)
Everything works well, But I have one problem here.
I have a web application, which calls that SQL query. Imagine that 200 people together call my application, there will be duplicates in the table named "B". Sow how can I lock those rows?
Is this correct?
SELECT TOP 500 * FROM A WITH (ROWLOCK) WHERE
NOT EXISTS (
SELECT * FROM B WITH (ROWLOCK) WHERE
(
A.id1=B.id1 AND A.id2=B.id2
)
)
FOR MORE CLEARLY:
imagine that when you open my web application, page calls (for instance Servlet or PHP) SQL query.
If many person together open my application there is threading problem. while open query is writing data to the table named B, another query parallely doing same thing.
Threading problem: while one thread read 500 row, another thread may read the same data because first thread is still writing data at this moment and has not finished it yet.
thread 1 read that data from table "A":
1,2,3,4,5,6,7 ... 500
thread 1 wriote that data to the table named "B":
1,2,3,4,5,6,7,...495
thread 1 is not finished and thread 2 read data that from table "a":
496, 497 .. 500 ,.... 995
thread 2 write
496, 497 .. 500 ,.... 995
then again thread 1 write data
495 .. 500
For exampl
A rowlock is only going to lock the row.
Moving to tablelock would impact performance.
I would add a status column to A of InProcess.
Then in a transaction mark that to true.
So your query would also have a where A.InProcess = false.
I need to point out that the queries that you have posted won't actually write anything to Table B.
That being said, if you're just looking to lock down rows that are being worked on in Table A, I would suggest adding a status column to the table and using a stored procedure similar to the following.
DECLARE #Results TABLE
(
id1 int NOT NULL,
id2 int NOT NULL --Assuming these are ints
)
UPDATE TOP (500) A with (ROWLOCK, READPAST)
SET [Status] = 'InProcess'
OUTPUT inserted.id1,
inserted.id2
INTO #Results
WHERE [Status] <> 'InProcess'
SELECT * FROM #Results temp
INNER JOIN A
ON A.id1 = temp.id1 AND A.id2=temp.id2
I have used this method when I had multiple workers coming to a single table looking for tasks to do.
The keys points here are:
SELECT statement do not lock or block, but UPDATE statments will.
The ROWLOCK hint is only used to prevent table locks. It does not magically create row locks. We provide the hint only so that we don't lock down the whole table.
The READPAST hint tells each query that if it encounters a row lock, it should proceed on the the next row instead of waiting on the lock to release. We provide this hint so that multiple queries can execute at the same time.
We set the [Status] field to indicate that this row is being worked on and should be ignored until such time as we decide to reset the status (if ever). This prevents multiple queries from reaching the status after the locks have been released.
While we've used an UPDATE to give us the locks we want, we still need to actually retrieve some data. The OUTPUT statement into a table variable lets us capture the keys so that we can merge back into the table later.
There's a lot going on here, and maybe you don't need all this, but this is the method I found to give me high concurrency while preventing race conditions.

Updating 100k records in one update query

Is it possible, or recommended at all, to run one update query, that will update nearly 100k records at once?
If so, how can I do that? I am trying to pass an array to my stored proc, but it seems not to work, this is my SP:
CREATE PROCEDURE [dbo].[UpdateAllClients]
#ClientIDs varchar(max)
AS
BEGIN
DECLARE #vSQL varchar(max)
SET #vSQL = 'UPDATE Clients SET LastUpdate=GETDATE() WHERE ID IN (' + #ClientIDs + ')';
EXEC(#vSQL);
END
I have not idea whats not working, but its just not updating the relevant queries.
Anyone?
The UPDATE is reading your #ClientIDs (as a Comma Separated Value) as a whole. To illustrate it more, you are doing like this.
assume the #ClientIDs = 1,2,3,4,5
your UPDATE command is interpreting it like this
UPDATE Clients SET LastUpdate=GETDATE() WHERE ID IN ('1,2,3,4,5')';
and not
UPDATE Clients SET LastUpdate=GETDATE() WHERE ID IN (1,2,3,4,5)';
One suggestion to your question is by using subquery on your UPDATE, example
UPDATE Clients
SET LastUpdate = GETDATE()
WHERE ID IN
(
SELECT ID
FROM tableName
-- where condtion
)
Hope this makes sense.
A few notes to be aware of.
Big updates like this can lock up the target table. If > 5000 rows are affected by the operation, the individual row locks will be promoted to a table lock, which would block other processes. Worth bearing in mind if this could cause an issue in your scenario. See: Lock Escalation
With a large number of rows to update like this, an approach I'd consider is (basic):
bulk insert the 100K Ids into a staging table (e.g. from .NET, use SqlBulkCopy)
update the target table, using a join onto the above staging table
drop the staging table
This gives some more room for controlling the process, but breaking the workload up into chunks and doing it x rows at a time.
There is a limit for the number of items you pass to 'IN' if you are giving an array.
So, if you just want to update the whole table, skip the IN condition.
If not specify an SQL inside IN. That should do the job
The database will very likely reject that SQL statement because it is too long.
When you need to update so many records at once, then maybe your database schema isn't appropriate. Maybe the LastUpdate datum should not be stored separately for each client but only once globally or only once for a constant group of clients?
But it's hard to recommend a good course of action without seeing the whole picture.
What version of sql server are you using? If it is 2005+ I would recommend using TVPs (table valued parameters - http://msdn.microsoft.com/en-us/library/bb510489.aspx). The transfer of data will be faster (as opposed to building a huge string) and your query would look nicer:
update c
set lastupdate=getdate()
from clients c
join #mytvp t on c.Id = t.Id
Each SQL statement on its own is a transaction statement . This means sql server is going to grab locks for all these million of rows .It can really degrade the performance of a table .So you really don’t tend to update a table which has million of rows in it which hurts the performance.So the workaround is to set rowcount before DML operation
set rowcount=100
UPDATE Clients SET LastUpdate=GETDATE()
WHERE ID IN ('1,2,3,4,5')';
set rowcount=0
or from SQL server 2008 you can parametrize Top keyword
Declare #value int
set #value=100000
again:
UPDATE top (#value) Clients SET LastUpdate=GETDATE()
WHERE ID IN ('1,2,3,4,5')';
if ##rowcount!=0 goto again
See how fast the above query takes and then adjust and change the value of the variable .You need to break the tasks for smaller units as suggested in the above answers
Method 1:
Split the #clientids with delimiters ','
put in array and iterate over that array
update clients table for each id.
OR
Method 2:
Instead of taking #clientids as a varchar2, follow below steps
create object type table for ids and use join.
For faster processing u can also create index on clientid as well.

Have "select for update" block on nonrexisting rows

we have some persistent data in an application, that is queried from a server and then stored in a database so we can keep track of additional information. Because we do not want to query when an object is used in the memory we do an select for update so that other threads that want to get the same data will be blocked.
I am not sure how select for update handles non-existing rows. If the row does not exist and another thread tries to do another select for update on the same row, will this thread be blocked until the other transaction finishes or will it also get an empty result set? If it does only get an empty result set is there any way to make it block as well, for example by inserting the missing row immediately?
EDIT:
Because there was a remark, that we might lock too much, here some more details on the concrete usage in our case. In reduced pseudocode our programm flow looks like this:
d = queue.fetch();
r = SELECT * FROM table WHERE key = d.key() FOR UPDATE;
if r.empty() then
r = get_data_from_somewhere_else();
new_r = process_stuff( r );
if Data was present then
update row to new_r
else
insert new_r
This code is run in multiple thread and the data that is fetched from the queue might be concerning the same row in the database (hence the lock). However if multiple threads are using data that needs the same row, then these threads need to be sequentialized (order does not matter). However this sequentialization fails, if the row is not present, because we do not get a lock.
EDIT:
For now I have the following solution, which seems like an ugly hack to me.
select the data for update
if zero rows match then
insert some dummy data // this will block if multiple transactions try to insert
if insertion failed then
// somebody beat us at the race
select the data for update
do processing
if data was changed then
update the old or dummy data
else
rollback the whole transaction
I am neither 100% sure however that this actually solves the problem, nor does this solution seem good style. So if anybody has to offer something more usable this would be great.
I am not sure how select for update handles non-existing rows.
It doesn't.
The best you can do is to use an advisory lock if you know something unique about the new row. (Use hashtext() if needed, and the table's oid to lock it.)
The next best thing is a table lock.
That being said, your question makes it sound like you're locking way more than you should. Only lock rows when you actually need to, i.e. write operations.
Example solution (i haven't found better :/)
Thread A:
BEGIN;
SELECT pg_advisory_xact_lock(42); -- database semaphore arbitrary ID
SELECT * FROM t WHERE id = 1;
DELETE FROM t WHERE id = 1;
INSERT INTO t (id, value) VALUES (1, 'thread A');
SELECT 1 FROM pg_sleep(10); -- only for race condition simulation
COMMIT;
Thread B:
BEGIN;
SELECT pg_advisory_xact_lock(42); -- database semaphore arbitrary ID
SELECT * FROM t WHERE id = 1;
DELETE FROM t WHERE id = 1;
INSERT INTO t (id, value) VALUES (1, 'thread B');
SELECT 1 FROM pg_sleep(10); -- only for race condition simulation
COMMIT;
Causes always correct order of transactions execution.
Looking at the code added in the second edit, it looks right.
As for it looking like a hack, there's a couple options - basically it's all about moving the database logic to the database.
One is simply to put the whole select for update, if not exist then insert logic in a function, and do select get_object(key1,key2,etc) instead.
Alternatively, you could make an insert trigger that will ignore attempts to add an entry if it already exists, and simply do an insert before you do the select for update. This does have more potential to interfere with other code already in place, though.
(If I remember to, I'll edit and add example code later on when I'm in a position to check what I'm doing.)