Here's the situation:
MS SQL 2008 database with table that is updated approximately once a minute.
The table structure is similar to following:
[docID], [warehouseID], [docDate], [docNum], [partID], [partQty]
Typical working cycle:
User starts data exchange from in-house developed system:
BEGIN TRANSACTION
SELECT * FROM t1
WHERE [docDate] BETWEEN &DateStart AND &DateEnd
AND [warehouseID] IN ('w1','w2','w3')
...then system performs rather long processing of the data selected, generates the list of [docID]s to delete from t1, then goes
DELETE FROM t1 WHERE [docID] IN ('d1','d2','d3',...,'dN')
COMMIT TRANSACTION
Here, the problem is that while 1st transaction processes selected the data, another reads it too and then they together populate the same data in in-house system.
At first, I inserted (TABLOCKX) table hint into SELECT query. And it worked pretty well until users started to complain about system's performance.
Then I changed hints to (ROWLOCK, XLOCK, HOLDLOCK), assuming that it would:
exclusively lock...
selected rows (instead of whole table)...
until the end of transaction
But this seems making a whole table lock anyway. I have no access to database itself, so I can't just analyze these locks (actually, I have no idea yet how to do it, even if I had access)
What I would like to have as a result:
users are able to process data related with different warehouses and dates in parallel
as a result of 1., avoid duplication of downloaded data
Except locks, other solutions I have are (although they both seem clumsy):
Implement a flag in t1, showing that the data is under processing (and then do 'SELECT ... WHERE NOT [flag]')
Divide t1 into two parts: header and details, and apply locks separately.
I beleive that I might misunderstood some concepts with regards to transaction isolation levels and/or table hints and there is another (better) way.
Please, advise!
You may change a concept of workflow.
Instead of deleting records update them with setting extra field Deprecated from 0 to 1.
And read data not from the table but from the view where Deprecated = 0.
BEGIN TRANSACTION
SELECT * FROM vT1
WHERE [docDate] BETWEEN &DateStart AND &DateEnd
AND [warehouseID] IN ('w1','w2','w3')
where vT1 view looks like this:
select *
from t1
where Deprecated = 0
And the deletion will look like this:
UPDATE t1 SET Deprecated = 1 WHERE [docID] IN ('d1','d2','d3',...,'dN')
COMMIT TRANSACTION
Using such a concept you will achieve two goals:
decrease probability of locks
get history of movings on warehouses
Related
I've tried to ask this question at least once, but I never seem to put it across properly. I really have two questions.
My database has a table called PatientCarePlans
( ID, Name, LastReviewed, LastChanged, PatientID, DateStarted, DateEnded). There are many other fields, but these are the most important.
Every hour, a JSON extract gets a fresh copy of the data for PatientCarePlans, which may or may not be different to the existing records. That data is stored temporarily in PatientCarePlansDump. Unlike other tables which will rarely change, and if they do only one or two fields, with this table there are MANY fields which may now be different. Therefore, rather than simply copy the Dump files to the live table based on whether the record already exists or not, my code does the no doubt wrong thing: I empty out any records from PatientCarePlans from that location, and then copy them all from the Dump table back to the live one. Since I don't know whether or not there are any changes, and there are far too many fields to manually check, I must assume that each record is different in some way or another and act accordingly.
My first question is how best (I have OKish basic knowledge, but this is essentially a useful hobby, and therefore have limited technical / theoretical knowledge) do I ensure that there is minimal disruption to the PatientCarePlans table whilst doing so? At present, my code is:
IF Object_ID('PatientCarePlans') IS NOT NULL
BEGIN
BEGIN TRANSACTION
DELETE FROM [PatientCarePlans] WHERE PatientID IN (SELECT PatientID FROM [Patients] WHERE location = #facility)
COMMIT TRANSACTION
END
ELSE
SELECT TOP 0 * INTO [PatientCarePlans]
FROM [PatientCareplansDUMP]
INSERT INTO [PatientCarePlans] SELECT * FROM [PatientCarePlansDump]
DROP TABLE [PatientCarePlansDUMP]
My second question relates to how this process affects the numerous queries that run on and around the same time as this import. Very often those queries will act as though there are no records in the PatientCarePlans table, which causes obvious problems. I'm vaguely aware of transaction locks etc, but it goes a bit over my head given the hobby status! How can I ensure that a query is executed and results returned whilst this process is taking place? Is there a more efficient or less obstructive method of updating the table, rather than simply removing them and re-adding? I know there are merge and update commands, but none of the examples seem to fit my issue, which only confuses me more!
Apologies for the lack of knowhow, though that of course is why I'm here asking the question.
Thanks
I suggest you do not delete and re-create the table. The DDL script to create the table should be part of your database setup, not part of regular modification scripts.
You are going to want to do the DELETE and INSERT inside a transaction. Preferably you would do this under SERIALIZABLE isolation in order to prevent concurrency issues. (You could instead do a WITH (TABLOCK) hint, which would be less likely cause a deadlock, but will completely lock the table.)
SET XACT_ABORT, NOCOUNT ON; -- always set XACT_ABORT if you have a transaction
SET TRANSACTION ISOLATION LEVEL SNAPSHOT;
BEGIN TRAN;
DELETE FROM [PatientCarePlans]
WHERE PatientID IN (
SELECT p.PatientID
FROM [Patients] p
WHERE location = #facility
);
INSERT INTO [PatientCarePlans] (YourColumnsHere) -- always specify columns
SELECT YourColumnsHere
FROM [PatientCarePlansDump];
COMMIT;
You could also do this with a single MERGE statement. However it is complex to write (owing to the need to restrict the set of rows being targetted), and is not usually more performant than separate statements, and also needs SERIALIZABLE.
Context: I am a long-time MSSQL developer... What I would like to know is how to implement a read-only-once select from SAP HANA.
High-level pseudo-code:
Collect request via db proc (query)
Call API with request
Store results of the request (response)
I have a table (A) that is the source of inputs to a process. Once a process has completed it will write results to another table (B).
Perhaps this is all solved if I just add a column to table A to avoid concurrent processors from selecting the same records from A?
I am wondering how to do this without adding the column to source table A.
What I have tried is a left outer join between tables A and B to get rows from A that have no corresponding rows (yet) in B. This doesn't work, or I haven't implemented such that rows are processed only 1 time by any of the processors.
I have a stored proc to handle batch selection:
/*
* getBatch.sql
*
* SYNOPSIS: Retrieve the next set of criteria to be used in a search
* request. Use left outer join between input source table
* and results table to determine the next set of inputs, and
* provide support so that concurrent processes may call this
* proc and get their inputs exclusively.
*/
alter procedure "ACOX"."getBatch" (
in in_limit int
,in in_run_group_id varchar(36)
,out ot_result table (
id bigint
,runGroupId varchar(36)
,sourceTableRefId integer
,name nvarchar(22)
,location nvarchar(13)
,regionCode nvarchar(3)
,countryCode nvarchar(3)
)
) language sqlscript sql security definer as
begin
-- insert new records:
insert into "ACOX"."search_result_v4" (
"RUN_GROUP_ID"
,"BEGIN_DATE_TS"
,"SOURCE_TABLE"
,"SOURCE_TABLE_REFID"
)
select
in_run_group_id as "RUN_GROUP_ID"
,CURRENT_TIMESTAMP as "BEGIN_DATE_TS"
,'acox.searchCriteria' as "SOURCE_TABLE"
,fp.descriptor_id as "SOURCE_TABLE_REFID"
from
acox.searchCriteria fp
left join "ACOX"."us_state_codes" st
on trim(fp.region) = trim(st.usps)
left outer join "ACOX"."search_result_v4" r
on fp.descriptor_id = r.source_table_refid
where
st.usps is not null
and r.BEGIN_DATE_TS is null
limit :in_limit;
-- select records inserted for return:
ot_result =
select
r.ID id
,r.RUN_GROUP_ID runGroupId
,fp.descriptor_id sourceTableRefId
,fp.merch_name name
,fp.Location location
,st.usps regionCode
,'USA' countryCode
from
acox.searchCriteria fp
left join "ACOX"."us_state_codes" st
on trim(fp.region) = trim(st.usps)
inner join "ACOX"."search_result_v4" r
on fp.descriptor_id = r.source_table_refid
and r.COMPLETE_DATE_TS is null
and r.RUN_GROUP_ID = in_run_group_id
where
st.usps is not null
limit :in_limit;
end;
When running 7 concurrent processors, I get a 35% overlap. That is to say that out of 5,000 input rows, the resulting row count is 6,755. Running time is about 7 mins.
Currently my solution includes adding a column to the source table. I wanted to avoid that but it seems to make a simpler implement. I will update the code shortly, but it includes an update statement prior to the insert.
Useful references:
SAP HANA Concurrency Control
Exactly-Once Semantics Are Possible: Here’s How Kafka Does It
First off: there is no "read-only-once" in any RDBMS, including MS SQL.
Literally, this would mean that a given record can only be read once and would then "disappear" for all subsequent reads. (that's effectively what a queue does, or the well-known special-case of a queue: the pipe)
I assume that that is not what you are looking for.
Instead, I believe you want to implement a processing-semantic analogous to "once-and-only-once" aka "exactly-once" message delivery. While this is impossible to achieve in potentially partitioned networks it is possible within the transaction context of databases.
This is a common requirement, e.g. with batch data loading jobs that should only load data that has not been loaded so far (i.e. the new data that was created after the last batch load job began).
Sorry for the long pre-text, but any solution for this will depend on being clear on what we want to actually achieve. I will get to an approach for that now.
The major RDBMS have long figured out that blocking readers is generally a terrible idea if the goal is to enable high transaction throughput. Consequently, HANA does not block readers - ever (ok, not ever-ever, but in the normal operation setup).
The main issue with the "exactly-once" processing requirement really is not the reading of the records, but the possibility of processing more than once or not at all.
Both of these potential issues can be addressed with the following approach:
SELECT ... FOR UPDATE ... the records that should be processed (based on e.g. unprocessed records, up to N records, even-odd-IDs, zip-code, ...). With this, the current session has an UPDATE TRANSACTION context and exclusive locks on the selected records. Other transactions can still read those records, but no other transaction can lock those records - neither for UPDATE, DELETE, nor for SELECT ... FOR UPDATE ... .
Now you do your processing - whatever this involves: merging, inserting, updating other tables, writing log-entries...
As the final step of the processing, you want to "mark" the records as processed. How exactly this is implemented, does not really matter.
One could create a processed-column in the table and set it to TRUE when records have been processed. Or one could have a separate table that contains the primary keys of the processed records (and maybe a load-job-id to keep track of multiple load jobs).
In whatever way this is implemented, this is the point in time, where this processed status needs to be captured.
COMMIT or ROLLBACK (in case something went wrong). This will COMMIT the records written to the target table, the processed-status information, and it will release the exclusive locks from the source table.
As you see, Step 1 takes care of the issue that records may be missed by selecting all wanted records that can be processed (i.e. they are not exclusively locked by any other process).
Step 3 takes care of the issue of records potentially be processed more than once by keeping track of the processed records. Obviously, this tracking has to be checked in Step 1 - both steps are interconnected, which is why I point them out explicitly. Finally, all the processing occurs within the same DB-transaction context, allowing for guaranteed COMMIT or ROLLBACK across the whole transaction. That means, that no "record marker" will ever be lost when the processing of the records was committed.
Now, why is this approach preferable to making records "un-readable"?
Because of the other processes in the system.
Maybe the source records are still read by the transaction system but never updated. This transaction system should not have to wait for the data load to finish.
Or maybe, somebody wants to do some analytics on the source data and also needs to read those records.
Or maybe you want to parallelise the data loading: it's easily possible to skip locked records and only work on the ones that are "available for update" right now. See e.g. Load balancing SQL reads while batch-processing? for that.
Ok, I guess you were hoping for something easier to consume; alas, that's my approach to this sort of requirement as I understood it.
We face the following situation (Teradata):
Business layer frequently executes long-running queries on Table X_Past UNION ALL Table X_Today.
Table X_Today gets updated frequently, say once every 10 minutes. X_Past only once after midnight (per full-load).
Writing process should not block reading process.
Writing should happen as soon as new data is available.
Proposed approach:
2 "Today" and a "past" table, plus a UNION ALL view that selects from one of them based on the value in a load status table.
X_Today_1
X_Today_0
X_Past
loading process with load in X_Today_1 and set the active_table value in the load status table to "X_Today_1"
next time it will load X_Today_0 and set the active_table value to "X_Today_0"
etc.
The view that is used to select on the table will be built as follows:
select *
from X_PAST
UNION ALL
select td1.*
from X_Today_1 td1
, ( select active_table from LOAD_STATUS ) active_tab1
where active_tab1.te_active_table = 'X_Today_1'
UNION ALL
select td0.*
from X_Today_0 td0
, ( select active_table from STATUS_LOG ) active_tab0
where active_tab1.te_active_table = 'X_Today_0'
my main questions:
when executing the select, will there be a lock on ALL tables, or only on those that are actually accessed for data? Since because of the where clause, data from one of the Today_1/0 tables will always be ignored and this table should be availablew for loading;
do we need any form of locking or is the default locking mechanism that what we want (which I suspect it is)?
will this work, or am I overlooking something?
It is important that the loading process will wait in case the reading process takes longer than 20 minutes and the loader is about to refresh the second table again. The reading process should never really be blocked, except maybe by itself.
Any input is much appreciated...
thank you for your help.
A few comments to your questions:
Depending on the query structure, the Optimizer will try to get the default locks (in this case a READ lock) at different levels -- most likely table or row-hash locks. For example, if you do a SELECT * FROM my_table WHERE PI_column = 'value', you should get a row-hash lock and not a table lock.
Try running an EXPLAIN on your SELECT and see if it gives you any locking info. The Optimizer might be smart enough to determine there are 0 rows in one of the joined tables and reduce the lock requests. If it still locks both tables, see the end of this post for an alternative approach.
Your query written as-is will result in READ locks, which would block any WRITE requests on the tables. If you are worried about locking issues / concurrency, have you thought about using an explicit ACCESS lock? This would allow your SELECT to run without ever having to wait for your write queries to complete. This is called a "dirty read", since there could be other requests still modifying the tables while they are being read, so it may or may not be appropriate depending on your requirements.
Your approach seems feasible. You could also do something similar, but instead of having two UNIONs, have a single "X_Today" view that points to the "active" table. After your load process completes, you could re-point the view to the appropriate table as needed via a MACRO call:
-- macros (switch between active / loading)
REPLACE MACRO switch_to_today_table_0 AS
REPLACE VIEW X_Today AS SELECT * FROM X_Today_0;
REPLACE MACRO switch_to_today_table_1 AS
REPLACE VIEW X_Today AS SELECT * FROM X_Today_1;
-- SELECT query
SELECT * FROM X_PAST UNION ALL SELECT * FROM X_Today;
-- Write request
MERGE INTO x_today_0...;
-- Switch active "today" table to must recently loaded one
EXEC switch_to_today_table_0;
You'd have to manage which table to write to (or possible do that using a view too) and which "switch" macro to call within your application.
One thing to think about is that having two physical tables that logically represent the same table (i.e. should have the same data) may potentially allow for situations where one table is missing data and needs to be manually synced.
Also, if you haven't looked at them already, a few ideas to optimize your SELECT queries to run faster: row partitioning, indexes, compression, statistics, primary index selection.
I've got a 3-tier app and have data cached on a client side, so I need to know when data changed on the server to keep this cache in sync.
So I added a "lastmodification" field in the tables, and update this field when a data change. But some 'parent' lastmodification rows must be updated in case child rows (using FK) are modified.
Fetching the MAX(lastmodification) from the main table, and MAX from a related table, and then MAX of these several values was working but was a bit slow.
I mean:
MAX(MAX(MAIN_TABLE), MAX(CHILD1_TABLE), MAX(CHILD2_TABLE))
So I switched and added a trigger to this table so that it update a field in a TBL_METADATA table:
CREATE TABLE [TABLE_METADATA](
[TABLE_NAME] [nvarchar](250) NOT NULL,
[TABLE_LAST_MODIFICATION] [datetime] NOT NULL
Now related table can update the 'main' table last modification time by just also updating the last modification in the metadata table.
Fetching the lastmodification is now fast
But ... now I've random deadlock related to updating this table.
This is due to 2 transactions modifying the TABLE_METADATA at a different step, and then locking each other.
My question: Do you see a way to keep this lastmodification update without locking the row?
In my case I really don't care if:
The lastmodification stay updated even if the transaction is rollback
The 'dirty' lastmodification (updated but not yet committed) is
overwritten by a new value
In fact, I really don't need these update to be in the transaction, but as they are executed by the trigger it's automatically in the current transaction.
Thank you for any help
As far as I know, you cannot prevent a U-lock. However, you could try reducing the number of locks to a minimum by using with (rowlock).
This will tell the query optimiser to lock rows one by one as they are updated, rather than to use a page or table lock.
You can also use with (nolock) on tables which are joined to the table which is being updated. An alternative to this would be to use set transaction isolation level read uncommitted.
Be careful using this method though, as you can potentially create corrupted data.
For example:
update mt with (rowlock)
set SomeColumn = Something
from MyTable mt
inner join AnotherTable at with (nolock)
on mt.mtId = at.atId
You can also add with (rowlock) and with (nolock)/set transaction isolation level read uncommitted to other database objects which often read and write the same table, to further reduce the likelihood of a deadlock occurring.
If deadlocks are still occurring, you can reduce read locking on the target table by self joining like this:
update mt with (rowlock)
set SomeColumn = Something
from MyTable mt
where mt.Id in (select Id from MyTable mt2 where Column = Condition)
More documentation about table hints can be found here.
I have a table in a SQL Server 2005 Database that is used a lot. It has our product on hand availability information. We get updates every hour from our warehouse and for the past few years we've been running a routine that truncates the table and updates the information. This only takes a few seconds and has not been a problem, until now. We have a lot more people using our systems that query this information now, and as a result we're seeing a lot of timeouts due to blocking processes.
... so ...
We researched our options and have come up with an idea to mitigate the problem.
We would have two tables. Table A (active) and table B (inactive).
We would create a view that points to the active table (table A).
All things needing this tables information (4 objects) would now have to go through the view.
The hourly routine would truncate the inactive table, update it with the latest information then update the view to point at the inactive table, making it the active one.
This routine would determine which table is active and basically switch the view between them.
What's wrong with this? Will switching the view mid query cause problems? Can this work?
Thank you for your expertise.
Extra Information
the routine is a SSIS package that peforms many steps and eventually truncates/updates the table in question
The blocking processes are two other stored procedures that query this table.
Have you considered using snapshot isolation. It would allow you to begin a big fat transaction for your SSIS stuff and still read from the table.
This solution seems much cleaner than switching the tables.
I think this is going about it the wrong way - updating a table has to lock it, although you can limit that locking to per page or even per row.
I'd look at not truncating the table and refilling it. That's always going to interfere with users trying to read it.
If you did update rather than replace the table you could control this the other way - the reading users shouldn't block the table and may be able to get away with optimistic reads.
Try adding the with(nolock) hint to the reading SQL View statement. You should be able to get very large volumes of users reading even with the table being regularly updated.
Personally, if you are always going to be introducing down time to run a batch process against the table, I think you should manage the user experience at the business/data access layer. Introduce a table management object that monitors connections to that table and controls the batch processing.
When new batch data is ready, the management object stops all new query requests (maybe even queueing?), allows the existing queries to complete, runs the batch, then reopens the table for queries. The management object can raise an event (BatchProcessingEvent) that the UI layer can interpret to let people know that the table is currently unavailable.
My $0.02,
Nate
Just read you are using SSIS
you could use the TableDiference Component from: http://www.sqlbi.eu/Home/tabid/36/ctl/Details/mid/374/ItemID/0/Default.aspx
(source: sqlbi.eu)
This way you can apply the changes to the table, ONE by ONE but of course, this will be much more slower and depending on the table size will require more RAM at the server, but the Locking problem will be totally corrected.
Why not use transactions to update the information rather than a truncate operation.
Truncate is non logged so it cannot be done in a transaction.
If you're operation is done in a transaction then existing users will not be affected.
How this is done would depend on things like the size of the table and how radically the data changes. If you give more detail perhaps I could advise further.
One possible solution would be to minimize the time needed to update the table.
I would first Create a staging table to download the data from the warehouse.
If you have to do "inserts, updates and deletes" in the final table
Lets suppose the finale table looks like this:
Table Products:
ProductId int
QuantityOnHand Int
And you need to update QuantityOnHand from the warehouse.
First Create a Staging table like:
Table Prodcuts_WareHouse
ProductId int
QuantityOnHand Int
And then Create an "Actions" Table like this:
Table Prodcuts_Actions
ProductId int
QuantityOnHand Int
Action Char(1)
The update process should then be something like this:
1.Truncate table Prodcuts_WareHouse
2.Truncate table Prodcuts_Actions
3.Fill the Prodcuts_WareHouse table with the data from the warehouse
4.Fill the Prodcuts_Actions table with this:
Inserts:
INSERT INTO Prodcuts_Actions (ProductId, QuantityOnHand,Action)
SELECT SRC.ProductId, SRC.QuantityOnHand, 'I' AS ACTION
FROM Prodcuts_WareHouse AS SRC LEFT OUTER JOIN
Products AS DEST ON SRC.ProductId = DEST.ProductId
WHERE (DEST.ProductId IS NULL)
Deletes
INSERT INTO Prodcuts_Actions (ProductId, QuantityOnHand,Action)
SELECT DEST.ProductId, DEST.QuantityOnHand, 'D' AS Action
FROM Prodcuts_WareHouse AS SRC RIGHT OUTER JOIN
Products AS DEST ON SRC.ProductId = DEST.ProductId
WHERE (SRC.ProductId IS NULL)
Updates
INSERT INTO Prodcuts_Actions (ProductId, QuantityOnHand,Action)
SELECT SRC.ProductId, SRC.QuantityOnHand, 'U' AS Action
FROM Prodcuts_WareHouse AS SRC INNER JOIN
Products AS DEST ON SRC.ProductId = DEST.ProductId AND SRC.QuantityOnHand <> DEST.QuantityOnHand
Until now you haven't locked the final table.
5.In a transaction update the final table:
BEGIN TRANS
DELETE Products FROM Products INNER JOIN
Prodcuts_Actions ON Products.ProductId = Prodcuts_Actions.ProductId
WHERE (Prodcuts_Actions.Action = 'D')
INSERT INTO Prodcuts (ProductId, QuantityOnHand)
SELECT ProductId, QuantityOnHand FROM Prodcuts_Actions WHERE Action ='I';
UPDATE Products SET QuantityOnHand = SRC.QuantityOnHand
FROM Products INNER JOIN
Prodcuts_Actions AS SRC ON Products.ProductId = SRC.ProductId
WHERE (SRC.Action = 'U')
COMMIT TRAN
With all the process above, you minimize the amount of records to be updated to the minimum necessary, and so the time the final table will be locked while updating.
You can even don't use a transaction in the final step, so between command the table will be released.
If you have the Enterprise Edition of SQL Server at your disposal then may I suggest that you use SQL Server Partitioning technology.
You could have your currently required data reside within the 'Live' partition and the updated version of the data in the 'Secondary' partition (which is not available for querying but rather for administering data).
Once the data has been imported into the 'Secondary' parition you can instantly SWITCH the 'LIVE' partition OUT and the 'Secondary' partition IN, thereby incurring zero downtime and no blocking.
Once you have made the switch, you can go about truncating the no longer needed data without adversley affecting users of the newly live data (previously the Secondary partition).
Each time you need to do an import job, you simply repeat/reverse the process.
To learn more about SQL Server Partitioning see:
http://msdn.microsoft.com/en-us/library/ms345146(SQL.90).aspx
Or you can just ask me :-)
EDIT:
On a side note, in order to address any blocking issues, you could use SQL Server Row Versioning technology.
http://msdn.microsoft.com/en-us/library/ms345124(SQL.90).aspx
We do this on our high usage systems and haven't had any problems. However, as with all things database, the only way to be sure it would help would be to make the change in dev and then load test it. Not knowing waht else your SSIS package does, it may still cause blocks.
If the table is not very large you could cache the data in your application for a short time. It may not eliminate blocking altogether, but it would reduce the chances that the table would be queried when an update occurs.
Perhaps it would make sense to do some analysis of the processes which are blocking since they seem to be the part of your landscape which has changed. It only takes one poorly written query to create the blocks which your are seeing. Barring a poorly written query, maybe the table needs one or more covering indexes to speed up those queries and get you back on your way without having to re-engineer your already working code.
Hope this helps,
Bill