What's the best way to update data in a table while it's in use without locking the table? - sql

I have a table in a SQL Server 2005 Database that is used a lot. It has our product on hand availability information. We get updates every hour from our warehouse and for the past few years we've been running a routine that truncates the table and updates the information. This only takes a few seconds and has not been a problem, until now. We have a lot more people using our systems that query this information now, and as a result we're seeing a lot of timeouts due to blocking processes.
... so ...
We researched our options and have come up with an idea to mitigate the problem.
We would have two tables. Table A (active) and table B (inactive).
We would create a view that points to the active table (table A).
All things needing this tables information (4 objects) would now have to go through the view.
The hourly routine would truncate the inactive table, update it with the latest information then update the view to point at the inactive table, making it the active one.
This routine would determine which table is active and basically switch the view between them.
What's wrong with this? Will switching the view mid query cause problems? Can this work?
Thank you for your expertise.
Extra Information
the routine is a SSIS package that peforms many steps and eventually truncates/updates the table in question
The blocking processes are two other stored procedures that query this table.

Have you considered using snapshot isolation. It would allow you to begin a big fat transaction for your SSIS stuff and still read from the table.
This solution seems much cleaner than switching the tables.

I think this is going about it the wrong way - updating a table has to lock it, although you can limit that locking to per page or even per row.
I'd look at not truncating the table and refilling it. That's always going to interfere with users trying to read it.
If you did update rather than replace the table you could control this the other way - the reading users shouldn't block the table and may be able to get away with optimistic reads.
Try adding the with(nolock) hint to the reading SQL View statement. You should be able to get very large volumes of users reading even with the table being regularly updated.

Personally, if you are always going to be introducing down time to run a batch process against the table, I think you should manage the user experience at the business/data access layer. Introduce a table management object that monitors connections to that table and controls the batch processing.
When new batch data is ready, the management object stops all new query requests (maybe even queueing?), allows the existing queries to complete, runs the batch, then reopens the table for queries. The management object can raise an event (BatchProcessingEvent) that the UI layer can interpret to let people know that the table is currently unavailable.
My $0.02,
Nate

Just read you are using SSIS
you could use the TableDiference Component from: http://www.sqlbi.eu/Home/tabid/36/ctl/Details/mid/374/ItemID/0/Default.aspx
(source: sqlbi.eu)
This way you can apply the changes to the table, ONE by ONE but of course, this will be much more slower and depending on the table size will require more RAM at the server, but the Locking problem will be totally corrected.

Why not use transactions to update the information rather than a truncate operation.
Truncate is non logged so it cannot be done in a transaction.
If you're operation is done in a transaction then existing users will not be affected.
How this is done would depend on things like the size of the table and how radically the data changes. If you give more detail perhaps I could advise further.

One possible solution would be to minimize the time needed to update the table.
I would first Create a staging table to download the data from the warehouse.
If you have to do "inserts, updates and deletes" in the final table
Lets suppose the finale table looks like this:
Table Products:
ProductId int
QuantityOnHand Int
And you need to update QuantityOnHand from the warehouse.
First Create a Staging table like:
Table Prodcuts_WareHouse
ProductId int
QuantityOnHand Int
And then Create an "Actions" Table like this:
Table Prodcuts_Actions
ProductId int
QuantityOnHand Int
Action Char(1)
The update process should then be something like this:
1.Truncate table Prodcuts_WareHouse
2.Truncate table Prodcuts_Actions
3.Fill the Prodcuts_WareHouse table with the data from the warehouse
4.Fill the Prodcuts_Actions table with this:
Inserts:
INSERT INTO Prodcuts_Actions (ProductId, QuantityOnHand,Action)
SELECT SRC.ProductId, SRC.QuantityOnHand, 'I' AS ACTION
FROM Prodcuts_WareHouse AS SRC LEFT OUTER JOIN
Products AS DEST ON SRC.ProductId = DEST.ProductId
WHERE (DEST.ProductId IS NULL)
Deletes
INSERT INTO Prodcuts_Actions (ProductId, QuantityOnHand,Action)
SELECT DEST.ProductId, DEST.QuantityOnHand, 'D' AS Action
FROM Prodcuts_WareHouse AS SRC RIGHT OUTER JOIN
Products AS DEST ON SRC.ProductId = DEST.ProductId
WHERE (SRC.ProductId IS NULL)
Updates
INSERT INTO Prodcuts_Actions (ProductId, QuantityOnHand,Action)
SELECT SRC.ProductId, SRC.QuantityOnHand, 'U' AS Action
FROM Prodcuts_WareHouse AS SRC INNER JOIN
Products AS DEST ON SRC.ProductId = DEST.ProductId AND SRC.QuantityOnHand <> DEST.QuantityOnHand
Until now you haven't locked the final table.
5.In a transaction update the final table:
BEGIN TRANS
DELETE Products FROM Products INNER JOIN
Prodcuts_Actions ON Products.ProductId = Prodcuts_Actions.ProductId
WHERE (Prodcuts_Actions.Action = 'D')
INSERT INTO Prodcuts (ProductId, QuantityOnHand)
SELECT ProductId, QuantityOnHand FROM Prodcuts_Actions WHERE Action ='I';
UPDATE Products SET QuantityOnHand = SRC.QuantityOnHand
FROM Products INNER JOIN
Prodcuts_Actions AS SRC ON Products.ProductId = SRC.ProductId
WHERE (SRC.Action = 'U')
COMMIT TRAN
With all the process above, you minimize the amount of records to be updated to the minimum necessary, and so the time the final table will be locked while updating.
You can even don't use a transaction in the final step, so between command the table will be released.

If you have the Enterprise Edition of SQL Server at your disposal then may I suggest that you use SQL Server Partitioning technology.
You could have your currently required data reside within the 'Live' partition and the updated version of the data in the 'Secondary' partition (which is not available for querying but rather for administering data).
Once the data has been imported into the 'Secondary' parition you can instantly SWITCH the 'LIVE' partition OUT and the 'Secondary' partition IN, thereby incurring zero downtime and no blocking.
Once you have made the switch, you can go about truncating the no longer needed data without adversley affecting users of the newly live data (previously the Secondary partition).
Each time you need to do an import job, you simply repeat/reverse the process.
To learn more about SQL Server Partitioning see:
http://msdn.microsoft.com/en-us/library/ms345146(SQL.90).aspx
Or you can just ask me :-)
EDIT:
On a side note, in order to address any blocking issues, you could use SQL Server Row Versioning technology.
http://msdn.microsoft.com/en-us/library/ms345124(SQL.90).aspx

We do this on our high usage systems and haven't had any problems. However, as with all things database, the only way to be sure it would help would be to make the change in dev and then load test it. Not knowing waht else your SSIS package does, it may still cause blocks.

If the table is not very large you could cache the data in your application for a short time. It may not eliminate blocking altogether, but it would reduce the chances that the table would be queried when an update occurs.

Perhaps it would make sense to do some analysis of the processes which are blocking since they seem to be the part of your landscape which has changed. It only takes one poorly written query to create the blocks which your are seeing. Barring a poorly written query, maybe the table needs one or more covering indexes to speed up those queries and get you back on your way without having to re-engineer your already working code.
Hope this helps,
Bill

Related

SQL: Insert Into Table 1 From Table 2 then Update Table 2 - Performance Increase

I am working to increase the speed and performance for a database process that I have inherited. The basic steps, prior to this process, is a utility uploads about a million or more records into an Upload Table. That process is pretty quick, but things start to slow down once we start adding/updating/moving items from the Upload Table into other tables in the database.
I have read a few articles stating that using IF NOT EXIST may be quicker than SELECT DISTINCT so I was thinking about refactoring the following code to do so but I was also wondering if there is a way to combine these two queries in order to increase the speed.
The Upload Table contains many columns, I am just showing the Product portion but there is also Store Columns which has the same number of columns as the Product and many other details that are not a one-to-one relationship between tables.
The first query inserts the product into the Product table if it does not already exist, then the next step updates the Upload Table with Product IDs for all the records in the Upload Table.
INSERT INTO Product (p.ProductCode, p.ProductDescription, p.ProductCodeQualifier, p.UnitOfMeasure)
SELECT DISTINCT
ut.ProductCode, ut.ProductDescription, ut.ProductCodeQualifier, ut.UnitOfMeasure
FROM
Upload_Table ut
LEFT JOIN
Product p ON (ut.ProductCode = p.ProductCode)
AND (ut.ProductDescription = p.ProductDescription)
AND (ut.ProductCodeQualifier = p.ProductCodeQualifier)
AND (ut.UnitOfMeasure = p.UnitOfMeasure)
WHERE
p.Id IS NULL
AND ut.UploadId = 123456;
UPDATE Upload_Table
SET ProductId = Product.Id
FROM Upload_Table
INNER JOIN Product ON Upload_Table.ProductCode = Product.ProductCode
AND Upload_Table.ProductDescription = Product.ProductDescription
AND Upload_Table.ProductCodeQualifier = Product.ProductCodeQualifier
AND Upload_Table.UnitOfMeasure = Product.UnitOfMeasure
WHERE (Upload_Table.UploadId = 123456)
Any help or suggestions would be greatly appreciated. I am decent with my understanding of SQL but I am not an expert.
Thanks!
Currently have not tried to make any changes for this part as I am trying to find the best result for speed increases and a better understanding of how this process can be improved.
Recommendations:
You can disable triggers, foreign keys, constraints, and indexes before inserting and updating, after these processes you can enable all these again. Because indexes, triggers, and foreign keys accept the inserting (updating) performance very badly.
Is not recommended to use auto-commit transaction mode during the update or insert process. This is getting a very bad performance, on auto-commit mode every time for each after inserting records the transactions automatically will be committed. But, for best performance, I recommended you use commit only after inserting 1000 records (or after 10000).
If you can, then you can do this process (insert or update) periodically multiple times a day, you can also do this using triggers. I don't know your business logic, maybe this variant will not satisfy you.
And don't forget to analyze executing plan for your queries.

How to make 2 remote tables in sync along with determination of Delta records

We have 10 tables on vendor system and same 10 tables on our DB side along with 10 _HISTORIC tables i.e. for each table in order to capture updated/new records.
We are reading the main tables from Vendor system using Informatica to truncate and load into our tables. How do we find Delta records without using Triggers and CDC as it comes with cost on vendor system.
4 tables are such that which have 200 columns and records around 31K in each with expectation that 100-500 records might update daily.
We are using Left Join in Informatica to load new Records in our Main and _HISTORIC tables.
But what's efficient approach to find the Updated records of Vendor table and load them in our _HISTORIC table ?
For new Records using query :
-- NEW RECORDS
INSERT INTO TABLEA_HISTORIC
SELECT FROM TABLEA
LEFT JOIN TABLEB
ON A.PK = B.PK
WHERE B.PK IS NULL
I believe a system versioned temporary table will be something you are looking for here. You can create a system versioned table for any table in SQL server 2016 or later.
for example, say I have a table Employee
CREATE TABLE Employee
(
EmployeeId VARCHAR(20) PRIMARY KEY,
EmployeeName VARCHAR(255) NOT NULL,
EmployeeDOJ DATE,
ValidFrom datetime2 GENERATED ALWAYS AS ROW START,--Automatically updated by the system when the row is updated
ValidTo datetime2 GENERATED ALWAYS AS ROW END,--auto-updated
PERIOD FOR SYSTEM_TIME (ValidFrom, ValidTo)--to set the row validity period
)
the column ValidFrom, ValidTo determines the time period on which that particular row was active.
For More Info refer the micorsoft article
https://learn.microsoft.com/en-us/sql/relational-databases/tables/temporal-tables?view=sql-server-ver15
Create staging tables, load wipe&load them. Next, use them for finding the differences that need to be load into your target tables.
The CDC logic needs to be performed this way, but it will not affect your source system.
Another way - not sure if possible in your case - is to load partial data based on some source system date or key. This way you stage only the new data. This improves performance a lot, but makes finding the deletes in source impossible.
A. To replicate a smaller subset of records in the source without making schema changes, there are a few options.
Transactional Replication, however this is not very flexible. For example would not allow any differences in the target database, and therefore is not a solution for you.
Identify a "date modified" field in the source. This obviously has to already exist, and will not allow you to identify deletes
Use a "windowing approach" where you simply delete and reload the last months transactions, again based on an existing date. Requires an existing date that isn't back dated and doesn't work for non transactional tables (which are usually small enough to just do full copies anyway)
Turn on change tracking. Your vendor may or may not argue that tihs is a costly change (it isn't) or impacts application performance (it probably doesn't)
https://learn.microsoft.com/en-us/sql/relational-databases/track-changes/about-change-tracking-sql-server?view=sql-server-ver15
Turning on change tracking will allow you to more easily identify changes to all tables.
You need to ask yourself: is it really an issue to copy the entire table? I have built solutions that simple copy entire large tables (far larger than 31k records) every hour and there is never an issue.
You need to consider what complications you introduce by building an incremental solution, and whether the associated maintenance and complexity is worth being able to reduce a record copy from 31K (full table) to 500 records (changed). Again a full copy of 31K records is actually pretty fast under normal circumstances (like 10 seconds or so)
B. Target table
As already recommended by many, you might want to consider a temporal table, although if you do decide to do full copies, a temporal table might not be the beast option.

How to establish read-only-once implement within SAP HANA?

Context: I am a long-time MSSQL developer... What I would like to know is how to implement a read-only-once select from SAP HANA.
High-level pseudo-code:
Collect request via db proc (query)
Call API with request
Store results of the request (response)
I have a table (A) that is the source of inputs to a process. Once a process has completed it will write results to another table (B).
Perhaps this is all solved if I just add a column to table A to avoid concurrent processors from selecting the same records from A?
I am wondering how to do this without adding the column to source table A.
What I have tried is a left outer join between tables A and B to get rows from A that have no corresponding rows (yet) in B. This doesn't work, or I haven't implemented such that rows are processed only 1 time by any of the processors.
I have a stored proc to handle batch selection:
/*
* getBatch.sql
*
* SYNOPSIS: Retrieve the next set of criteria to be used in a search
* request. Use left outer join between input source table
* and results table to determine the next set of inputs, and
* provide support so that concurrent processes may call this
* proc and get their inputs exclusively.
*/
alter procedure "ACOX"."getBatch" (
in in_limit int
,in in_run_group_id varchar(36)
,out ot_result table (
id bigint
,runGroupId varchar(36)
,sourceTableRefId integer
,name nvarchar(22)
,location nvarchar(13)
,regionCode nvarchar(3)
,countryCode nvarchar(3)
)
) language sqlscript sql security definer as
begin
-- insert new records:
insert into "ACOX"."search_result_v4" (
"RUN_GROUP_ID"
,"BEGIN_DATE_TS"
,"SOURCE_TABLE"
,"SOURCE_TABLE_REFID"
)
select
in_run_group_id as "RUN_GROUP_ID"
,CURRENT_TIMESTAMP as "BEGIN_DATE_TS"
,'acox.searchCriteria' as "SOURCE_TABLE"
,fp.descriptor_id as "SOURCE_TABLE_REFID"
from
acox.searchCriteria fp
left join "ACOX"."us_state_codes" st
on trim(fp.region) = trim(st.usps)
left outer join "ACOX"."search_result_v4" r
on fp.descriptor_id = r.source_table_refid
where
st.usps is not null
and r.BEGIN_DATE_TS is null
limit :in_limit;
-- select records inserted for return:
ot_result =
select
r.ID id
,r.RUN_GROUP_ID runGroupId
,fp.descriptor_id sourceTableRefId
,fp.merch_name name
,fp.Location location
,st.usps regionCode
,'USA' countryCode
from
acox.searchCriteria fp
left join "ACOX"."us_state_codes" st
on trim(fp.region) = trim(st.usps)
inner join "ACOX"."search_result_v4" r
on fp.descriptor_id = r.source_table_refid
and r.COMPLETE_DATE_TS is null
and r.RUN_GROUP_ID = in_run_group_id
where
st.usps is not null
limit :in_limit;
end;
When running 7 concurrent processors, I get a 35% overlap. That is to say that out of 5,000 input rows, the resulting row count is 6,755. Running time is about 7 mins.
Currently my solution includes adding a column to the source table. I wanted to avoid that but it seems to make a simpler implement. I will update the code shortly, but it includes an update statement prior to the insert.
Useful references:
SAP HANA Concurrency Control
Exactly-Once Semantics Are Possible: Here’s How Kafka Does It
First off: there is no "read-only-once" in any RDBMS, including MS SQL.
Literally, this would mean that a given record can only be read once and would then "disappear" for all subsequent reads. (that's effectively what a queue does, or the well-known special-case of a queue: the pipe)
I assume that that is not what you are looking for.
Instead, I believe you want to implement a processing-semantic analogous to "once-and-only-once" aka "exactly-once" message delivery. While this is impossible to achieve in potentially partitioned networks it is possible within the transaction context of databases.
This is a common requirement, e.g. with batch data loading jobs that should only load data that has not been loaded so far (i.e. the new data that was created after the last batch load job began).
Sorry for the long pre-text, but any solution for this will depend on being clear on what we want to actually achieve. I will get to an approach for that now.
The major RDBMS have long figured out that blocking readers is generally a terrible idea if the goal is to enable high transaction throughput. Consequently, HANA does not block readers - ever (ok, not ever-ever, but in the normal operation setup).
The main issue with the "exactly-once" processing requirement really is not the reading of the records, but the possibility of processing more than once or not at all.
Both of these potential issues can be addressed with the following approach:
SELECT ... FOR UPDATE ... the records that should be processed (based on e.g. unprocessed records, up to N records, even-odd-IDs, zip-code, ...). With this, the current session has an UPDATE TRANSACTION context and exclusive locks on the selected records. Other transactions can still read those records, but no other transaction can lock those records - neither for UPDATE, DELETE, nor for SELECT ... FOR UPDATE ... .
Now you do your processing - whatever this involves: merging, inserting, updating other tables, writing log-entries...
As the final step of the processing, you want to "mark" the records as processed. How exactly this is implemented, does not really matter.
One could create a processed-column in the table and set it to TRUE when records have been processed. Or one could have a separate table that contains the primary keys of the processed records (and maybe a load-job-id to keep track of multiple load jobs).
In whatever way this is implemented, this is the point in time, where this processed status needs to be captured.
COMMIT or ROLLBACK (in case something went wrong). This will COMMIT the records written to the target table, the processed-status information, and it will release the exclusive locks from the source table.
As you see, Step 1 takes care of the issue that records may be missed by selecting all wanted records that can be processed (i.e. they are not exclusively locked by any other process).
Step 3 takes care of the issue of records potentially be processed more than once by keeping track of the processed records. Obviously, this tracking has to be checked in Step 1 - both steps are interconnected, which is why I point them out explicitly. Finally, all the processing occurs within the same DB-transaction context, allowing for guaranteed COMMIT or ROLLBACK across the whole transaction. That means, that no "record marker" will ever be lost when the processing of the records was committed.
Now, why is this approach preferable to making records "un-readable"?
Because of the other processes in the system.
Maybe the source records are still read by the transaction system but never updated. This transaction system should not have to wait for the data load to finish.
Or maybe, somebody wants to do some analytics on the source data and also needs to read those records.
Or maybe you want to parallelise the data loading: it's easily possible to skip locked records and only work on the ones that are "available for update" right now. See e.g. Load balancing SQL reads while batch-processing? for that.
Ok, I guess you were hoping for something easier to consume; alas, that's my approach to this sort of requirement as I understood it.

Truncate and insert new content into table with the least amount of interruption

Twice a day, I run a heavy query and save the results (40MBs worth of rows) to a table.
I truncate this results table before inserting the new results such that it only ever has the latest query's results in it.
The problem, is that while the update to the table is written, there is technically no data and/or a lock. When that is the case, anyone interacting with the site could experience an interruption. I haven't experienced this yet, but I am looking to mitigate this in the future.
What is the best way to remedy this? Is it proper to write the new results to a table named results_pending, then drop the results table and rename results_pending to results?
Two methods come to mind. One is to swap partitions for the table. To be honest, I haven't done this in SQL Server, but it should work at a low level.
I would normally have all access go through a view. Then, I would create the new day's data in a separate table -- and change the view to point to the new table. The view change is close to "atomic". Well, actually, there is a small period of time when the view might not be available.
Then, at your leisure you can drop the old version of the table.
TRUNCATE is a DDL operation which causes problems like this. If you are using snapshot isolation with row versioning and want users to either see the old or new data then use a single transaction to DELETE the old records and INSERT the new data.
Another option if a lot of the data doesn't actually change is to UPDATE / INSERT / DELETE only those records that need it and leave unchanged records alone.

MS SQL table hints and locking, parallelism

Here's the situation:
MS SQL 2008 database with table that is updated approximately once a minute.
The table structure is similar to following:
[docID], [warehouseID], [docDate], [docNum], [partID], [partQty]
Typical working cycle:
User starts data exchange from in-house developed system:
BEGIN TRANSACTION
SELECT * FROM t1
WHERE [docDate] BETWEEN &DateStart AND &DateEnd
AND [warehouseID] IN ('w1','w2','w3')
...then system performs rather long processing of the data selected, generates the list of [docID]s to delete from t1, then goes
DELETE FROM t1 WHERE [docID] IN ('d1','d2','d3',...,'dN')
COMMIT TRANSACTION
Here, the problem is that while 1st transaction processes selected the data, another reads it too and then they together populate the same data in in-house system.
At first, I inserted (TABLOCKX) table hint into SELECT query. And it worked pretty well until users started to complain about system's performance.
Then I changed hints to (ROWLOCK, XLOCK, HOLDLOCK), assuming that it would:
exclusively lock...
selected rows (instead of whole table)...
until the end of transaction
But this seems making a whole table lock anyway. I have no access to database itself, so I can't just analyze these locks (actually, I have no idea yet how to do it, even if I had access)
What I would like to have as a result:
users are able to process data related with different warehouses and dates in parallel
as a result of 1., avoid duplication of downloaded data
Except locks, other solutions I have are (although they both seem clumsy):
Implement a flag in t1, showing that the data is under processing (and then do 'SELECT ... WHERE NOT [flag]')
Divide t1 into two parts: header and details, and apply locks separately.
I beleive that I might misunderstood some concepts with regards to transaction isolation levels and/or table hints and there is another (better) way.
Please, advise!
You may change a concept of workflow.
Instead of deleting records update them with setting extra field Deprecated from 0 to 1.
And read data not from the table but from the view where Deprecated = 0.
BEGIN TRANSACTION
SELECT * FROM vT1
WHERE [docDate] BETWEEN &DateStart AND &DateEnd
AND [warehouseID] IN ('w1','w2','w3')
where vT1 view looks like this:
select *
from t1
where Deprecated = 0
And the deletion will look like this:
UPDATE t1 SET Deprecated = 1 WHERE [docID] IN ('d1','d2','d3',...,'dN')
COMMIT TRANSACTION
Using such a concept you will achieve two goals:
decrease probability of locks
get history of movings on warehouses