I have a question regarding how to handle sql queries to a table while performing batch inserts to the same table.
I have an ASP.NET web application that creates lots of objects (perhaps 50000) that are inserted in a batch fashion to a table using nHibernate. Even with the Nhibernate optimizations in place this takes up to two minutes. I perform this in a database transaction with isolation level set to read commited.
During the batch insert clients in the web application must be able to read previously created data in this table. However, they should not be able to read uncommited data. My problem is that if I use isolation level "read committed" on the select queries they time out because they are waiting for the batch insert job to finish.
Is there any way to query the database in such a way so that the query runs fast and returns all of the committed rows in the table without waiting on the batch insert job to finish? I do not want to return any uncommitted data.
I have tested setting the isolation level to "snapshot" and that seems to solve my problem, but is it the best approach?
Best regards Whimsical
SNAPSHOT isolation returns data that existed prior to the beginning of the transaction, and it doesn't a lock on the table so it doesn't block. It also ignores other locking transactions, so in your scenario, it sounds like the best fit for you. What it does mean is that since your data is being inserted in a batch, no data from that batch will be available to the SELECT statement until the batch completes (i.e.)
Time 1: Dataset A exists in Table
Time 2: Batch starts inserting dataset B into table (but doesn't commit).
Time 3: App takes snapshot, and reads in dataset A.
Time 4: App finishes returing dataset A (and only dataset A).
Time 5: Batch finishes writing dataset B; Dataset A and DataSet B are
both available in table.
Related
I have a data flow, oledb source and oledb destination (both are SQL Server). In source, there are two tables A and B, A has 4M rows and B has 6M rows. They all have 30+ columns. When I do the query, I select 30 columns from A left join B where a.date > '2020-01-01'. it will return 50K rows. the query last 9 -10 seconds. Sometimes, I got error
Transaction (Process ID 75) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
Even I did the query directly on the source server, I also could get
Transaction (Process ID 67) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
but not as frequent as on the SSIS
Is that because they are transaction tables, user could do some update at the same time?
How to avoid it. Like in SSIS, if fail can SSIS wait for 5 second and reran it?
SSIS doesn't know anything about scheduling. Typically, that is done through SQL Agent in which you can specify retry on failure values.
The root of your question is why am I getting these deadlocks. You are asking for data and your request is preventing a more important query from completing. Since your query is less important, it gets snuffed out so the database as a system can remain operational.
Your question identifies that you are querying against transactional tables and yes, the day-to-day operation of the system is what is likely killing your queries. The deadlock graph in the default extended event would reveal precisely what happened (ask your DBAs for help).
As David Browne points out, you likely need to look at using a different isolation level to allow your read queries to operate on data while concurrent activity inserts/deletes/updates data. This tends to be decision points the business units that you are generating ETL for can provide guidance. Maybe working with "dirty" data is acceptable. If so, add SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED to your query.
If not, then you need to look at the query plans being generated and optimize them. Maybe the left join could be reworked as an Exists if you're only using it to test whether a condition exists. Perhaps there's implicit conversion going on all over the place. Or statistics are out of date. Or a covering index could be created. Lots of options here, but the key takeaway is make the query go faster so there's less resource contention.
Use one of the row-versioning isolation levels READ_COMMITTED_SNAPSHOT isolation or SNAPSHOT isolation to prevent your SSIS source query from acquiring locks on the data it reads.
While using the python connector for snowflake with queries of the form
UPDATE X.TABLEY SET STATUS = %(status)s, STATUS_DETAILS = %(status_details)s WHERE ID = %(entry_id)s
, sometimes I get the following message:
(snowflake.connector.errors.ProgrammingError) 000625 (57014): Statement 'X' has locked table 'XX' in transaction 1588294931722 and this lock has not yet been released.
and soon after that
Your statement X' was aborted because the number of waiters for this lock exceeds the 20 statements limit
This usually happens when multiple queries are trying to update a single table. What I don't understand is that when I see the query history in Snowflake, it says the query finished successfully (Succeded Status) but in reality, the Update never happened, because the table did not alter.
So according to https://community.snowflake.com/s/article/how-to-resolve-blocked-queries I used
SELECT SYSTEM$ABORT_TRANSACTION(<transaction_id>);
to release the lock, but still, nothing happened and even with the succeed status the query seems to not have executed at all. So my question is, how does this really work and how can a lock be released without losing the execution of the query (also, what happens to the other 20+ queries that are queued because of the lock, sometimes it seems that when the lock is released the next one takes the lock and have to be aborted as well).
I would appreciate it if you could help me. Thanks!
Not sure if Sergio got an answer to this. The problem in this case is not with the table. Based on my experience with snowflake below is my understanding.
In snowflake, every table operations also involves a change in the meta table which keeps track of micro partitions, min and max. This meta table supports only 20 concurrent DML statements by default. So if a table is continuously getting updated and getting hit at the same partition, there is a chance that this limit will exceed. In this case, we should look at redesigning the table updation/insertion logic. In one of our use cases, we increased the limit to 50 after speaking to snowflake support team
UPDATE, DELETE, and MERGE cannot run concurrently on a single table; they will be serialized as only one can take a lock on a table at at a time. Others will queue up in the "blocked" state until it is their turn to take the lock. There is a limit on the number of queries that can be waiting on a single lock.
If you see an update finish successfully but don't see the updated data in the table, then you are most likely not COMMITting your transactions. Make sure you run COMMIT after an update so that the new data is committed to the table and the lock is released.
Alternatively, you can make sure AUTOCOMMIT is enabled so that DML will commit automatically after completion. You can enable it with ALTER SESSION SET AUTOCOMMIT=TRUE; in any sessions that are going to run an UPDATE.
I have an operative table, call it Ops. The table gets queried by our customers via a web service every other second.
There are two processes that affect the table:
Deleting expired records (daily)
Inserting new records (weekly)
My goal is to reduce downtime to a minimum during these processes. I know Oracle, but this is the first time I'm using SQL Server and T-SQL. In Oracle, I would do a truncate to speed up the first process of deleting expired records and a partition exchange to insert new records.
Partition Exchanges for SQL Server seem a bit harder to handle, because from what I can read, one has to create file groups, partition schemes and partition functions (?).
What are your recommendations for reducing downtime?
A table is not offline because someone is deleting or inserting rows. The table can be read and updated concurrently.
However, under the default isolation level READ COMMITTED readers are blocked by writers and writers are blocked by readers. This means that a SELECT statement can take longer to complete because a not-yet-committed transaction is locking some rows the SELECT statement is trying to read. The SELECT statement is blocked until the transaction completes. This can be a problem if the transaction takes long time, since it appears as the table was offline.
On the other hand, under READ COMMITTED SNAPSHOT and SNAPSHOT isolation levels readers don't block writers and writers don't block readers. This means that a SELECT statement can run concurrently with INSERT, UPDATE and DELETE statements without waiting to acquire locks, because under these isolation levels SELECT statements don't request locks.
The simplest thing you can do is to enable READ COMMITTED SNAPSHOT isolation level on the database. When this isolation level is enabled it becomes the default isolation level, so you don't need to change the code of your application.
ALTER DATABASE MyDataBase SET READ_COMMITTED_SNAPSHOT ON
If your problem is "selects getting blocked," you can try 'NO LOCK' hint. But be sure to read the implications. You can check https://www.mssqltips.com/sqlservertip/2470/understanding-the-sql-server-nolock-hint/ for details.
I have an ETL process that is building dimension tables incrementally in RedShift. It performs actions in the following order:
Begins transaction
Creates a table staging_foo like foo
Copies data from external source into staging_foo
Performs mass insert/update/delete on foo so that it matches staging_foo
Drop staging_foo
Commit transaction
Individually this process works, but in order to achieve continuous streaming refreshes to foo and redundancy in the event of failure, I have several instances of the process running at the same time. And when that happens I occasionally get concurrent serialization errors. This is because both processes are replaying some of the same changes to foo from foo_staging in overlapping transactions.
What happens is that the first process creates the staging_foo table, and the second process is blocked when it attempts to create a table with the same name (this is what I want). When the first process commits its transaction (which can take several seconds) I find that the second process gets unblocked before the commit is complete. So it appears to be getting a snapshot of the foo table before the commit is in place, which causes the inserts/updates/deletes (some of which may be redundant) to fail.
I am theorizing based on the documentation http://docs.aws.amazon.com/redshift/latest/dg/c_serial_isolation.html where it says:
Concurrent transactions are invisible to each other; they cannot detect each other's changes. Each concurrent transaction will create a snapshot of the database at the beginning of the transaction. A database snapshot is created within a transaction on the first occurrence of most SELECT statements, DML commands such as COPY, DELETE, INSERT, UPDATE, and TRUNCATE, and the following DDL commands :
ALTER TABLE (to add or drop columns)
CREATE TABLE
DROP TABLE
TRUNCATE TABLE
The documentation quoted above is somewhat confusing to me because it first says a snapshot will be created at the beginning of a transaction, but subsequently says a snapshot will be created only at the first occurrence of some specific DML/DDL operations.
I do not want to do a deep copy where I replace foo instead of incrementally updating it. I have other processes that continually query this table so there is never a time when I can replace it without interruption. Another question asks a similar question for deep copy but it will not work for me: How can I ensure synchronous DDL operations on a table that is being replaced?
Is there a way for me to perform my operations in a way that I can avoid concurrent serialization errors? I need to ensure that read access is available for foo so I can't LOCK that table.
OK, Postgres (and therefore Redshift [more or less]) uses MVCC (Multi Version Concurrency Control) for transaction isolation instead of a db/table/row/page locking model (as seen in SQL Server, MySQL, etc.). Simplistically every transaction operates on the data as it existed when the transaction started.
So your comment "I have several instances of the process running at the same time" explains the problem. If Process 2 starts while Process 1 is running then Process 2 has no visibility of the results from Process 1.
Understanding the final decision is business decision, what are the accuracy considerations between NOLOCK & READPAST running in SQL 2008 R2? I would like to have a better understanding before discussing changes with the business area.
I have inherited a number of queries, used to create data views for management reporting. ‘WITH (NOLOCK)’ is used liberally but inconsistently. The data being read is from the production server of a widely used application that is constantly being updated. We are migrating from a SQL 2005 server to a SQL 2008 R2 server. These reports want data fresher than the 24 hour old data on the archive server. The use of NOLOCK suggests a past decision; potential for conflict exists and it a bit of accuracy loss is acceptable. Data is used to populate dashboards for human awareness/decision making.
All the Queries are SELECT, with Read Only access for the data view login. The majority of the queries are single table with a few 2 and 3 table joins. Given the low level of joins WITH () seems a better choice than SET TRANSACTION ISOLATION LEVEL {}
Table Hints (Transact-SQL) http://msdn.microsoft.com/en-us/library/ms187373.aspx (as well as multiple questions on SO) says that NOLOCK and/or READUNCOMMITTED are likely to have duplicate read issues, in addition to missing locked records.
READPAST looks like the more accurate, as it will only miss locked records without a chance of duplicates. But I am not sure the level of missing locked records is consistent between it and NOLOCK.
There is good article by Tim Chapman comparing the two but it was written in 2007, most of the comments revolve around 2000 & 2005, with one comment indicating READPAST is problematic in 2008 R2
References
Effect of NOLOCK hint in SELECT statements
When should you use "with (nolock)"
Using NOLOCK and READPAST table hints in SQL Server (By Tim Chapman)
Edit:
Snapshot isolation is suggested in two answers below. Snapshot isolation is dependent setting of the DB, this Q/A https://serverfault.com/questions/117104/how-can-i-tell-if-snapshot-isolation-is-turned-on describes how to see what setting are in place on the database. I now know it is disabled, I am reading for reports from a major applications database. Changing the setting is not an option. +- a couple of percent accuracy is acceptable, application (OLTP) impact is not acceptable. Most simple queries do not need lock considerations but in some extreme cases, lock consideration is required. With the advent of Snapshot isolation for SQL 2005, little information is available on NOLOCK & READPAST behavior in SQL 2008 or higher. Yet they remain my only choices.
A better option worth consideration is enabling READ COMMITTED SNAPSHOT for the database itself. This uses versioning in the tempdb to capture the state of a table at the beginning of the transaction.
There is a very good read on various aspects of NOLOCK, READPAST etc, at http://www.brentozar.com/archive/2013/01/implementing-snapshot-or-read-committed-snapshot-isolation-in-sql-server-a-guide/
WITH (NOLOCK) can provide incorrect results if someone is updating the table when you are selecting from it. If a page-split happens as a result of an insert while you are reading the table, and the new page happens to be beyond the point you've read, WITH (NOLOCK) will have already returned rows from the old page, and will then return duplicate rows from the new page. This is just a single example of why (NOLOCK) is bad.
WITH (READPAST) will skip any records that are being updated or inserted while you are reading from the table. Neither option is good in a busy database.
In light of the recent edit to your question where you state you cannot change the database setting for READ COMMITTED SNAPSHOT, perhaps you should consider using a stored procedure to gather data for you reports, and setting the transaction isolation level at the beginning of the stored proc using SET TRANSACTION ISOLATION LEVEL SNAPSHOT;. In order to do this, you would need to change the database option 'allow snapshot isolation'.
From SQL Server Books Online:
SNAPSHOT
Specifies that data read by any statement in a transaction will be the transactionally consistent version of the data that existed at the start of the transaction. The transaction can only recognize data modifications that were committed before the start of the transaction. Data modifications made by other transactions after the start of the current transaction are not visible to statements executing in the current transaction. The effect is as if the statements in a transaction get a snapshot of the committed data as it existed at the start of the transaction.
Except when a database is being recovered, SNAPSHOT transactions do not request locks when reading data. SNAPSHOT transactions reading data do not block other transactions from writing data. Transactions writing data do not block SNAPSHOT transactions from reading data.
During the roll-back phase of a database recovery, SNAPSHOT transactions will request a lock if an attempt is made to read data that is locked by another transaction that is being rolled back. The SNAPSHOT transaction is blocked until that transaction has been rolled back. The lock is released immediately after it has been granted.
The ALLOW_SNAPSHOT_ISOLATION database option must be set to ON before you can start a transaction that uses the SNAPSHOT isolation level. If a transaction using the SNAPSHOT isolation level accesses data in multiple databases, ALLOW_SNAPSHOT_ISOLATION must be set to ON in each database.
A transaction cannot be set to SNAPSHOT isolation level that started with another isolation level; doing so will cause the transaction to abort. If a transaction starts in the SNAPSHOT isolation level, you can change it to another isolation level and then back to SNAPSHOT. A transaction starts the first time it accesses data.
A transaction running under SNAPSHOT isolation level can view changes made by that transaction. For example, if the transaction performs an UPDATE on a table and then issues a SELECT statement against the same table, the modified data will be included in the result set.
NOLOCK can cause duplicate data to be read, data to be missed and the query to actually fail with an error messaged (something with "data movement").
On the other hand, a non-NONLOCK query can also read duplicate data and mis data! It is by no means a consistent snapshot of the database. The difference is that it will not read uncommitted data and will never fail.
The problem with NOLOCK is mostly that it can fail randomly, so you need retry. Also, the probability of wrong data being read is slightly higher.
NOLOCK has a big advantage when you're doing table scans: SQL Server can use allocation order scanning instead of index-order scans. TABLOCK has the same effect. This can be a significant speedup in the presence of fragmentation.
Consider just using the snapshot isolation level as it gets rid of all of these concerns. It comes with some other trade-offs and you don't get allocation-order scans. But it permanently and comprehensively removes locking problems.
Answering my own question after stress testing with SQLQueryStress http://www.datamanipulation.net/sqlquerystress/ (this is wonderful tool that is extremely easy to use). Results from SQLQueryStress are tested against SQL Server Profiler; the accuracy is the same as SQL Server Profiler, though the precession is two decimal places of a second less (but is sufficient for this test).
As mentioned in the question, the primary concern is application performance impact, with report accuracy and performance the secondary consideration. All testing occurred on the test server with where the test application is active and has some minor activity.
After downloading and becoming familiar with SQLQueryStress I set up a simple ‘ReportQuery’ to act as resource hog. It set to run 15 Iterations with 15 threads (225 total queries). Total run time is around 28 seconds, with an average iteration time of 1.49 seconds.
Created an Add/Delete ‘ApplicationQuery’ to represent ongoing application activity. It is set to run 2000 iterations with 1 thread. There are two versions, with a select statement (runs 31 seconds) and without a select statement (runs 28 seconds). These represent normal peak time application activity.
10 test runs of each of three versions of “ReportQuery” are run, this is to identify if there is any performance benefit between ‘with(nolock)’, ‘with(readpast), and without hints . Results indicate no significant difference the ReportQuery runs constantly in about 28 seconds with an average 1.5 second iteration time.
There are no big outliers so, decision to drop to 5 test runs for the following tests.
5 test runs of ApplicationQuery with a select statement; with one of each of three versions of “ReportQuery” also running. In each of 15 totals test, the ApplicationQuery is manually started, with the ReportQuery manually started immediately after. This scenario represents a resource heavy report query struggling with the applications ongoing activity for resources.
Repeated the test runs but this time used ApplicationQuery without a select statement.
Results: In every case the ApplicationQuery was throttled back to almost no forward progress, while the ReportQuery was running.
The ReportQuery had no significant loss of performance when struggling for resources with multiple ApplicationQuery’s against the database.
The ApplicationQuery was able to run queries parallel to the ReportQuery, but progress was very slow while competing for resources. In essence the total time to run 2000 Application Add/Delete Queries was extended by the time used by the ReportQuery.
The initial question was about which was more accurate, becomes pointless. There is essentially no report or application performance difference between using or not using the hints NOLOCK or READPAST, so don’t use either in a busy database and get the highest accuracy possible.
‘ReportQuery’
select
ID
, [TABLE_NAME]
, NUMBER
, FIELD
, OLD_VALUE
, NEW_VALUE
, SYSMODUSER
, SYSMODTIME
, SYSMODCOUNT
from dbo.UPMCINCIDENTMGMTAUDITRECORDSM1
where Number like '%'
or NUMBER like '2010-01-01'
‘ApplicationQuery’ (with Select Statement)
select *
from dbo.UPMCINCIDENTMGMTAUDITRECORDSM1
where FIELD = 'JJTestingPerformance'
insert into dbo.UPMCINCIDENTMGMTAUDITRECORDSM1 (ID
, [TABLE_NAME]
, NUMBER
, FIELD
, OLD_VALUE
, NEW_VALUE
)
values ('Test+Time'
, 'none'
, 'tst01'
, 'JJTestingPerformance'
, 'No Value'
, 'Test'
)
delete from dbo.UPMCINCIDENTMGMTAUDITRECORDSM1
where FIELD = 'JJTestingPerformance'
‘ApplicationQuery’ (without Select Statement)
insert into dbo.UPMCINCIDENTMGMTAUDITRECORDSM1 (ID
, [TABLE_NAME]
, NUMBER
, FIELD
, OLD_VALUE
, NEW_VALUE
)
values ('Test+Time'
, 'none'
, 'tst01'
, 'JJTestingPerformance'
, 'No Value'
, 'Test'
)
delete from dbo.UPMCINCIDENTMGMTAUDITRECORDSM1
where FIELD = 'JJTestingPerformance'