SQL Server - Dirty Reads Pros & Cons

SQL Server - Dirty Reads Pros & Cons - sql

Why should I or shouldn't I use dirty reads:
set transaction isolation level read uncommitted
in SQL Server?

From MSDN:
When this option is set, it is possible to read uncommitted or dirty data; values in the data can be changed and rows can appear or disappear in the data set before the end of the transaction.
Simply put, when you are using this isolation level, and you are performing multiple queries on an active table as part of one transaction, there is no guarantee that the information returned to you within different parts of the transaction will remain the same. You could query the same data twice within one transaction and get different results (this might happen in the case where a different user was updating the same data in the midst of your transaction). This can obviously have severe ramifications for parts of your application that rely on data integrity.

Generally when you need to do a sizeable (or frequent) queries to busy tables, where read committed would possibly be blocked by locks from uncommited transactions, but ONLY when you can live with inaccurate data.
As an example, on a gaming web site I worked on recently there was a summary display of some stats about recent games, this was all based on dirty reads, it was more important for us to include then exclude the transactional data not yet committed (we knew anyway that few, if any, transactions would be backed out), we felt that on average the data would be more accurate that way.

use it if you want the data back right away and it is not that important if it is right
do not use if if the data is important to be correct or if you are doing updates with it
Also take a look at snapshot isolation which has been introduced in sql server 2005

The Thing is when you want to read the data before committing, we can do with the help of set transaction isolation level read uncommitted, the data may, or may not change.
We can read the data by using the query:
Select * from table_name with(nolock)
This is applicable to only read uncommitted isolation level.

Related

SQL: Are dirty reads guaranteed to see very latest data?

If I open a transaction in READ UNCOMMITTED Isolation level, am I guaranteed to see the latest data on every table/row? I.e. as soon as some other transaction updates a row, my tranaction will see that change? (this would be analogous to a write-through to main memory)
Could it even be that my SELECT will get a row containing part of an UPDATE, but not all of it? What would in this case be the smallest element that is atomically updated/read?
Are there differences in the various relational database systems?

No. "Dirty data" means that you are relying on the internals of the database, so there are no guarantees. Data could be written to the data page and then removed due to a transaction rollback. Data could be written to the data page -- and then a later step in the same transaction could overwrite it.
In addition, what you are asking for is not possible. Your query could be scanning an entire table. Your reads are occurring at the page level. Each page could be a different amalgamation of transactions, with no consistency.

Could an query with READ UNCOMMITTED isolation level cause locks on the tables it access?

My app needs to batch process 10M rows, the result of a complex SQL query that join tables.
I'll plan to be iterating a resultset, reading a hundred per iteration.
To run this on a busy OLTP production DB and avoid locks, I figured I'll query with a READ UNCOMMITTED isolation level.
Would that get the query out of the way of any DB writes? avoiding any rows/table locks?
My main concern is my query blocking any other DB activity, I'm far less concerned with the other way around.
Side Notes:
1. I'll be reading historical data, so I'm unlikely to meet uncommitted data. It's OK if I do.
2. The iteration process could take hours. The DB connection would remain open through this process.
3. I'll have two such concurrent batch instances at most.
4. I can tolerate dup rows. (by product of read uncommitted).
5. DB2 is the target DB, but I want a solution that fits other DBs vendors as well.
6. Will snapshot isolation level help me clear out server memory?

Have you actually encountered any real locks on read?
As far as I'm concerned, the only reason that READ UNCOMMITED existed in SQL standard was to allow non-locking reads. So I don't know DB2, but I blindly bet that it does not lock data during read in READ UNCOMMITED mode. Most modern RDBMS systems however don't lock data at all during read (*). So READ UNCOMMITED is either not available (in Oracle, for example) or is silently promoted to READ COMMITED (PostgreSQL).
If you can freely choose the engine, either check DB2 transaction isolation level handling or go for Oracle/PostgreSQL/other.
(*) More precisely, they don't exclusively lock the data. Some shared locks can be placed on queried tables so no DDL alters them during read.

My answer applies to SQL Server.
Read committed releases lock after every row read (approximately). Locking is probably not your problem.
I recommend you use the safer READ COMMITTED. Better yet, use snapshot isolation. That removes many locking problems. There are disadvantages as well, sou you better read a little about it.
My main concern is my query blocking any other DB activity
Snapshot isolation makes all locking concerns go away for read-only transactions. No blocking either way, full data consistency. Be aware that long-running transactions can cause TempDB to fill with snapshot versions.
The DB connection would remain open through this process.
That's a problem because a network hiccup, app deployment or mirroring failover would kill your batch process.
Be aware, that read uncommitted can cause queries to sporadically fail outright. You need retry logic or tolerate failed jobs.

In sql server Transaction isolation level Read uncommitted cause no lock on table.

read-committed for rows whose keys are in a list (is the read atomic?)

I happen to be using innodb, read-committed.
My simple question is this relative to a transaction:
I have a table (TreeNodeId) which holds a set of 4 different nodekeys, that represent all extant nodes in my system that relate to available paths to webpages. Each key represents an item in the database, and each row in the table represents various combinations in which items are used.
At the beginning of a transaction, based on the items being changed, I make a single query for all rows in TreeNodeId that reflect some extant combination of my one or 2 items.
Will this single query be internally consistent, even if it fetches 10,000 rows? Is it possible for the db query set to get the first 100 rows, and then for some other simultaneous transaction to commit new or deleted rows that would cause the remaing results to be inconsistent?
Andy

If you isolation level is read 'committed' it will only return results that have been 'committed' by the transaction log. So if you start a query that is in isolation level 'committed' at that point in time the sql transaction log will only give you transactions that had posted to it's log as committed. If in the middle of the select someone posts records they will be seen as 'uncommitted' at that point in time till they end their operation and will be 'committed'. However even if you change the level to 'uncommitted' you should not get data as it is in mid stream, you should get what is available to the engine at the moment you began your operation according to the transaction logs.
Committed versus uncommitted will get records in the system at the moment of select that are there based on your select. So if I had say 3,000,000 records and 200,000 records inserting but they were committing one at a time and only 100,000 had committed and 100,000 were aware of operation in the logs but not committed yet.
Committed would give me 3,100,000 and Uncommitted would give 3,200,000. However there are schools of thought and I just got into a discussion yesterday with someone on this.... Uncommitted will give you the uncommitted results and are known as 'dirty reads' in that you are reading logs that are not set yet(you rebel). You are saying "Hey database I don't care what you got incoming that is finalized I WANT IT NOW." When you say committed you are saying: "Database I only want qualified data, if something is not finalized I don't want it."
Advantages with each:
Uncommitted you will not LOCK anything. You are basically saying to the system: "Don't lock anything out, just let me go through the system freely getting what is there and I don't care if you change something. I want it at moment of operation." If something is trying to insert or update when you perform this it WILL NOT LOCK IT.
Committed will not lock anything except that which is in process to commit till your operation has been completed. You are safe in knowing your data returned is finalized but your run the risk of BLOCKING transactions trying to insert or post. Your are essentially telling the database: "Wait for me to finish before continuing your commits on tables I am accessing. I need my data accurate so hands off till I am done". This will potentially lock data while it is performing the reads on a table you are gathering from. This is not that common as most selects are near instant but on huge systems that are transactional based on posting thousands of records a second it is a BIG CONCERN.
Honestly in my discussion I favored uncommitted and the other person favored committed. I argued it is far more acceptable to get dirty data than stop production inserts. They argued that phantom reads and other instances were worse. This is an opinion and SQL systems are designed around inserts and selects but seldom can you do both exceptionally fast without taking a little away from the other. My answer if you want accurate reporting is do nightly backups, SSIS packages, binary collections, or something similar in an isolation level such as snapshot or committed and put that data somewhere. Let that data have been set in a way that we know it is finalized and it is locked so it may not be changed later and report off of that. Don't report off of production data hot and make it a point to tell everyone to do that. That is bad practice in and of itself to tell people to report off of live data performing inserts and updates in real time.
Will it hurt if you are a small mom and pop store with only 5 or 10 people using the database, probably not. Will it hurt if you are little bigger and have 50 people accessing the same database but it is about 100 gigs and semi transactional in that you get trickle's of data during the day. Still probably not. Will it hurt if you have 200 people and multiple servers and databases and a main transactional database brain storing the composite of all the data. ABSOLUTELY, don't read from a main production database with intense operations if it's main purpose is to get data to store.
EDIT to further point from real world example:
That is why usually at the top of most operations where I am not using table variables (declare #Table table) I set this: "set transaction isolation level read uncommitted". Will I be using this intensely every time I query? LOL, I hope not. In fact Full disclosure it may NEVER EVER help me from this point on because I isolate my data a lot with temp tables for huge transaction reporting. But I will not be getting yelled at by others I have a long running transaction blocking their inserts. You will also see a lot of people do this: "select * from table (nolock)" I Generally give code like this to lesser query designers as it embeds the nolock hint with the query. If I tell everyone to do this they will make it policy.
You do not have to do this and in fact some people will maybe follow me and claim this is wrong and post their side. I do it MOSTLY FOR PRODUCTION PROTECTION and anyone that tells me that is wrong I would like to hear why they like to lock tables and report off of them in production versus getting their data in or updated in real time first. I would have a hard time going to a manager and saying: "You know that huge account you were waiting to post 2 million records on and know the instance it was done. Well John down the hall really wanted to run this query that takes an hour to run because it was sloppily designed. He chose to use committed and is hitting some of the tables doing inserts so we are getting occasional locks. Well I think it is more important he get his report than we get business." I wonder what the manager would tell me back?

Serializable isolation level atomicity

I have several threads executing some SQL select queries with serializable isolation level. I am not sure which implementation to choose. This:
_repository.Select(...)
or this
lock (_lockObject)
{
_repository.Select(...);
}
In other words, is it possible several transactions will start executing at the same time and partially block records inside Select operation range.
P. S. I am using MySQL but I guess it is a more general question.

Transactions performing SELECT queries place a shared lock on the rows, permitting other transactions to read those rows, but preventing them from making changes to the rows (including inserting new records into the gaps)
Locking in the application is doing something else, it will not allow other threads to enter the code block which fetches the data from the repository, This approach can lead to very bad performance for a few reasons:
If any of the rows are locked by another transaction (outside the application) via a exclusive lock, the lock in the application will not help.
Multiple transactions will not be able to perform reads even on rows that are not locked in exclusive mode (not being updated).
The lock will not be released until all the data is fetched and returned to the client. This includes the network latency and any other overhead that it takes converting the MySql result set to a code object.
Most importantly, Enforcing data integrity & atomicity is the databases job, it knows how to handle it very well, how to detect potential deadlocks. When to perform record locks, and when to add Index gap locks. It is what databases are for, and MySql is ACID complaint and is proven to handle these situations
I suggest you read through Section 13.2.8. The InnoDB Transaction Model and Locking of the MySql docs, it will give you a great insight how locking in InnoDB is performed.

Default SQL Server IsolationLevel Changes

we have a customer that's been experiencing some blocking issues with our database application. We asked them to run a Blocked Process Report trace and the trace they gave us shows blocking occurring between a SELECT and UPDATE operation. The trace files show the following:
The same SELECT query is being executed at different isolation levels. One trace shows a Serializable IsolationLevel while a later trace shows a RepeatableRead IsolationLevel. We do not use an explicit transaction while executing the query.
The UPDATE query is being executed with a RepeatableRead isolation level but is being blocked by the SELECT query. This is expected as our updates are wrapped in an explicit transaction with IsolationLevel of RepeatableRead.
So basically we're at a loss as to why the Isolation Level of the SELECT query would not be the default ReadCommitted IsolationLevel but, even more confusingly, why the IsolationLevel of the query would change over time? It is only one customer that is seeing this behaviour so we suspect it may be a database configuration issue.
Any ideas?
Thanks in advance,
Graham

In your scenario, I would recommend explicitly setting isolation level to snapshot - that will prevent read from getting in the way of writes (inserts and updates) by preventing locks, yet those read would still be "good" reads (i.e. not dirty data - it is not the same as a NOLOCK)
Generally i find that where i have locking issues with my queries, i manually control the lock applied. e.g. i would do updates with row-level locks to avoid page/table level locking, and set my reads to readpast (accepting that i may miss some data, in some scenarios that might be ok)
link|edit|delete|flag
EDIT-- Combining all the comments into the answer
As part of the optimisation process, sql server avoids getting commited reads on a page that it know hasn't changed, and automatically falls back to a lesser locking strategy. In your case, sql server drops from a serializable read to a repeatable read.
Q: Thanks for that useful info regarding dropping Isolation Levels. Can you think of any reason that it would use Serializable IsolationLevel in the first place, given that we don't use an explicit transaction for the SELECT - it was our understanding that the implicit transaction would use ReadCommitted?
A: By default, SQL Server will use Read Commmited if that is your default isolation level BUT if you do not additionally specify a locking strategy in your query, you are basically saying to sql server "do what you think is best, but my preference is Read Commited". Since SQL Server is free to choose, so it does in order to optimise the query. (The optimisation algorithm in sql server is very complex and i do not fully understand it myself). Not explicitly executing within a transaction does not, afaik, affect the isolation level that sql server uses.
Q: One last thing, does it seem reasonable that SQL Server would increase the Isolation Level (and presumably the number of locks required) to optimise the query? I'm also wondering whether the reuse of a pooled connection would affect this if it inherited the last used Isolation Level?
A: Sql server will do that as part of a process called "Lock Escalation". From http://support.microsoft.com/kb/323630, i quote: "Microsoft SQL Server dynamically determines when to perform lock escalation. When making this decision, SQL Server takes into account the number of locks that are held on a particular scan, the number of locks that are held by the whole transaction, and the memory that is being used for locks in the system as a whole. Typically, SQL Server's default behavior results in lock escalation occurring only at those points where it would improve performance or when you must reduce excessive system lock memory to a more reasonable level. However, some application or query designs may trigger lock escalation at a time when it is not desirable, and the escalated table lock may block other users".
Although lock escalation is not exactly the same thing as changing the isolation level a query runs under, this surprises me because i would not have expected sql server to take more locks than what the default isolation level permits.

More info regarding why SQL would take more locks by escalating: this is incorrect, escalating reduces (not increases) the number of locks required. A table lock is a single lock vs. all the page or row locks required to do the same from a lower level. Lock escalation is always done for one reason: it's more efficient to take a higher level lock than to lock all the lower-level objects
For example, perhaps there is no index available to lock efficiently against. I.e. if you take a count with UPDLOCK on all records with a year of 2010 in a field, and there is no index on that date field, this will require a row lock on each record in 2010, which is not efficient if many records are hit, and a page lock will not help either since they are presumably distributed randomly across pages, therefore SQL takes a table lock. Moreover, SQL MUST also lock other records from changing to being in the year 2010 while the UPDLOCK is held, and with no index on this field to do a range lock, SQL has NO CHOICE but to take a table lock to prevent this from happening. This latter point is one often missed by those new to optimization: the realization that SQL must also "protect" the integrity of the queries already executed in the transaction.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas