read-committed for rows whose keys are in a list (is the read atomic?)

read-committed for rows whose keys are in a list (is the read atomic?) - sql

I happen to be using innodb, read-committed.
My simple question is this relative to a transaction:
I have a table (TreeNodeId) which holds a set of 4 different nodekeys, that represent all extant nodes in my system that relate to available paths to webpages. Each key represents an item in the database, and each row in the table represents various combinations in which items are used.
At the beginning of a transaction, based on the items being changed, I make a single query for all rows in TreeNodeId that reflect some extant combination of my one or 2 items.
Will this single query be internally consistent, even if it fetches 10,000 rows? Is it possible for the db query set to get the first 100 rows, and then for some other simultaneous transaction to commit new or deleted rows that would cause the remaing results to be inconsistent?
Andy

If you isolation level is read 'committed' it will only return results that have been 'committed' by the transaction log. So if you start a query that is in isolation level 'committed' at that point in time the sql transaction log will only give you transactions that had posted to it's log as committed. If in the middle of the select someone posts records they will be seen as 'uncommitted' at that point in time till they end their operation and will be 'committed'. However even if you change the level to 'uncommitted' you should not get data as it is in mid stream, you should get what is available to the engine at the moment you began your operation according to the transaction logs.
Committed versus uncommitted will get records in the system at the moment of select that are there based on your select. So if I had say 3,000,000 records and 200,000 records inserting but they were committing one at a time and only 100,000 had committed and 100,000 were aware of operation in the logs but not committed yet.
Committed would give me 3,100,000 and Uncommitted would give 3,200,000. However there are schools of thought and I just got into a discussion yesterday with someone on this.... Uncommitted will give you the uncommitted results and are known as 'dirty reads' in that you are reading logs that are not set yet(you rebel). You are saying "Hey database I don't care what you got incoming that is finalized I WANT IT NOW." When you say committed you are saying: "Database I only want qualified data, if something is not finalized I don't want it."
Advantages with each:
Uncommitted you will not LOCK anything. You are basically saying to the system: "Don't lock anything out, just let me go through the system freely getting what is there and I don't care if you change something. I want it at moment of operation." If something is trying to insert or update when you perform this it WILL NOT LOCK IT.
Committed will not lock anything except that which is in process to commit till your operation has been completed. You are safe in knowing your data returned is finalized but your run the risk of BLOCKING transactions trying to insert or post. Your are essentially telling the database: "Wait for me to finish before continuing your commits on tables I am accessing. I need my data accurate so hands off till I am done". This will potentially lock data while it is performing the reads on a table you are gathering from. This is not that common as most selects are near instant but on huge systems that are transactional based on posting thousands of records a second it is a BIG CONCERN.
Honestly in my discussion I favored uncommitted and the other person favored committed. I argued it is far more acceptable to get dirty data than stop production inserts. They argued that phantom reads and other instances were worse. This is an opinion and SQL systems are designed around inserts and selects but seldom can you do both exceptionally fast without taking a little away from the other. My answer if you want accurate reporting is do nightly backups, SSIS packages, binary collections, or something similar in an isolation level such as snapshot or committed and put that data somewhere. Let that data have been set in a way that we know it is finalized and it is locked so it may not be changed later and report off of that. Don't report off of production data hot and make it a point to tell everyone to do that. That is bad practice in and of itself to tell people to report off of live data performing inserts and updates in real time.
Will it hurt if you are a small mom and pop store with only 5 or 10 people using the database, probably not. Will it hurt if you are little bigger and have 50 people accessing the same database but it is about 100 gigs and semi transactional in that you get trickle's of data during the day. Still probably not. Will it hurt if you have 200 people and multiple servers and databases and a main transactional database brain storing the composite of all the data. ABSOLUTELY, don't read from a main production database with intense operations if it's main purpose is to get data to store.
EDIT to further point from real world example:
That is why usually at the top of most operations where I am not using table variables (declare #Table table) I set this: "set transaction isolation level read uncommitted". Will I be using this intensely every time I query? LOL, I hope not. In fact Full disclosure it may NEVER EVER help me from this point on because I isolate my data a lot with temp tables for huge transaction reporting. But I will not be getting yelled at by others I have a long running transaction blocking their inserts. You will also see a lot of people do this: "select * from table (nolock)" I Generally give code like this to lesser query designers as it embeds the nolock hint with the query. If I tell everyone to do this they will make it policy.
You do not have to do this and in fact some people will maybe follow me and claim this is wrong and post their side. I do it MOSTLY FOR PRODUCTION PROTECTION and anyone that tells me that is wrong I would like to hear why they like to lock tables and report off of them in production versus getting their data in or updated in real time first. I would have a hard time going to a manager and saying: "You know that huge account you were waiting to post 2 million records on and know the instance it was done. Well John down the hall really wanted to run this query that takes an hour to run because it was sloppily designed. He chose to use committed and is hitting some of the tables doing inserts so we are getting occasional locks. Well I think it is more important he get his report than we get business." I wonder what the manager would tell me back?

Related

When a SQL DELETE query times out, what happens with the data?

If I were to run a DELETE FROM some_table, and that were to timeout, what happens to the data?
The way I see it, one of two things might happen:
The data is deleted up to the point where the query times out, so if there were 1,000,000 entries in the database and the first 500,000 were deleted, they'd stay deleted. The database now contains half as many as it did before the query was run.
The data is deleted, the query times out, the data is rolled back (I would guess from the logs made by DELETE?). The database now contains the exact same data it started with.
Both seem logical. Would one happen 100% of the time? Or is this dependent on some settings I'm unaware of? Note that I'm not asking about the viability of the DELETE, I realize that TRUNCATE would likely be opportune. This is purely out of curiosity of how timeout functions with DELETE.

The Oracle, SQL Server,MySQL, PostgreSQL databases follows ACID properties. Hence whenever delete statement shows timed-out it must get rolled back.
You can get overview of ACID from the this Link.

Removing rows without transaction logging?

We have pretty big table with hundreds of millions of rows. It takes about 5-15 minutes to run removal of rows for a specific foreign key value. For example removing 8 million rows takes 15 minutes.
The questions is that does the removal of the rows actually even free up space as the database has transaction logging on? Can I remove rows with by-passing transaction logging for that operation?

In simple terms, you can't get around the transaction logging. That's just how the database ensures consistency - if the transaction fails halfway through (or the server's power fails, for example), the database engine needs to know how to get into a consistent state again. Also, appending the things to be changed into the transaction log is much faster than actually performing a change on the data files of the DB, especially in cases like yours.
There's a few special cases where it's safe to get around those things - truncate table will remove all the rows at once, and only if the table has no foreign keys, which makes it rather trivial. You can't limit it in any way, though.
The newly free space will be reclaimed as part of the database maintenance cycle. During each database backup, the database is synchronized to have all the data written in the data files, and the transaction log is backed up and emptied in the DB itself (I'm oversimplifying, since there's a lot of possible configurations - in any case, this is something your DBA should care about).
If this is posing a problem to you, the solution wouldn't be to get around the transaction logging anyway. You probably want to ask why (and how often) you need to delete millions of rows at a time.

.NET SQLDatareader isolated read

I have a SQL Server database which stores accounts with credits (about 200.000 records), and a separate table which stores the transactions (about 20.000.000).
Whenever a transaction is added to the database the credits are updated.
What I need to do is update client programs (using a web service) to store the credits locally, and whenever new transactions are added to the server they are sent to the clients as well (using timestamps for the delta). My main problem is creating the first data set for the client. I need to supply the list of all accounts and the last timestamp on the transaction table.
This would mean I have to create this list and the last timestamp within a snapshot, because any updates during creating this list would mean a mismatch in credits total and last transaction timestamp.
I've researched the ALLOW_SNAPSHOT_ISOLATION setting and using snapshot isolation on the SqlCommand transaction, but from what I've read this will induce a significant performance penalty. Is this true, and can this problem be solved using other means ?

but from what I've read this will induce a significant performance penalty.
I don't even want to know where you read that. I'll refer you to the official document. The costs come from additional tempdb space used for row versions and from traversing old row versions. These problems do not concern you if you have a low write rate.
Snapshot isolation is a boon for solving blocking and consistency issues. It is a perfect match for your scenario.
Many SQL Server questions on Stack Overflow lead me to comment "did you investigate snapshot isolation yet?". Most underused feature.
Oracle and Postgres have it always on.

Don't jump onto SI wagon hastily. As everythng else it has it's benefits and has it's drawbacks.
As the drawbacks are concerned, for example, the application might count on blocking behaviour or/and is willing to wait for that last version of the data. You should thoroughly test the application under SI to be sure it behaves correctly. Further, an uncommitted transaction could make a mess of the version store and lead to dramatic tempdb growth, so monitoring is a must.
Also, SI might be an overkill for you, if normally you don't have blocking issues.
Instead, if what you need is a one-off or close to it, create a database snapshot of your database, create the initial list from that snapshot, and then simply drop it.

Firebird backup restore is frustrating, is there a way to avoid it?

I am using Firebird, but lately the database grows really seriously.
There is really a lot of delete statements running, as well update/inserts, and the database file size grows really fast.
After tons of deleting records the database size doesn't decrease, and even worse, i have the feeling that actually the query getting slowed down a bit.
In order to fix this a daily backup/restore process have been involved, but because of it's time to complete - i could say that it is really frustrating to use Firebird.
Any ideas on workarounds or solution on this will be welcome.
As well, I am considering switching to Interbase because I heard from a friend that it is not having this issue - it is so ?

We have a lot of huge databases on Firebird in production but never had an issue with a database growth. Yes, every time a record being deleted or updated an old version of it will be kept in the file. But sooner or later a garbage collector will sweap it away. Once both processes will balance each other the database file will grow only for the size of new data and indices.
As general precaution to prevent an enormous database growth try to make your transactions as short as possible. In our applications we use one READ ONLY transaction for reading all the data. This transaction is open through whole application life time. For every batch of insert/update/delete statements we use short separate transactions.
Slowing of database operations could be resulted from obsolete indices stats. Here you can find an example of how to recalculate statistics for all indices: http://www.firebirdfaq.org/faq167/

Check if you have unfinished transactions in your applications. If transaction is started but not committed or rolled back, database will have own revision for each transaction after the oldest active transaction.
You can check the database statistics (gstat or external tool), there's oldest transaction and the next transaction. If the difference between those numbers keeps growing, you have the stuck transaction problem.
There are also monitoring tools the check situation, one I've used is Sinatica Monitor for Firebird.
Edit: Also, database file doesn't shrink automatically ever. Parts of it get marked as unused (after sweep operation) and will be reused. http://www.firebirdfaq.org/faq41/

The space occupied by deleted records will be re-used as soon as it is garbage collected by Firebird.
If GC is not happening (transaction problems?), DB will keep growing, until GC can do its job.
Also, there is a problem when you do a massive delete in a table (ex: millions of records), the next select in that table will "trigger" the garbage collection, and the performance will drop until GC finishes. The only way to workaround this would be to do the massive deletes in a time when the server is not very used, and run a sweep after that, making sure that there are no stuck transactions.
Also, keep in mind that if you are using "standard" tables to hold temporary data (ie: info is inserted and delete several times), you can get corrupted database in some circumstances. I strongly suggest you to start using Global Temporary Tables feature.

Are single statement UPDATES atomic, regardless of the isolation level? (SQL Server 2005)

In an app, Users and Cases have a many-to-many relationship. Users pull their list of Cases often, Users can update a single case at a time (a 1-10 second operation, requiring more than one UPDATE). Under READCOMMITTED, any in-use Case would block all associated Users from pulling their list of Cases. Also, the most recent data is a hotspot for both reads and writes to the Cases table.
I think I want to employ dirty reads to keep the experience snappy. READPAST on Cases won't work for this purpose. NOLOCK will work, but I'd like to be able to show which records are dirty when they are listed.
I don't know of any native way to show which records are dirty, so I'm thinking that for each update or insert to Cases, an INUSE flag will be set. This flag must be cleared by the end of the updating transaction such that under READCOMMITTED, this flag will never appear to be set. Note that this is NOT to replace concurrency management, only to show which records are potentially dirty to the User.
My question is whether this is reliable - if we UPDATE two or more fields (INUSE plus the other fields) in a single statement, is it possible that a concurrent NOLOCK query would read some of the new values but not others? If so, is it possible to guarantee that INUSE be set first?
And if I'm thinking about this all wrong, please enlighten me. My ideal situation would be to, in a manageable way, be able to show the values as they were PRIOR to any related transaction so the data is immediately available and always consistent (but partially out-dated). But I don't think this is available - especially in the more complex actual database.
Thanks!

Restating the problem just to be sure: User A on connection A updates two columns (col1, col2) in MyTable. While this is going on, user B on connection B issues a dirty read, selecting data from that row. You are wondering if user B could get, say, the updated value in col1 AND the old/not updated value in col2. Correct?
I have to say: no way could this happen. As I understand it, updates are indeed an atomic transaction, and if you're writing data to the page (in memory), then the entire row update would have to finish on that set of bytes before anything else (another thread) could get access to them.
But I don't know for sure, and I can't imagine how to set up a test to confirm or deny this. The only answer I'd rely on would have to come from someone who actually had a hand in writing the code, or perhaps a Microsoft technician who has similar access. If you don't get any good answers here, posting the question on the appropriate MSDN forum (link) might get a good answer.

Have you considered using SNAPSHOT isolation level? When used for a query, it requires no locks whatsoever, and it gives precisely the semantics that you're asking for:
show the values as they were PRIOR to any related transaction so the data is immediately available and always consistent (but partially out-dated)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas