I have a task that takes quite a long time. So I would like to let several programs/threads/computers execute the same task to speed things up. Each task requires unique ids which are stored in a db – so I thought these ids could be obtained like this:
NHibernateSession.Current.BeginTransaction(IsolationLevel.Serializable);
list = NHibernateSession.Current.CreateCriteria<RelevantId>().SetFirstResult(0).SetMaxResults(500).List<RelevantId>();
foreach (RelevantId x in list)
{
RelevantIdsRepository.Delete(x);
}
NHibernateSession.Current.Transaction.Commit();
Unfortunately, this throws an exception after a while if several processes access the database (nr of deleted objects is not the same as batch size). Why is this? The isolation level of the db should be ok shouldn’t it? Thanks.
Best wishes,
Christian
I'm not sure that I understand what you are doing here. It looks like each process should take some ids and process them but no two processes should take the same.
It doesn't work like you implemented it. All processes are reading the same ids. After committing the transaction they disappear from the database. Until then, they are visible to everyone. Isolation level only make sure that other transactions can't read them after they got deleted. But until then, they all can read them.
It's not so easy to distribute load. You could
maintain ids in a table where each process is registering itself as the executer and commits it before starting (handling conflicts, eg. StaleObjectStateException). Make sure to clean it up even when a process crashes.
write a central service which distributes ids.
The problem that it runs slow, is possibly due to the fact that you perform multiple SQL statements in a loop.
You should see if it is not possible to delete all entities in one batch-statement.
Related
Recently we see a huge amount of errors that pertain to Serializable isolation violation on table we have some base tables that forms our core data and we extract values from these tables to run our business logic in Lambda's.
Scenario :
Lambda 1 : Runs every 15 mins and gets the latest data from the source (RDBMS) into Redshift which forms our base tables (Does a DELETE and INSERT)
Lambda 2 : Triggered after successful run of the above Lambda and this is where the business logic is written and is normal SELECT statements
Lambda 3 : Triggered every 15 mins and is also run using the base tables and has only SELECT statements
When the Lambda 1 is triggered for its next run at instances we see that it fails with the Serializable isolation violation error.
Based on most of the posts putting a LOCK on the table might solve the issue but will increase the wait time for the other queries to run longer than expected and due to the constraint of the Lambda it will timeout after 15 mins which is not ideal. And I did see posts that stated putting a LOCK didn't entirely solve it too so skeptical to use to.
So something that struck me is would creating a VIEW on top of the base table and use the view in all the SELECT statements would that help here, if someone has any insights on this would really be helpful.
So the issue is with the locks each transaction is creating and being unable to determine the correct order the locks need to be resolved in. See: https://aws.amazon.com/premiumsupport/knowledge-center/redshift-serializable-isolation/
Now your description doesn't have enough writes in flight as stated to see how you are getting this from one pass of the Lambdas runs. So either this description isn't complete (multiple Lambdas updating tables) or the issue is coming between runs of Lambdas. A possibility is that the transaction aren't being closed and the Lambda invocations are having locks that cannot be resolved. Do you have COMMITs closing transactions? More info is needed to know which.
You can inspect pg_locks between Lambdas to see what locks are left around. The XID is the transaction that has the lock. I'd guess you have many more open transactions than you expect. Are your Lambda sessions in autocommit mode? Are you updating tables from multiple sessions? Are you COMMITting your changes? Are you reusing scratch tables?
Adding an explicit LOCK to serialize things can work if you know why / what tables are causing the serialization issue. This is also not likely the best solution and a better approach will likely be apparent when the issue is understood.
Adding a view to the mix won't resolve the issue (though it might move it if it changes the timing of events).
Given an SQL table with timestamped records. Every once in a while an application App0 does something like foreach record in since(certainTimestamp) do process(record); commitOffset(record.timestamp), i.e. periodically it consumes a batch of "fresh" data, processes it sequentially and commits success after each record and then just sleeps for reasonable time (to accumulate yet another batch). That works perfect with single instance.. however how to load balance multiple ones?
In exactly the same environment App0 and App1 concurrently competite for the fresh data. The idea is that ready query executed by the App0 must not overlay with the same read query executed by the App1 - such that they never try to process the same item. In other words, I need SQL-based guarantees that concurrent read queries return different data. Is that even possible?
P.S. Postgres is preferred option.
The problem description is rather vague on what App1 should do while App0 is processing the previously selected records.
In this answer, I make the following assumptions:
all Apps somehow know what the last certainTimestamp is and it is the same for all Apps whenever they start a DB query.
while App0 is processing, say the 10 records it found when it started working, new records come in. That means, the pile of new records with respect to certainTimestamp grows.
when App1 (or any further App) starts, the should process only those new records with respect to certainTimestamp that are not yet being handled by other Apps.
yet, if on App fails/crashes, the unfinished records should be picked the next time another App runs.
This can be achieved by locking records in many SQL databases.
One way to go about this is to use
SELECT ... FOR UPDATE SKIP LOCKED
This statement, in combination with the range-selection since(certainTimestamp) selects and locks all records matching the condition and not being locked currently.
Whenever a new App instance runs this query, it only gets "what's left" to do and can work on that.
This solves the problem of "overlay" or working on the same data.
What's left is then the definition and update of the certainTimestamp.
In order to keep this answer short, I don't go into that here and just leave the pointer to the OP that this needs to be thought through properly to avoid situations where e.g. a single record that cannot be processed for some reason keeps the certainTimestamp at a permanent minimum.
I happen to be using innodb, read-committed.
My simple question is this relative to a transaction:
I have a table (TreeNodeId) which holds a set of 4 different nodekeys, that represent all extant nodes in my system that relate to available paths to webpages. Each key represents an item in the database, and each row in the table represents various combinations in which items are used.
At the beginning of a transaction, based on the items being changed, I make a single query for all rows in TreeNodeId that reflect some extant combination of my one or 2 items.
Will this single query be internally consistent, even if it fetches 10,000 rows? Is it possible for the db query set to get the first 100 rows, and then for some other simultaneous transaction to commit new or deleted rows that would cause the remaing results to be inconsistent?
Andy
If you isolation level is read 'committed' it will only return results that have been 'committed' by the transaction log. So if you start a query that is in isolation level 'committed' at that point in time the sql transaction log will only give you transactions that had posted to it's log as committed. If in the middle of the select someone posts records they will be seen as 'uncommitted' at that point in time till they end their operation and will be 'committed'. However even if you change the level to 'uncommitted' you should not get data as it is in mid stream, you should get what is available to the engine at the moment you began your operation according to the transaction logs.
Committed versus uncommitted will get records in the system at the moment of select that are there based on your select. So if I had say 3,000,000 records and 200,000 records inserting but they were committing one at a time and only 100,000 had committed and 100,000 were aware of operation in the logs but not committed yet.
Committed would give me 3,100,000 and Uncommitted would give 3,200,000. However there are schools of thought and I just got into a discussion yesterday with someone on this.... Uncommitted will give you the uncommitted results and are known as 'dirty reads' in that you are reading logs that are not set yet(you rebel). You are saying "Hey database I don't care what you got incoming that is finalized I WANT IT NOW." When you say committed you are saying: "Database I only want qualified data, if something is not finalized I don't want it."
Advantages with each:
Uncommitted you will not LOCK anything. You are basically saying to the system: "Don't lock anything out, just let me go through the system freely getting what is there and I don't care if you change something. I want it at moment of operation." If something is trying to insert or update when you perform this it WILL NOT LOCK IT.
Committed will not lock anything except that which is in process to commit till your operation has been completed. You are safe in knowing your data returned is finalized but your run the risk of BLOCKING transactions trying to insert or post. Your are essentially telling the database: "Wait for me to finish before continuing your commits on tables I am accessing. I need my data accurate so hands off till I am done". This will potentially lock data while it is performing the reads on a table you are gathering from. This is not that common as most selects are near instant but on huge systems that are transactional based on posting thousands of records a second it is a BIG CONCERN.
Honestly in my discussion I favored uncommitted and the other person favored committed. I argued it is far more acceptable to get dirty data than stop production inserts. They argued that phantom reads and other instances were worse. This is an opinion and SQL systems are designed around inserts and selects but seldom can you do both exceptionally fast without taking a little away from the other. My answer if you want accurate reporting is do nightly backups, SSIS packages, binary collections, or something similar in an isolation level such as snapshot or committed and put that data somewhere. Let that data have been set in a way that we know it is finalized and it is locked so it may not be changed later and report off of that. Don't report off of production data hot and make it a point to tell everyone to do that. That is bad practice in and of itself to tell people to report off of live data performing inserts and updates in real time.
Will it hurt if you are a small mom and pop store with only 5 or 10 people using the database, probably not. Will it hurt if you are little bigger and have 50 people accessing the same database but it is about 100 gigs and semi transactional in that you get trickle's of data during the day. Still probably not. Will it hurt if you have 200 people and multiple servers and databases and a main transactional database brain storing the composite of all the data. ABSOLUTELY, don't read from a main production database with intense operations if it's main purpose is to get data to store.
EDIT to further point from real world example:
That is why usually at the top of most operations where I am not using table variables (declare #Table table) I set this: "set transaction isolation level read uncommitted". Will I be using this intensely every time I query? LOL, I hope not. In fact Full disclosure it may NEVER EVER help me from this point on because I isolate my data a lot with temp tables for huge transaction reporting. But I will not be getting yelled at by others I have a long running transaction blocking their inserts. You will also see a lot of people do this: "select * from table (nolock)" I Generally give code like this to lesser query designers as it embeds the nolock hint with the query. If I tell everyone to do this they will make it policy.
You do not have to do this and in fact some people will maybe follow me and claim this is wrong and post their side. I do it MOSTLY FOR PRODUCTION PROTECTION and anyone that tells me that is wrong I would like to hear why they like to lock tables and report off of them in production versus getting their data in or updated in real time first. I would have a hard time going to a manager and saying: "You know that huge account you were waiting to post 2 million records on and know the instance it was done. Well John down the hall really wanted to run this query that takes an hour to run because it was sloppily designed. He chose to use committed and is hitting some of the tables doing inserts so we are getting occasional locks. Well I think it is more important he get his report than we get business." I wonder what the manager would tell me back?
How are people coping with changes to redis object schemas - adding or removing properties from objects?
Sharing from my own experience (one year old project with thousands of user requests per second).
Usually, there were two scenarios for me:
Add new information to existing structures (like, "email" field to a user)
Remove or change existing values in existing structures (like, change format of some field)
Drop stuff from the database
For 1 I keep following simple strategy: degrade gracefully, e.g. if user doesn't have email record - treat it as empty email. Worked all the time.
For 2 and 3 it depends, whether data can be changed/calculated/fixed before releasing or after. I run a job on database that does all the work for me, for few millions of keys it takes considerable time (minutes). If that job can be run only after I release the new code - then degrading gracefully helps a lot, I simply release and then run the job.
PS: If you affect a lot of keys in redis then it is very important to use http://redis.io/topics/pipelining Saves a lot of time.
Take a list of all affected (i.e. you want to fix them in any way) keys or records in pipeline
Do whatever you want on them. If it's possible try to queue writing operations into pipeline too
Send queued operations to redis.
It is also very important for you to make indexes of your structures. I keep sets with ids. Then I simply iterate over SMEMBERS(set_with_ids).
It is much, much better than iterating over KEYS command.
For extremely simple versioning, you could use different database numbers. This could be quite limiting in cases where almost everything is the same between two versions but it's also a very clean way to do it if it will work for you.
Do you know of any ORM tool that offers deadlock recovery? I know deadlocks are a bad thing but sometimes any system will suffer from it given the right amount of load. In Sql Server, the deadlock message says "Rerun the transaction" so I would suspect that rerunning a deadlock statement is a desirable feature on ORM's.
I don't know of any special ORM tool support for automatically rerunning transactions that failed because of deadlocks. However I don't think that a ORM makes dealing with locking/deadlocking issues very different. Firstly, you should analyze the root cause for your deadlocks, then redesign your transactions and queries in a way that deadlocks are avoided or at least reduced. There are lots of options for improvement, like choosing the right isolation level for (parts) of your transactions, using lock hints etc. This depends much more on your database system then on your ORM. Of course it helps if your ORM allows you to use stored procedures for some fine-tuned command etc.
If this doesn't help to avoid deadlocks completely, or you don't have the time to implement and test the real fix now, of course you could simply place a try/catch around your save/commit/persist or whatever call, check catched exceptions if they indicate that the failed transaction is a "deadlock victim", and then simply recall save/commit/persist after a few seconds sleeping. Waiting a few seconds is a good idea since deadlocks are often an indication that there is a temporary peak of transactions competing for the same resources, and rerunning the same transaction quickly again and again would probably make things even worse.
For the same reason you probably would wont to make sure that you only try once to rerun the same transaction.
In a real world scenario we once implemented this kind of workaround, and about 80% of the "deadlock victims" succeeded on the second go. But I strongly recommend to digg deeper to fix the actual reason for the deadlocking, because these problems usually increase exponentially with the number of users. Hope that helps.
Deadlocks are to be expected, and SQL Server seems to be worse off in this front than other database servers. First, you should try to minimize your deadlocks. Try using the SQL Server Profiler to figure out why its happening and what you can do about it. Next, configure your ORM to not read after making an update in the same transaction, if possible. Finally, after you've done that, if you happen to use Spring and Hibernate together, you can put in an interceptor to watch for this situation. Extend MethodInterceptor and place it in your Spring bean under interceptorNames. When the interceptor is run, use invocation.proceed() to execute the transaction. Catch any exceptions, and define a number of times you want to retry.
An o/r mapper can't detect this, as the deadlock is always occuring inside the DBMS, which could be caused by locks set by other threads or other apps even.
To be sure a piece of code doesn't create a deadlock, always use these rules:
- do fetching outside the transaction. So first fetch, then perform processing then perform DML statements like insert, delete and update
- every action inside a method or series of methods which contain / work with a transaction have to use the same connection to the database. This is required because for example write locks are ignored by statements executed over the same connection (as that same connection set the locks ;)).
Often, deadlocks occur because either code fetches data inside a transaction which causes a NEW connection to be opened (which has to wait for locks) or uses different connections for the statements in a transaction.
I had a quick look (no doubt you have too) and couldn't find anything suggesting that hibernate at least offers this. This is probably because ORMs consider this outside of the scope of the problem they are trying to solve.
If you are having issues with deadlocks certainly follow some of the suggestions posted here to try and resolve them. After that you just need to make sure all your database access code gets wrapped with something which can detect a deadlock and retry the transaction.
One system I worked on was based on “commands” that were then committed to the database when the user pressed save, it worked like this:
While(true)
start a database transaction
Foreach command to process
read data the command need into objects
update the object by calling the command.run method
EndForeach
Save the objects to the database
If not deadlock
commit the database transaction
we are done
Else
abort the database transaction
log deadlock and try again
EndIf
EndWhile
You may be able to do something like with any ORM; we used an in house data access system, as ORM were too new at the time.
We run the commands outside of a transaction while the user was interacting with the system. Then rerun them as above (when you use did a "save") to cope with changes other people have made. As we already had a good ideal of the rows the command would change, we could even use locking hints or “select for update” to take out all the write locks we needed at the start of the transaction. (We shorted the set of rows to be updated to reduce the number of deadlocks even more)