Handle "Lock not granted. Try restarting the transaction" error in TokuMX - locking

We have several parallel processes that might at the same time execute findAndModify query to Toku.
This sometimes leads to exception: Lock not granted. Try restarting the transaction with the code 16759.
There is an article about this error an even a closed bug ticket - but unfortunately neither of them provide info on how to deal with such issue.
So are the approaches to handle such exception?
Otherwise its hard to have parallel processes because we can not rely on the locking mechanism of Toku.

Related

The wait operation timed out. .aspx

I created an internal website for our company. It run smoothly for several months and then I add more items to website. When I run in live, it run normally. Then suddenly one of my user from another server sending me an "The Wait operation timed out." error. When I check access that certain link, It run normally for me and some other who I ask to check if they access that page. I already increase the connection timeout but still no luck. Is it the error come from another server? Can someone explain the possible causes?
This is how the another plant faced, every time they firstly open the website, error screen show up, but when they refresh it, they can use the website. I dont know why this happened. I need your help.
Down below is a error detail:
1.Exception Details: System.ComponentModel.Win32Exception: The wait operation timed out
source error :An unhandled exception was generated during the execution of the current web request.
2.Information regarding the origin and location of the exception can be identified using the exception stack trace below.
Thanks in advance
The fact that this happens for a user but not for the testers implies this may occur when the system is under load; database timeouts are pretty common in database queries functioning under stress if the database has been set up "out of the box" without tuning.
I would suggest referring to
The wait operation timed out. ASP
I don't have enough information to troubleshoot more question properly, since I don't know what DBMS you are working with. But as a rule this seems to happen because a call to the database is timing out. In SQL Server, increasing the CommandTimeout (NOT connection timeout) is one of the quick-and-dirty ways to solve the problem.
In SQL Server, CommandTimeout is the time allowed for an operation before exiting with a time out error. Connectiontimeout, by contrast, is the time the system waits when trying to open an initial connection to the database. Changing connectiontimeout won't help with the timeout of an operation, but commandtimeout will.
Other DBMS systems will have other mechanisms for resolving timeout issues.
That's one quick and dirty solution. The longer solution is to add more logging to your system to identify which calls are timing out, then doing some DBA work to optimize the query and database performance. My understanding is that entity frameworks also have tuning options for automatically generated queries, but exactly what those are depends on which one you're using!

CXSYNC_PORT wait type in Azure Sql Database

I'm facing this issue intermittently now, where the query (called from stored Procedure) goes for CXSYNC_PORT wait type and continues to remain in that for longer time (sometimes 8hours in stretch). I had to kill the process and then rerun the procedure. This procedure is called every 2-hours from ADF pipeline.
What's the reason for this behavior and how do I fix the issue?
I searched a lot and there is not Microsoft documents talk about the wait type: CXSYNC_PORT. Others have asked the same question but still with no more details.
Most suggestions are that ask the same problem in more forums. Or ask professional engineer for help, and they will deal with your problem separately and confidentially.
Ask Azure support for details help: https://learn.microsoft.com/en-us/azure/azure-portal/supportability/how-to-create-azure-support-request
And here's the same question which Microsoft engineer gave more details about the issue:
As part of a fix CXPACKET waits were further broken down into
CXSYNC_CONSUMER and CXSYNC_PORT (and data transfer waits still
reported as CXPACKET) as to distinguish between different wait times
for correct diagnose of the problem.
Basically, CXPACKET is divided into 3: CXPACKET, CXSYNC_PORT,
CXSYNC_CONSUMER. CXPACKET is used for data transfer sync, while
CXSYNC_* are used for other synchronizations. CXSYNC_PORT is used for
synchronizing opening/closing of exchange port between consuming
thread and producing thread. Long waits here may indicate server load
and lack of available threads. Plans containing sort may contribute
this wait type because complete sorting may occur before port is
synchronized.
Please ref this link What is causing wait type CXSYNC_PORT and what to do about it? to get more useful messages. But for now, there isn't an exact solution.
use query hint OPTION(MAXDOP 1)
This will run your long running query in a single thread and you won't get the CX type waits. In my experience this can make a massive 10-20X decrease in execution time and will free up CPU for other tasks as there will be no context switching and thread coordination activity.

Spring Batch restart crashed jobs

Hi spring batch users,
regarding the documentation http://docs.spring.io/spring-batch/reference/htmlsingle/#d5e1320
"If the process died ("kill -9" or server failure) the job is, of course, not running, but the JobRepository has no way of knowing because no-one told it before the process died."
I try to find and restart the stale job executions by using
Set<JobExecution> jobExecutions = jobExplorer.findRunningJobExecutions(jobName);
...
jobExecution.setStatus(FAILED);
jobExecution.setEndTime(new Date());
jobRepository.update(jobExecution);
jobOperator.restart(jobExecution.getId());
But this seems to be very inconvenient.
1) I have to do this before other (new) jobs could be started.
2) I have to handle multiple instances of running servers so findRunningJobExecutions will not do the trick.
You can find other questions regarding this topic:
https://jira.spring.io/browse/BATCH-2433?jql=project%20%3D%20BATCH%20AND%20status%20%3D%20Open%20ORDER%20BY%20priority%20DESC
Spring Batch after JVM crash
I would love to see a solution to register a "start up clean jobs listener". This will still not fix the problems originated by the multi server environment because spring batch does not know if the JobExecution marked by STARTED is not running on an other instance.
Thanks for any advice
Alex
Your job cannot and should not recover "automatically" from a kill -9 scenario. A kill -9 is treated very differently than you application throwing a caught Exception. The reason for this is that you've effectively pulled the carpet out from under the application without giving it a chance to reach a synchronization point with the database to commit any necessary information to the ExecutionContext or update the job/step status(es). Therefore, the last status touchpoint with the database will remain and the job will still look STARTED.
"OK, fine" you say, "but if I start another execution, I want it to find that STARTED execution, and pick up where it left off." The problem here is that there is no clean way for the application to distinguish a job that is ACTUALLY RUNNING from one that has failed but couldn't up the database. The framework here correctly errs on the side of caution and prevents you from starting a job that already appears running, and this is a GOOD thing.
Why? Because let's assume your job was actually still running and you restarted by accident. As coded, the framework will start to spin up, see your running execution and fail with the following message A job execution for this job is already running. I can't tell you how many times we've been saved by this because someone accidentally launched a job twice!
If you were to implement the listener you suggest, the 2nd execution would instead be allowed to start and you'd have 2 different JVMs repeating the same work, possibly writing to the same files/tables and causing a huge data mess that could be impossible to clean up.
Trust me, in the event the Linux terminal kills your job or your job dies because the connection to the database has been severed, you WANT human eyes on those execution states before you attempt a restart.
Finally, on the off chance you actually wanted to kill you job, you can leverage several other standard patterns for stopping jobs:
Stop via throw Exception
Stop via JobOperator.stop()

Deadlocks when running NServicebus service causes corrupt connection

We're running NServiceBus for a web application to handle situations where the user do "batch like" actions. Like fire a command that affects 1000 entities..
It works well, but during moderate load we get some deadlocks, this isn't a problem, just retry the message.. right? :)
The problem occurs when the next message arrives and tries to open a connection. The connection is then "corrupt".
We get the following error:
System.Data.SqlClient.SqlException (0x80131904): New request is not allowed to start because it should come with valid transaction descriptor
I've searched the web and I think our problem is a reported NH "bug":
A workaround should be to disable connection pooling. But I don't like that, since performce will degrade..
We're running NServiceBus 2.6, NHibernate 3.3.
Does anyone have any experience with this? Can a upgrade of NServiceBus help?
I’ve seen this in the past, if your design warrants, try breaking the transaction into two, if you flow the message transaction all the way to your database operations, any failures will have a cascading effect and it will impact (ideally it shouldn’t) any subsequent messages as well.
Instead of updating the 1000 entities in the command could you publishing an event to say that the command has been completed and then have several subscribers acting on this event to update effect entities. It sounds to me that a command that updates a 1000 entities should be split into a number of smaller commands. Take a look a the sagas to see how you can handle long running business process. For example, you might have something like, process started, step 1 completed, step 2 completed , process completed etc...

ORM Support for Handling Deadlocks

Do you know of any ORM tool that offers deadlock recovery? I know deadlocks are a bad thing but sometimes any system will suffer from it given the right amount of load. In Sql Server, the deadlock message says "Rerun the transaction" so I would suspect that rerunning a deadlock statement is a desirable feature on ORM's.
I don't know of any special ORM tool support for automatically rerunning transactions that failed because of deadlocks. However I don't think that a ORM makes dealing with locking/deadlocking issues very different. Firstly, you should analyze the root cause for your deadlocks, then redesign your transactions and queries in a way that deadlocks are avoided or at least reduced. There are lots of options for improvement, like choosing the right isolation level for (parts) of your transactions, using lock hints etc. This depends much more on your database system then on your ORM. Of course it helps if your ORM allows you to use stored procedures for some fine-tuned command etc.
If this doesn't help to avoid deadlocks completely, or you don't have the time to implement and test the real fix now, of course you could simply place a try/catch around your save/commit/persist or whatever call, check catched exceptions if they indicate that the failed transaction is a "deadlock victim", and then simply recall save/commit/persist after a few seconds sleeping. Waiting a few seconds is a good idea since deadlocks are often an indication that there is a temporary peak of transactions competing for the same resources, and rerunning the same transaction quickly again and again would probably make things even worse.
For the same reason you probably would wont to make sure that you only try once to rerun the same transaction.
In a real world scenario we once implemented this kind of workaround, and about 80% of the "deadlock victims" succeeded on the second go. But I strongly recommend to digg deeper to fix the actual reason for the deadlocking, because these problems usually increase exponentially with the number of users. Hope that helps.
Deadlocks are to be expected, and SQL Server seems to be worse off in this front than other database servers. First, you should try to minimize your deadlocks. Try using the SQL Server Profiler to figure out why its happening and what you can do about it. Next, configure your ORM to not read after making an update in the same transaction, if possible. Finally, after you've done that, if you happen to use Spring and Hibernate together, you can put in an interceptor to watch for this situation. Extend MethodInterceptor and place it in your Spring bean under interceptorNames. When the interceptor is run, use invocation.proceed() to execute the transaction. Catch any exceptions, and define a number of times you want to retry.
An o/r mapper can't detect this, as the deadlock is always occuring inside the DBMS, which could be caused by locks set by other threads or other apps even.
To be sure a piece of code doesn't create a deadlock, always use these rules:
- do fetching outside the transaction. So first fetch, then perform processing then perform DML statements like insert, delete and update
- every action inside a method or series of methods which contain / work with a transaction have to use the same connection to the database. This is required because for example write locks are ignored by statements executed over the same connection (as that same connection set the locks ;)).
Often, deadlocks occur because either code fetches data inside a transaction which causes a NEW connection to be opened (which has to wait for locks) or uses different connections for the statements in a transaction.
I had a quick look (no doubt you have too) and couldn't find anything suggesting that hibernate at least offers this. This is probably because ORMs consider this outside of the scope of the problem they are trying to solve.
If you are having issues with deadlocks certainly follow some of the suggestions posted here to try and resolve them. After that you just need to make sure all your database access code gets wrapped with something which can detect a deadlock and retry the transaction.
One system I worked on was based on “commands” that were then committed to the database when the user pressed save, it worked like this:
While(true)
start a database transaction
Foreach command to process
read data the command need into objects
update the object by calling the command.run method
EndForeach
Save the objects to the database
If not deadlock
commit the database transaction
we are done
Else
abort the database transaction
log deadlock and try again
EndIf
EndWhile
You may be able to do something like with any ORM; we used an in house data access system, as ORM were too new at the time.
We run the commands outside of a transaction while the user was interacting with the system. Then rerun them as above (when you use did a "save") to cope with changes other people have made. As we already had a good ideal of the rows the command would change, we could even use locking hints or “select for update” to take out all the write locks we needed at the start of the transaction. (We shorted the set of rows to be updated to reduce the number of deadlocks even more)