This worked fine with Ignite 1.7. Recently upgraded to 2 and noticed that calling
grid.compute().withAsync().run([my runnable here])
stops working after about an hour. That gets called about once a minute and executes a short-lived runnable on the cluster. After about an hour, calling that above does nothing.
No warnings, no errors, it just does nothing.
Turns out there was nothing wrong with Ignite 2.0, which makes me very happy.
I had a particular scenario where a semaphore after about an hour's worth of running would block a thread which ended-up blocking what the runnables were doing.
Related
I have a script running on one of my linux servers which handles batch file processing with a ProcessPoolExecutor and generally runs fine days or even weeks on end without any issue. Sometimes though it looks like a few of my child processes just die (I have no error message or exception at all and can't reproduce it even with killing cp's from the shell) and lead to the parent process just waiting for the return indefinitely...
Thats the call (the initializer doesn't have any effect in this case, it's just to handle the reverse scenario described in another very helpful thread on s.o.)
with ProcessPoolExecutor(max_workers=int(config['PERFORMANCE']['NumberOfProcesses']),
initializer=start_thread_to_terminate_when_parent_process_dies,
initargs=(os.getpid(),)
) as executor:
executor.map(process_main, file_list)
From what I've gathere the Pool should be able to recover in exactly the described scenario:
https://bugs.python.org/issue9205
Anyone got any idea? (thought about switching to the pebble library with it's timeout functionality or creating a separate watchdog script)
Well, I a new developer with Vert.x... so, I have a problem with an implementation with a database connection.
In one or many querys, I have a lot of information like 160K records, those records will be in a JSON object throw GraphQL; so... when the query time is over 30000(ms)... the console says:
Thread Thread[vert.x-eventloop-thread-1,5,main] has been blocked for 5026 ms, time limit is 2000 ms
io.vertx.core.VertxException: Thread blocked
So I investigated about this, and I cannot find a way to resolve, maximize or set a bigger value to the query until these is finish or get all records.
This question is actually covered in detail in the official documentation.
you can’t call blocking operations directly from an event loop, as
that would prevent it from doing any other useful work
That's what you're doing at the moment - calling a blocking operation.
An alternative way to run blocking code is to use a worker verticle A
worker verticle is always executed with a thread from the worker pool.
Run your "slow" code in a worker verticle. Communicate between EventLoop verticls and workers using EventBus. As long as you're inside same VM, passing even large collections over EventBus has no overhead.
I'm trying to execute a process to update my database, but the problem is that I set different RecurringJobs for it at different hours.
Today when I checked hangfire status, since yesterday that I instanced hangfire, I found the job should execute yesterday and the one task for today, both executed 30 minutes ago at the same time, and this has created duplicates in the database.
Can you help me with this?
If your problem is one of concurrency, you can solve it by running hangfire single threaded. Simply configure the number of hangfire worker threads on startup:
var server = new BackgroundJobServer(new BackgroundJobServerOptions
{
WorkerCount = 1
});
This will force hangfire to process queued jobs sequentially.
Alternatively, if you have the Pro version of hangfire you can control order using batch chaining.
I don't know if a worker can be considered as a thread.
Within a hangfire worker, single threaded code will be run by exactly one thread
This doesn't look like a concurrency issue as has been suggested. It's not completely clear what you are trying to do but I'm assuming you want the job to run at 7, 12:45, and 17:30 and had issues because both the 7am and 17:30 job ran at the same time (7am).
Based on the created time it looks like you created these around 14:30. That means the 17:30 job should have ran but didn't until the next morning around 7am. My best guess is this was hosted in IIS and the site app pool was recycled.
This would cause any recurring jobs that were supposed to run to be delayed until the app pool / site was started again (which I assume was around 7am).
Check out these documents on how to ensure your site is always running: http://docs.hangfire.io/en/latest/deployment-to-production/making-aspnet-app-always-running.html
If it's not an IIS issue something must have caused the BackgroundJobServer to stop monitoring the database for jobs until ~7:00am (server shutdown, error, etc).
Hi spring batch users,
regarding the documentation http://docs.spring.io/spring-batch/reference/htmlsingle/#d5e1320
"If the process died ("kill -9" or server failure) the job is, of course, not running, but the JobRepository has no way of knowing because no-one told it before the process died."
I try to find and restart the stale job executions by using
Set<JobExecution> jobExecutions = jobExplorer.findRunningJobExecutions(jobName);
...
jobExecution.setStatus(FAILED);
jobExecution.setEndTime(new Date());
jobRepository.update(jobExecution);
jobOperator.restart(jobExecution.getId());
But this seems to be very inconvenient.
1) I have to do this before other (new) jobs could be started.
2) I have to handle multiple instances of running servers so findRunningJobExecutions will not do the trick.
You can find other questions regarding this topic:
https://jira.spring.io/browse/BATCH-2433?jql=project%20%3D%20BATCH%20AND%20status%20%3D%20Open%20ORDER%20BY%20priority%20DESC
Spring Batch after JVM crash
I would love to see a solution to register a "start up clean jobs listener". This will still not fix the problems originated by the multi server environment because spring batch does not know if the JobExecution marked by STARTED is not running on an other instance.
Thanks for any advice
Alex
Your job cannot and should not recover "automatically" from a kill -9 scenario. A kill -9 is treated very differently than you application throwing a caught Exception. The reason for this is that you've effectively pulled the carpet out from under the application without giving it a chance to reach a synchronization point with the database to commit any necessary information to the ExecutionContext or update the job/step status(es). Therefore, the last status touchpoint with the database will remain and the job will still look STARTED.
"OK, fine" you say, "but if I start another execution, I want it to find that STARTED execution, and pick up where it left off." The problem here is that there is no clean way for the application to distinguish a job that is ACTUALLY RUNNING from one that has failed but couldn't up the database. The framework here correctly errs on the side of caution and prevents you from starting a job that already appears running, and this is a GOOD thing.
Why? Because let's assume your job was actually still running and you restarted by accident. As coded, the framework will start to spin up, see your running execution and fail with the following message A job execution for this job is already running. I can't tell you how many times we've been saved by this because someone accidentally launched a job twice!
If you were to implement the listener you suggest, the 2nd execution would instead be allowed to start and you'd have 2 different JVMs repeating the same work, possibly writing to the same files/tables and causing a huge data mess that could be impossible to clean up.
Trust me, in the event the Linux terminal kills your job or your job dies because the connection to the database has been severed, you WANT human eyes on those execution states before you attempt a restart.
Finally, on the off chance you actually wanted to kill you job, you can leverage several other standard patterns for stopping jobs:
Stop via throw Exception
Stop via JobOperator.stop()
We are running hangfire single threaded using BackgroundJobServerOptions.WorkerCount = 1 (because we have a requiement for ordered processing).
Most of the time this is no problem, but occasionally a job gets stuck for entirely expected reasons (eg, the actual code it is running goes into an infintite loop), but because we are running single threaded this prevents other jobs in the queue from starting.
In order to try and work around this, we delete the job, but then it stays on the queue, blocking any other job from starting:
The only way I have found to resolve this is to drop and recreate the hangfire DB which is obviously not great.
Why does deleting a running job in hangfire not also remove it from the queue? Is this weird delete behavior a bug which to be fixed in a later version, or is this behavior by design because we're running single threaded?
If this is by design then how do you cancel a processing job in a way which removes it from the queue?
Well it seems that this behavior is by design.
If the IIS app pool worker is recycled, Hangfire will start processing the next task immediately. However, without this restart Hangfire will "hang" indefinitely.
An issue was raised on github about this, however it has not been solved yet:
https://github.com/HangfireIO/Hangfire/issues/80
With no way to cancel or manually "fail" a job, this makes hangfire a lot less useful in a single threaded scenario.
Update: this has been partially or fully addressed in some later version of Hangfire.