I'm using beeQueue for video transcoding job scheduling and processing
For now everything is fine and but I'm now facing challenge of working with distributed environment like auto scaling the amazon the instances for adding more workers to process more jobs which are pending in the queue, We scale well but need to implement a system which is fail safe, I mean in case a instance on which workers were processing the job has gone shutdown and we don't get job status or events, In that case the job which were running on that instance is gone into blackhole and can't be recovered and processed again.
What I did :
I'm looking up for ready made solution who works fail safe in distributed env.
Thanks
Related
I have a FAST API based Rest Application where I need to have some scheduled tasks that needs to run every 30 mins. This application will run on Kubernetes as such the number of instances are not fixed. I want the Scheduled Jobs to only trigger from one of the available instance and not from all the running instances creating a Race condition, as such I need some kind of locking mechanism that will prevent the schedulers to fire if one is already running. My App does connect to a MySql compatible Aurora DB running on AWS. Can I achieve this with ApScheduler, if not are there any alternatives available?
On a celery service on CENTOS which runs a single task at a time, the termination of a task is simple:
revoke(id, terminate=True, signal='SIGINT')
However while the interrupt signal is being processed, the running task gets revoked. Then a new task - from the queue - starts on the node. This is troublesome. Two task are running at the same time on the node. The signal handling could take up to a minute.
The question is how a signal could be sent to a running task, without actually terminating the task in celery?
Or let's say is there any way to send a signal to a running task?
The assumption is user should be able to send a signal from a remote node. In other words user does not have access to list the running processes of the node.
Any other solution is welcome.
I don't understand your goal.
Are you trying to kill the worker? if so, I guess you are talking about t "Warm shutdown", so you can send the SIGTEERM to the worker's process. The running task will get a chance to finish but no new task will be added.
If you're just interested in revoking a specific task and keep using the same worker, can you share your celery configuration and the worker command? are you sure you're running with concurrency 1 ?
We are using Hangfire to download data from Azure. We are using Hangfire 1.7.6. However, after running for some time, Hangfire is having a deadlock and seems stuck in processing the job. We had to restart the service to keep it working.
There is a recurring job which is adding jobs to the other background server. Mostly the jobs are stuck when it is downloading a big file.
Has anyone faced this type of problem of hangfire jobs stuck in processing?
Please let me know if any further information is required. Any help/guidance is appreciated.
Is this not caused by the length of time it takes to complete the download from azure?
You could try testing this with large files, and see how it handles it.
Also, like #jbl asked, how is your Hangfire Server hosted? If it is hosted in IIS then remember that the Hangfire Server may lose its heartbeat if IIS shuts down the application process due to it being idle for a given period of time.
I came across this issue in the past and ended up running the application as a process on the server.
IIS is optimised to save resources, so it will shutdown processes that aren't being used. When a request is made to your application, then it fires the process back up. This will also cause any scheduled background jobs not to fire.
I'm looking into an issue where we're seeing CPU usage maxed out on a RavenDB instance using NServiceBus outbox implementation.
The design currently has all outbox workers configuring the deduplicationdatacleanup settings in their start up configuration. i.e. these settings:
endpointConfiguration.SetTimeToKeepDeduplicationData(TimeSpan.FromDays(7));
endpointConfiguration.SetFrequencyToRunDeduplicationDataCleanup(TimeSpan.FromMinutes(1));
If you've got multiple workers that are processing messages for the system, should the cleanup be implemented on each of those, or should it be treated like a cronjob where you run the clean up process on just one of those workers, or a dedicated system in the environment that is not a worker, but more of a utility role?
I would imagine the latter, otherwise if you scale out workers, all of them are going to be trying to run the cleanup process every minute, or am I understanding the way this configuration executes the clean up incorrectly?
Thanks
"rabbitmqctl list_connections" shows as running but on the UI in the connections tab, under client properties, i see "connection.blocked: true".
I can see that messages are in queued in RabbitMq and the connection is in idle state.
I am running Airflow with Celery. My jobs are not executing at all.
Is this the reason why jobs are not executing?
How to resolve the issue so that my jobs start running
I'm experiencing the same kind of issue by just using celery.
It seems that when you have a lot of messages in the queue, and these are fairly chunky, and your node memory goes high, the rabbitMQ memory watermark gets trespassed and this triggers a blocking into consumer connections, so no worker can access that node (and related queues).
At the same time publishers are happily sending stuff via the exchange so you get in a lose-lose situation.
The only solution we had is to avoid hitting that memory watermark and scale up the number of consumers.
Keep messages/tasks lean so that the signature is not MB but KB