Run a Job in ALL distributed server instances - hangfire

Today we are using HangFire to distribute the load between different servers, and works ok with BackgroundJob.Enqueue()
But I need a way to run a Job in all Hangfire servers such as the following
BackgroundJob.Enqueue(() => refreshCachedVariables());
Is there a way to do this?

Related

Scheduling a task in a Distributed Environment like Kubernetes

I have a FAST API based Rest Application where I need to have some scheduled tasks that needs to run every 30 mins. This application will run on Kubernetes as such the number of instances are not fixed. I want the Scheduled Jobs to only trigger from one of the available instance and not from all the running instances creating a Race condition, as such I need some kind of locking mechanism that will prevent the schedulers to fire if one is already running. My App does connect to a MySql compatible Aurora DB running on AWS. Can I achieve this with ApScheduler, if not are there any alternatives available?

Hangfire jobs stuck in processing state

We are using Hangfire to download data from Azure. We are using Hangfire 1.7.6. However, after running for some time, Hangfire is having a deadlock and seems stuck in processing the job. We had to restart the service to keep it working.
There is a recurring job which is adding jobs to the other background server. Mostly the jobs are stuck when it is downloading a big file.
Has anyone faced this type of problem of hangfire jobs stuck in processing?
Please let me know if any further information is required. Any help/guidance is appreciated.
Is this not caused by the length of time it takes to complete the download from azure?
You could try testing this with large files, and see how it handles it.
Also, like #jbl asked, how is your Hangfire Server hosted? If it is hosted in IIS then remember that the Hangfire Server may lose its heartbeat if IIS shuts down the application process due to it being idle for a given period of time.
I came across this issue in the past and ended up running the application as a process on the server.
IIS is optimised to save resources, so it will shutdown processes that aren't being used. When a request is made to your application, then it fires the process back up. This will also cause any scheduled background jobs not to fire.

Scheduler not queuing jobs

I'm trying to test out Airflow on Kubernetes. The Scheduler, Worker, Queue, and Webserver are all on different deployments and I am using a Celery Executor to run my tasks.
Everything is working fine except for the fact that the Scheduler is not able to queue up jobs. Airflow is able to run my tasks fine when I manually execute it from the Web UI or CLI but I am trying to test the scheduler to make it work.
My configuration is almost the same as it is on a single server:
sql_alchemy_conn = postgresql+psycopg2://username:password#localhost/db
broker_url = amqp://user:password#$RABBITMQ_SERVICE_HOST:5672/vhost
celery_result_backend = amqp://user:password#$RABBITMQ_SERVICE_HOST:5672/vhost
I believe that with these configurations, I should be able to make it run but for some reason, only the workers are able to see the DAGs and their state, but not the scheduler, even though the scheduler is able to log their heartbeats just fine. Is there anything else I should debug or look at?
First, you use postgres as database for airflow, don't you? Do you deploy a pod and service for postgres? If it is the case, do you verify that in your config file you have :
sql_alchemy_conn = postgresql+psycopg2://username:password#serviceNamePostgres/db
You can use this github. I used it 3 weeks ago for a first test and it worked pretty well.
The entrypoint is useful to verify that rabbitMq and Postgres are well configured.

Amazon EMR managing my spark cluster

I have a spark setup on Amazon EC2 machines with 2 worker machines running. It reads data from cassandra, do some processing and write to sql server. I have heard about amazon EMR and read about it. I want a managed system where my worker machines are automatically added to my cluster if my job is taking more time and shutdown when my job gets completed.
Can I achieve this through Amazon EMR?
The requirements are:
My worker machines are automatically added to my cluster if my job is taking more time.
Shutdown when my job gets completed.
No. 2 is definitely possible if your job is launched from the steps. There is an option that auto-terminates cluster after the last step is completed. Alternatively, this could also be done programatically with the SDK.
No. 1 is a little more difficult but EMR has three classes of nodes; master, core, and task. Task nodes can be added after cluster creation. The trigger for that would probably have to be done programatically or utilizing another Amazon service, like Lambda.

Resque Workers from other hosts registered and active on my system

The Rails application I'm currently working on is hosted at Amazon EC2 servers. It's using Resque for running background jobs, and there are 2 such instances (would-be production and a stage). Also I've mounted Resque monitoring web app to the /resque route (on stage only).
Here is my question:
Why there are workers from multiple hosts registered within my stage system and how can I avoid this?
Some additional details:
I see workers from apparently 3 different machines, but only 2 of them I managed to identify - the stage(obviously) and the production. The third has another address format(starts with domU) and haven't any clue what it could be.
It looks like you're sharing a single Redis server across multiple resque server environments.
The best way to do this safely is to use separate Redis servers or separate Redis databases or namespaces. The Redis-namespace gem can be used with Resque to isolate each environments Resque queues and worker data.
I can't really help you with what the unknown one is, but I had something similar happen when moving hosts and having dns names change. The only way I found to clear out the old ones was to stop all workers on the machine, fire up IRB, require 'resque' and look at Resque.workers. This will list all the workers resque knows about, which in your case will include about 20 bogus ones. You can then do:
Resque.workers.each do {|worker| worker.unregister_worker}
This should prune all the not-really-there workers and get you back to a proper display of the real workers.