Running multiple tasks of one DAG in separate machines in Airflow - redis

I need to create a dag which looks like this-
print_date task needs to run from a server A and templated task needs to run from server B. From the documentation it is clear that celery with Redis or RabbitMq will be required. I am using celery along with Redis(puckel/docker-airflow). I already have airflow running in server B with celery executer.
Do I need to have the same setup in server A as well ??
Also, how will I connect these two tasks in a single dag which are actually present in the different server?
A sample framework for this kind of use case will be much appreciated.

Use Airflow Queues. And when you define your task add a queue parameter and assign it to a particular queue.
For example, queue1 would just run all the task on Machine 1 & queue2 would run all tasks on Machine 2.
So you can assign your task A to queue 1, hence it would run on Machine 1
and assign
task B to queue 2, hence it would run on Machine 2
Check documentation at https://airflow.apache.org/concepts.html#queues

Related

Distributed JobRunr using single data source

I want to create a scheduler using JobRunr that will run in two different server. This Scheduler will select data from SQL database and will call an api endpoint. How can I make sure these 2 schedulers running in 2 different server will not pick same data from database ?
Main concern is they should not call the API with duplicate data from 2 different server.
As per documentation JobRunr will push the job in the queue first, but I am wondering how one scheduler queue will know that the same data has not picked by other scheduler in different server, is they any extra locking mechanism I need to maintain ?
JobRunr will run any job only once - the locking is already done in JobRunr itself.

Prevent Celery From Grabbing One More Task Than Concur

In Celery using RabbitMQ, I have distributed workers running very long tasks on individual ec2 instances.
What is happening is that my concurrency is set to 2, -Ofair is enabled, and task_acks_late = True worker_prefetch_multiplier = 1 are set, but the Celery worker runs the 2 tasks in parallel, but then grabs a third task and doesn't run it. This leaves other workers with no tasks to run.
What i would like to happen is for the workers to only grab jobs when they can perform work on them. Allowing other workers that are free to grab the tasks and perform them.
Does anyone know how to achieve the result that I'm looking for? Attached below is an example of my concurrency being 2, and there being three jobs on the worker, where one is not yet acknowledged. I would like for there to be only two tasks there, and the other remain on the server until another worker can start them.

celery multiple workers but one queue

i am new to celery and redis.
I started up my redis server by using redis-server.
Celery was run using this parameter
celery -A proj worker
There are no other configurations. However, i realised that when i have a long running job in celery, it does not process another task that is in the queue until the long running task is completed. My understanding is that since i have 8 cores on my CPU, i should be able to process 8 tasks concurrently since the default parameter for -c is the number of cores?
Am i missing something here ?
Your problem is classical, everybode met this who had long-running tasks.
The root cause is that celery tries to optimize your execution flow reserving some tasks for each worker. But if one of these tasks is long-running the others get locked. It is known as 'prefetch count'. This is because by default celery set up for short tasks.
Another related setting is a 'late ack'. By default worker takes a task from the queue and immediately sends an 'acknowledge' signal, then broker removes this task from the queue. But this means that more messages will be prefetched for this worker. 'late ack' enabled tells worker to send acknowledge only after the task is completed.
This is just in two words. You may read more about prefetch and late ack.
As for the solution - just use these settings (celery 4.x):
task_acks_late = True
worker_prefetch_multiplier = 1
or for previous versions (2.x - 3.x):
CELERY_ACKS_LATE = True
CELERYD_PREFETCH_MULTIPLIER = 1
Also, starting the worker with parameter -Ofair does the same.

Celery, zmq, message passing approach for a distributed system

I need to implement a system which does the following:
Transfer data to a remote place.
Once the data gets transferred fully, start a computation on the remote server.
Once the computation is done, fetch the resulting computed data back to the source.
A web interface to track/edit the progress of each task.
I am thinking of using:
1. Ruby on Rails for 4)
2. Celery as the distributed solution.
3. Zmq to pass messages across to RoR app and in between the different "categories" of workers within celery described below.
To decouple these components from each other, I'm considering having 3 sets of celery workers, each belonging to a separate category :-
A. 'Sync' workers,
B. 'Render' workers, and
C. 'Fetch' workers.
I wanna use zmq pub sub or broadcast model to pass messages around between these sets of workers and the web app so that they can be synchronised properly. For example B) should only kick in once A) is done. And C) should follow B).
Does this approach sound reasonable or can it be done better using perhaps just zmq or celery alone? Should instead of these I be using the celery back end like redis or amp?
Reasons I wanna use celery is of course data persistence as well as a web interface to monitor the workers.
I'm obviously relatively new to celery, zmq and distributed computation in general so any advice would be welcome.
Thanks all.
I have done something similar for work but it has all been done using rabbitmq and celery. The way I would approach this is have a celery worker running on the remote server and on the local host. Have each worker have it's own unique queue and fire off a chain something like
chain(sync.s(file), compute.s(), sync_back.s()).delay have the 2 sync tasks go to the localhost queue and the compute task go into the remote host queue

Distributor and worker end point queue in same machine

I am using NServiceBus 3.2.2.0, trying to test distributor and worker in same machine.
I noticed distributor is creating following queues
EndPointQueue
EndPointQueue.distributor.control
EndPointQueue.distributor.storage
EndPointQueue.retries
EndPointQueue.timeouts
And worker is creating a new queue something like:
EndPointQueue.5eb1d8d2-8274-45cf-b639-7f2276b56c0c
Is there a way to specify worker end point queue name instead of worker creating a queue by prefixing random string with end point queue?
Since it doesn't really make sense to run a worker on the same machine as the master (distributor), NServiceBus assumes that you're doing this for test purposes only and generates this kind of queue name.
In a true distributed scenario where the worker is running on its own box, it will have the same queue name as the master. The whole idea is that you shouldn't have to make any code or config changes to go from a single machine to a scaled-out deployment.