High performance Airflow Celery Workers do not work - rabbitmq

I have
2 Airflow and Celary nodes (node 1, node 2)
1 Postgresql node (node 3)
2 RebbitMQ nodes (node 4, node 5)
I want to implement failover of Сelary workers.
Node 1 - Airflow(webserver, scheduler), Celary(workers_1, flower_1)
Node 2 - Celary(workers_2, flower_2)
I run tasks that write the current timestamp to the database for 5 minutes. After that, I kill the worker_1 (I use kill ... (pid worker) , or systemctl stop/restart worker.service). I expect that the Celary, having received an error, will start re-executing the task on worker_2, but the task is not re-executed.
I tried adding variables to airflow.cfg and ./airflow-certified.
task_acks_late=True
CELERY_ACKS_LATE=True
CELERY_TASK_REJECT_ON_WORKER_LOST=True
task_reject_on_worker_lost=True
task_track_started=True
But it did not help.
How can I make it so that if my worker dies, the task starts again on another worker?

Related

Airflow 2.0 - Task stuck in queued state even though pool slots available

python dependencies:
apache-airflow==2.1.3
redis==3.5.3
celery==4.4.7
airflow config:
core.parallelism = 1000
core.dag_concurrency = 400
core.max_active_runs_per_dag = 1
celery.worker_concurrency = 6
celery.broker_url = redis
celery.result_backend = mysql
Facing this issue after upgrade to airflow 2.0
Not all tasks stuck in "queued" state, only 1 task stuck at a time (out of 100+).
Sometimes tasks stuck in "running" state also.
This issue happens once in 1-2 days.
This happens even though pool slots available to use (Slots = 128, Running slots = 0, Queued slots = 1)
As per worker logs (for both stuck tasks and normal tasks):
INFO/MainProcess] Received task: airflow.executors.celery_executor.execute_command[066398d7-98e0-4f17-bd2b-93ae36438577]
INFO/ForkPoolWorker-6] Executing command in Celery: ['airflow', 'tasks', 'run', 'dag_name', 'task01', '2021-10-26T14:00:00+00:00', '--local', '--pool', 'default_pool', '--subdir', '/dag_dir/dag.py']
I see below log for all normal working tasks, this is missing for the stuck tasks:
WARNING/ForkPoolWorker-1] Running <TaskInstance: dag.task02 2021-10-26T14:00:00+00:00 [queued]> on host worker02.localdomain

Apache Ignite - Distibuted Queue and Executors

I am planning to use Apache Ignite Distributed Queue.
I am using Ignite with a spring boot application. So, on bootup, I will be adding 20 names in a queue. But, since there are 3 servers in a cluster, the same 20 names gets added 3 times. But, i want to add them only once in the queue.
Ignite ignite = Ignition.ignite();
IgniteQueue<String> queue = ignite.queue(
"queueName", // Queue name.
0, // Queue capacity. 0 for unbounded queue.
null // Collection configuration.
);
Distributed executors, will be able to poll from the queue and run the task. Here, the executor is expected to poll, run the task and then add the same name to the queue. Trying to achieve round robin here.
Only one executor should be running the same task at any point of time, though there are multiple servers in a cluster.
Any suggestion for this.
You can launch ignite cluster singleton service https://apacheignite.readme.io/docs/cluster-singletons which will fill data to queue. Also you can adding data from coordinator node (oldest node in cluster) ignite.cluster().forOldest().node().isLocal()
I fixed bootup time duplicate cache loading issue this way:
final IgniteAtomicLong cacheLoadCnt = ignite.atomicLong(cacheName + "Cnt", 0, true);
if (cacheLoadCnt.get() == 0) {
loadCache();
cacheLoadCnt.addAndGet(1);
}

Replication issue on basic 3-node Disque cluster

I'm hitting a replication issue on a three-node Disque cluster, it seems weird because the use case is fairly typical so it's entirely possible I'm doing something wrong.
This is how to reproduce it locally:
# relevant disque info
disque_version:1.0-rc1
disque_git_sha1:0192ba7e
disque_git_dirty:0
disque_build_id:b02910aa5c47590a
Start 3 disque nodes in ports 9001, 9002 and 9003, and then have servers on port 9002 and 9003 meet with 9001.
127.0.0.1:9002> CLUSTER MEET 127.0.0.1 9001 #=> OK
127.0.0.1:9003> CLUSTER MEET 127.0.0.1 9001 #=> OK
The HELLO reports the same data for all three nodes, as expected.
127.0.0.1:9003> hello
1) (integer) 1
2) "e93cbbd17ad12369dd2066a55f9d4c51be9c93dd"
3) 1) "b61c63e8fd0c67544f895f5d045aa832ccb47e08"
2) "127.0.0.1"
3) "9001"
4) "1"
4) 1) "b32eb6501e272a06d4c20a1459260ceba658b5cd"
2) "127.0.0.1"
3) "9002"
4) "1"
5) 1) "e93cbbd17ad12369dd2066a55f9d4c51be9c93dd"
2) "127.0.0.1"
3) "9003"
4) "1"
Enqueuing a job succeeds, but the job does not show on on either QLEN or QPEEK in the other nodes.
127.0.0.1:9001> addjob myqueue body 1 #=> D-b61c63e8-IFA29ufvL37FRVjVVWisbO/x-05a1
127.0.0.1:9001> qlen myqueue #=> 1
127.0.0.1:9002> qlen myqueue #=> 0
127.0.0.1:9002> qpeek myqueue 1 #=> (empty list or set)
127.0.0.1:9003> qlen myqueue #=> 0
127.0.0.1:9003> qpeek myqueue 1 #=> (empty list or set)
When explicitly setting a replication value higher than the amount of nodes, disque fails with a NOREPL as one would expect, an explicit replication level of 2 succeeds, but the jobs are still nowhere to be seen in nodes 9002 and 9003. The same behavior happens regardless of the node in which I add the job.
My understanding is that replication happens synchronously when calling ADDJOB (unless explicitly using ASYNC), but it doesn't seem to be working properly, the test suite is passing in the master branch so I'm hitting a wall here and will have to dig into the source code, any help will be greatly appreciated!
The job is replicated, but it's enqueued only in one node. Try killing the first node to see the job enqueued in a different one.

Spark execution occasionally gets stuck at mapPartitions at Exchange.scala:44

I am running a Spark job on a two node standalone cluster (v 1.0.1).
Spark execution often gets stuck at the task mapPartitions at Exchange.scala:44.
This happens at the final stage of my job in a call to saveAsTextFile (as I expect from Spark's lazy execution).
It is hard to diagnose the problem because I never experience it in local mode with local IO paths, and occasionally the job on the cluster does complete as expected with the correct output (same output as with local mode).
This seems possibly related to reading from s3 (of a ~170MB file) immediately prior, as I see the following logging in the console:
DEBUG NativeS3FileSystem - getFileStatus returning 'file' for key '[PATH_REMOVED].avro'
INFO FileInputFormat - Total input paths to process : 1
DEBUG FileInputFormat - Total # of splits: 3
...
INFO DAGScheduler - Submitting 3 missing tasks from Stage 32 (MapPartitionsRDD[96] at mapPartitions at Exchange.scala:44)
DEBUG DAGScheduler - New pending tasks: Set(ShuffleMapTask(32, 0), ShuffleMapTask(32, 1), ShuffleMapTask(32, 2))
The last logging I see before the task apparently hangs/gets stuck is:
INFO NativeS3FileSystem: INFO NativeS3FileSystem: Opening key '[PATH_REMOVED].avro' for reading at position '67108864'
Has anyone else experience non-deterministic problems related to reading from s3 in Spark?

In celery, how to ensure tasks are retried when worker crashes

First of all please don't consider this question as a duplicate of this question
I have a setup an environment which uses celery and redis as broker and result_backend. My question is how can I make sure that when the celery workers crash, all the scheduled tasks are re-tried, when the celery worker is back up.
I have seen advice on using CELERY_ACKS_LATE = True , so that the broker will re-drive the tasks until it get an ACK, but in my case its not working. Whenever I schedule a task its immediately goes to the worker which persists it until the scheduled time of execution. Let me give some example:
I am scheduling a task like this: res=test_task.apply_async(countdown=600) , but immediately in celery worker logs i can see something like : Got task from broker: test_task[a137c44e-b08e-4569-8677-f84070873fc0] eta:[2013-01-...] . Now when I kill the celery worker, these scheduled tasks are lost. My settings:
BROKER_URL = "redis://localhost:6379/0"
CELERY_ALWAYS_EAGER = False
CELERY_RESULT_BACKEND = "redis://localhost:6379/0"
CELERY_ACKS_LATE = True
Apparently this is how celery behaves.
When worker is abruptly killed (but dispatching process isn't), the message will be considered as 'failed' even though you have acks_late=True
Motivation (to my understanding) is that if consumer was killed by OS due to out-of-mem, there is no point in redelivering the same task.
You may see the exact issue here: https://github.com/celery/celery/issues/1628
I actually disagree with this behaviour. IMO it would make more sense not to acknowledge.
I've had the issue, where I was using some open-source C libraries that went totaly amok and crashed my worker ungraceful without throwing an exception. For any reason whatsoever, one can simply wrap the content of a task in a child process and check its status in the parent.
n = os.fork()
if n > 0: //inside the parent process
status = os.wait() //wait until child terminates
print("Signal number that killed the child process:", status[1])
if status[1] > 0: // if the signal was something other then graceful
// here one can do whatever they want, like restart or throw an Exception.
self.retry(exc=SomeException(), countdown=2 ** self.request.retries)
else: // here comes the actual task content with its respected return
return myResult // Make sure there are not returns in child and parent at the same time.