Airflow 2.0 - Task stuck in queued state even though pool slots available - redis

python dependencies:
apache-airflow==2.1.3
redis==3.5.3
celery==4.4.7
airflow config:
core.parallelism = 1000
core.dag_concurrency = 400
core.max_active_runs_per_dag = 1
celery.worker_concurrency = 6
celery.broker_url = redis
celery.result_backend = mysql
Facing this issue after upgrade to airflow 2.0
Not all tasks stuck in "queued" state, only 1 task stuck at a time (out of 100+).
Sometimes tasks stuck in "running" state also.
This issue happens once in 1-2 days.
This happens even though pool slots available to use (Slots = 128, Running slots = 0, Queued slots = 1)
As per worker logs (for both stuck tasks and normal tasks):
INFO/MainProcess] Received task: airflow.executors.celery_executor.execute_command[066398d7-98e0-4f17-bd2b-93ae36438577]
INFO/ForkPoolWorker-6] Executing command in Celery: ['airflow', 'tasks', 'run', 'dag_name', 'task01', '2021-10-26T14:00:00+00:00', '--local', '--pool', 'default_pool', '--subdir', '/dag_dir/dag.py']
I see below log for all normal working tasks, this is missing for the stuck tasks:
WARNING/ForkPoolWorker-1] Running <TaskInstance: dag.task02 2021-10-26T14:00:00+00:00 [queued]> on host worker02.localdomain

Related

High performance Airflow Celery Workers do not work

I have
2 Airflow and Celary nodes (node 1, node 2)
1 Postgresql node (node 3)
2 RebbitMQ nodes (node 4, node 5)
I want to implement failover of Сelary workers.
Node 1 - Airflow(webserver, scheduler), Celary(workers_1, flower_1)
Node 2 - Celary(workers_2, flower_2)
I run tasks that write the current timestamp to the database for 5 minutes. After that, I kill the worker_1 (I use kill ... (pid worker) , or systemctl stop/restart worker.service). I expect that the Celary, having received an error, will start re-executing the task on worker_2, but the task is not re-executed.
I tried adding variables to airflow.cfg and ./airflow-certified.
task_acks_late=True
CELERY_ACKS_LATE=True
CELERY_TASK_REJECT_ON_WORKER_LOST=True
task_reject_on_worker_lost=True
task_track_started=True
But it did not help.
How can I make it so that if my worker dies, the task starts again on another worker?

Long running jobs redelivering after broker visibility timeout with celery and redis

I am using celery 4.3 + Redis + flower. I have a few long-running jobs with acks_late=True and task_reject_on_worker_lost=True. I am using celery grouping to group jobs run parallelly and append result and use in parent job.
In this scenario, my few jobs will run more than an hour, after every one hour the same child jobs are redelivering again to the worker.
The sample jobs as below.
#app.task(queue='q1', bind=True, acks_late=True, task_reject_on_worker_lost=True, max_retries=3)
def job_1(self):
do_something()
task_group = group(job_2.s(batch) for batch in range(0, len([1,2,3,4,5,6]), 3))
result_group = task_group.apply_async()
#app.task(queue='q1', bind=True, acks_late=True, task_reject_on_worker_lost=True, max_retries=3)
def job_2(self, batch):
do_something()
return result
The above job_2 will run more than hour and after one hour the same is redelivering again to the worker.
My celery setup and config as shown below:
c = Celery(app.import_name,
backend=app.config['CELERY_RESULT_BACKEND'],
broker=app.config['CELERY_BROKER_URL'])
config.py
CELERY_BROKER_URL = os.environ['CELERY_BROKER_URL']
CELERY_RESULT_BACKEND = os.environ['CELERY_RESULT_BACKEND']
CELERY_BROKER_TRANSPORT_OPTIONS = {'visibility_timeout': 36000}
I tried to increase visibility timeout to 10 hours after the redelivering issue like above in configuration. but it looks like not working.
Please help with this issue and let me know if any solution.

Inconsistent behavior of Quartz2 scheduler in Apache Camel

I have an Apache Camel project that is using Quartz2 as the scheduler. The requirement is to make it a cluster. The code is deployed to weblogic 12c. the quartz is configured as per many samples with clustering enabled.
This is my properties file (without the datasource)
org.quartz.scheduler.instanceName = MyScheduler
org.quartz.scheduler.instanceId = AUTO
org.quartz.scheduler.skipUpdateCheck = true
org.quartz.scheduler.jobFactory.class = org.quartz.simpl.SimpleJobFactory
org.quartz.threadPool.class = org.quartz.simpl.SimpleThreadPool
org.quartz.threadPool.threadCount = 10
org.quartz.threadPool.threadPriority = 5
org.quartz.jobStore.misfireThreshold = 60000
org.quartz.jobStore.class=org.quartz.impl.jdbcjobstore.JobStoreTX
org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.oracle.OracleDelegate
org.quartz.jobStore.useProperties=true
org.quartz.JobBuilder.requestRecovery=true
org.quartz.jobStore.isClustered = true
org.quartz.jobStore.clusterCheckinInterval = 20000
When I deploy and start both nodes I see that the QRTZ_SCHEDULER_STATE table has extra entry for one of the nodes:
MyScheduler-routerContext server_node21567108546690
MyScheduler-routerContext-1 server_node11565896495100
MyScheduler-routerContext-1 server_node11567108547295
And I am guessing because of that the one node is being called once in a while while the other node gets called all the time (so occasionally both nodes are invoked at the same time).
I have tried to do a clean restart of weblogic nodes but the issue is still there
This is how my route(s) look like:
from("quartz2://provRegGroup/createUsersTrigger?cron={{create_users_cron}}&job.name=createUsersJob")
.routeId("createUsersRB")
.log("**** starting check for create users");
//where
//create_users_cron=0+0,5,10,15,20,25,30,35,40,45,50,55+*+*+*+?
//expecting one node being called by the scheduler at a time..
I figured out what caused the issue. apparently there were orphan weblogic processes that were running on one (or even both nodes) - this would be a question to our tech archs - why this was such a mess.. ps was showing two weblogic servers running on a node - one that I started recently and one that was there for say a month..
expecting this would never happen to production environment I assume the issue has been resolved..

Spark execution occasionally gets stuck at mapPartitions at Exchange.scala:44

I am running a Spark job on a two node standalone cluster (v 1.0.1).
Spark execution often gets stuck at the task mapPartitions at Exchange.scala:44.
This happens at the final stage of my job in a call to saveAsTextFile (as I expect from Spark's lazy execution).
It is hard to diagnose the problem because I never experience it in local mode with local IO paths, and occasionally the job on the cluster does complete as expected with the correct output (same output as with local mode).
This seems possibly related to reading from s3 (of a ~170MB file) immediately prior, as I see the following logging in the console:
DEBUG NativeS3FileSystem - getFileStatus returning 'file' for key '[PATH_REMOVED].avro'
INFO FileInputFormat - Total input paths to process : 1
DEBUG FileInputFormat - Total # of splits: 3
...
INFO DAGScheduler - Submitting 3 missing tasks from Stage 32 (MapPartitionsRDD[96] at mapPartitions at Exchange.scala:44)
DEBUG DAGScheduler - New pending tasks: Set(ShuffleMapTask(32, 0), ShuffleMapTask(32, 1), ShuffleMapTask(32, 2))
The last logging I see before the task apparently hangs/gets stuck is:
INFO NativeS3FileSystem: INFO NativeS3FileSystem: Opening key '[PATH_REMOVED].avro' for reading at position '67108864'
Has anyone else experience non-deterministic problems related to reading from s3 in Spark?

In celery, how to ensure tasks are retried when worker crashes

First of all please don't consider this question as a duplicate of this question
I have a setup an environment which uses celery and redis as broker and result_backend. My question is how can I make sure that when the celery workers crash, all the scheduled tasks are re-tried, when the celery worker is back up.
I have seen advice on using CELERY_ACKS_LATE = True , so that the broker will re-drive the tasks until it get an ACK, but in my case its not working. Whenever I schedule a task its immediately goes to the worker which persists it until the scheduled time of execution. Let me give some example:
I am scheduling a task like this: res=test_task.apply_async(countdown=600) , but immediately in celery worker logs i can see something like : Got task from broker: test_task[a137c44e-b08e-4569-8677-f84070873fc0] eta:[2013-01-...] . Now when I kill the celery worker, these scheduled tasks are lost. My settings:
BROKER_URL = "redis://localhost:6379/0"
CELERY_ALWAYS_EAGER = False
CELERY_RESULT_BACKEND = "redis://localhost:6379/0"
CELERY_ACKS_LATE = True
Apparently this is how celery behaves.
When worker is abruptly killed (but dispatching process isn't), the message will be considered as 'failed' even though you have acks_late=True
Motivation (to my understanding) is that if consumer was killed by OS due to out-of-mem, there is no point in redelivering the same task.
You may see the exact issue here: https://github.com/celery/celery/issues/1628
I actually disagree with this behaviour. IMO it would make more sense not to acknowledge.
I've had the issue, where I was using some open-source C libraries that went totaly amok and crashed my worker ungraceful without throwing an exception. For any reason whatsoever, one can simply wrap the content of a task in a child process and check its status in the parent.
n = os.fork()
if n > 0: //inside the parent process
status = os.wait() //wait until child terminates
print("Signal number that killed the child process:", status[1])
if status[1] > 0: // if the signal was something other then graceful
// here one can do whatever they want, like restart or throw an Exception.
self.retry(exc=SomeException(), countdown=2 ** self.request.retries)
else: // here comes the actual task content with its respected return
return myResult // Make sure there are not returns in child and parent at the same time.