Why is my supervised django celeryd process not accepting tasks? - rabbitmq

We've had a django-celery process with 5 worker processes running in production for ages now. It properly receives and runs tasks. These processes run tasks which are inserted into two queues: live and celery.
The command used to run the celery process is roughly:
manage.py celeryd -E --loglevel=WARNING --concurrency=5 \
--settings=django_settings.production_celery -Q live,celery
I've now just built a new system which is supposed to process different tasks on a different queue called foobar. These celery processes are run with a command roughly like:
manage.py celeryd -E --loglevel=WARNING --concurrency=5 \
--settings=django_settings.production_foobar -Q foobar
However when I attempt to run tasks in the new queue using my_task.apply_async(queue='foobar'), the result object remains in a PENDING state indefinitely.
Through logging I have determined that the foobar workers never receive the task. So now I'm trying to debug at what point the task message is being lost.
(We use RabbitMQ as our AMQP message broker.)
How can I determine the current contents of a celery queue? Can I directly inspect the contents of the RabbitMQ queue?

Related

Failure to execute tasks by Celery after a month

My program runs specific tasks daily, these tasks are set by django-celery-beat.
Recently, I noticed that the tasks are not performed and all changes are made by resetting the celery service configured by supervisorctl.
command=/opt/taskjo/taskjo-venv/bin/celery -A taskjo worker --pool=gevent --autoscale 4,2 -B -l info --scheduler django_celery_beat.schedulers:DatabaseScheduler --loglevel=INFO
I added these items
--pool=gevent --autoscale 4,2
New error in log
Traceback (most recent call last):
File "/home/ubuntu/venv/lib/python3.5/site-packages/billiard/pool.py", line 1224, in mark_as_worker_lost
human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: exitcode 0.
Recently, I run 4 odoo services on a server with 6 RAM and a large amount of RAM is occupied.
what happened:
Celery sends several tasks, but it can't do it, and when it is reset, all the tasks are done.
Do you have a solution?
WorkerLostError: Worker exited prematurely: exitcode 0 when shutting down worker #273
WorkerLostError: Worker exited prematurely: signal 15 (SIGTERM). #6291
The problem was solved
Steps to do the work:
Checking the server's RAM (it was almost 98% full)
Celery configuration (about one gigabyte of RAM was occupied due to the number of workers)
Closing a number of services and changing the configuration of Celery
As a result, it has been working without problems for several days.
The following config was replaced:
[program:celery]
directory=/opt/taskjo/taskjo
command=/opt/taskjo/taskjo-venv/bin/celery -A taskjo worker -B -l info --scheduler django_celery_beat.schedulers:DatabaseScheduler --loglevel=INFO
;user=taskjo
;numprocs=1
stdout_logfile=/opt/taskjo/logs/celery/worker-access.log
stderr_logfile=/opt/taskjo/logs/celery/worker-error.log
stdout_logfile_maxbytes=50
stderr_logfile_maxbytes=50
stdout_logfile_backups=10
stderr_logfile_backups=10
autostart=true
autorestart=true
startsecs=10
; Need to wait for currently executing tasks to finish at shutdown.
; Increase this if you have very long running tasks.
stopwaitsecs = 600
; Causes supervisor to send the termination signal (SIGTERM) to the whole process group.
stopasgroup=true
; Set Celery priority higher than default (999)
; so, if rabbitmq is supervised, it will start first.
priority=1000

celery flower does not show previously run tasks after restart

I am using celery with django and using flower to inspect tasks. I am using rabbitmq as broker. It all works fine but after I restart flower, the previously listed states of task is lost and I see 0 entries in flower.
I am running flower with persistent mode too.
python -m celery -A curatepro -l debug flower --persistent
--db=/var/lib/flower/flowerdb
There is a similar question asked on this SO post:
Celery Flower - how can i load previous catched tasks?
and as it suggests using --persistent flag and I am using the same but still it doesn't seem to work.
You have to set the variable as --persistent=True for this to work.

Flink job started from another program on YARN fails with "JobClientActor seems to have died"

I'm new flink user and I have the following problem.
I use flink on YARN cluster to transfer related data extracted from RDBMS to HBase.
I write flink batch application on java with multiple ExecutionEnvironments (one per RDB table to transfer table rows in parrallel) to transfer table by table sequentially (because call of env.execute() is blocking).
I start YARN session like this
export YARN_CONF_DIR=/etc/hadoop/conf
export FLINK_HOME=/opt/flink-1.3.1
export FLINK_CONF_DIR=$FLINK_HOME/conf
$FLINK_HOME/bin/yarn-session.sh -n 1 -s 4 -d -jm 2048 -tm 8096
Then I run my application on YARN session started via shell script transfer.sh. Its content is here
#!/bin/bash
export YARN_CONF_DIR=/etc/hadoop/conf
export FLINK_HOME=/opt/flink-1.3.1
export FLINK_CONF_DIR=$FLINK_HOME/conf
$FLINK_HOME/bin/flink run -p 4 transfer.jar
When I start this script from command line manually it works fine - jobs are submitted to YARN session one by one without errors.
Now I should be able to run this script from another java program.
For this aim I use
Runtime.exec("transfer.sh");
(maybe are there better ways to do this? I have seen at REST API but there are some difficulties because job manager is proxied by YARN).
At the beginning is works as usually - first several jobs are submitted to session and finished successfully. But the following jobs are not submitted to YARN session.
In /opt/flink-1.3.1/log/flink-tsvetkoff-client-hadoop-dev1.log I see error (and no another errors found in DEBUG level)
The program execution failed: JobClientActor seems to have died before the JobExecutionResult could be retrieved.
I have tried to analyse this problem by myself and found out that this error has occurred in JobClient class while sending ping request with timeout to JobClientActor (i.e. YARN cluster).
I tried to increase multiple heartbeat and timeout options like akka.*.timeout, akka.watch.heartbeat.* and yarn.heartbeat-delay options but it doesn't solve the problem - new jobs are not submit to YARN session from CliFrontend.
The environment for both case (manual call and call from another program) is the same. When I call
$ ps axu | grep transfer
it will give me output
/usr/lib/jvm/java-8-oracle/bin/java -Dlog.file=/opt/flink-1.3.1/log/flink-tsvetkoff-client-hadoop-dev1.log -Dlog4j.configuration=file:/opt/flink-1.3.1/conf/log4j-cli.properties -Dlogback.configurationFile=file:/opt/flink-1.3.1/conf/logback.xml -classpath /opt/flink-1.3.1/lib/flink-metrics-graphite-1.3.1.jar:/opt/flink-1.3.1/lib/flink-python_2.11-1.3.1.jar:/opt/flink-1.3.1/lib/flink-shaded-hadoop2-uber-1.3.1.jar:/opt/flink-1.3.1/lib/log4j-1.2.17.jar:/opt/flink-1.3.1/lib/slf4j-log4j12-1.7.7.jar:/opt/flink-1.3.1/lib/flink-dist_2.11-1.3.1.jar:::/etc/hadoop/conf org.apache.flink.client.CliFrontend run -p 4 transfer.jar
I also tried to update flink to 1.4.0 release or change parallelism of job (even to -p 1) but error has still occurred.
I have no idea what could be different? Is any workaround by the way?
Thank you for any help.
Finally I find out how to resolve that error
Just replace Runtime.exec(...) with new ProcessBuilder(...).inheritIO().start().
I really don't know why the call of inheritIO helps in that case because as I understand it just redirects IO streams from child process to parent process.
But I have checked that if I comment out this line of code the program begins to fall again.

Celery node fail, on pidbox already using on restart

I have Celery running with RabbitMQ broker.
Today, I have a failure of a Celery node, it doesn't execute tasks and doesn't respond on service celeryd stop command. After few repeats, the node stopped, but on start I get this message:
[WARNING/MainProcess] celery#nodename ready.
[WARNING/MainProcess] /home/ubuntu/virtualenv/project_1/local/lib/python2.7/site-packages/kombu/pidbox.py:73: UserWarning: A node named u'nodename' is already using this process mailbox!
Maybe you forgot to shutdown the other node or did not do so properly?
Or if you meant to start multiple nodes on the same host please make sure
you give each node a unique node name!
warnings.warn(W_PIDBOX_IN_USE % {'hostname': self.hostname})
Can anyone suggest how to unlock process mailbox?
From here http://celery.readthedocs.org/en/latest/userguide/workers.html#starting-the-worker you might need to name each node uniquely. Example:
$ celery -A proj worker --loglevel=INFO --concurrency=10 -n worker1.%h
In supervisor escape by using %%h.
Large log file or not enough free space was a reason, i think.
After deletion all is ok

Fork shell script (not &)

I'm accessing a webserver via PHP. I want to update some info in the Apache configs, so I start a shell script that makes the changes. Then I want to stop and restart Apache.
Problem: as soon as I stop Apache, my process stops and my shell script, being a child process, is killed. Apache never restarts. This also happens with Apache restart.
Is there a way to fork an independent, non-child process for the shell script, so I can restart Apache?
Thx,
Mr B
You can use disown:
disown [-ar] [-h] [jobspec ...]
Without options, each jobspec is removed from the table of active jobs. If the `-h' option is given, the job is not removed from the table, but is marked so that SIGHUP is not sent to the job if the shell receives a SIGHUP. If jobspec is not present, and neither the `-a' nor `-r' option is supplied, the current job is used. If no jobspec is supplied, the `-a' option means to remove or mark all jobs; the `-r' option without a jobspec argument restricts operation to running jobs.
./myscript.sh &
disown
./myscript.sh will continue running even if the script that started it dies.
Take a look at nohup, may fit you needs.
let's say you have a script called test.sh
for i in $(seq 100); do
echo $i >> test.temp
sleep 1;
done
if you run nohup ./test.sh & you can kill the shell and the process stay alive.