Celery node fail, on pidbox already using on restart - rabbitmq

I have Celery running with RabbitMQ broker.
Today, I have a failure of a Celery node, it doesn't execute tasks and doesn't respond on service celeryd stop command. After few repeats, the node stopped, but on start I get this message:
[WARNING/MainProcess] celery#nodename ready.
[WARNING/MainProcess] /home/ubuntu/virtualenv/project_1/local/lib/python2.7/site-packages/kombu/pidbox.py:73: UserWarning: A node named u'nodename' is already using this process mailbox!
Maybe you forgot to shutdown the other node or did not do so properly?
Or if you meant to start multiple nodes on the same host please make sure
you give each node a unique node name!
warnings.warn(W_PIDBOX_IN_USE % {'hostname': self.hostname})
Can anyone suggest how to unlock process mailbox?

From here http://celery.readthedocs.org/en/latest/userguide/workers.html#starting-the-worker you might need to name each node uniquely. Example:
$ celery -A proj worker --loglevel=INFO --concurrency=10 -n worker1.%h
In supervisor escape by using %%h.

Large log file or not enough free space was a reason, i think.
After deletion all is ok

Related

Celery worker receives unregeistered task from celerybeat run by systemd

On my staging server I had my celery worker(4.3.0) up and running with celery beat as daemons via systemd with RabbitMQ as broker. Everything was alright for few weeks just to the one moment 4 days ago when there was some sort of connection error between celery and amqp through kombu. [Errno 104] Connection reset by peer after started
I wasn't paying much of an attention to the server logs, since the project is in WiP stage, however when I tried to deploy newest version of the code, I realized that something is wrong with the worker.
I googled for the issue and that's what popped out:
https://github.com/celery/celery/issues/4867
The easy solution was to downgrade celery to 4.1.1 and wait till fix in future stable releases.
I removed celery, amqp, billiard and kombu from my venv, installed celery.4.1.1, which installed above packages in appropriate versions.
Atm services of celery and celerybeat are active, celerybeat sends the tasks to the celery worker, however celery logs shows me error message (please see error code of celery after downgrade). It is weird, because I haven't changed anything in task declarations or my settings ( which may be the issue here).
The weirdest thing is that if I shut down systemd services and run them with the commend:
celery -A celery_cfg:app worker -B --loglevel=DEBUG
All current tasks are being proceed as the past ones. So the celery and celerybeat configs as they are seems to be working.
Few pointed approaches I tried:
1) Made sure to import all modules without relatives imports.
2) In the past encountered issue with missing packages in venv --> they are up to date
3) Rebooted celery/celerybeat/gunicorn/systemd/rabbitmq and server itself
4) Double checked the paths in systemd services (however maybe I am debugging this to long and I just cant see the typo or something)
5) Tried with developing version 4.4.0rc2, (celery worker won't stand up)
6) Installed apps contains all required apps
Error message after downgrade of celery version
`2019-06-16 19:35:00,092: ERROR/MainProcess] Received unregistered task of type 'apps.mailing.tasks.execute_sending_system_mail'.
The message has been ignored and discarded.
Did you remember to import the module containing this task?
Or maybe you're using relative imports?
Please see
http://docs.celeryq.org/en/latest/internals/protocol.html
for more information.
The full contents of the message body was:
'[[], {}, {"callbacks": null, "errbacks": null, "chain": null, "chord": null}]' (77b)
Traceback (most recent call last):
File "/home/user/apps/venv/loans/lib/python3.7/site-packages/celery/worker/consumer/consumer.py", line 557, in on_task_received
strategy = strategies[type_]
KeyError: 'apps.mailing.tasks.execute_sending_system_mail'
Celery Service Systemd Code
Description=Celery Service
After=network.target
[Service]
Type=forking
User=<user>
Group=<user>
EnvironmentFile=/etc/default/celery
WorkingDirectory=/home/<user>/apps/loans
ExecStart=/bin/sh -c '${CELERY_BIN} multi start ${CELERYD_NODES} \
-A ${CELERY_APP} --pidfile=${CELERYD_PID_FILE} \
--logfile=${CELERYD_LOG_FILE} --loglevel=${CELERYD_LOG_LEVEL} ${CELERYD_OPTS}'
ExecStop=/bin/sh -c '${CELERY_BIN} multi stopwait ${CELERYD_NODES} \
--pidfile=${CELERYD_PID_FILE}'
ExecReload=/bin/sh -c '${CELERY_BIN} multi restart ${CELERYD_NODES} \
-A ${CELERY_APP} --pidfile=${CELERYD_PID_FILE} \
Celery Beat Service Systemd Code
Description=Celery Beat Service
After=network.target
[Service]
Type=simple
User=user
Group=user
EnvironmentFile=/etc/default/celery
WorkingDirectory=/home/user/apps/loans
ExecStart=/bin/sh -c '${CELERY_BIN} beat \
-A ${CELERY_APP} --pidfile=${CELERYBEAT_PID_FILE} \
--logfile=${CELERYBEAT_LOG_FILE} --loglevel=${CELERYD_LOG_LEVEL}'
[Install]
WantedBy=multi-user.target
Conf file for variables
CELERYD_NODES="w1"
CELERY_BIN="/home/user/apps/venv/loans/bin/celery"
CELERY_APP="celery_cfg:app"
CELERYD_MULTI="multi"
CELERYD_OPTS=""
CELERYD_PID_FILE="/home/user/apps/pids/celery/%n.pid"
CELERYD_LOG_FILE="/home/user/apps/logs/celery/%n%I.log"
CELERYD_LOG_LEVEL="INFO"
CELERYBEAT_PID_FILE="/home/user/apps/pids/celery/beat.pid"
CELERYBEAT_LOG_FILE="/home/user/apps/logs/celery/beat.log"
celery_cfg file
app = Celery('loans_apps')
app.config_from_object('django.conf:settings')
app.autodiscover_tasks(lambda: settings.INSTALLED_APPS)
app.set_default()
# <====CELERY BEAT PERIODIC TASKS ====>
app.conf.beat_schedule = {
'execute_sending_system_mail': {
'task': 'apps.mailing.tasks.execute_sending_system_mail',
'schedule': crontab(minute='*/5'),
'args': (),
},
}
#app.task(bind=True)
def debug_task(self):
print('Request: {0!r}'.format(self.request))
minor cut of settings containing celery cfg variables
BROKER_URL = 'amqp://localhost//',
CELERY_ENABLE_UTC = True
I know I can try setting celery and celerybeat without systemd, however I treat this as the last resort solution. I'd like to keep the conf as it was, even though I've no clue what's is wrong up there.
EDIT
By the mistake and guided by my friend I just found out, that both celery and celerybeat services seems to be working fine on user root, which is obviously not the solution but narrows down the number of possible flaws
It would be rude to leave the question unanswered, even though the answer comes from me, here it is:
If someone will ever encounter such issue, after following step pointed by me above, try to check for the permissions of directories which celery and celerybeat uses - You might have created them with root permissions, which may ends up with mentioned issue. Good luck to everyone in the future !

celery flower does not show previously run tasks after restart

I am using celery with django and using flower to inspect tasks. I am using rabbitmq as broker. It all works fine but after I restart flower, the previously listed states of task is lost and I see 0 entries in flower.
I am running flower with persistent mode too.
python -m celery -A curatepro -l debug flower --persistent
--db=/var/lib/flower/flowerdb
There is a similar question asked on this SO post:
Celery Flower - how can i load previous catched tasks?
and as it suggests using --persistent flag and I am using the same but still it doesn't seem to work.
You have to set the variable as --persistent=True for this to work.

Flink job started from another program on YARN fails with "JobClientActor seems to have died"

I'm new flink user and I have the following problem.
I use flink on YARN cluster to transfer related data extracted from RDBMS to HBase.
I write flink batch application on java with multiple ExecutionEnvironments (one per RDB table to transfer table rows in parrallel) to transfer table by table sequentially (because call of env.execute() is blocking).
I start YARN session like this
export YARN_CONF_DIR=/etc/hadoop/conf
export FLINK_HOME=/opt/flink-1.3.1
export FLINK_CONF_DIR=$FLINK_HOME/conf
$FLINK_HOME/bin/yarn-session.sh -n 1 -s 4 -d -jm 2048 -tm 8096
Then I run my application on YARN session started via shell script transfer.sh. Its content is here
#!/bin/bash
export YARN_CONF_DIR=/etc/hadoop/conf
export FLINK_HOME=/opt/flink-1.3.1
export FLINK_CONF_DIR=$FLINK_HOME/conf
$FLINK_HOME/bin/flink run -p 4 transfer.jar
When I start this script from command line manually it works fine - jobs are submitted to YARN session one by one without errors.
Now I should be able to run this script from another java program.
For this aim I use
Runtime.exec("transfer.sh");
(maybe are there better ways to do this? I have seen at REST API but there are some difficulties because job manager is proxied by YARN).
At the beginning is works as usually - first several jobs are submitted to session and finished successfully. But the following jobs are not submitted to YARN session.
In /opt/flink-1.3.1/log/flink-tsvetkoff-client-hadoop-dev1.log I see error (and no another errors found in DEBUG level)
The program execution failed: JobClientActor seems to have died before the JobExecutionResult could be retrieved.
I have tried to analyse this problem by myself and found out that this error has occurred in JobClient class while sending ping request with timeout to JobClientActor (i.e. YARN cluster).
I tried to increase multiple heartbeat and timeout options like akka.*.timeout, akka.watch.heartbeat.* and yarn.heartbeat-delay options but it doesn't solve the problem - new jobs are not submit to YARN session from CliFrontend.
The environment for both case (manual call and call from another program) is the same. When I call
$ ps axu | grep transfer
it will give me output
/usr/lib/jvm/java-8-oracle/bin/java -Dlog.file=/opt/flink-1.3.1/log/flink-tsvetkoff-client-hadoop-dev1.log -Dlog4j.configuration=file:/opt/flink-1.3.1/conf/log4j-cli.properties -Dlogback.configurationFile=file:/opt/flink-1.3.1/conf/logback.xml -classpath /opt/flink-1.3.1/lib/flink-metrics-graphite-1.3.1.jar:/opt/flink-1.3.1/lib/flink-python_2.11-1.3.1.jar:/opt/flink-1.3.1/lib/flink-shaded-hadoop2-uber-1.3.1.jar:/opt/flink-1.3.1/lib/log4j-1.2.17.jar:/opt/flink-1.3.1/lib/slf4j-log4j12-1.7.7.jar:/opt/flink-1.3.1/lib/flink-dist_2.11-1.3.1.jar:::/etc/hadoop/conf org.apache.flink.client.CliFrontend run -p 4 transfer.jar
I also tried to update flink to 1.4.0 release or change parallelism of job (even to -p 1) but error has still occurred.
I have no idea what could be different? Is any workaround by the way?
Thank you for any help.
Finally I find out how to resolve that error
Just replace Runtime.exec(...) with new ProcessBuilder(...).inheritIO().start().
I really don't know why the call of inheritIO helps in that case because as I understand it just redirects IO streams from child process to parent process.
But I have checked that if I comment out this line of code the program begins to fall again.

yarn not getting nodes

This is in AWS EMR cluster with 2 task nodes and a Master.
I'm trying the hello-samza that launches a yarn job. The job gets stuck in ACCEPTED STATE. I looked in other posts and it seems that my yarn getting no nodes. Any help on what yarn not getting task nodes will help.
[hadoop#xxx hello-samza]$ deploy/yarn/bin/yarn node -list
17/04/18 23:30:45 INFO client.RMProxy: Connecting to ResourceManager at /127.0.0.1:8032
Total Nodes:0
Node-Id Node-State Node-Http-Address Number-of-Running-Containers
[hadoop#xxx hello-samza]$ deploy/yarn/bin/yarn application -list -appStates ALL
17/04/18 23:26:30 INFO client.RMProxy: Connecting to ResourceManager at /127.0.0.1:8032
Total number of applications (application-types: [] and states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, RUNNING, FINISHED, FAILED, KILLED]):1
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1492557889328_0001 wikipedia-parser_1 Samza hadoop default ACCEPTED UNDEFINED 0% N/A
I made a complete answer for a similar case I've been experiencing: have a look at it, it might be this kind of conf issue
It seems like the nodemanagers are not running on either node (either not started at all or exited with error). Use jps command to check if all the daemons associated with YARN are running on the two nodes. Additionally, check both nodemanager logs to see if any exceptions might have killed it.

Why is my supervised django celeryd process not accepting tasks?

We've had a django-celery process with 5 worker processes running in production for ages now. It properly receives and runs tasks. These processes run tasks which are inserted into two queues: live and celery.
The command used to run the celery process is roughly:
manage.py celeryd -E --loglevel=WARNING --concurrency=5 \
--settings=django_settings.production_celery -Q live,celery
I've now just built a new system which is supposed to process different tasks on a different queue called foobar. These celery processes are run with a command roughly like:
manage.py celeryd -E --loglevel=WARNING --concurrency=5 \
--settings=django_settings.production_foobar -Q foobar
However when I attempt to run tasks in the new queue using my_task.apply_async(queue='foobar'), the result object remains in a PENDING state indefinitely.
Through logging I have determined that the foobar workers never receive the task. So now I'm trying to debug at what point the task message is being lost.
(We use RabbitMQ as our AMQP message broker.)
How can I determine the current contents of a celery queue? Can I directly inspect the contents of the RabbitMQ queue?