Celery worker receives unregeistered task from celerybeat run by systemd - rabbitmq

On my staging server I had my celery worker(4.3.0) up and running with celery beat as daemons via systemd with RabbitMQ as broker. Everything was alright for few weeks just to the one moment 4 days ago when there was some sort of connection error between celery and amqp through kombu. [Errno 104] Connection reset by peer after started
I wasn't paying much of an attention to the server logs, since the project is in WiP stage, however when I tried to deploy newest version of the code, I realized that something is wrong with the worker.
I googled for the issue and that's what popped out:
https://github.com/celery/celery/issues/4867
The easy solution was to downgrade celery to 4.1.1 and wait till fix in future stable releases.
I removed celery, amqp, billiard and kombu from my venv, installed celery.4.1.1, which installed above packages in appropriate versions.
Atm services of celery and celerybeat are active, celerybeat sends the tasks to the celery worker, however celery logs shows me error message (please see error code of celery after downgrade). It is weird, because I haven't changed anything in task declarations or my settings ( which may be the issue here).
The weirdest thing is that if I shut down systemd services and run them with the commend:
celery -A celery_cfg:app worker -B --loglevel=DEBUG
All current tasks are being proceed as the past ones. So the celery and celerybeat configs as they are seems to be working.
Few pointed approaches I tried:
1) Made sure to import all modules without relatives imports.
2) In the past encountered issue with missing packages in venv --> they are up to date
3) Rebooted celery/celerybeat/gunicorn/systemd/rabbitmq and server itself
4) Double checked the paths in systemd services (however maybe I am debugging this to long and I just cant see the typo or something)
5) Tried with developing version 4.4.0rc2, (celery worker won't stand up)
6) Installed apps contains all required apps
Error message after downgrade of celery version
`2019-06-16 19:35:00,092: ERROR/MainProcess] Received unregistered task of type 'apps.mailing.tasks.execute_sending_system_mail'.
The message has been ignored and discarded.
Did you remember to import the module containing this task?
Or maybe you're using relative imports?
Please see
http://docs.celeryq.org/en/latest/internals/protocol.html
for more information.
The full contents of the message body was:
'[[], {}, {"callbacks": null, "errbacks": null, "chain": null, "chord": null}]' (77b)
Traceback (most recent call last):
File "/home/user/apps/venv/loans/lib/python3.7/site-packages/celery/worker/consumer/consumer.py", line 557, in on_task_received
strategy = strategies[type_]
KeyError: 'apps.mailing.tasks.execute_sending_system_mail'
Celery Service Systemd Code
Description=Celery Service
After=network.target
[Service]
Type=forking
User=<user>
Group=<user>
EnvironmentFile=/etc/default/celery
WorkingDirectory=/home/<user>/apps/loans
ExecStart=/bin/sh -c '${CELERY_BIN} multi start ${CELERYD_NODES} \
-A ${CELERY_APP} --pidfile=${CELERYD_PID_FILE} \
--logfile=${CELERYD_LOG_FILE} --loglevel=${CELERYD_LOG_LEVEL} ${CELERYD_OPTS}'
ExecStop=/bin/sh -c '${CELERY_BIN} multi stopwait ${CELERYD_NODES} \
--pidfile=${CELERYD_PID_FILE}'
ExecReload=/bin/sh -c '${CELERY_BIN} multi restart ${CELERYD_NODES} \
-A ${CELERY_APP} --pidfile=${CELERYD_PID_FILE} \
Celery Beat Service Systemd Code
Description=Celery Beat Service
After=network.target
[Service]
Type=simple
User=user
Group=user
EnvironmentFile=/etc/default/celery
WorkingDirectory=/home/user/apps/loans
ExecStart=/bin/sh -c '${CELERY_BIN} beat \
-A ${CELERY_APP} --pidfile=${CELERYBEAT_PID_FILE} \
--logfile=${CELERYBEAT_LOG_FILE} --loglevel=${CELERYD_LOG_LEVEL}'
[Install]
WantedBy=multi-user.target
Conf file for variables
CELERYD_NODES="w1"
CELERY_BIN="/home/user/apps/venv/loans/bin/celery"
CELERY_APP="celery_cfg:app"
CELERYD_MULTI="multi"
CELERYD_OPTS=""
CELERYD_PID_FILE="/home/user/apps/pids/celery/%n.pid"
CELERYD_LOG_FILE="/home/user/apps/logs/celery/%n%I.log"
CELERYD_LOG_LEVEL="INFO"
CELERYBEAT_PID_FILE="/home/user/apps/pids/celery/beat.pid"
CELERYBEAT_LOG_FILE="/home/user/apps/logs/celery/beat.log"
celery_cfg file
app = Celery('loans_apps')
app.config_from_object('django.conf:settings')
app.autodiscover_tasks(lambda: settings.INSTALLED_APPS)
app.set_default()
# <====CELERY BEAT PERIODIC TASKS ====>
app.conf.beat_schedule = {
'execute_sending_system_mail': {
'task': 'apps.mailing.tasks.execute_sending_system_mail',
'schedule': crontab(minute='*/5'),
'args': (),
},
}
#app.task(bind=True)
def debug_task(self):
print('Request: {0!r}'.format(self.request))
minor cut of settings containing celery cfg variables
BROKER_URL = 'amqp://localhost//',
CELERY_ENABLE_UTC = True
I know I can try setting celery and celerybeat without systemd, however I treat this as the last resort solution. I'd like to keep the conf as it was, even though I've no clue what's is wrong up there.
EDIT
By the mistake and guided by my friend I just found out, that both celery and celerybeat services seems to be working fine on user root, which is obviously not the solution but narrows down the number of possible flaws

It would be rude to leave the question unanswered, even though the answer comes from me, here it is:
If someone will ever encounter such issue, after following step pointed by me above, try to check for the permissions of directories which celery and celerybeat uses - You might have created them with root permissions, which may ends up with mentioned issue. Good luck to everyone in the future !

Related

celery flower does not show previously run tasks after restart

I am using celery with django and using flower to inspect tasks. I am using rabbitmq as broker. It all works fine but after I restart flower, the previously listed states of task is lost and I see 0 entries in flower.
I am running flower with persistent mode too.
python -m celery -A curatepro -l debug flower --persistent
--db=/var/lib/flower/flowerdb
There is a similar question asked on this SO post:
Celery Flower - how can i load previous catched tasks?
and as it suggests using --persistent flag and I am using the same but still it doesn't seem to work.
You have to set the variable as --persistent=True for this to work.

Flink job started from another program on YARN fails with "JobClientActor seems to have died"

I'm new flink user and I have the following problem.
I use flink on YARN cluster to transfer related data extracted from RDBMS to HBase.
I write flink batch application on java with multiple ExecutionEnvironments (one per RDB table to transfer table rows in parrallel) to transfer table by table sequentially (because call of env.execute() is blocking).
I start YARN session like this
export YARN_CONF_DIR=/etc/hadoop/conf
export FLINK_HOME=/opt/flink-1.3.1
export FLINK_CONF_DIR=$FLINK_HOME/conf
$FLINK_HOME/bin/yarn-session.sh -n 1 -s 4 -d -jm 2048 -tm 8096
Then I run my application on YARN session started via shell script transfer.sh. Its content is here
#!/bin/bash
export YARN_CONF_DIR=/etc/hadoop/conf
export FLINK_HOME=/opt/flink-1.3.1
export FLINK_CONF_DIR=$FLINK_HOME/conf
$FLINK_HOME/bin/flink run -p 4 transfer.jar
When I start this script from command line manually it works fine - jobs are submitted to YARN session one by one without errors.
Now I should be able to run this script from another java program.
For this aim I use
Runtime.exec("transfer.sh");
(maybe are there better ways to do this? I have seen at REST API but there are some difficulties because job manager is proxied by YARN).
At the beginning is works as usually - first several jobs are submitted to session and finished successfully. But the following jobs are not submitted to YARN session.
In /opt/flink-1.3.1/log/flink-tsvetkoff-client-hadoop-dev1.log I see error (and no another errors found in DEBUG level)
The program execution failed: JobClientActor seems to have died before the JobExecutionResult could be retrieved.
I have tried to analyse this problem by myself and found out that this error has occurred in JobClient class while sending ping request with timeout to JobClientActor (i.e. YARN cluster).
I tried to increase multiple heartbeat and timeout options like akka.*.timeout, akka.watch.heartbeat.* and yarn.heartbeat-delay options but it doesn't solve the problem - new jobs are not submit to YARN session from CliFrontend.
The environment for both case (manual call and call from another program) is the same. When I call
$ ps axu | grep transfer
it will give me output
/usr/lib/jvm/java-8-oracle/bin/java -Dlog.file=/opt/flink-1.3.1/log/flink-tsvetkoff-client-hadoop-dev1.log -Dlog4j.configuration=file:/opt/flink-1.3.1/conf/log4j-cli.properties -Dlogback.configurationFile=file:/opt/flink-1.3.1/conf/logback.xml -classpath /opt/flink-1.3.1/lib/flink-metrics-graphite-1.3.1.jar:/opt/flink-1.3.1/lib/flink-python_2.11-1.3.1.jar:/opt/flink-1.3.1/lib/flink-shaded-hadoop2-uber-1.3.1.jar:/opt/flink-1.3.1/lib/log4j-1.2.17.jar:/opt/flink-1.3.1/lib/slf4j-log4j12-1.7.7.jar:/opt/flink-1.3.1/lib/flink-dist_2.11-1.3.1.jar:::/etc/hadoop/conf org.apache.flink.client.CliFrontend run -p 4 transfer.jar
I also tried to update flink to 1.4.0 release or change parallelism of job (even to -p 1) but error has still occurred.
I have no idea what could be different? Is any workaround by the way?
Thank you for any help.
Finally I find out how to resolve that error
Just replace Runtime.exec(...) with new ProcessBuilder(...).inheritIO().start().
I really don't know why the call of inheritIO helps in that case because as I understand it just redirects IO streams from child process to parent process.
But I have checked that if I comment out this line of code the program begins to fall again.

Celery error "Received 0x00 while expecting 0xce"

I try to use celery. I have installed rabbit-mq by command from celery tutorial:
sudo apt-get install rabbitmq-server
And all worked well while I write my code in one file and run it to test functionality. But when I tried to add my code in Django views, and then to do concurrent requests to my views, I got this kind of exception:
File "/home/kinmanz/PycharmProjects/GitFace/myvenv/lib/python3.5/site-packages/amqp/connection.py", line 464, in drain_events
return self.blocking_read(timeout)
File "/home/kinmanz/PycharmProjects/GitFace/myvenv/lib/python3.5/site-packages/amqp/connection.py", line 468, in blocking_read
frame = self.transport.read_frame()
File "/home/kinmanz/PycharmProjects/GitFace/myvenv/lib/python3.5/site-packages/amqp/transport.py", line 251, in read_frame
'Received {0:#04x} while expecting 0xce'.format(ch))
amqp.exceptions.UnexpectedFrame: Received 0x00 while expecting 0xce
I think that problem may be in concurrency of request, and I should somehow to make queue concurrent safe.
I use Python 3.5, Celery 4.0.0, RabbitMQ 3.5.7
Actually problem in amqplib see answer below.
May be for someone who has the same problem, I will list possible solutions that I have managed to find. If you know better solution please add your answer or comment mine.
If you are using Python 2.x then see that issue https://github.com/celery/celery/issues/922
problem actually in amqplib if it change on librabbitmq all should be working, it's quite easy to do, see:
Framing Errors in Celery 3.0.1
But if you are using Python 3.x you can't solve that problem in that way, because there is no Python 3-compatible librabbitmq available, see that issue https://github.com/celery/celery/issues/2066 but in that case you can change your result backend on redis for example:
1) Install redis server:
$ sudo aptitude install redis-server
2) Change your app configuration
app = Celery('tasks', backend='redis://localhost', broker='pyamqp://')
Some useful links about installation redis: Setting up an asynchronous task queue for django using celery redis and Celery-redis quick guide
Also for Python 3 you can try to run celery worker in Python 2.7 while your app is working on Python 3, in that case don't forget install librabbitmq instead of amqplib. (This way seems to be inconvenient)

Stopping systemd service with salt-tack

I am using salt-stack to manage my production machine.
The minions run Raspbian and my and I have configured a systemd service. The services config file is located at /lib/systemd/system/my_service.service
When I run the following command:
sudo salt my_minion service.stop my_service
The following error is returned:
ERROR: Unable to run command ['/etc/init.d/my_service', 'stop'] with the context {'with_communicate': True, 'shell': False, 'env': {'LANG': 'en_GB.UTF-8', 'PATH': '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin', 'LC_ALL': 'C'}, 'stdout': -1, 'close_fds': True, 'stdin': None, 'stderr': -2, 'cwd': '/root'}, reason: [Errno 2] No such file or directory
I understand that salt tries to use sysvinit instead of systemd.
Is there any way to tell salt to use systemd?
EDIT:
Tried adding
providers:
service: systemd
to /etc/salt/minion as suggested by Eric. Still getting the same error
EDIT 2
The issue was fixed by using Erics suggestion + upgrading salt-minion to 2015.8.8 from 2015.8.3
This is almost certainly because newer Raspbian is based off of Debian 8, and Salt's systemd execution module does not properly detect newer Raspbian as needing systemd. Can the OP please reply to this message with the output from sudo salt my_minion grains.items? Please redact any grains which you feel have personally-identifiable information, I'm mainly interested in the grains that deal with OS name and version.
EDIT: One more thing. Please confirm that /run/systemd/system exists on the Raspbian box. What I think is happening here is two modules are both claiming to be the ones to provide the service module.
https://github.com/saltstack/salt/pull/32421 should fix this, but you can work around this immediately (without waiting for a new Salt release) by adding the following to /etc/salt/minion on your Raspbian minions:
providers:
service: systemd

Celery node fail, on pidbox already using on restart

I have Celery running with RabbitMQ broker.
Today, I have a failure of a Celery node, it doesn't execute tasks and doesn't respond on service celeryd stop command. After few repeats, the node stopped, but on start I get this message:
[WARNING/MainProcess] celery#nodename ready.
[WARNING/MainProcess] /home/ubuntu/virtualenv/project_1/local/lib/python2.7/site-packages/kombu/pidbox.py:73: UserWarning: A node named u'nodename' is already using this process mailbox!
Maybe you forgot to shutdown the other node or did not do so properly?
Or if you meant to start multiple nodes on the same host please make sure
you give each node a unique node name!
warnings.warn(W_PIDBOX_IN_USE % {'hostname': self.hostname})
Can anyone suggest how to unlock process mailbox?
From here http://celery.readthedocs.org/en/latest/userguide/workers.html#starting-the-worker you might need to name each node uniquely. Example:
$ celery -A proj worker --loglevel=INFO --concurrency=10 -n worker1.%h
In supervisor escape by using %%h.
Large log file or not enough free space was a reason, i think.
After deletion all is ok