I tried to shutdown celery workers gracefully using this command.
ps aux | grep celery\ worker | awk '{print $2}' | xargs kill -SIGINT
However, some worker process are still throwing WorkerLostError.
This is a part of celery worker logs in which one of the workers gets killed before completing task
[2016-01-18 09:10:56,667: INFO/MainProcess] Received task: task[b83163b4-9fd6-4c06-88bb-8cf8fc98bb23]
[2016-01-18 09:10:58,390: INFO/MainProcess] Received task: task[d5239190-2cea-4ee1-bac6-7600a9f05839]
[2016-01-18 09:11:00,621: INFO/MainProcess] Received task: task[1139f6f8-31b5-449b-9425-0cf6943496d4]
[2016-01-18 09:11:01,543: INFO/MainProcess] Received task: task[0547455c-83e4-4a9e-a0a5-e872afdbbc62]
[2016-01-18 09:11:01,695: INFO/MainProcess] Received task: task[520b41da-7e3a-4c9d-a807-8e11aebd4bcd]
[2016-01-18 09:11:02,286: INFO/MainProcess] Task task[d5239190-2cea-4ee1-bac6-7600a9f05839] succeeded in 3.8944714s: []
worker: Hitting Ctrl+C again will terminate all running tasks!
worker: Warm shutdown (MainProcess)
[2016-01-18 09:11:04,451: INFO/MainProcess] Task task[520b41da-7e3a-4c9d-a807-8e11aebd4bcd] succeeded in 2.754699837s: []
[2016-01-18 09:11:04,497: ERROR/MainProcess] Task task[b83163b4-9fd6-4c06-88bb-8cf8fc98bb23] raised unexpected: WorkerLostError('Worker exited prematurely: exitcode 0.',)
Traceback (most recent call last):
File "/usr/lib/python2.6/site-packages/billiard-3.3.0.22-py2.6.egg/billiard/pool.py", line 1175, in mark_as_worker_lost
human_status(exitcode)),
WorkerLostError: Worker exited prematurely: exitcode 0.
[2016-01-18 09:11:05,500: INFO/MainProcess] Task task[1139f6f8-31b5-449b-9425-0cf6943496d4] succeeded in 4.877286218s: []
[2016-01-18 09:11:06,002: INFO/MainProcess] Task task[0547455c-83e4-4a9e-a0a5-e872afdbbc62] succeeded in 4.457699244s: []
-------------- celery#ip-10-0-1-78 v3.1.19 (Cipater)
---- **** -----
--- * *** * -- Linux-3.14.26-24.46.amzn1.x86_64-x86_64-with-glibc2.2.5
The below code snippet will gracefully shutdown a worker. i.e. the worker will be shutdown, once the job being executed is finished.
from celery import Celery
celery = Celery('vwadaptor', broker='redis://workerdb:6379/0',backend='redis://workerdb:6379/0')
celery.control.broadcast('shutdown', destination=[<celery_worker_name>])
That's not a very graceful way to shutdown workers.
I recommend using supervisord to manage your workers - there is an example configuration file in the celery sources that you can use as a starting point.
Related
I have a pretty stable setup with celery 4.4.7 and a redis node on AWS.
However, once every few weeks, I notice that the celery workers suddenly stop processing the queue.
If I inspect the logs, I see the following:
[2021-10-14 06:36:52,250: DEBUG/ForkPoolWorker-1] Executing task PutObjectTask(transfer_id=0, {'bucket': 'cdn.my-app.com', 'key': 'thumbs/-0DqXnEL1k3J_1824x0_n1wMX-gx.jpg', 'extra_args': {'ContentType': 'image/jpeg', 'Expires': 'Wed, 31 Dec 2036 23:59:59 GMT', 'StorageClass': 'INTELLIGENT_TIERING', 'ACL': 'private'}}) with kwargs {'client': <botocore.client.S3 object at 0xffff90746040>, 'fileobj': <s3transfer.utils.ReadFileChunk object at 0xffff9884d670>, 'bucket': 'cdn.my-app.com', 'key': 'thumbs/-0DqXnEL1k3J_1824x0_n1wMX-gx.jpg', 'extra_args': {'ContentType': 'image/jpeg', 'Expires': 'Wed, 31 Dec 2036 23:59:59 GMT', 'StorageClass': 'INTELLIGENT_TIERING', 'ACL': 'private'}}
[2021-10-14 06:36:52,253: DEBUG/ForkPoolWorker-1] Releasing acquire 0/None
[2021-10-14 06:36:52,429: DEBUG/ForkPoolWorker-1] Releasing acquire 0/None
[2021-10-14 06:37:50,202: INFO/MainProcess] missed heartbeat from celery#3a05aa0db663
[2021-10-14 06:37:50,202: INFO/MainProcess] missed heartbeat from celery#7c9532243cd0
[2021-10-14 06:37:50,202: INFO/MainProcess] missed heartbeat from celery#686f7e96d04f
[2021-10-14 06:54:06,510: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/redis/connection.py", line 706, in send_packed_command
sendall(self._sock, item)
File "/usr/local/lib/python3.8/dist-packages/redis/_compat.py", line 9, in sendall
return sock.sendall(*args, **kwargs)
TimeoutError: [Errno 110] Connection timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/celery/worker/consumer/consumer.py", line 318, in start
blueprint.start(self)
File "/usr/local/lib/python3.8/dist-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/usr/local/lib/python3.8/dist-packages/celery/worker/consumer/consumer.py", line 599, in start
c.loop(*c.loop_args())
File "/usr/local/lib/python3.8/dist-packages/celery/worker/loops.py", line 83, in asynloop
next(loop)
File "/usr/local/lib/python3.8/dist-packages/kombu/asynchronous/hub.py", line 364, in create_loop
cb(*cbargs)
File "/usr/local/lib/python3.8/dist-packages/kombu/transport/redis.py", line 1083, in on_readable
self.cycle.on_readable(fileno)
File "/usr/local/lib/python3.8/dist-packages/kombu/transport/redis.py", line 354, in on_readable
chan.handlers[type]()
File "/usr/local/lib/python3.8/dist-packages/kombu/transport/redis.py", line 688, in _receive
ret.append(self._receive_one(c))
File "/usr/local/lib/python3.8/dist-packages/kombu/transport/redis.py", line 698, in _receive_one
response = c.parse_response()
File "/usr/local/lib/python3.8/dist-packages/redis/client.py", line 3501, in parse_response
self.check_health()
File "/usr/local/lib/python3.8/dist-packages/redis/client.py", line 3521, in check_health
conn.send_command('PING', self.HEALTH_CHECK_MESSAGE,
File "/usr/local/lib/python3.8/dist-packages/redis/connection.py", line 725, in send_command
self.send_packed_command(self.pack_command(*args),
File "/usr/local/lib/python3.8/dist-packages/redis/connection.py", line 717, in send_packed_command
raise ConnectionError("Error %s while writing to socket. %s." %
redis.exceptions.ConnectionError: Error 110 while writing to socket. Connection timed out.
[2021-10-14 06:54:06,520: DEBUG/MainProcess] Canceling task consumer...
[2021-10-14 06:54:06,538: INFO/MainProcess] Connected to redis://my-redis-endpoint.cache.amazonaws.com:6379/0
[2021-10-14 06:54:06,548: INFO/MainProcess] mingle: searching for neighbors
[2021-10-14 07:09:45,507: INFO/MainProcess] mingle: sync with 1 nodes
[2021-10-14 07:09:45,507: DEBUG/MainProcess] mingle: processing reply from celery#3a05aa0db663
[2021-10-14 07:09:45,508: INFO/MainProcess] mingle: sync complete
[2021-10-14 07:09:45,536: INFO/MainProcess] Received task: my-app.tasks.retrieve_thumbnail[f50d66fd-28f5-4f1d-9b69-bf79daab43c0]
Please note the amount of time passed with seemingly no activity between 06:37:50,202 (detected loss of connection from other workers) and 06:54:06,510 (detected loss of connection to broker), and also between 06:54:06,548 (reconnected to broker) and 07:09:45,536 started receiving tasks.
All in all, my celery setup was not processing tasks between 06:37:50 and 07:09:45, which impact the functioning of my app.
Any suggestions on how to tackle this issue?
Thanks!c
I'll need a bit of help here.
running airflow on docker container+LocalExecutor.
Airflow version is 2.0.0 (https://pypi.org/project/apache-airflow/2.0.0/)
I'm running a long runnning task with a wrapper of SSHOperator.
Basically I open an SSH Session to run a spark-submit job in spark edge node.
(YARN JOB succeeds but airflow task fails)
Task starts with PID 31675:
[2021-06-24 18:29:09,664] {standard_task_runner.py:51} INFO - Started process 31675 to run task
Then after sometime getting this warning:
Recorded pid 1098 does not match the current pid 31631
And then the tasks fails:
[2021-06-24 19:45:44,493] {local_task_job.py:166} WARNING - Recorded pid 1098 does not match the current pid 31631
[2021-06-24 19:45:44,496] {process_utils.py:95} INFO - Sending Signals.SIGTERM to GPID 31675
[2021-06-24 19:45:44,496] {taskinstance.py:1214} ERROR - Received SIGTERM. Terminating subprocesses.
[2021-06-24 19:45:44,528] {taskinstance.py:1396} ERROR - LatamSSH operator error: Task received SIGTERM signal
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/latamairflow/operators/latam_ssh_operator.py", line 453, in execute
readq, _, _ = select([channel], [], [], self.timeout)
File "/usr/local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1216, in signal_handler
raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1086, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/usr/local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1260, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/usr/local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1300, in _execute_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.6/site-packages/latamairflow/operators/latam_ssh_operator.py", line 502, in execute
raise AirflowException("LatamSSH operator error: {0}".format(str(e)))
airflow.exceptions.AirflowException: LatamSSH operator error: Task received SIGTERM signal
[2021-06-24 19:45:44,529] {taskinstance.py:1440} INFO - Marking task as FAILED.
Using filebeat 7.5.2:
I'm using a filebeat configuration with close_eof enabled and I run filebeat with the flag --once. I can see the harvester reaching eof but the filebeat keeps going.
Flebeat conf:
filebeat.inputs:
- type: log
close_eof: true
enabled: true
paths:
- "${LOGS_PATH}"
scan_frequency: 1s
fields: {
machine: "${HOST}"
}
output.logstash:
hosts: ["192.168.41.6:5044"]
bulk_max_size: 1024
timeout: 30s
pipelining: 1
workers: 1
And I run it using:
filebeat run --once -v -c "PATH TO CONF..."
And some logs from the filebeat instance:
...
2020-02-04T18:30:16.950Z INFO instance/beat.go:297 Setup Beat: filebeat; Version: 7.5.2
2020-02-04T18:30:17.059Z INFO [publisher] pipeline/module.go:97 Beat name: logstash
2020-02-04T18:30:17.167Z WARN beater/filebeat.go:152 Filebeat is unable to load the Ingest Node pipelines for the configured modules because the Elasticsearch out
put is not configured/enabled. If you have already loaded the Ingest Node pipelines or are using Logstash pipelines, you can ignore this warning.
2020-02-04T18:30:17.168Z INFO instance/beat.go:429 filebeat start running.
2020-02-04T18:30:17.168Z INFO [monitoring] log/log.go:118 Starting metrics logging every 30s
2020-02-04T18:30:17.168Z INFO registrar/migrate.go:104 No registry home found. Create: /tmp/tmp.BXJtfiaEzb/data/registry/filebeat
2020-02-04T18:30:17.179Z INFO registrar/migrate.go:112 Initialize registry meta file
2020-02-04T18:30:17.192Z INFO registrar/registrar.go:108 No registry file found under: /tmp/tmp.BXJtfiaEzb/data/registry/filebeat/data.json. Creating a new re
gistry file.
2020-02-04T18:30:17.193Z INFO registrar/registrar.go:145 Loading registrar data from /tmp/tmp.BXJtfiaEzb/data/registry/filebeat/data.json
2020-02-04T18:30:17.193Z INFO registrar/registrar.go:152 States Loaded from registrar: 0
2020-02-04T18:30:17.193Z WARN beater/filebeat.go:368 Filebeat is unable to load the Ingest Node pipelines for the configured modules because the Elasticsearch out
put is not configured/enabled. If you have already loaded the Ingest Node pipelines or are using Logstash pipelines, you can ignore this warning.
2020-02-04T18:30:17.193Z INFO crawler/crawler.go:72 Loading Inputs: 1
2020-02-04T18:30:17.194Z INFO log/input.go:152 Configured paths: [/tmp/tmp.BXJtfiaEzb/*.log]
2020-02-04T18:30:17.206Z INFO input/input.go:114 Starting input of type: log; ID: 13918413832820009056
2020-02-04T18:30:17.225Z INFO input/input.go:167 Stopping Input: 13918413832820009056
2020-02-04T18:30:17.225Z INFO crawler/crawler.go:106 Loading and starting Inputs completed. Enabled inputs: 1
2020-02-04T18:30:17.225Z INFO log/harvester.go:251 Harvester started for file: /tmp/tmp.BXJtfiaEzb/dcbgw-20200124080032_darkblue.log
2020-02-04T18:30:17.231Z INFO beater/filebeat.go:384 Running filebeat once. Waiting for completion ...
2020-02-04T18:30:17.231Z INFO beater/filebeat.go:386 All data collection completed. Shutting down.
2020-02-04T18:30:17.231Z INFO crawler/crawler.go:139 Stopping Crawler
2020-02-04T18:30:17.231Z INFO crawler/crawler.go:149 Stopping 1 inputs
2020-02-04T18:30:17.258Z INFO pipeline/output.go:95 Connecting to backoff(async(tcp://192.168.41.6:5044))
2020-02-04T18:30:17.296Z INFO pipeline/output.go:105 Connection to backoff(async(tcp://192.168.41.6:5044)) established
... Only metrics here ...
2020-02-04T18:35:55.686Z INFO log/harvester.go:274 End of file reached: /tmp/tmp.BXJtfiaEzb/dcbgw-20200124080032_darkblue.log. Closing because close_eof is enabled.
2020-02-04T18:35:55.686Z INFO crawler/crawler.go:165 Crawler stopped
... MORE METRICS ...
2020-02-04T18:36:26.609Z ERROR logstash/async.go:256 Failed to publish events caused by: read tcp 192.168.41.6:49662->192.168.41.6:5044: i/o timeout
2020-02-04T18:36:26.621Z ERROR logstash/async.go:256 Failed to publish events caused by: client is not connected
2020-02-04T18:36:28.520Z ERROR pipeline/output.go:121 Failed to publish events: client is not connected
2020-02-04T18:36:28.520Z INFO pipeline/output.go:95 Connecting to backoff(async(tcp://192.168.41.6:5044))
2020-02-04T18:36:28.521Z INFO pipeline/output.go:105 Connection to backoff(async(tcp://192.168.41.6:5044)) established
... MORE METRICS ...
From this I'm outputing this to Logstash 7.5.2 running in the same Ubuntu 18 VM. Running Logstash with log level trace does not output any error.
I have a celery worker with redis backending running for more than half a year and I did not have any problems so far.
Suddenly, I do not get any reply from the nodes.
I can successfully start celery, there is no error message when executing the command:
celery multi start myqueue -A myapp.celery -Ofair
celery multi v4.3.0 (rhubarb)
> Starting nodes...
> myqueue#myhost: OK
However, when I check the status of the celery worker
celery -A myapp.celery status
I get the message:
Error: No nodes replied within time constraint.
If I look up the processes, the celery worker appears to be running:
/usr/bin/python3 -m celery worker -Ofair -A myapp.celery --concurrency=4
\_ /usr/bin/python3 -m celery worker -Ofair -A myapp.celery --concurrency=4
\_ /usr/bin/python3 -m celery worker -Ofair -A myapp.celery --concurrency=4
\_ /usr/bin/python3 -m celery worker -Ofair -A myapp.celery --concurrency=4
\_ /usr/bin/python3 -m celery worker -Ofair -A myapp.celery --concurrency=4
When I do a
celery -A myapp.celery control shutdown
the above processes are removed as expected.
Starting in the foreground does not give any hint either:
$ celery -A myapp.celery myworker -l debug
Please specify a different user using the --uid option.
User information: uid=1000120000 euid=1000120000 gid=0 egid=0
uid=uid, euid=euid, gid=gid, egid=egid,
[2019-08-23 11:36:36,790: DEBUG/MainProcess] | Worker: Preparing bootsteps.
[2019-08-23 11:36:36,792: DEBUG/MainProcess] | Worker: Building graph...
[2019-08-23 11:36:36,793: DEBUG/MainProcess] | Worker: New boot order: {StateDB, Beat, Timer, Hub, Pool, Autoscaler, Consumer}
[2019-08-23 11:36:36,808: DEBUG/MainProcess] | Consumer: Preparing bootsteps.
[2019-08-23 11:36:36,808: DEBUG/MainProcess] | Consumer: Building graph...
[2019-08-23 11:36:36,862: DEBUG/MainProcess] | Consumer: New boot order: {Connection, Events, Mingle, Tasks, Control, Heart, Gossip, Agent, event loop}
-------------- celery#myapp-163-m4hs9 v4.3.0 (rhubarb)
---- **** -----
--- * *** * -- Linux-3.10.0-862.3.2.el7.x86_64-x86_64-with-Ubuntu-16.04-xenial 2019-08-23 11:36:36
-- * - **** ---
- ** ---------- [config]
- ** ---------- .> app: myapp:0x7f2094fcd978
- ** ---------- .> transport: redis://:**#${redis-host}:6379/0
- ** ---------- .> results: redis://:**#${redis-host}:6379/0
- *** --- * --- .> concurrency: 4 (prefork)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** -----
-------------- [queues]
.> myqueue exchange=myqueue(direct) key=myqueue
[tasks]
. sometask1
. sometask2
[2019-08-23 11:36:36,874: DEBUG/MainProcess] | Worker: Starting Hub
[2019-08-23 11:36:36,874: DEBUG/MainProcess] ^-- substep ok
[2019-08-23 11:36:36,874: DEBUG/MainProcess] | Worker: Starting Pool
[2019-08-23 11:36:37,278: DEBUG/MainProcess] ^-- substep ok
[2019-08-23 11:36:37,279: DEBUG/MainProcess] | Worker: Starting Consumer
[2019-08-23 11:36:37,280: DEBUG/MainProcess] | Consumer: Starting Connection
[2019-08-23 11:36:37,299: INFO/MainProcess] Connected to redis://:**#${redis-host}:6379/0
[2019-08-23 11:36:37,299: DEBUG/MainProcess] ^-- substep ok
[2019-08-23 11:36:37,299: DEBUG/MainProcess] | Consumer: Starting Events
[2019-08-23 11:36:37,311: DEBUG/MainProcess] ^-- substep ok
[2019-08-23 11:36:37,312: DEBUG/MainProcess] | Consumer: Starting Mingle
[2019-08-23 11:36:37,312: INFO/MainProcess] mingle: searching for neighbors
[2019-08-23 11:36:38,343: INFO/MainProcess] mingle: all alone
[2019-08-23 11:36:38,343: DEBUG/MainProcess] ^-- substep ok
[2019-08-23 11:36:38,343: DEBUG/MainProcess] | Consumer: Starting Tasks
[2019-08-23 11:36:38,350: DEBUG/MainProcess] ^-- substep ok
[2019-08-23 11:36:38,350: DEBUG/MainProcess] | Consumer: Starting Control
[2019-08-23 11:36:38,359: DEBUG/MainProcess] ^-- substep ok
[2019-08-23 11:36:38,359: DEBUG/MainProcess] | Consumer: Starting Heart
[2019-08-23 11:36:38,363: DEBUG/MainProcess] ^-- substep ok
[2019-08-23 11:36:38,363: DEBUG/MainProcess] | Consumer: Starting Gossip
[2019-08-23 11:36:38,371: DEBUG/MainProcess] ^-- substep ok
[2019-08-23 11:36:38,371: DEBUG/MainProcess] | Consumer: Starting event loop
[2019-08-23 11:36:38,372: DEBUG/MainProcess] | Worker: Hub.register Pool...
[2019-08-23 11:36:38,373: INFO/MainProcess] celery#myapp-163-m4hs9 ready.
[2019-08-23 11:36:38,373: DEBUG/MainProcess] basic.qos: prefetch_count->16
[2019-08-23 11:36:38,838: DEBUG/MainProcess] pidbox received method enable_events() [reply_to:None ticket:None]
[2019-08-23 11:36:38,839: INFO/MainProcess] Events of group {task} enabled by remote.
[2019-08-23 11:36:43,838: DEBUG/MainProcess] pidbox received method enable_events() [reply_to:None ticket:None]
Redis is up an running:
redis-cli -h ${redis-host}
redis:6379> ping
PONG
The log file does not contain any hint.
As already mentioned, when I check the status of the celery worker
celery -A myapp.celery status
I get the message:
Error: No nodes replied within time constraint.
Instead, celery should respond with
> myqueue#myhost: OK
or at least give some error message.
Interim solution and further investigation:
For now, the immediate measure was to switch the message queue to RabbitMQ and the worker is online and responding again. So this issue seems to be specific to using Redis as message queue.
Updating Celery/Redis-client to the most recent versions (Celery 4.3.0, redis 3.3.8) did not help.
Python version is 3.5 (on OpenShift).
There's a bug in the latest version (4.6.4) of the Kombu library (a Celery dependency) which causes this issue with Redis as documented in this Github issue.
The bug was recently fixed in a pull request in the Kombu repository but hasn't made it to a release yet.
Downgrading Kombu to version 4.6.3 will fix the issue.
I'm trying to setup rabbitmq/celery/django-celery/django so that it is "rebootproof", i.e. it all just comes back up by itself. Everything seems to work fine except this:
When I reboot, all services get started, but it seems celeryd is started before rabbitmq, and celerybeat gets subsequently terminated because it can't connect (?):
[2011-06-14 00:48:35,128: WARNING/MainProcess] celery#inquire has started.
[2011-06-14 00:48:35,130: INFO/Beat] child process calling self.run()
[2011-06-14 00:48:35,131: INFO/Beat] Celerybeat: Starting...
[2011-06-14 00:48:35,134: ERROR/MainProcess] Consumer: Connection Error: [Errno 111] Connection refused. Trying again in 2 seconds...
[2011-06-14 00:48:35,688: INFO/Beat] process shutting down
[2011-06-14 00:48:35,689: WARNING/Beat] Process Beat:
[2011-06-14 00:48:35,689: WARNING/Beat] Traceback (most recent call last):
...
[2011-06-14 00:48:35,756: WARNING/Beat] File "/home/inquire/inquire.env/lib/python2.6/site-packages/amqplib/client_0_8/transport.py", line 220, in create_transport
[2011-06-14 00:48:35,760: WARNING/Beat] return TCPTransport(host, connect_timeout)
[2011-06-14 00:48:35,761: WARNING/Beat] File "/home/inquire/inquire.env/lib/python2.6/site-packages/amqplib/client_0_8/transport.py", line 58, in __init__
[2011-06-14 00:48:35,761: WARNING/Beat] self.sock.connect((host, port))
[2011-06-14 00:48:35,761: WARNING/Beat] File "<string>", line 1, in connect
[2011-06-14 00:48:35,761: WARNING/Beat] error: [Errno 111] Connection refused
[2011-06-14 00:48:35,761: INFO/Beat] process exiting with exitcode 1
[2011-06-14 00:48:37,137: ERROR/MainProcess] Consumer: Connection Error: [Errno 111] Connection refused. Trying again in 4 seconds...
On Ubuntu, I installed rabbitmq-server with apt, django-celery with pip into my virtualenv, then I symlinked the "celeryd" initscript I got from https://github.com/ask/celery/tree/master/contrib/debian/init.d in /etc/init.d, configured it in /etc/default/celeryd to use the django celeryd from my virtualenv, and made it "rebootproof" via (maybe "defaults" is the problem?)
update-rc.d celeryd defaults
Rather than running celeryd and celerybeat with separate initscripts, I just configured celeryd to include Beat (maybe that's the problem?):
CELERYD_OPTS="-v 2 -B -s celery -E"
Any pointers how to solve this issue?
If I
sudo /etc/init.d/celeryd restart
there are no complaints:
[2011-06-14 00:54:29,157: WARNING/MainProcess] celery#inquire has started.
[2011-06-14 00:54:29,161: INFO/Beat] child process calling self.run()
[2011-06-14 00:54:29,162: INFO/Beat] Celerybeat: Starting...
but I need to eliminate the need for any manual steps.
celerybeat's dependency on the broker service was indeed the issue.
Rather than installing the initscript with
update-rc.d celeryd defaults
with the rabbitmq-server script being installed as sequence number 20 for start and kill, celerybeat's dependency must be resolved by explicitly starting it after (and killing it before) rabbitmq-server by using
update-rc.d celeryd defaults 21 19
NB: I've actually opted for the separate celerybeat service instead of the -B invocation, and only did 21 19 for that script, i.e. the one with the problem.
I think that the problem is not in celery it self but in your script, probably when celeryd starts the broker is not listening yet.
I'm using almost your same command and I don't have any issue, launch the celeryd script with -B option is not wrong.
I think that on your reboot script you have to wait for rabbitmq complete restart before launch celeryd, maybe with test of connection too.