celerybeat shutdown - initscript order? - rabbitmq

I'm trying to setup rabbitmq/celery/django-celery/django so that it is "rebootproof", i.e. it all just comes back up by itself. Everything seems to work fine except this:
When I reboot, all services get started, but it seems celeryd is started before rabbitmq, and celerybeat gets subsequently terminated because it can't connect (?):
[2011-06-14 00:48:35,128: WARNING/MainProcess] celery#inquire has started.
[2011-06-14 00:48:35,130: INFO/Beat] child process calling self.run()
[2011-06-14 00:48:35,131: INFO/Beat] Celerybeat: Starting...
[2011-06-14 00:48:35,134: ERROR/MainProcess] Consumer: Connection Error: [Errno 111] Connection refused. Trying again in 2 seconds...
[2011-06-14 00:48:35,688: INFO/Beat] process shutting down
[2011-06-14 00:48:35,689: WARNING/Beat] Process Beat:
[2011-06-14 00:48:35,689: WARNING/Beat] Traceback (most recent call last):
...
[2011-06-14 00:48:35,756: WARNING/Beat] File "/home/inquire/inquire.env/lib/python2.6/site-packages/amqplib/client_0_8/transport.py", line 220, in create_transport
[2011-06-14 00:48:35,760: WARNING/Beat] return TCPTransport(host, connect_timeout)
[2011-06-14 00:48:35,761: WARNING/Beat] File "/home/inquire/inquire.env/lib/python2.6/site-packages/amqplib/client_0_8/transport.py", line 58, in __init__
[2011-06-14 00:48:35,761: WARNING/Beat] self.sock.connect((host, port))
[2011-06-14 00:48:35,761: WARNING/Beat] File "<string>", line 1, in connect
[2011-06-14 00:48:35,761: WARNING/Beat] error: [Errno 111] Connection refused
[2011-06-14 00:48:35,761: INFO/Beat] process exiting with exitcode 1
[2011-06-14 00:48:37,137: ERROR/MainProcess] Consumer: Connection Error: [Errno 111] Connection refused. Trying again in 4 seconds...
On Ubuntu, I installed rabbitmq-server with apt, django-celery with pip into my virtualenv, then I symlinked the "celeryd" initscript I got from https://github.com/ask/celery/tree/master/contrib/debian/init.d in /etc/init.d, configured it in /etc/default/celeryd to use the django celeryd from my virtualenv, and made it "rebootproof" via (maybe "defaults" is the problem?)
update-rc.d celeryd defaults
Rather than running celeryd and celerybeat with separate initscripts, I just configured celeryd to include Beat (maybe that's the problem?):
CELERYD_OPTS="-v 2 -B -s celery -E"
Any pointers how to solve this issue?
If I
sudo /etc/init.d/celeryd restart
there are no complaints:
[2011-06-14 00:54:29,157: WARNING/MainProcess] celery#inquire has started.
[2011-06-14 00:54:29,161: INFO/Beat] child process calling self.run()
[2011-06-14 00:54:29,162: INFO/Beat] Celerybeat: Starting...
but I need to eliminate the need for any manual steps.

celerybeat's dependency on the broker service was indeed the issue.
Rather than installing the initscript with
update-rc.d celeryd defaults
with the rabbitmq-server script being installed as sequence number 20 for start and kill, celerybeat's dependency must be resolved by explicitly starting it after (and killing it before) rabbitmq-server by using
update-rc.d celeryd defaults 21 19
NB: I've actually opted for the separate celerybeat service instead of the -B invocation, and only did 21 19 for that script, i.e. the one with the problem.

I think that the problem is not in celery it self but in your script, probably when celeryd starts the broker is not listening yet.
I'm using almost your same command and I don't have any issue, launch the celeryd script with -B option is not wrong.
I think that on your reboot script you have to wait for rabbitmq complete restart before launch celeryd, maybe with test of connection too.

Related

Celery task disconnects from redis broker and takes a long time to reconnect

I have a pretty stable setup with celery 4.4.7 and a redis node on AWS.
However, once every few weeks, I notice that the celery workers suddenly stop processing the queue.
If I inspect the logs, I see the following:
[2021-10-14 06:36:52,250: DEBUG/ForkPoolWorker-1] Executing task PutObjectTask(transfer_id=0, {'bucket': 'cdn.my-app.com', 'key': 'thumbs/-0DqXnEL1k3J_1824x0_n1wMX-gx.jpg', 'extra_args': {'ContentType': 'image/jpeg', 'Expires': 'Wed, 31 Dec 2036 23:59:59 GMT', 'StorageClass': 'INTELLIGENT_TIERING', 'ACL': 'private'}}) with kwargs {'client': <botocore.client.S3 object at 0xffff90746040>, 'fileobj': <s3transfer.utils.ReadFileChunk object at 0xffff9884d670>, 'bucket': 'cdn.my-app.com', 'key': 'thumbs/-0DqXnEL1k3J_1824x0_n1wMX-gx.jpg', 'extra_args': {'ContentType': 'image/jpeg', 'Expires': 'Wed, 31 Dec 2036 23:59:59 GMT', 'StorageClass': 'INTELLIGENT_TIERING', 'ACL': 'private'}}
[2021-10-14 06:36:52,253: DEBUG/ForkPoolWorker-1] Releasing acquire 0/None
[2021-10-14 06:36:52,429: DEBUG/ForkPoolWorker-1] Releasing acquire 0/None
[2021-10-14 06:37:50,202: INFO/MainProcess] missed heartbeat from celery#3a05aa0db663
[2021-10-14 06:37:50,202: INFO/MainProcess] missed heartbeat from celery#7c9532243cd0
[2021-10-14 06:37:50,202: INFO/MainProcess] missed heartbeat from celery#686f7e96d04f
[2021-10-14 06:54:06,510: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/redis/connection.py", line 706, in send_packed_command
sendall(self._sock, item)
File "/usr/local/lib/python3.8/dist-packages/redis/_compat.py", line 9, in sendall
return sock.sendall(*args, **kwargs)
TimeoutError: [Errno 110] Connection timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/celery/worker/consumer/consumer.py", line 318, in start
blueprint.start(self)
File "/usr/local/lib/python3.8/dist-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/usr/local/lib/python3.8/dist-packages/celery/worker/consumer/consumer.py", line 599, in start
c.loop(*c.loop_args())
File "/usr/local/lib/python3.8/dist-packages/celery/worker/loops.py", line 83, in asynloop
next(loop)
File "/usr/local/lib/python3.8/dist-packages/kombu/asynchronous/hub.py", line 364, in create_loop
cb(*cbargs)
File "/usr/local/lib/python3.8/dist-packages/kombu/transport/redis.py", line 1083, in on_readable
self.cycle.on_readable(fileno)
File "/usr/local/lib/python3.8/dist-packages/kombu/transport/redis.py", line 354, in on_readable
chan.handlers[type]()
File "/usr/local/lib/python3.8/dist-packages/kombu/transport/redis.py", line 688, in _receive
ret.append(self._receive_one(c))
File "/usr/local/lib/python3.8/dist-packages/kombu/transport/redis.py", line 698, in _receive_one
response = c.parse_response()
File "/usr/local/lib/python3.8/dist-packages/redis/client.py", line 3501, in parse_response
self.check_health()
File "/usr/local/lib/python3.8/dist-packages/redis/client.py", line 3521, in check_health
conn.send_command('PING', self.HEALTH_CHECK_MESSAGE,
File "/usr/local/lib/python3.8/dist-packages/redis/connection.py", line 725, in send_command
self.send_packed_command(self.pack_command(*args),
File "/usr/local/lib/python3.8/dist-packages/redis/connection.py", line 717, in send_packed_command
raise ConnectionError("Error %s while writing to socket. %s." %
redis.exceptions.ConnectionError: Error 110 while writing to socket. Connection timed out.
[2021-10-14 06:54:06,520: DEBUG/MainProcess] Canceling task consumer...
[2021-10-14 06:54:06,538: INFO/MainProcess] Connected to redis://my-redis-endpoint.cache.amazonaws.com:6379/0
[2021-10-14 06:54:06,548: INFO/MainProcess] mingle: searching for neighbors
[2021-10-14 07:09:45,507: INFO/MainProcess] mingle: sync with 1 nodes
[2021-10-14 07:09:45,507: DEBUG/MainProcess] mingle: processing reply from celery#3a05aa0db663
[2021-10-14 07:09:45,508: INFO/MainProcess] mingle: sync complete
[2021-10-14 07:09:45,536: INFO/MainProcess] Received task: my-app.tasks.retrieve_thumbnail[f50d66fd-28f5-4f1d-9b69-bf79daab43c0]
Please note the amount of time passed with seemingly no activity between 06:37:50,202 (detected loss of connection from other workers) and 06:54:06,510 (detected loss of connection to broker), and also between 06:54:06,548 (reconnected to broker) and 07:09:45,536 started receiving tasks.
All in all, my celery setup was not processing tasks between 06:37:50 and 07:09:45, which impact the functioning of my app.
Any suggestions on how to tackle this issue?
Thanks!c

Airflow SSH received SIGTERM after warning: Recorded pid 1098 does not match the current pid 31631

I'll need a bit of help here.
running airflow on docker container+LocalExecutor.
Airflow version is 2.0.0 (https://pypi.org/project/apache-airflow/2.0.0/)
I'm running a long runnning task with a wrapper of SSHOperator.
Basically I open an SSH Session to run a spark-submit job in spark edge node.
(YARN JOB succeeds but airflow task fails)
Task starts with PID 31675:
[2021-06-24 18:29:09,664] {standard_task_runner.py:51} INFO - Started process 31675 to run task
Then after sometime getting this warning:
Recorded pid 1098 does not match the current pid 31631
And then the tasks fails:
[2021-06-24 19:45:44,493] {local_task_job.py:166} WARNING - Recorded pid 1098 does not match the current pid 31631
[2021-06-24 19:45:44,496] {process_utils.py:95} INFO - Sending Signals.SIGTERM to GPID 31675
[2021-06-24 19:45:44,496] {taskinstance.py:1214} ERROR - Received SIGTERM. Terminating subprocesses.
[2021-06-24 19:45:44,528] {taskinstance.py:1396} ERROR - LatamSSH operator error: Task received SIGTERM signal
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/latamairflow/operators/latam_ssh_operator.py", line 453, in execute
readq, _, _ = select([channel], [], [], self.timeout)
File "/usr/local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1216, in signal_handler
raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1086, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/usr/local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1260, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/usr/local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1300, in _execute_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.6/site-packages/latamairflow/operators/latam_ssh_operator.py", line 502, in execute
raise AirflowException("LatamSSH operator error: {0}".format(str(e)))
airflow.exceptions.AirflowException: LatamSSH operator error: Task received SIGTERM signal
[2021-06-24 19:45:44,529] {taskinstance.py:1440} INFO - Marking task as FAILED.

SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed on automate-git.py

I'm trying to compile CEF locally on my Ubuntu 20.10 machine, but my automate-git.py can't finish due to a strange error while running hooks:
Apply runhooks.patch in /home/user/code/chromium_git/chromium/src
9 5 build/toolchain/win/setup_toolchain.py
11 0 build/vs_toolchain.py
... successfully applied.
-------- Running "gclient runhooks --jobs 16" in "/home/user/code/chromium_git/chromium"...
Running hooks: 5% ( 6/101) nacltools
________ running 'vpython src/build/download_nacl_toolchains.py --mode nacl_core_sdk sync --extract' in '/home/user/code/chromium_git/chromium'
INFO: --Syncing arm_trusted to revision 2--
INFO: Downloading package archive: emulator_arm_trusted_precise.tgz (1/1)
package_version: Could not download URL (https://storage.googleapis.com/nativeclient-archive2/toolchain/2/emulator_arm_trusted_precise.tgz): <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:727)>
Error: Command 'vpython src/build/download_nacl_toolchains.py --mode nacl_core_sdk sync --extract' returned non-zero exit status 1 in /home/user/code/chromium_git/chromium
Traceback (most recent call last):
File "../automate/automate-git.py", line 1385, in <module>
run("gclient runhooks --jobs 16", chromium_dir, depot_tools_dir)
File "../automate/automate-git.py", line 69, in run
return subprocess.check_call(
File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['gclient', 'runhooks', '--jobs', '16']' returned non-zero exit status 2.
On the restart it it succeeds, though, but there are compilation errors in future. I pasted check_certificate = off in ~/.wgetrc and insecure in ~/.curlrc, but no luck yet. What do I do?
I solved this issue by setting up a VM and compiling CEF inside it, and it all magically started working, so I guess it was my system's issue.

Ambari cluster restart error: Timeline Service V2.0 Reader not restarting

Attempting to restart an Ambari-managed cluster and getting errors related to the Timeline Service V2.0 Reader service starting:
Traceback (most recent call last):
File "/var/lib/ambari-agent/cache/stacks/HDP/3.0/services/YARN/package/scripts/timelinereader.py", line 108, in <module>
ApplicationTimelineReader().execute()
File "/usr/lib/ambari-agent/lib/resource_management/libraries/script/script.py", line 353, in execute
method(env)
File "/var/lib/ambari-agent/cache/stacks/HDP/3.0/services/YARN/package/scripts/timelinereader.py", line 51, in start
hbase(action='start')
File "/var/lib/ambari-agent/cache/stacks/HDP/3.0/services/YARN/package/scripts/hbase_service.py", line 80, in hbase
createTables()
File "/var/lib/ambari-agent/cache/stacks/HDP/3.0/services/YARN/package/scripts/hbase_service.py", line 147, in createTables
logoutput=True)
File "/usr/lib/ambari-agent/lib/resource_management/core/base.py", line 166, in __init__
self.env.run()
File "/usr/lib/ambari-agent/lib/resource_management/core/environment.py", line 160, in run
self.run_action(resource, action)
File "/usr/lib/ambari-agent/lib/resource_management/core/environment.py", line 124, in run_action
provider_action()
File "/usr/lib/ambari-agent/lib/resource_management/core/providers/system.py", line 263, in action_run
returns=self.resource.returns)
File "/usr/lib/ambari-agent/lib/resource_management/core/shell.py", line 72, in inner
result = function(command, **kwargs)
File "/usr/lib/ambari-agent/lib/resource_management/core/shell.py", line 102, in checked_call
tries=tries, try_sleep=try_sleep, timeout_kill_strategy=timeout_kill_strategy, returns=returns)
File "/usr/lib/ambari-agent/lib/resource_management/core/shell.py", line 150, in _call_wrapper
result = _call(command, **kwargs_copy)
File "/usr/lib/ambari-agent/lib/resource_management/core/shell.py", line 308, in _call
raise ExecuteTimeoutException(err_msg)
resource_management.core.exceptions.ExecuteTimeoutException: Execution of 'ambari-sudo.sh su yarn-ats -l -s /bin/bash -c 'export PATH='"'"'/usr/sbin:/sbin:/usr/lib/ambari-server/*:/usr/local/texlive/2016/bin/x86_64-linux:/usr/local/texlive/2016/bin/x86_64-linux:/usr/local/texlive/2016/bin/x86_64-linux:/usr/lib64/qt-3.3/bin:/usr/local/texlive/2016/bin/x86_64-linux:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/maven/bin:/root/bin:/opt/maven/bin:/opt/maven/bin:/var/lib/ambari-agent'"'"' ; sleep 10;export HBASE_CLASSPATH_PREFIX=/usr/hdp/3.0.0.0-1634/hadoop-yarn/timelineservice/*; /usr/hdp/3.0.0.0-1634/hbase/bin/hbase --config /usr/hdp/3.0.0.0-1634/hadoop/conf/embedded-yarn-ats-hbase org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator -Dhbase.client.retries.number=35 -create -s'' was killed due timeout after 300 seconds
I have not changed any configs or installed anything new between the restart attempt; simply stopped the cluster services and attempted to restart them. Not sure what this error message means. Any debugging tips or fixes?
Found the solution on another community post.
navigate to the host where Timeline Reader is installed and Install Hbase Client in that host
Here is how I installed HBase Client from via the Ambari UI...
In the Ambari UI, go to Hosts then click the host you want to install the hbase client component on
In the list on components, you will have option to add more, see...
From here I installed the HBase client
Then stopped and restarted the cluster via Ambari UI (got notification of stale configs (though not sure if this was my problem all along or if installing the HBase Client reaised the stale configs alert))

Celery not connecting to Redis broker: Connection to broker lost

I'm trying to get Redis to work as a broker for my Celery 3.0.19 install on Django. I see that redis-server is running on port 6379. When I run a simple Celery test, I get the following stack trace:
Ubuntu Lucid 10.0.4
Celery 3.0.19
celery -A tasks worker --loglevel=info
[2013-05-02 18:56:27,835: INFO/MainProcess] consumer: Connected to redis://127.0.0.1:6379/0.
[2013-05-02 18:56:27,835: ERROR/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
File "/usr/local/lib/python2.6/dist-packages/celery/worker/consumer.py", line 394, in start
self.reset_connection()
File "/usr/local/lib/python2.6/dist-packages/celery/worker/consumer.py", line 744, in reset_connection
self.connection, on_decode_error=self.on_decode_error,
File "/usr/local/lib/python2.6/dist-packages/celery/app/amqp.py", line 311, in __init__
**kw
File "/usr/local/lib/python2.6/dist-packages/kombu/messaging.py", line 355, in __init__
self.revive(self.channel)
File "/usr/local/lib/python2.6/dist-packages/kombu/messaging.py", line 367, in revive
self.declare()
File "/usr/local/lib/python2.6/dist-packages/kombu/messaging.py", line 377, in declare
queue.declare()
File "/usr/local/lib/python2.6/dist-packages/kombu/entity.py", line 490, in declare
self.queue_declare(nowait, passive=False)
File "/usr/local/lib/python2.6/dist-packages/kombu/entity.py", line 516, in queue_declare
nowait=nowait)
File "/usr/local/lib/python2.6/dist-packages/kombu/transport/virtual/__init__.py", line 404, in queue_declare
return queue, self._size(queue), 0
File "/usr/local/lib/python2.6/dist-packages/kombu/transport/redis.py", line 516, in _size
sizes = cmds.execute()
File "/usr/local/lib/python2.6/dist-packages/redis/client.py", line 1919, in execute
return execute(conn, stack, raise_on_error)
File "/usr/local/lib/python2.6/dist-packages/redis/client.py", line 1811, in _execute_transaction
self.parse_response(connection, '_')
File "/usr/local/lib/python2.6/dist-packages/redis/client.py", line 1882, in parse_response
self, connection, command_name, **options)
File "/usr/local/lib/python2.6/dist-packages/redis/client.py", line 387, in parse_response
response = connection.read_response()
File "/usr/local/lib/python2.6/dist-packages/redis/connection.py", line 312, in read_response
raise response
ResponseError: unknown command 'MULTI'
You need redis version >= 2.2.0.