I'm intermittently (about 20% of the time) getting an IOError exception from Celery when I attempt to retry a failed task.
Here is my task:
#task
def update_data(pk_id):
try:
pk = PK.objects.get(pk=pk_id)
results = pk.get_update()
return results
except urllib2.HTTPError, exc:
print "Let's retry in a few minutes."
update_data.retry(exc=exc, countdown=600)
The exception:
[2011-10-07 11:35:53,594: ERROR/MainProcess] Task report.tasks.update_data[1babd4e3-45eb-4fa3-a497-68b67bb4a6df] raised exception: IOError()
Traceback (most recent call last):
File "/home/prj/prj_env/lib/python2.6/site-packages/celery/execute/trace.py", line 36, in trace
return cls(states.SUCCESS, retval=fun(*args, **kwargs))
File "/home/prj/prj_env/lib/python2.6/site-packages/celery/app/task/__init__.py", line 232, in __call__
return self.run(*args, **kwargs)
File "/home/prj/prj_env/lib/python2.6/site-packages/celery/app/__init__.py", line 172, in run
return fun(*args, **kwargs)
File "/home/prj/prj/report/tasks.py", line 109, in update_data
update_data.retry(exc=exc, countdown=600)
File "/home/prj/prj_env/lib/python2.6/site-packages/celery/app/task/__init__.py", line 520, in retry
self.name, options["task_id"], args, kwargs))
HTTPError
RabbitMQ Logs
=INFO REPORT==== 7-Oct-2011::15:35:43 ===
closing TCP connection <0.4294.17> from 10.254.122.225:59704
=WARNING REPORT==== 7-Oct-2011::15:35:43 ===
exception on TCP connection <0.4330.17> from 10.254.122.225:59715
connection_closed_abruptly
=INFO REPORT==== 7-Oct-2011::15:35:43 ===
closing TCP connection <0.4330.17> from 10.254.122.225:59715
=WARNING REPORT==== 7-Oct-2011::15:35:49 ===
exception on TCP connection <0.4313.17> from 10.254.122.225:59709
connection_closed_abruptly
=INFO REPORT==== 7-Oct-2011::15:35:49 ===
closing TCP connection <0.4313.17> from 10.254.122.225:59709
=WARNING REPORT==== 7-Oct-2011::15:35:49 ===
exception on TCP connection <0.4350.17> from 10.254.122.225:59720
connection_closed_abruptly
=INFO REPORT==== 7-Oct-2011::15:35:49 ===
closing TCP connection <0.4350.17> from 10.254.122.225:59720
=INFO REPORT==== 7-Oct-2011::15:36:22 ===
accepted TCP connection on [::]:5672 from 10.255.199.63:50526
=INFO REPORT==== 7-Oct-2011::15:36:22 ===
starting TCP connection <0.4501.17> from 10.255.199.63:50526
Any ideas why this might be happening?
Thanks!
May be save each task in database and retry them if no result is
received for some time? Or may be dispatcher have it's own persistent
storage? What about then if worker thread crash receiving the task or
while executing it?
Retry Lost or Failed Tasks (Celery, Django and RabbitMQ)
max_retries in celery is per default 3, so if the same task fails 3 times in a row (i.e. 20% of the time), retry will rethrow the exception.
Related
Trying to sort out why my local RabbitMQ is not starting.
I had an issue with a previous version of RabbitMQ on the system not starting, so I decided to uninstall it and reinstall using chocolaty. The service wasn't starting after having quite a few messages in the queue, system going to sleep and restarting multiple times... The uninstall did remove all the files from the AppData\Roaming\RabbitMQ directory, the service wasn't running, and the system was rebooted.
Currently have RabbitMQ 3.8.2, which installed with Erlang20.0
Here's the snipped from the rabbit log file:
=INFO REPORT==== 22-Jan-2020::19:39:24 ===
Starting RabbitMQ 3.6.11 on Erlang 20.0
Copyright (C) 2007-2017 Pivotal Software, Inc.
Licensed under the MPL. See http://www.rabbitmq.com/
=INFO REPORT==== 22-Jan-2020::19:39:24 ===
node : rabbit#myhostname
home dir : C:\WINDOWS
config file(s) : c:/Users/username/AppData/Roaming/RabbitMQ/rabbitmq.config
cookie hash : a hash goes here
log : C:/Users/username/AppData/Roaming/RabbitMQ/log/RABBIT~1.LOG
sasl log : C:/Users/username/AppData/Roaming/RabbitMQ/log/RABBIT~2.LOG
database dir : c:/Users/username/AppData/Roaming/RabbitMQ/db/RABBIT~1
=INFO REPORT==== 22-Jan-2020::19:39:25 ===
RabbitMQ hasn't finished starting yet. Waiting for startup to finish before stopping...
=INFO REPORT==== 22-Jan-2020::19:39:31 ===
Memory high watermark set to 6505 MiB (6821275238 bytes) of 16263 MiB (17053188096 bytes) total
=INFO REPORT==== 22-Jan-2020::19:39:31 ===
Enabling free disk space monitoring
=INFO REPORT==== 22-Jan-2020::19:39:31 ===
Disk free limit set to 50MB
=INFO REPORT==== 22-Jan-2020::19:39:31 ===
Limiting to approx 8092 file handles (7280 sockets)
=INFO REPORT==== 22-Jan-2020::19:39:31 ===
FHC read buffering: OFF
FHC write buffering: ON
=INFO REPORT==== 22-Jan-2020::19:39:31 ===
Waiting for Mnesia tables for 30000 ms, 9 retries left
=INFO REPORT==== 22-Jan-2020::19:39:31 ===
Waiting for Mnesia tables for 30000 ms, 9 retries left
=INFO REPORT==== 22-Jan-2020::19:39:31 ===
Priority queues enabled, real BQ is rabbit_variable_queue
=INFO REPORT==== 22-Jan-2020::19:39:52 ===
Error description:
{could_not_start,rabbit,
{error,
{{shutdown,
{failed_to_start_child,rabbit_epmd_monitor,
{{badmatch,noport},
[{rabbit_epmd_monitor,init,1,
[{file,"src/rabbit_epmd_monitor.erl"},{line,56}]},
{gen_server,init_it,2,
[{file,"gen_server.erl"},{line,365}]},
{gen_server,init_it,6,
[{file,"gen_server.erl"},{line,333}]},
{proc_lib,init_p_do_apply,3,
[{file,"proc_lib.erl"},{line,247}]}]}}},
{child,undefined,rabbit_epmd_monitor_sup,
{rabbit_restartable_sup,start_link,
[rabbit_epmd_monitor_sup,
{rabbit_epmd_monitor,start_link,[]},
false]},
transient,infinity,supervisor,
[rabbit_restartable_sup]}}}}
Log files (may contain more information):
C:/Users/username/AppData/Roaming/RabbitMQ/log/RABBIT~1.LOG
C:/Users/username/AppData/Roaming/RabbitMQ/log/RABBIT~2.LOG
=ERROR REPORT==== 22-Jan-2020::19:39:53 ===
Error trying to stop RabbitMQ: error:{badmatch,false}
=INFO REPORT==== 22-Jan-2020::19:39:53 ===
Halting Erlang VM with the following applications:
sasl
stdlib
kernel
Not a lot of help to a new RabbitMQ user trying to get an install working.
This is the first few lines from the erl_crash.dump file in the same dir as the logs:
=erl_crash_dump:0.3
Wed Jan 22 20:38:13 2020
Slogan: init terminating in do_boot ({undef,[{rabbit_nodes_common,make,rabbit#myhostname,[]},{rabbit_prelaunch,start,0,[{_},{_}]},{init,start_em,1,[{_},{_}]},{init,do_boot,3,[{_},{_}]}]})
System version: Erlang/OTP 20 [erts-9.0] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:10]
Compiled: Tue Jun 20 19:49:32 2017
I've been going through the the docs here, but haven't found much of a solution to this.
We are using RabbitMQ for celery tasks execution. We were having one queue operating over 230000 tasks which was crashed yesterday with below log,
<code>2019-02-11 22:30:32,770 WARNING 13003 [celery.worker.consumer] consumer.py:289 - consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/celery/worker/consumer.py", line 278, in start
blueprint.start(self)
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/celery/bootsteps.py", line 123, in start
step.start(parent)
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/celery/worker/consumer.py", line 821, in start
c.loop(*c.loop_args())
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/celery/worker/loops.py", line 70, in asynloop
next(loop)
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/kombu/async/hub.py", line 340, in create_loop
cb(*cbargs)
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/kombu/transport/base.py", line 164, in on_readable
reader(loop)
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/kombu/transport/base.py", line 146, in _read
drain_events(timeout=0)
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/amqp/connection.py", line 324, in drain_events
return amqp_method(channel, args)
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/amqp/channel.py", line 1647, in _basic_cancel_notify
raise ConsumerCancelled(consumer_tag, (60, 30))
ConsumerCancelled: Basic.cancel: (0) None8
2019-02-11 22:30:32,878 INFO 13003 [celery.worker.consumer] consumer.py:479 - Connected to amqp://celery:**#127.0.0.1:5672//
2019-02-11 22:31:20,308 ERROR 13003 [celery.worker.consumer] consumer.py:364 - consumer: Cannot connect to amqp://celery:**#127.0.0.1:5672//: [Errno 104] Connection res$
Trying again in 2.00 seconds...
</code>
After crashed rabbitmq i have restarted again using below command:
sudo service rabbitmq-server restart
Once rabbitmq restart i lost my all queues. My queue durability was Durable and message delivery mode was non-persistent.
Is there any way we can recover messages which was in queue ? It was having very important data of user which were under processing.
Nope. Non-persistent means they were in RAM, not stored on the disk.
A general comment - RabbitMQ is not a database. Even if you had set the queues to be persistent, expecting a message broker to reliably handle temporary storage of 200,000 messages is madness. Your system should be designed such that the broker is a buffer between tasks, with average queue length of zero. If you find such large numbers, either speed up processing or store in a database designed to be able to survive occasional restarts with little to no consequences.
I want to test with gitlab-ci.yml a rpc nameko server.
I can't succeed to make work the Rabitt inside .gitlab-ci.yml::
image: python:latest
before_script:
- apt-get update -yq
- apt-get install -y python-dev python-pip tree
- curl -I http://guest:guest#rabbitmq:8080/api/overview
mytest:
artifacts:
paths:
- dist
script:
- pip install -r requirements.txt
- pip install .
- pytest --amqp-uri=amqp://guest:guest#rabbitmq:5672 --rabbit-ctl-uri=http://guest:guest#rabbitmq:15672 tests
# - python setup.py test
- python setup.py bdist_wheel
look:
stage: deploy
script:
- ls -lah dist
services:
- rabbitmq:3-management
The Rabbit start correctly::
2017-04-13T18:19:23.436309219Z
2017-04-13T18:19:23.436409026Z RabbitMQ 3.6.9. Copyright (C) 2007-2016 Pivotal Software, Inc.
2017-04-13T18:19:23.436432568Z ## ## Licensed under the MPL. See http://www.rabbitmq.com/
2017-04-13T18:19:23.436451431Z ## ##
2017-04-13T18:19:23.436468542Z ########## Logs: tty
2017-04-13T18:19:23.436485607Z ###### ## tty
2017-04-13T18:19:23.436501886Z ##########
2017-04-13T18:19:23.436519036Z Starting broker...
2017-04-13T18:19:23.440790736Z
2017-04-13T18:19:23.440809836Z =INFO REPORT==== 13-Apr-2017::18:19:23 ===
2017-04-13T18:19:23.440819014Z Starting RabbitMQ 3.6.9 on Erlang 19.3
2017-04-13T18:19:23.440827601Z Copyright (C) 2007-2016 Pivotal Software, Inc.
2017-04-13T18:19:23.440835737Z Licensed under the MPL. See http://www.rabbitmq.com/
2017-04-13T18:19:23.443408721Z
2017-04-13T18:19:23.443429311Z =INFO REPORT==== 13-Apr-2017::18:19:23 ===
2017-04-13T18:19:23.443439837Z node : rabbit#ea1a207b738e
2017-04-13T18:19:23.443449307Z home dir : /var/lib/rabbitmq
2017-04-13T18:19:23.443460663Z config file(s) : /etc/rabbitmq/rabbitmq.config
2017-04-13T18:19:23.443470393Z cookie hash : h6vFB5LezZ4GR1nGuQOVSg==
2017-04-13T18:19:23.443480053Z log : tty
2017-04-13T18:19:23.443489256Z sasl log : tty
2017-04-13T18:19:23.443498676Z database dir : /var/lib/rabbitmq/mnesia/rabbit#ea1a207b738e
2017-04-13T18:19:27.717290199Z
2017-04-13T18:19:27.717345348Z =INFO REPORT==== 13-Apr-2017::18:19:27 ===
2017-04-13T18:19:27.717355143Z Memory limit set to 3202MB of 8005MB total.
2017-04-13T18:19:27.726821043Z
2017-04-13T18:19:27.726841925Z =INFO REPORT==== 13-Apr-2017::18:19:27 ===
2017-04-13T18:19:27.726850927Z Disk free limit set to 50MB
2017-04-13T18:19:27.732864417Z
2017-04-13T18:19:27.732882507Z =INFO REPORT==== 13-Apr-2017::18:19:27 ===
2017-04-13T18:19:27.732891347Z Limiting to approx 1048476 file handles (943626 sockets)
2017-04-13T18:19:27.733030868Z
2017-04-13T18:19:27.733041770Z =INFO REPORT==== 13-Apr-2017::18:19:27 ===
2017-04-13T18:19:27.733049763Z FHC read buffering: OFF
2017-04-13T18:19:27.733126168Z FHC write buffering: ON
2017-04-13T18:19:27.793026622Z
2017-04-13T18:19:27.793043832Z =INFO REPORT==== 13-Apr-2017::18:19:27 ===
2017-04-13T18:19:27.793052900Z Database directory at /var/lib/rabbitmq/mnesia/rabbit#ea1a207b738e is empty. Initialising from scratch...
2017-04-13T18:19:27.800414211Z
2017-04-13T18:19:27.800429311Z =INFO REPORT==== 13-Apr-2017::18:19:27 ===
2017-04-13T18:19:27.800438013Z application: mnesia
2017-04-13T18:19:27.800464988Z exited: stopped
2017-04-13T18:19:27.800473228Z type: temporary
2017-04-13T18:19:28.129404329Z
2017-04-13T18:19:28.129482072Z =INFO REPORT==== 13-Apr-2017::18:19:28 ===
2017-04-13T18:19:28.129491680Z Waiting for Mnesia tables for 30000 ms, 9 retries left
2017-04-13T18:19:28.153509130Z
2017-04-13T18:19:28.153526528Z =INFO REPORT==== 13-Apr-2017::18:19:28 ===
2017-04-13T18:19:28.153535638Z Waiting for Mnesia tables for 30000 ms, 9 retries left
2017-04-13T18:19:28.193558406Z
2017-04-13T18:19:28.193600316Z =INFO REPORT==== 13-Apr-2017::18:19:28 ===
2017-04-13T18:19:28.193611144Z Waiting for Mnesia tables for 30000 ms, 9 retries left
2017-04-13T18:19:28.194448672Z
2017-04-13T18:19:28.194464866Z =INFO REPORT==== 13-Apr-2017::18:19:28 ===
2017-04-13T18:19:28.194475629Z Priority queues enabled, real BQ is rabbit_variable_queue
2017-04-13T18:19:28.208882072Z
2017-04-13T18:19:28.208912016Z =INFO REPORT==== 13-Apr-2017::18:19:28 ===
2017-04-13T18:19:28.208921824Z Starting rabbit_node_monitor
2017-04-13T18:19:28.211145158Z
2017-04-13T18:19:28.211169236Z =INFO REPORT==== 13-Apr-2017::18:19:28 ===
2017-04-13T18:19:28.211182089Z Management plugin: using rates mode 'basic'
2017-04-13T18:19:28.224499311Z
2017-04-13T18:19:28.224527962Z =INFO REPORT==== 13-Apr-2017::18:19:28 ===
2017-04-13T18:19:28.224538810Z msg_store_transient: using rabbit_msg_store_ets_index to provide index
2017-04-13T18:19:28.226355958Z
2017-04-13T18:19:28.226376272Z =INFO REPORT==== 13-Apr-2017::18:19:28 ===
2017-04-13T18:19:28.226385706Z msg_store_persistent: using rabbit_msg_store_ets_index to provide index
2017-04-13T18:19:28.227832476Z
2017-04-13T18:19:28.227870221Z =WARNING REPORT==== 13-Apr-2017::18:19:28 ===
2017-04-13T18:19:28.227891823Z msg_store_persistent: rebuilding indices from scratch
2017-04-13T18:19:28.230832501Z
2017-04-13T18:19:28.230872729Z =INFO REPORT==== 13-Apr-2017::18:19:28 ===
2017-04-13T18:19:28.230893941Z Adding vhost '/'
2017-04-13T18:19:28.385440862Z
2017-04-13T18:19:28.385520360Z =INFO REPORT==== 13-Apr-2017::18:19:28 ===
2017-04-13T18:19:28.385540022Z Creating user 'guest'
2017-04-13T18:19:28.398092244Z
2017-04-13T18:19:28.398184254Z =INFO REPORT==== 13-Apr-2017::18:19:28 ===
2017-04-13T18:19:28.398206496Z Setting user tags for user 'guest' to [administrator]
2017-04-13T18:19:28.413704571Z
2017-04-13T18:19:28.413789806Z =INFO REPORT==== 13-Apr-2017::18:19:28 ===
2017-04-13T18:19:28.413810378Z Setting permissions for 'guest' in '/' to '.*', '.*', '.*'
2017-04-13T18:19:28.451109821Z
2017-04-13T18:19:28.451162892Z =INFO REPORT==== 13-Apr-2017::18:19:28 ===
2017-04-13T18:19:28.451172185Z started TCP Listener on [::]:5672
2017-04-13T18:19:28.475429729Z
2017-04-13T18:19:28.475491074Z =INFO REPORT==== 13-Apr-2017::18:19:28 ===
2017-04-13T18:19:28.475501172Z Management plugin started. Port: 15672
2017-04-13T18:19:28.475821397Z
2017-04-13T18:19:28.475835599Z =INFO REPORT==== 13-Apr-2017::18:19:28 ===
2017-04-13T18:19:28.475844143Z Statistics database started.
2017-04-13T18:19:28.487572236Z completed with 6 plugins.
2017-04-13T18:19:28.487797794Z
2017-04-13T18:19:28.487809763Z =INFO REPORT==== 13-Apr-2017::18:19:28 ===
2017-04-13T18:19:28.487818426Z Server startup complete; 6 plugins started.
2017-04-13T18:19:28.487826288Z * rabbitmq_management
2017-04-13T18:19:28.487833914Z * rabbitmq_web_dispatch
2017-04-13T18:19:28.487841610Z * rabbitmq_management_agent
2017-04-13T18:19:28.487861057Z * amqp_client
2017-04-13T18:19:28.487875546Z * cowboy
2017-04-13T18:19:28.487883514Z * cowlib
*********
But I get this error
$ pytest --amqp-uri=amqp://guest:guest#rabbitmq:5672 --rabbit-ctl-uri=http://guest:guest#rabbitmq:15672 tests
============================= test session starts ==============================
platform linux -- Python 3.6.1, pytest-3.0.7, py-1.4.33, pluggy-0.4.0
...
E Exception: Connection error for the RabbitMQ management HTTP API at http://guest:guest#rabbitmq:15672/api/overview, is it enabled?
...
source:565: DeprecationWarning: invalid escape sequence \*
ERROR: Job failed: exit code 1
I used it the following way and it worked for me
image: "ruby:2.3.3" //not required by rabbitmq
services:
- rabbitmq:latest
variables:
RABBITMQ_DEFAULT_USER: guest
RABBITMQ_DEFAULT_PASS: guest
AMQP_URL: 'amqp://guest:guest#rabbitmq:5672'
Now you can use the AMQP_URL env variable to connect to the rabbimq server. The general rule of thumb is any services declared will have the name (e.g. rabbitmq from rabbitmq:latest) as host or url or server. However in case you are running it in your own server or kubernetes cluster it will be localhost or 127.0.0.1. In my humble opinion that might be issue in your code. Hope it helps. :)
I run to issue that Celery worker connection with Rabbitmq met broken pipe error IN Gevent Mode. While no problem when Celery worker work in Process pool mode (without gevent without monkey patch).
After that, Celery workers will not get task messages from Rabbitmq anymore until they are restarted.
That issue always happen when the speed of Celery workers consuming task messages slower than Django applications producing messages, and about 3 thounds of messages piled in Rabbitmq.
Gevent version 1.1.0
Celery version 3.1.22
====== Celery log ======
[2016-08-08 13:52:06,913: CRITICAL/MainProcess] Couldn't ack 293, reason:error(32, 'Broken pipe')
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/kombu/message.py", line 93, in ack_log_error
self.ack()
File "/usr/local/lib/python2.7/site-packages/kombu/message.py", line 88, in ack
self.channel.basic_ack(self.delivery_tag)
File "/usr/local/lib/python2.7/site-packages/amqp/channel.py", line 1584, in basic_ack
self._send_method((60, 80), args)
File "/usr/local/lib/python2.7/site-packages/amqp/abstract_channel.py", line 56, in _send_method
self.channel_id, method_sig, args, content,
File "/usr/local/lib/python2.7/site-packages/amqp/method_framing.py", line 221, in write_method
write_frame(1, channel, payload)
File "/usr/local/lib/python2.7/site-packages/amqp/transport.py", line 182, in write_frame
frame_type, channel, size, payload, 0xce,
File "/usr/local/lib/python2.7/site-packages/gevent/_socket2.py", line 412, in sendall
timeleft = self.__send_chunk(chunk, flags, timeleft, end)
File "/usr/local/lib/python2.7/site-packages/gevent/_socket2.py", line 351, in __send_chunk
data_sent += self.send(chunk, flags)
File "/usr/local/lib/python2.7/site-packages/gevent/_socket2.py", line 320, in send
return sock.send(data, flags)
error: [Errno 32] Broken pipe
======= Rabbitmq log ==================
=ERROR REPORT==== 8-Aug-2016::14:28:33 ===
closing AMQP connection <0.15928.4> (10.26.39.183:60732 -> 10.26.39.183:5672):
{writer,send_failed,{error,enotconn}}
=ERROR REPORT==== 8-Aug-2016::14:29:03 ===
closing AMQP connection <0.15981.4> (10.26.39.183:60736 -> 10.26.39.183:5672):
{writer,send_failed,{error,enotconn}}
=ERROR REPORT==== 8-Aug-2016::14:29:03 ===
closing AMQP connection <0.15955.4> (10.26.39.183:60734 -> 10.26.39.183:5672):
{writer,send_failed,{error,enotconn}}
The similar issue appears when Celery worker use eventlet.
[2016-08-09 17:41:37,952: CRITICAL/MainProcess] Couldn't ack 583, reason:error(32, 'Broken pipe')
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/kombu/message.py", line 93, in ack_log_error
self.ack()
File "/usr/local/lib/python2.7/site-packages/kombu/message.py", line 88, in ack
self.channel.basic_ack(self.delivery_tag)
File "/usr/local/lib/python2.7/site-packages/amqp/channel.py", line 1584, in basic_ack
self._send_method((60, 80), args)
File "/usr/local/lib/python2.7/site-packages/amqp/abstract_channel.py", line 56, in _send_method
self.channel_id, method_sig, args, content,
File "/usr/local/lib/python2.7/site-packages/amqp/method_framing.py", line 221, in write_method
write_frame(1, channel, payload)
File "/usr/local/lib/python2.7/site-packages/amqp/transport.py", line 182, in write_frame
frame_type, channel, size, payload, 0xce,
File "/usr/local/lib/python2.7/site-packages/eventlet/greenio/base.py", line 385, in sendall
tail = self.send(data, flags)
File "/usr/local/lib/python2.7/site-packages/eventlet/greenio/base.py", line 379, in send
return self._send_loop(self.fd.send, data, flags)
File "/usr/local/lib/python2.7/site-packages/eventlet/greenio/base.py", line 366, in _send_loop
return send_method(data, *args)
error: [Errno 32] Broken pipe
Add setup and load test info
We use supervisor to launch celery with the following options
celery worker -A celerytasks.celery_worker_init -Q default -P gevent -c 1000 --loglevel=info
And Celery use Rabbitmq as broker.
And we have 4 Celery worker processes by specifying "numprocs=4" in supervisor configurations.
We use jmeter to emulate web access load, Django applications will produces tasks for Celery workers to consume. Those tasks basically need to access Mysql DB to get/update some data.
From rabbitmq web admin page, tasks-producing speed is like 50 per seconds while consuming speed is like 20 per seconds. After about 1 miniutes load testing, log file shows many connections between Rabbitmq and Celery met Broken-Pipe error
We noticed that this issue is also caused because of a combination of high prefect count along with high concurrency.
We had concurrency set to 500 and prefetch to 100, which means the ultimate prefetch is 500*100=50,000 per worker.
We had around 100k tasks piled up and because of this configuration one worker reserved all tasks for itself and other workers weren't even used, this one worker kept getting Broken pipe error and never acknowledge any task which lead to tasks being never cleared from the queue.
We then changed the prefetch to 3 and restarted all the workers which fixed the issue, after changing the prefetch down to a lower number we have seen 0 instances of Broken pipe error since we used to see it quite often before that.
I'm having some trouble with keeping RabbitMQ up.
I start it via the provided /etc/init.d/rabbitmq-server start, and it starts up fine. status shows that it's fine.
But after a while, the server dies. status prints
Error: unable to connect to node 'rabbit#myserver': nodedown
Checking the log file, it seems I've reached the memory threshold. Here are the logs:
# start
=INFO REPORT==== 26-Mar-2014::03:24:13 ===
Limiting to approx 924 file handles (829 sockets)
=INFO REPORT==== 26-Mar-2014::03:24:13 ===
Memory limit set to 723MB of 1807MB total.
=INFO REPORT==== 26-Mar-2014::03:24:13 ===
Disk free limit set to 953MB
=INFO REPORT==== 26-Mar-2014::03:24:13 ===
Management plugin upgraded statistics to fine.
=INFO REPORT==== 26-Mar-2014::03:24:13 ===
msg_store_transient: using rabbit_msg_store_ets_index to provide index
=INFO REPORT==== 26-Mar-2014::03:24:13 ===
msg_store_persistent: using rabbit_msg_store_ets_index to provide index
=WARNING REPORT==== 26-Mar-2014::03:24:13 ===
msg_store_persistent: rebuilding indices from scratch
=INFO REPORT==== 26-Mar-2014::03:24:27 ===
started TCP Listener on [::]:5672
=INFO REPORT==== 26-Mar-2014::03:24:27 ===
Management agent started.
=INFO REPORT==== 26-Mar-2014::03:24:27 ===
Management plugin started. Port: 55672, path: /
=INFO REPORT==== 26-Mar-2014::03:24:39 ===
accepting AMQP connection <0.1999.0> (127.0.0.1:34788 -> 127.0.0.1:5672)
=WARNING REPORT==== 26-Mar-2014::03:24:40 ===
closing AMQP connection <0.1999.0> (127.0.0.1:34788 -> 127.0.0.1:5672):
connection_closed_abruptly
=INFO REPORT==== 26-Mar-2014::03:24:42 ===
accepting AMQP connection <0.2035.0> (127.0.0.1:34791 -> 127.0.0.1:5672)
=INFO REPORT==== 26-Mar-2014::03:24:46 ===
accepting AMQP connection <0.2072.0> (127.0.0.1:34792 -> 127.0.0.1:5672)
=INFO REPORT==== 26-Mar-2014::03:25:19 ===
vm_memory_high_watermark set. Memory used:768651448 allowed:758279372
=INFO REPORT==== 26-Mar-2014::03:25:19 ===
alarm_handler: {set,{{resource_limit,memory,'rabbit#myserver'},
[]}}
=INFO REPORT==== 26-Mar-2014::03:25:48 ===
Statistics database started.
# server dies here
I seem to have been reaching the memory threshold, but reading the docs, it shouldn't shutdown the server? Just prevent publishing until some memory is freed up?
And yes, I am aware that my celery workers are the cause of the memory usage, I'd just thought that RabbitMQ would handle it correctly, which the docs seem to imply. So I'm doing something wrong?
EDIT: Refactored my task so it's message is just a single string (max 15 chars). Doesn't seem to be making any difference.
I tried starting RabbitMQ and celery worker --purge, with no events coming in to trigger the tasks, but it seems RabbitMQ's memory usage still steadily climbs to 40%. It then crashes shortly afterwards. It crashes, with none of my tasks having the chance to run.
Updating RabbitMQ to official stable version fixed the issue. The RabbitMQ package in Ubuntu 12.04's repository was really old.