nova services fails intermittently to connect to rabbitmq with timeout error - rabbitmq

I am using Openstack Pike with RabbitMQ 3.6.5 on Erlang 18.3.4.4 as the messaging queue. When the nova services are restarted, one of the nova services (mostly nova-scheduler) will have error in the logs as shown below. It says the service cannot connect to the rabbitmq and the error appears on and on until it is restarted. Instance launch also fails due to this error. The service openstack-nova- start commands exits successfully.
2018-05-31 06:19:44.737 5510 INFO nova.service [req-ba58e560-cca2-43ee-a770-5bc94bc14624 - - - - -] Starting scheduler node (version 16.1.1-1.el7)
2018-05-31 06:19:44.745 5510 DEBUG nova.service [req-ba58e560-cca2-43ee-a770-5bc94bc14624 - - - - -] Creating RPC server for service scheduler start /usr/lib/python2.7/site-packages/nova/service.py:179
2018-05-31 06:19:44.749 5510 DEBUG oslo.messaging._drivers.pool [req-ba58e560-cca2-43ee-a770-5bc94bc14624 - - - - -] Pool creating new connection create /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/pool.py:143
2018-05-31 06:19:44.753 5510 DEBUG oslo.messaging._drivers.impl_rabbit [req-ba58e560-cca2-43ee-a770-5bc94bc14624 - - - - -] [57700d60-c761-4072-b4a3-d33706864086] Connecting to AMQP server on localhost:5672 __init__ /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py:597
2018-05-31 06:19:44.761 5510 DEBUG nova.scheduler.host_manager [req-ba58e560-cca2-43ee-a770-5bc94bc14624 - - - - -] Found 2 cells: 00000000-0000-0000-0000-000000000000, 4963dd3d-246d-4f9b-9d4b-0cb1ab3032ae _load_cells /usr/lib/python2.7/site-packages/nova/scheduler/host_manager.py:642
2018-05-31 06:19:44.761 5510 DEBUG nova.scheduler.host_manager [req-ba58e560-cca2-43ee-a770-5bc94bc14624 - - - - -] START:_async_init_instance_info _async_init_instance_info /usr/lib/python2.7/site-packages/nova/scheduler/host_manager.py:421
2018-05-31 06:19:44.763 5510 DEBUG oslo_concurrency.lockutils [req-ba58e560-cca2-43ee-a770-5bc94bc14624 - - - - -] Lock "00000000-0000-0000-0000-000000000000" acquired by "nova.context.get_or_set_cached_cell_and_set_connections" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:270
2018-05-31 06:19:44.764 5510 DEBUG oslo_concurrency.lockutils [req-ba58e560-cca2-43ee-a770-5bc94bc14624 - - - - -] Lock "00000000-0000-0000-0000-000000000000" released by "nova.context.get_or_set_cached_cell_and_set_connections" :: held 0.001s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:282
2018-05-31 06:19:49.761 5510 DEBUG oslo.messaging._drivers.impl_rabbit [req-ba58e560-cca2-43ee-a770-5bc94bc14624 - - - - -] [57700d60-c761-4072-b4a3-d33706864086] Received recoverable error from kombu: on_error /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py:744
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit Traceback (most recent call last):
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 494, in _ensured
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit return fun(*args, **kwargs)
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 569, in __call__
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit self.revive(self.connection.default_channel)
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 819, in default_channel
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit self.connection
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 802, in connection
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit self._connection = self._establish_connection()
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 757, in _establish_connection
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit conn = self.transport.establish_connection()
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 130, in establish_connection
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit conn.connect()
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 300, in connect
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit self.drain_events(timeout=self.connect_timeout)
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 464, in drain_events
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit return self.blocking_read(timeout)
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 468, in blocking_read
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit frame = self.transport.read_frame()
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 237, in read_frame
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit frame_header = read(7, True)
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 377, in _read
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit s = recv(n - len(rbuf))
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 354, in recv
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit return self._recv_loop(self.fd.recv, b'', bufsize, flags)
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 348, in _recv_loop
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit self._read_trampoline()
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 319, in _read_trampoline
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit timeout_exc=socket.timeout("timed out"))
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 203, in _trampoline
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit mark_as_closed=self._mark_as_closed)
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/eventlet/hubs/__init__.py", line 162, in trampoline
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit return hub.switch()
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 294, in switch
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit return self.greenlet.switch()
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit timeout: timed out
2018-05-31 06:19:49.761 5510 ERROR oslo.messaging._drivers.impl_rabbit
2018-05-31 06:19:49.763 5510 ERROR oslo.messaging._drivers.impl_rabbit [req-ba58e560-cca2-43ee-a770-5bc94bc14624 - - - - -] [57700d60-c761-4072-b4a3-d33706864086] AMQP server on 127.0.0.1:5672 is unreachable: timed out. Trying again in 1 seconds. Client port: None: timeout: timed out
The rabbitmq logs shows TCP accept and closure errors.
=INFO REPORT==== 31-May-2018::06:39:41 === accepting AMQP connection <0.957.0> (127.0.0.1:33456 -> 127.0.0.1:5672)
=ERROR REPORT==== 31-May-2018::06:40:11 === closing AMQP connection <0.948.0> (127.0.0.1:33450 -> 127.0.0.1:5672): {handshake_timeout,frame_header}
On investigation I found that the rabbitmq error suggests that appropriate frame header was not sent by the client and hence the connection was closed. But I cant find a way to see what is the problem with the frame headers or if it is due to some other errors.
Can anyone please shed some light on how I can further debug this please ? Thanks.

I guess that its a problem with TCP/buffer etc. Try to increase the memory buffer of the rabbitmq and the number of TCP connections
$ sysctl -w net.ipv4.tcp_sack=0
$ echo 0 > /proc/sys/net/ipv4/tcp_sack

Related

Celery task disconnects from redis broker and takes a long time to reconnect

I have a pretty stable setup with celery 4.4.7 and a redis node on AWS.
However, once every few weeks, I notice that the celery workers suddenly stop processing the queue.
If I inspect the logs, I see the following:
[2021-10-14 06:36:52,250: DEBUG/ForkPoolWorker-1] Executing task PutObjectTask(transfer_id=0, {'bucket': 'cdn.my-app.com', 'key': 'thumbs/-0DqXnEL1k3J_1824x0_n1wMX-gx.jpg', 'extra_args': {'ContentType': 'image/jpeg', 'Expires': 'Wed, 31 Dec 2036 23:59:59 GMT', 'StorageClass': 'INTELLIGENT_TIERING', 'ACL': 'private'}}) with kwargs {'client': <botocore.client.S3 object at 0xffff90746040>, 'fileobj': <s3transfer.utils.ReadFileChunk object at 0xffff9884d670>, 'bucket': 'cdn.my-app.com', 'key': 'thumbs/-0DqXnEL1k3J_1824x0_n1wMX-gx.jpg', 'extra_args': {'ContentType': 'image/jpeg', 'Expires': 'Wed, 31 Dec 2036 23:59:59 GMT', 'StorageClass': 'INTELLIGENT_TIERING', 'ACL': 'private'}}
[2021-10-14 06:36:52,253: DEBUG/ForkPoolWorker-1] Releasing acquire 0/None
[2021-10-14 06:36:52,429: DEBUG/ForkPoolWorker-1] Releasing acquire 0/None
[2021-10-14 06:37:50,202: INFO/MainProcess] missed heartbeat from celery#3a05aa0db663
[2021-10-14 06:37:50,202: INFO/MainProcess] missed heartbeat from celery#7c9532243cd0
[2021-10-14 06:37:50,202: INFO/MainProcess] missed heartbeat from celery#686f7e96d04f
[2021-10-14 06:54:06,510: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/redis/connection.py", line 706, in send_packed_command
sendall(self._sock, item)
File "/usr/local/lib/python3.8/dist-packages/redis/_compat.py", line 9, in sendall
return sock.sendall(*args, **kwargs)
TimeoutError: [Errno 110] Connection timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/celery/worker/consumer/consumer.py", line 318, in start
blueprint.start(self)
File "/usr/local/lib/python3.8/dist-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/usr/local/lib/python3.8/dist-packages/celery/worker/consumer/consumer.py", line 599, in start
c.loop(*c.loop_args())
File "/usr/local/lib/python3.8/dist-packages/celery/worker/loops.py", line 83, in asynloop
next(loop)
File "/usr/local/lib/python3.8/dist-packages/kombu/asynchronous/hub.py", line 364, in create_loop
cb(*cbargs)
File "/usr/local/lib/python3.8/dist-packages/kombu/transport/redis.py", line 1083, in on_readable
self.cycle.on_readable(fileno)
File "/usr/local/lib/python3.8/dist-packages/kombu/transport/redis.py", line 354, in on_readable
chan.handlers[type]()
File "/usr/local/lib/python3.8/dist-packages/kombu/transport/redis.py", line 688, in _receive
ret.append(self._receive_one(c))
File "/usr/local/lib/python3.8/dist-packages/kombu/transport/redis.py", line 698, in _receive_one
response = c.parse_response()
File "/usr/local/lib/python3.8/dist-packages/redis/client.py", line 3501, in parse_response
self.check_health()
File "/usr/local/lib/python3.8/dist-packages/redis/client.py", line 3521, in check_health
conn.send_command('PING', self.HEALTH_CHECK_MESSAGE,
File "/usr/local/lib/python3.8/dist-packages/redis/connection.py", line 725, in send_command
self.send_packed_command(self.pack_command(*args),
File "/usr/local/lib/python3.8/dist-packages/redis/connection.py", line 717, in send_packed_command
raise ConnectionError("Error %s while writing to socket. %s." %
redis.exceptions.ConnectionError: Error 110 while writing to socket. Connection timed out.
[2021-10-14 06:54:06,520: DEBUG/MainProcess] Canceling task consumer...
[2021-10-14 06:54:06,538: INFO/MainProcess] Connected to redis://my-redis-endpoint.cache.amazonaws.com:6379/0
[2021-10-14 06:54:06,548: INFO/MainProcess] mingle: searching for neighbors
[2021-10-14 07:09:45,507: INFO/MainProcess] mingle: sync with 1 nodes
[2021-10-14 07:09:45,507: DEBUG/MainProcess] mingle: processing reply from celery#3a05aa0db663
[2021-10-14 07:09:45,508: INFO/MainProcess] mingle: sync complete
[2021-10-14 07:09:45,536: INFO/MainProcess] Received task: my-app.tasks.retrieve_thumbnail[f50d66fd-28f5-4f1d-9b69-bf79daab43c0]
Please note the amount of time passed with seemingly no activity between 06:37:50,202 (detected loss of connection from other workers) and 06:54:06,510 (detected loss of connection to broker), and also between 06:54:06,548 (reconnected to broker) and 07:09:45,536 started receiving tasks.
All in all, my celery setup was not processing tasks between 06:37:50 and 07:09:45, which impact the functioning of my app.
Any suggestions on how to tackle this issue?
Thanks!c

SageMaker GetBucketLocation operation: Access Denied

I have a SM notebook in Oregon region and I've ran estimator.fit with data in a S3 bucket of the same region several times previously. Today on attempting to rerun this script I'm getting the following error:
INFO:sagemaker:Creating training-job with name: sagemaker-tensorflow-scriptmode-2019-01-31-18-51-07-292
Creating tmp_axa2vmo_algo-1-s8vfu_1 ...
Attaching to tmp_axa2vmo_algo-1-s8vfu_1
algo-1-s8vfu_1 | 2019-01-31 18:51:11,828 sagemaker-containers INFO Imported framework sagemaker_tensorflow_container.training
algo-1-s8vfu_1 | 2019-01-31 18:51:11,940 sagemaker-containers ERROR Reporting training FAILURE
algo-1-s8vfu_1 | 2019-01-31 18:51:11,940 sagemaker-containers ERROR framework error:
algo-1-s8vfu_1 | Traceback (most recent call last):
algo-1-s8vfu_1 | File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_trainer.py", line 67, in train
algo-1-s8vfu_1 | entrypoint()
algo-1-s8vfu_1 | File "/usr/local/lib/python3.6/dist-packages/sagemaker_tensorflow_container/training.py", line 167, in main
algo-1-s8vfu_1 | s3_utils.configure(env.hyperparameters.get('model_dir'), os.environ.get('SAGEMAKER_REGION'))
algo-1-s8vfu_1 | File "/usr/local/lib/python3.6/dist-packages/sagemaker_tensorflow_container/s3_utils.py", line 23, in configure
algo-1-s8vfu_1 | os.environ['S3_REGION'] = _s3_region(job_region, model_dir)
algo-1-s8vfu_1 | File "/usr/local/lib/python3.6/dist-packages/sagemaker_tensorflow_container/s3_utils.py", line 39, in _s3_region
algo-1-s8vfu_1 | bucket_location = s3.get_bucket_location(Bucket=bucket_name)['LocationConstraint']
algo-1-s8vfu_1 | File "/usr/local/lib/python3.6/dist-packages/botocore/client.py", line 357, in _api_call
algo-1-s8vfu_1 | return self._make_api_call(operation_name, kwargs)
algo-1-s8vfu_1 | File "/usr/local/lib/python3.6/dist-packages/botocore/client.py", line 661, in _make_api_call
algo-1-s8vfu_1 | raise error_class(parsed_response, operation_name)
algo-1-s8vfu_1 | botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the GetBucketLocation operation: Access Denied
algo-1-s8vfu_1 |
algo-1-s8vfu_1 | An error occurred (AccessDenied) when calling the GetBucketLocation operation: Access Denied
tmp_axa2vmo_algo-1-s8vfu_1 exited with code 1
Aborting on container exit...
To my knowledge no role permissions or access policies have been updated. Any idea what could be causing this?

Tensorflow Object-detection model training failed on google cloud

ERROR 2017-11-23 18:39:51 -0800 service The replica worker 2 exited
with a non-zero status of 1. Termination reason: Error.
ERROR 2017-11-23 18:39:51 -0800 service Traceback (most recent call
last):
ERROR 2017-11-23 18:39:51 -0800 service File
"/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
ERROR 2017-11-23 18:39:51 -0800 service "main", fname,
loader, pkg_name)
ERROR 2017-11-23 18:39:51 -0800 service File
"/usr/lib/python2.7/runpy.py", line 72, in _run_code
ERROR 2017-11-23
18:39:51 -0800 service exec code in run_globals
ERROR 2017-11-23
18:39:51 -0800 service File
"/root/.local/lib/python2.7/site-packages/object_detection/train.py",
line 49, in
ERROR 2017-11-23 18:39:51 -0800 service from
object_detection import trainer
ERROR 2017-11-23 18:39:51
-0800 service File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py",
line 27, in
ERROR 2017-11-23 18:39:51 -0800 service from
object_detection.builders import preprocessor_builder
ERROR 2017-11-23
18:39:51 -0800 service File
"/root/.local/lib/python2.7/site-packages/object_detection/builders/preprocessor_builder.py",
line 21, in
ERROR 2017-11-23 18:39:51 -0800 service from
object_detection.protos import preprocessor_pb2
ERROR 2017-11-23
18:39:51 -0800 service File
"/root/.local/lib/python2.7/site-packages/object_detection/protos/preprocessor_pb2.py",
line 71, in
ERROR 2017-11-23 18:39:51 -0800 service options=None, file=DESCRIPTOR),
ERROR 2017-11-23 18:39:51
-0800 service TypeError: new() got an unexpected keyword argument 'file'
Using protobuf (3.5.0.post1)
But when I run taining local, no error!
The Cloud ML Engine doesn't support the latest versions of TensorFlow or protobuf. You can see the current packages and versions here. Did you add protobuf to the list of required packages in setup.py?
In setup.py, you can request a more current version of TensorFlow with code like the following:
REQUIRED_PACKAGES = ['tensorflow>=1.3']
setup(
...
install_requires=REQUIRED_PACKAGES,
...
)

Airflow CROSSSLOT Keys in request don't hash to the same slot error using AWS ElastiCache

I am running apache-airflow 1.8.1 on AWS ECS and I have an AWS ElastiCache cluster (redis 3.2.4) running 2 shards / 2 nodes with multi-AZ enabled (clustered redis engine). I've verified that airflow can access the host/port of the cluster without any problem.
Here's the logs:
Thu Jul 20 01:39:21 UTC 2017 - Checking for redis (endpoint: redis://xxxxxx.xxxxxx.clustercfg.usw2.cache.amazonaws.com:6379) connectivity
Thu Jul 20 01:39:21 UTC 2017 - Connected to redis (endpoint: redis://xxxxxx.xxxxxx.clustercfg.usw2.cache.amazonaws.com:6379)
logging to s3://xxxx-xxxx-xxxx/logs/airflow
Starting worker
[2017-07-20 01:39:44,020] {__init__.py:57} INFO - Using executor CeleryExecutor
[2017-07-20 01:39:45,960] {driver.py:120} INFO - Generating grammar tables from /usr/lib/python2.7/lib2to3/Grammar.txt
[2017-07-20 01:39:45,989] {driver.py:120} INFO - Generating grammar tables from /usr/lib/python2.7/lib2to3/PatternGrammar.txt
[2017-07-20 01:39:53,352] {__init__.py:57} INFO - Using executor CeleryExecutor
[2017-07-20 01:39:55,187] {driver.py:120} INFO - Generating grammar tables from /usr/lib/python2.7/lib2to3/Grammar.txt
[2017-07-20 01:39:55,210] {driver.py:120} INFO - Generating grammar tables from /usr/lib/python2.7/lib2to3/PatternGrammar.txt
[2017-07-20 01:53:09,536: ERROR/MainProcess] Unrecoverable error: ResponseError("CROSSSLOT Keys in request don't hash to the same slot",)
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/celery/worker/__init__.py", line 206, in start
self.blueprint.start(self)
File "/usr/local/lib/python2.7/dist-packages/celery/bootsteps.py", line 123, in start
step.start(parent)
File "/usr/local/lib/python2.7/dist-packages/celery/bootsteps.py", line 374, in start
return self.obj.start()
File "/usr/local/lib/python2.7/dist-packages/celery/worker/consumer.py", line 278, in start
blueprint.start(self)
File "/usr/local/lib/python2.7/dist-packages/celery/bootsteps.py", line 123, in start
step.start(parent)
File "/usr/local/lib/python2.7/dist-packages/celery/worker/consumer.py", line 569, in start
replies = I.hello(c.hostname, revoked._data) or
{}
File "/usr/local/lib/python2.7/dist-packages/celery/app/control.py", line 112, in hello
return self._request('hello', from_node=from_node, revoked=revoked)
File "/usr/local/lib/python2.7/dist-packages/celery/app/control.py", line 71, in _request
timeout=self.timeout, reply=True,
File "/usr/local/lib/python2.7/dist-packages/celery/app/control.py", line 307, in broadcast
limit, callback, channel=channel,
File "/usr/local/lib/python2.7/dist-packages/kombu/pidbox.py", line 294, in _broadcast
serializer=serializer)
File "/usr/local/lib/python2.7/dist-packages/kombu/pidbox.py", line 259, in _publish
maybe_declare(self.reply_queue(channel))
File "/usr/local/lib/python2.7/dist-packages/kombu/common.py", line 120, in maybe_declare
return _maybe_declare(entity, declared, ident, channel)
File "/usr/local/lib/python2.7/dist-packages/kombu/common.py", line 127, in _maybe_declare
entity.declare()
File "/usr/local/lib/python2.7/dist-packages/kombu/entity.py", line 522, in declare
self.queue_declare(nowait, passive=False)
File "/usr/local/lib/python2.7/dist-packages/kombu/entity.py", line 548, in queue_declare
nowait=nowait)
File "/usr/local/lib/python2.7/dist-packages/kombu/transport/virtual/__init__.py", line 447, in queue_declare
return queue_declare_ok_t(queue, self._size(queue), 0)
File "/usr/local/lib/python2.7/dist-packages/kombu/transport/redis.py", line 690, in _size
sizes = pipe.execute()
File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 2626, in execute
return execute(conn, stack, raise_on_error)
File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 2518, in _execute_transaction
response = self.parse_response(connection, '_')
File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 2584, in parse_response
self, connection, command_name, **options)
File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 585, in parse_response
response = connection.read_response()
File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 582, in read_response
raise response
ResponseError: CROSSSLOT Keys in request don't hash to the same slot
I had the exact same issue, I solved it by not using a clustered setup with elasticache. Perhaps celery workers don't support using clustered Redis, I was unable to find any information that definitively pointed this out.

Ambari registration phase fails for SSL on EC2

I am trying to use Apache Ambari to configure a Hadoop cluster on EC2.
During the registration phase I get this error:
Command start time 2016-11-23 20:25:12
('Traceback (most recent call last):
File "/usr/lib/python2.6/site-packages/ambari_agent/main.py", line 312, in <module>
main(heartbeat_stop_callback)
File "/usr/lib/python2.6/site-packages/ambari_agent/main.py", line 248, in main
stop_agent()
File "/usr/lib/python2.6/site-packages/ambari_agent/main.py", line 198, in stop_agent
sys.exit(1)
SystemExit: 1
INFO 2016-11-23 20:25:18,716 ExitHelper.py:53 - Performing cleanup before exiting...
INFO 2016-11-23 20:25:18,907 main.py:74 - loglevel=logging.INFO
INFO 2016-11-23 20:25:18,907 DataCleaner.py:39 - Data cleanup thread started
INFO 2016-11-23 20:25:18,908 DataCleaner.py:120 - Data cleanup started
INFO 2016-11-23 20:25:18,909 DataCleaner.py:122 - Data cleanup finished
INFO 2016-11-23 20:25:18,930 PingPortListener.py:50 - Ping port listener started on port: 8670
INFO 2016-11-23 20:25:18,931 main.py:289 - Connecting to Ambari server at https://IPADDRESS.us-west-2.compute.internal:8440 (172.31.37.172)
INFO 2016-11-23 20:25:18,931 NetUtil.py:59 - Connecting to https://IPADDRESS.us-west-2.compute.internal:8440/ca
ERROR 2016-11-23 20:25:18,983 NetUtil.py:77 - [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)
ERROR 2016-11-23 20:25:18,983 NetUtil.py:78 - SSLError: Failed to connect. Please check openssl library versions.
Refer to: https://bugzilla.redhat.com/show_bug.cgi?id=1022468 for more details.
WARNING 2016-11-23 20:25:18,983 NetUtil.py:105 - Server at https://IPADDRESS.us-west-2.compute.internal:8440 is not reachable, sleeping for 10 seconds...
', None)
('Traceback (most recent call last):
File "/usr/lib/python2.6/site-packages/ambari_agent/main.py", line 312, in <module>
main(heartbeat_stop_callback)
File "/usr/lib/python2.6/site-packages/ambari_agent/main.py", line 248, in main
stop_agent()
File "/usr/lib/python2.6/site-packages/ambari_agent/main.py", line 198, in stop_agent
sys.exit(1)
SystemExit: 1
INFO 2016-11-23 20:25:18,716 ExitHelper.py:53 - Performing cleanup before exiting...
INFO 2016-11-23 20:25:18,907 main.py:74 - loglevel=logging.INFO
INFO 2016-11-23 20:25:18,907 DataCleaner.py:39 - Data cleanup thread started
INFO 2016-11-23 20:25:18,908 DataCleaner.py:120 - Data cleanup started
INFO 2016-11-23 20:25:18,909 DataCleaner.py:122 - Data cleanup finished
INFO 2016-11-23 20:25:18,930 PingPortListener.py:50 - Ping port listener started on port: 8670
INFO 2016-11-23 20:25:18,931 main.py:289 - Connecting to Ambari server at https://IPADDRESS.us-west-2.compute.internal:8440 (172.31.37.172)
INFO 2016-11-23 20:25:18,931 NetUtil.py:59 - Connecting to https://IPADDRESS.us-west-2.compute.internal:8440/ca
ERROR 2016-11-23 20:25:18,983 NetUtil.py:77 - [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)
ERROR 2016-11-23 20:25:18,983 NetUtil.py:78 - SSLError: Failed to connect. Please check openssl library versions.
Refer to: https://bugzilla.redhat.com/show_bug.cgi?id=1022468 for more details.
WARNING 2016-11-23 20:25:18,983 NetUtil.py:105 - Server at https://IPADDRESS.us-west-2.compute.internal:8440 is not reachable, sleeping for 10 seconds...
', None)
Connection to IPADDRESS.us-west-2.compute.internal closed.
SSH command execution finished
host=IPADDRESS.us-west-2.compute.internal, exitcode=0
Command end time 2016-11-23 20:25:21
Registering with the server...
Registration with the server failed.
I think it is something basic, but I was not able to solve.
The openssl version is 1.0.2g
Any advice?
Thank you
This seems to be a known issue related to JDK being used on the host machine for Ambari server.
The post here mentions that Oracle JDK should be used to get past this problem.
in case this is not the JDK issue as mentioned here, then there would be some issue with the version of the python being used for launching ambari-agent and ambari-server. Make sure that both are using same version i.e python 2.7 etc and restart them.
P.S After struggling hours when I ran into this issue, it was due to ambari-server running python2.6 and agent running in python2.7 for me.