Celery RabbitMQ broker failover connect issue - rabbitmq

I have 3 RabbitMQ nodes in cluster in HA mode. Each node is on separate Docker container.
I am using Celery version 4 and kombu version 4.
I have used this command to set HA policy:
rabbitmqctl set_policy ha-all "" '{"ha-mode":"all","ha-sync-mode":"automatic"}'
Celery config looks like this:
CELERY = dict(
broker_url=[
'amqp://guest#rabbitmq1:5672',
'amqp://guest#rabbitmq2:5672',
'amqp://guest#rabbitmq3:5672',
],
celery_queue_ha_policy='all',
...
)
Everything works fine until I stop master RabbitMQ application in order to test Celery failover feature using command:
rabbitmqctl stop_app
Immediately after RabbitMQ application is stopped I started seeing errors in log bellow. Frequency of log messages is very high and it doesn't slow down with number of attempts.
According to logs Celery tries to reconnect using next failover, but it gets interrupted by another try to reconnect to master node that was stopped. The same thing happens over and over like in infinite loop.
[2017-03-17 15:10:28,084: ERROR/MainProcess] consumer: Cannot connect to amqp://guest:**#rabbitmq1:5672//: [Errno 111] Connection refused.
Will retry using next failover.
[2017-03-17 15:10:28,300: DEBUG/MainProcess] Start from server, version: 0.9, properties: {'information': 'Licensed under the MPL. See http://www.rabbitmq.com/', 'product': 'RabbitMQ', 'copyright': 'Copyright (C) 2007-2016 Pivotal Software, Inc.', 'capabilities': {'exchange_exchange_bindings': True, 'connection.blocked': True, 'authentication_failure_close': True, 'direct_reply_to': True, 'basic.nack': True, 'per_consumer_qos': True, 'consumer_priorities': True, 'consumer_cancel_notify': True, 'publisher_confirms': True}, 'cluster_name': 'rabbit#rabbitmq1', 'platform': 'Erlang/OTP', 'version': '3.6.6'}, mechanisms: [u'PLAIN', u'AMQPLAIN'], locales: [u'en_US']
[2017-03-17 15:10:28,302: DEBUG/MainProcess] ^-- substep ok
[2017-03-17 15:10:28,303: DEBUG/MainProcess] | Consumer: Starting Mingle
[2017-03-17 15:10:28,303: INFO/MainProcess] mingle: searching for neighbors
[2017-03-17 15:10:28,303: DEBUG/MainProcess] using channel_id: 1
[2017-03-17 15:10:28,318: DEBUG/MainProcess] Channel open
[2017-03-17 15:10:28,470: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/consumer.py", line 318, in start
blueprint.start(self)
File "/usr/local/lib/python2.7/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/mingle.py", line 38, in start
self.sync(c)
File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/mingle.py", line 42, in sync
replies = self.send_hello(c)
File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/mingle.py", line 55, in send_hello
replies = inspect.hello(c.hostname, our_revoked._data) or {}
File "/usr/local/lib/python2.7/site-packages/celery/app/control.py", line 129, in hello
return self._request('hello', from_node=from_node, revoked=revoked)
File "/usr/local/lib/python2.7/site-packages/celery/app/control.py", line 81, in _request
timeout=self.timeout, reply=True,
File "/usr/local/lib/python2.7/site-packages/celery/app/control.py", line 436, in broadcast
limit, callback, channel=channel,
File "/usr/local/lib/python2.7/site-packages/kombu/pidbox.py", line 315, in _broadcast
serializer=serializer)
File "/usr/local/lib/python2.7/site-packages/kombu/pidbox.py", line 290, in _publish
serializer=serializer,
File "/usr/local/lib/python2.7/site-packages/kombu/messaging.py", line 181, in publish
exchange_name, declare,
File "/usr/local/lib/python2.7/site-packages/kombu/messaging.py", line 187, in _publish
channel = self.channel
File "/usr/local/lib/python2.7/site-packages/kombu/messaging.py", line 209, in _get_channel
channel = self._channel = channel()
File "/usr/local/lib/python2.7/site-packages/kombu/utils/functional.py", line 38, in __call__
value = self.__value__ = self.__contract__()
File "/usr/local/lib/python2.7/site-packages/kombu/messaging.py", line 224, in <lambda>
channel = ChannelPromise(lambda: connection.default_channel)
File "/usr/local/lib/python2.7/site-packages/kombu/connection.py", line 819, in default_channel
self.connection
File "/usr/local/lib/python2.7/site-packages/kombu/connection.py", line 802, in connection
self._connection = self._establish_connection()
File "/usr/local/lib/python2.7/site-packages/kombu/connection.py", line 757, in _establish_connection
conn = self.transport.establish_connection()
File "/usr/local/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 130, in establish_connection
conn.connect()
File "/usr/local/lib/python2.7/site-packages/amqp/connection.py", line 294, in connect
self.transport.connect()
File "/usr/local/lib/python2.7/site-packages/amqp/transport.py", line 120, in connect
self._connect(self.host, self.port, self.connect_timeout)
File "/usr/local/lib/python2.7/site-packages/amqp/transport.py", line 161, in _connect
self.sock.connect(sa)
File "/usr/local/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused
[2017-03-17 15:10:28,508: DEBUG/MainProcess] Closed channel #1
[2017-03-17 15:10:28,570: DEBUG/MainProcess] | Consumer: Restarting event loop...
[2017-03-17 15:10:28,572: DEBUG/MainProcess] | Consumer: Restarting Gossip...
[2017-03-17 15:10:28,575: DEBUG/MainProcess] | Consumer: Restarting Heart...
[2017-03-17 15:10:28,648: DEBUG/MainProcess] | Consumer: Restarting Control...
[2017-03-17 15:10:28,655: DEBUG/MainProcess] | Consumer: Restarting Tasks...
[2017-03-17 15:10:28,655: DEBUG/MainProcess] Canceling task consumer...
[2017-03-17 15:10:28,655: DEBUG/MainProcess] | Consumer: Restarting Mingle...
[2017-03-17 15:10:28,655: DEBUG/MainProcess] | Consumer: Restarting Events...
[2017-03-17 15:10:28,672: DEBUG/MainProcess] | Consumer: Restarting Connection...
[2017-03-17 15:10:28,673: DEBUG/MainProcess] | Consumer: Starting Connection
[2017-03-17 15:10:28,947: ERROR/MainProcess] consumer: Cannot connect to amqp://guest:**#rabbitmq1:5672//: [Errno 111] Connection refused.
Will retry using next failover.
[2017-03-17 15:10:29,345: DEBUG/MainProcess] Start from server, version: 0.9, properties: {'information': 'Licensed under the MPL. See http://www.rabbitmq.com/', 'product': 'RabbitMQ', 'copyright': 'Copyright (C) 2007-2016 Pivotal Software, Inc.', 'capabilities': {'exchange_exchange_bindings': True, 'connection.blocked': True, 'authentication_failure_close': True, 'direct_reply_to': True, 'basic.nack': True, 'per_consumer_qos': True, 'consumer_priorities': True, 'consumer_cancel_notify': True, 'publisher_confirms': True}, 'cluster_name': 'rabbit#rabbitmq1', 'platform': 'Erlang/OTP', 'version': '3.6.6'}, mechanisms: [u'PLAIN', u'AMQPLAIN'], locales: [u'en_US']
[2017-03-17 15:10:29,506: INFO/MainProcess] Connected to amqp://guest:**#rabbitmq2:5672//
[2017-03-17 15:10:29,535: DEBUG/MainProcess] ^-- substep ok
[2017-03-17 15:10:29,569: DEBUG/MainProcess] | Consumer: Starting Events
[2017-03-17 15:10:29,682: ERROR/MainProcess] consumer: Cannot connect to amqp://guest:**#rabbitmq1:5672//: [Errno 111] Connection refused.
Will retry using next failover.
[2017-03-17 15:10:29,740: DEBUG/MainProcess] Start from server, version: 0.9, properties: {'information': 'Licensed under the MPL. See http://www.rabbitmq.com/', 'product': 'RabbitMQ', 'copyright': 'Copyright (C) 2007-2016 Pivotal Software, Inc.', 'capabilities': {'exchange_exchange_bindings': True, 'connection.blocked': True, 'authentication_failure_close': True, 'direct_reply_to': True, 'basic.nack': True, 'per_consumer_qos': True, 'consumer_priorities': True, 'consumer_cancel_notify': True, 'publisher_confirms': True}, 'cluster_name': 'rabbit#rabbitmq1', 'platform': 'Erlang/OTP', 'version': '3.6.6'}, mechanisms: [u'PLAIN', u'AMQPLAIN'], locales: [u'en_US']
[2017-03-17 15:10:29,768: DEBUG/MainProcess] ^-- substep ok
[2017-03-17 15:10:29,770: DEBUG/MainProcess] | Consumer: Starting Mingle
[2017-03-17 15:10:29,770: INFO/MainProcess] mingle: searching for neighbors
[2017-03-17 15:10:29,771: DEBUG/MainProcess] using channel_id: 1
[2017-03-17 15:10:29,795: DEBUG/MainProcess] Channel open
[2017-03-17 15:10:29,874: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/consumer.py", line 318, in start
blueprint.start(self)
File "/usr/local/lib/python2.7/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/mingle.py", line 38, in start
self.sync(c)
File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/mingle.py", line 42, in sync
replies = self.send_hello(c)
File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/mingle.py", line 55, in send_hello
replies = inspect.hello(c.hostname, our_revoked._data) or {}
File "/usr/local/lib/python2.7/site-packages/celery/app/control.py", line 129, in hello
return self._request('hello', from_node=from_node, revoked=revoked)
File "/usr/local/lib/python2.7/site-packages/celery/app/control.py", line 81, in _request
timeout=self.timeout, reply=True,
File "/usr/local/lib/python2.7/site-packages/celery/app/control.py", line 436, in broadcast
limit, callback, channel=channel,
File "/usr/local/lib/python2.7/site-packages/kombu/pidbox.py", line 315, in _broadcast
serializer=serializer)
File "/usr/local/lib/python2.7/site-packages/kombu/pidbox.py", line 290, in _publish
serializer=serializer,
File "/usr/local/lib/python2.7/site-packages/kombu/messaging.py", line 181, in publish
exchange_name, declare,
File "/usr/local/lib/python2.7/site-packages/kombu/messaging.py", line 187, in _publish
channel = self.channel
File "/usr/local/lib/python2.7/site-packages/kombu/messaging.py", line 209, in _get_channel
channel = self._channel = channel()
File "/usr/local/lib/python2.7/site-packages/kombu/utils/functional.py", line 38, in __call__
value = self.__value__ = self.__contract__()
File "/usr/local/lib/python2.7/site-packages/kombu/messaging.py", line 224, in <lambda>
channel = ChannelPromise(lambda: connection.default_channel)
File "/usr/local/lib/python2.7/site-packages/kombu/connection.py", line 819, in default_channel
self.connection
File "/usr/local/lib/python2.7/site-packages/kombu/connection.py", line 802, in connection
self._connection = self._establish_connection()
File "/usr/local/lib/python2.7/site-packages/kombu/connection.py", line 757, in _establish_connection
conn = self.transport.establish_connection()
File "/usr/local/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 130, in establish_connection
conn.connect()
File "/usr/local/lib/python2.7/site-packages/amqp/connection.py", line 294, in connect
self.transport.connect()
File "/usr/local/lib/python2.7/site-packages/amqp/transport.py", line 120, in connect
self._connect(self.host, self.port, self.connect_timeout)
File "/usr/local/lib/python2.7/site-packages/amqp/transport.py", line 161, in _connect
self.sock.connect(sa)
File "/usr/local/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused
[2017-03-17 15:10:29,887: DEBUG/MainProcess] Closed channel #1
[2017-03-17 15:10:29,907: DEBUG/MainProcess] | Consumer: Restarting event loop...
[2017-03-17 15:10:29,908: DEBUG/MainProcess] | Consumer: Restarting Gossip...
[2017-03-17 15:10:29,908: DEBUG/MainProcess] | Consumer: Restarting Heart...
[2017-03-17 15:10:29,908: DEBUG/MainProcess] | Consumer: Restarting Control...
[2017-03-17 15:10:29,909: DEBUG/MainProcess] | Consumer: Restarting Tasks...
[2017-03-17 15:10:29,910: DEBUG/MainProcess] Canceling task consumer...
[2017-03-17 15:10:29,911: DEBUG/MainProcess] | Consumer: Restarting Mingle...
[2017-03-17 15:10:29,912: DEBUG/MainProcess] | Consumer: Restarting Events...
[2017-03-17 15:10:29,953: DEBUG/MainProcess] | Consumer: Restarting Connection...
[2017-03-17 15:10:29,954: DEBUG/MainProcess] | Consumer: Starting Connection
[2017-03-17 15:10:30,036: ERROR/MainProcess] consumer: Cannot connect to amqp://guest:**#rabbitmq1:5672//: [Errno 111] Connection refused.
Will retry using next failover.
Unfortunately, Celery documentation doesn't say much about failover topic.

Its definitely bug, I have created issue on GitHub: https://github.com/celery/celery/issues/3921
Thanks to George Psarakis I have managed to avoid bug using --without-mingle flag for Celery workers, eg:
celery worker -A app.tasks -l debug --without-mingle

Related

Catch Scrapy exception when crawling from Airflow

I'm trying to catch the exception that occurs on my spider in a manner that I can mark the task instance as failed. Currently the task finishes and is marked as succeeded. I'm calling the crawl() from PythonOperator in Airflow, as follow:
with DAG(
'MySpider',
default_args=default_args,
schedule_interval=None) as dag:
t1 = python_task = PythonOperator(
task_id="crawler_task",
python_callable=run_crawler,
op_kwargs=dag_kwargs
)
Here is my run_crawler() method:
def run_crawler(**kwargs):
project_settings = set_project_settings({
'FEEDS': {
f'{kwargs["bucket"]}%(time)s.{kwargs["format"]}': {
'format': kwargs["format"],
'encoding': 'utf8',
'store_empty': kwargs["store_empty"]
}
}
})
print("Project settings: ")
pprint(project_settings.attributes.items())
set_connection("airflow", kwargs["gcs_connection_id"])
process = CrawlerProcess(project_settings)
process.crawl(spider.MySpider)
print("Starting crawler...")
process.start()
When running, I'm having problems with GCS credentials, which leads me to an Exception, as follow:
google.auth.exceptions.DefaultCredentialsError: The file /tmp/file_my_credentials.json does not have a valid type. Type is None, expected one of ('authorized_user', 'service_account', 'external_account', 'external_account_authorized_user', 'impersonated_service_account', 'gdch_service_account').
{logging_mixin.py:115} WARNING - [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 21087,
'downloader/request_count': 68,
'downloader/request_method_count/GET': 68,
'downloader/response_bytes': 1863876,
'downloader/response_count': 68,
'downloader/response_status_count/200': 68,
'elapsed_time_seconds': 25.647386,
'feedexport/failed_count/GCSFeedStorage': 1,
'httpcompression/response_bytes': 9212776,
'httpcompression/response_count': 68,
'item_scraped_count': 66,
'log_count/DEBUG': 136,
'log_count/ERROR': 1,
'log_count/INFO': 10,
'log_count/WARNING': 3,
'memusage/max': 264441856,
'memusage/startup': 264441856,
'request_depth_max': 1,
'response_received_count': 68,
'scheduler/dequeued': 68,
'scheduler/dequeued/memory': 68,
'scheduler/enqueued': 68,
'scheduler/enqueued/memory': 68,
[2032-13-13, 09:04:28 UTC] {engine.py:389} INFO - Spider closed (finished)
[2032-13-13, 09:04:28 UTC] {logging_mixin.py:115} WARNING -
[scrapy.core.engine] INFO: Spider closed (finished)
[2032-13-13, 09:04:28 UTC] {python.py:173} INFO - Done. Returned value was: None
[2032-13-13, 09:04:28 UTC] {taskinstance.py:1408} INFO - Marking task as SUCCESS. dag_id=MySpider, task_id=crawler_task, execution_date=2032-13-13, start_date=2032-13-13, end_date=2032-13-13
[2032-13-13, 09:04:28 UTC] {local_task_job.py:156} INFO - Task exited with return code 0
[2032-13-13, 09:04:28 UTC] {local_task_job.py:279} INFO - 0 downstream tasks scheduled from follow-on schedule check
As you can see, even having this exception, the task itself is marked as "SUCCESS". Is it possible to catch it in order to mark as FAILED, then we can follow it on airflow (Composer) interface?
Thank you
I don't understand why in this case the exception doesn't break the task.
You can add a try except in the run_crawler method and then raise you own exception in the except bloc :
import logging
def run_crawler(**kwargs):
class CustomException(Exception):
pass
try:
project_settings = set_project_settings({
'FEEDS': {
f'{kwargs["bucket"]}%(time)s.{kwargs["format"]}': {
'format': kwargs["format"],
'encoding': 'utf8',
'store_empty': kwargs["store_empty"]
}
}
})
print("Project settings: ")
pprint(project_settings.attributes.items())
set_connection("airflow", kwargs["gcs_connection_id"])
process = CrawlerProcess(project_settings)
process.crawl(spider.MySpider)
print("Starting crawler...")
process.start()
except Exception as err:
logging.error("Error in the Airflow task !!!!!", err)
raise CustomException("Error in the Airflow task !!!!!", err)
In this case when your custom exception will be raised, it will break the Airflow and mark it as failed.

Celery erroring out on runtime due to rabbitmq password rotation using Hashicorp Vault

Im trying to use rabbitmq as a broker for celery and using Hashicorp Vault for root credential rotation. How do I use vault's rabbitmq secret engine with celery?
def get_broker_url():
url = "http://localhost:8200/v1/rabbitmq/creds/dev-role"
payload = {}
headers = {
'X-Vault-Token': 'VAULT_TOKEN'
}
response = requests.get(url, headers=headers, data=payload).json()
return "amqp://{username}:{password}#127.0.0.1:5672/".format(username=response["data"]["username"], password=response["data"]["password"])
And while initialising celery
app = Celery(app_name, broker=get_broker_url(), backend="rpc://")
Im getting this error when lease expires
[2022-11-24 13:05:54,489: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
File "/Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/celery/worker/consumer/consumer.py", line 332, in start
blueprint.start(self)
File "/Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/celery/bootsteps.py", line 116, in start
step.start(parent)
File "/Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/celery/worker/consumer/consumer.py", line 628, in start
c.loop(*c.loop_args())
File "/Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/celery/worker/loops.py", line 97, in asynloop
next(loop)
File "/Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/kombu/asynchronous/hub.py", line 362, in create_loop
cb(*cbargs)
File "/Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/kombu/transport/base.py", line 235, in on_readable
reader(loop)
File "/Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/kombu/transport/base.py", line 217, in _read
drain_events(timeout=0)
File "/Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/amqp/connection.py", line 525, in drain_events
while not self.blocking_read(timeout):
File "/Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/amqp/connection.py", line 531, in blocking_read
return self.on_inbound_frame(frame)
File "/Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/amqp/method_framing.py", line 53, in on_frame
callback(channel, method_sig, buf, None)
File "/Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/amqp/connection.py", line 538, in on_inbound_method
method_sig, payload, content,
File "/Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/amqp/abstract_channel.py", line 156, in dispatch_method
listener(*args)
File "/Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/amqp/connection.py", line 668, in _on_close
(class_id, method_id), ConnectionError)
amqp.exceptions.ConnectionForced: (0, 0): (320) CONNECTION_FORCED - user 'root-3a978f79-e2fb-2889-08d4-0c69d2bcbffd' is deleted
[2022-11-24 13:05:54,498: WARNING/MainProcess] /Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/celery/worker/consumer/consumer.py:367: CPendingDeprecationWarning:
In Celery 5.1 we introduced an optional breaking change which
on connection loss cancels all currently executed tasks with late acknowledgement enabled.
These tasks cannot be acknowledged as the connection is gone, and the tasks are automatically redelivered back to the queue.
You can enable this behavior using the worker_cancel_long_running_tasks_on_connection_loss setting.
In Celery 5.1 it is set to False by default. The setting will be set to True by default in Celery 6.0.
warnings.warn(CANCEL_TASKS_BY_DEFAULT, CPendingDeprecationWarning)
[2022-11-24 13:05:54,505: CRITICAL/MainProcess] Unrecoverable error: AccessRefused(403, 'ACCESS_REFUSED - Login was refused using authentication mechanism AMQPLAIN. For details see the broker logfile.', (0, 0), '')
Traceback (most recent call last):
File "/Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/celery/worker/worker.py", line 203, in start
self.blueprint.start(self)
File "/Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/celery/bootsteps.py", line 116, in start
step.start(parent)
File "/Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/celery/bootsteps.py", line 365, in start
return self.obj.start()
File "/Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/celery/worker/consumer/consumer.py", line 332, in start
blueprint.start(self)
File "/Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/celery/bootsteps.py", line 116, in start
step.start(parent)
File "/Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/celery/worker/consumer/connection.py", line 21, in start
c.connection = c.connect()
File "/Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/celery/worker/consumer/consumer.py", line 428, in connect
conn = self.connection_for_read(heartbeat=self.amqheartbeat)
File "/Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/celery/worker/consumer/consumer.py", line 435, in connection_for_read
self.app.connection_for_read(heartbeat=heartbeat))
File "/Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/celery/worker/consumer/consumer.py", line 462, in ensure_connected
callback=maybe_shutdown,
File "/Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/kombu/connection.py", line 381, in ensure_connection
self._ensure_connection(*args, **kwargs)
File "/Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/kombu/connection.py", line 437, in _ensure_connection
callback, timeout=timeout
File "/Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/kombu/utils/functional.py", line 312, in retry_over_time
return fun(*args, **kwargs)
File "/Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/kombu/connection.py", line 877, in _connection_factory
self._connection = self._establish_connection()
File "/Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/kombu/connection.py", line 812, in _establish_connection
conn = self.transport.establish_connection()
File "/Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/kombu/transport/pyamqp.py", line 201, in establish_connection
conn.connect()
File "/Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/amqp/connection.py", line 329, in connect
self.drain_events(timeout=self.connect_timeout)
File "/Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/amqp/connection.py", line 525, in drain_events
while not self.blocking_read(timeout):
File "/Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/amqp/connection.py", line 531, in blocking_read
return self.on_inbound_frame(frame)
File "/Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/amqp/method_framing.py", line 53, in on_frame
callback(channel, method_sig, buf, None)
File "/Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/amqp/connection.py", line 538, in on_inbound_method
method_sig, payload, content,
File "/Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/amqp/abstract_channel.py", line 156, in dispatch_method
listener(*args)
File "/Users/vg/celery-mq-demo/venv/lib/python3.7/site-packages/amqp/connection.py", line 668, in _on_close
(class_id, method_id), ConnectionError)
amqp.exceptions.AccessRefused: (0, 0): (403) ACCESS_REFUSED - Login was refused using authentication mechanism AMQPLAIN. For details see the broker logfile.
Since broker root user's lease expires based on the ttl, How to add new broker credential for celery to use?
celery version - 5.2.7

Airflow 1.10.6 kubernetes executor over EKS using s3 connection, task pass test but dag run fail

I tried:
- configuring s3 connection from the value.yaml:
connections:
id: aws_default
type: aws
login: xxxaws_access_key_idxxx
password: xxxxxxxxxxxxxaws_secret_access_keyxxxxxxxxxxxxxxxxxx
id: my_s3
type: s3
login: xxxaws_access_key_idxxx
password: xxxxxxxxxxxxxaws_secret_access_keyxxxxxxxxxxxxxxxxxx
write DAG that uses the s3Hook and write string to s3.
run test from scheduler pod:
/entrypoint airflow test dag_id task_id date_before_the_start_data_of_DAG
the file is created and content is ok
# Airflow UI activating the
DAG and RUN it is queued and fail.
any suggestions?
BTW DAG ARGS:
args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(0),
'trigger_rule': 'dummy',
#'pool': 'my_workers_pool',
'catchup': False, # dont fill back
}
in addition I added 5 min sleep to the task that cause the DAG to fail and watched the pod creation at the kubectl but the task started for few seconds and disappears. any Ideas how to debug this issue?
logs from task pod:
kubectl logs postgressexamplesababaegozimpostgress-7322d44cb2684a09bef95ad3080b9505
-n airflow-research-p-8482f --tail=200 [2020-08-24 12:04:44,262] {{settings.py:252}} INFO - settings.configure_orm(): Using pool
settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=1
/usr/local/lib/python3.7/site-packages/psycopg2/init.py:144:
UserWarning: The psycopg2 wheel package will be renamed from release
2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see:
http://initd.org/psycopg/docs/install.html#binary-install-from-pypi.
""") [2020-08-24 12:04:44,833] {{init.py:51}} INFO - Using
executor LocalExecutor [2020-08-24 12:04:44,834] {{dagbag.py:92}} INFO
Filling up the DagBag from /usr/local/airflow/dags/postgress_example.py Traceback (most recent
call last): File "/usr/local/bin/airflow", line 37, in
[2020-08-24 12:04:44,841] {{dagbag.py:207}} ERROR - Failed to import:
/usr/local/airflow/dags/postgress_example.py Traceback (most recent
call last): File
"/usr/local/lib/python3.7/site-packages/airflow/models/dagbag.py",
line 204, in process_file
m = imp.load_source(mod_name, filepath) File "/usr/local/lib/python3.7/imp.py", line 171, in load_source
module = _load(spec) File "", line 696, in _load File "", line 677, in
_load_unlocked File "", line 728, in exec_module File "", line 219,
in _call_with_frames_removed File
"/usr/local/airflow/dags/postgress_example.py", line 33, in
import boto3 ModuleNotFoundError: No module named 'boto3'
args.func(args) File "/usr/local/lib/python3.7/site-packages/airflow/utils/cli.py", line
74, in wrapper
return f(*args, **kwargs) File "/usr/local/lib/python3.7/site-packages/airflow/bin/cli.py", line 529,
in run
dag = get_dag(args) File "/usr/local/lib/python3.7/site-packages/airflow/bin/cli.py", line 148,
in get_dag
'parse.'.format(args.dag_id)) airflow.exceptions.AirflowException: dag_id could not be found: postgress_example. Either the dag did not
exist or it failed to parse.

Problem with connecting the Storage Domain (Host host2 cannot access the Storage Domain(s) <UNKNOWN>)

Problem with connecting the Storage Domain (Host host2 cannot access the Storage Domain(s) )
Hello, everyone! I need specialist help, because I'm already desperate. My company has four hosts that are connected to the storage. Each host has its own IP to access the storage, which means host 1 has an ip 10.42.0.10 and 10.42.1.10 -> host 2 has an ip 10.42.0.20 and 10.42.0.20 respectively. Host 1 cannot ping the address 10.42.0.20. Hardware I tried to explain in more detail.
Host 1 has ovirt node 4.3.9 installed and hosted-engine deployed.
When trying to add host 2 to a cluster it is installed, but not activated. There is an error in ovirt manager - "Host **host2** cannot access the Storage Domain(s) <UNKNOWN>" and host 2 goes to "Not operational" status. On host 2, it writes "connect to 10.42.1.10:3260 failed (No route to host)" in the logs and repeats indefinitely. I manually connected host 2 to the storage using iscsiadm to ip 10.42.0.20. But the error is not missing(. At the same time, when the host tries to activate it, I can run virtual machines on it until the host shows an error message. VMs that have been run on host 2 continue to run even when the host has Non-operational status.
I assume that when adding host 2 to a cluster, ovirt tries to connect it to the same repository host 1 is connected to from ip 10.42.1.10. There may be a way to get ovirt to connect to another ip address instead of the ip domain address for the first host. I'm attaching logs:
/var/log/ovirt-engine/engine.log
2020-03-31 09:13:03,866+03 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-90) [7fa128f4] EVENT_ID: VDS_SET_NONOPERATIONAL_DOMAIN(522), Host host2.school34.local cannot access the Storage Domain(s) <UNKNOWN> attached to the Data Center DataCenter. Setting Host state to Non-Operational.
2020-03-31 10:40:04,883+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-12) [7a48ebb7] START, ConnectStorageServerVDSCommand(HostName = host2.school34.local, StorageServerConnectionManagementVDSParameters:{hostId='d82c3a76-e417-4fe4-8b08-a29414e3a9c1', storagePoolId='6052cc0a-71b9-11ea-ba5a-00163e10c7e7', storageType='ISCSI', connectionList='[StorageServerConnections:{id='c8a05dc2-f8a2-4354-96ed-907762c29761', connection='10.42.0.10', iqn='iqn.2002-09.com.lenovo:01.array.00c0ff3bfcb0', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id='0ec6f34e-01c8-4ecc-9bd4-7e2a250d589d', connection='10.42.1.10', iqn='iqn.2002-09.com.lenovo:01.array.00c0ff3bfcb0', vfsType='null', mountOptions='null', nfsVersion='null', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]', sendNetworkEventOnFailure='true'}), log id: 2c1a22b5
2020-03-31 10:43:05,061+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-12) [7a48ebb7] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host2.school34.local command ConnectStorageServerVDS failed: Message timeout which can be caused by communication issues
vdsm.log
2020-03-31 09:34:07,264+0300 ERROR (jsonrpc/5) [storage.HSM] Could not connect to storageServer (hsm:2420)
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2417, in connectStorageServer
conObj.connect()
File "/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 488, in connect
iscsi.addIscsiNode(self._iface, self._target, self._cred)
File "/usr/lib/python2.7/site-packages/vdsm/storage/iscsi.py", line 217, in addIscsiNode
iscsiadm.node_login(iface.name, target.address, target.iqn)
File "/usr/lib/python2.7/site-packages/vdsm/storage/iscsiadm.py", line 337, in node_login
raise IscsiNodeError(rc, out, err)
IscsiNodeError: (8, ['Logging in to [iface: default, target: iqn.2002-09.com.lenovo:01.array.00c0ff3bfcb0, portal: 10.42.1.10,3260] (multiple)'], ['iscsiadm: Could not login to [iface: default, target: iqn.2002-09.com.lenovo:01.array.00c0ff3bfcb0, portal: 10.42.1.10,3260].', 'iscsiadm: initiator reported error (8 - connection timed out)', 'iscsiadm: Could not log into all portals'])
2020-03-31 09:36:01,583+0300 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/0 running <Task <JsonRpcTask {'params': {u'connectionParams': [{u'port': u'3260', u'connection': u'10.42.0.10', u'iqn': u'iqn.2002-09.com.lenovo:01.array.00c0ff3bfcb0', u'user': u'', u'tpgt': u'2', u'ipv6_enabled': u'false', u'password': '********', u'id': u'c8a05dc2-f8a2-4354-96ed-907762c29761'}, {u'port': u'3260', u'connection': u'10.42.1.10', u'iqn': u'iqn.2002-09.com.lenovo:01.array.00c0ff3bfcb0', u'user': u'', u'tpgt': u'1', u'ipv6_enabled': u'false', u'password': '********', u'id': u'0ec6f34e-01c8-4ecc-9bd4-7e2a250d589d'}], u'storagepoolID': u'6052cc0a-71b9-11ea-ba5a-00163e10c7e7', u'domainType': 3}, 'jsonrpc': '2.0', 'method': u'StoragePool.connectStorageServer', 'id': u'64cc0385-3a11-474b-98f0-b0ecaa6c67c8'} at 0x7fe1ac1ff510> timeout=60, duration=60.00 at 0x7fe1ac1ffb10> task#=316 at 0x7fe1f0041ad0>, traceback:
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 260, in run
ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
self._execute_task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
self._callable()
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 262, in __call__
self._handler(self._ctx, self._req)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 305, in _serveRequest
response = self._handle_request(req, ctx)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 345, in _handle_request
res = method(**params)
File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 194, in _dynamicMethod
result = fn(*methodArgs)
File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1102, in connectStorageServer
connectionParams)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/dispatcher.py", line 74, in wrapper
result = ctask.prepare(func, *args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, in wrapper
return m(self, *a, **kw)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1179, in prepare
result = self._run(func, *args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
return fn(*args, **kargs)
File: "<string>", line 2, in connectStorageServer
File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method
ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2417, in connectStorageServer
conObj.connect()
File: "/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 488, in connect
iscsi.addIscsiNode(self._iface, self._target, self._cred)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/iscsi.py", line 217, in addIscsiNode
iscsiadm.node_login(iface.name, target.address, target.iqn)
File: "/usr/lib/python2.7/site-packages/vdsm/storage/iscsiadm.py", line 327, in node_login
portal, "-l"])
File: "/usr/lib/python2.7/site-packages/vdsm/storage/iscsiadm.py", line 122, in _runCmd
return misc.execCmd(cmd, printable=printCmd, sudo=True, sync=sync)
File: "/usr/lib/python2.7/site-packages/vdsm/common/commands.py", line 213, in execCmd
(out, err) = p.communicate(data)
File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 924, in communicate
stdout, stderr = self._communicate(input, endtime, timeout)
File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 1706, in _communicate
orig_timeout)
File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 1779, in _communicate_with_poll
ready = poller.poll(self._remaining_time(endtime)) (executor:363)
Thanks a lot!
I'm struggling trying to figure out your architecture. It seems like you've configured your Storage Domain pointing to 10.42.1.10:3260 as iSCSI portal.
If I've understood correctly, your cluster should be something like:
+-------------+
|HostedEngine |
+------+------+
|
| Management network
+-------+------+ 10.42.0.0/24
| |
+ +
10.42.0.10 10.42.0.20
+--------+ +--------+
| host1 | | host2 |
+--------+ +--------+
10.42.1.10 10.42.1.20
+ +
| |
+-------+------+
|
| Storage network
+---+---+ 10.42.1.0/24
|Storage|
+-------+
Provided my guess is correct, it seems you've configured your iSCSI target on host1 instead of a proper, external, storage device. Otherwise, you've messed up with IP addressing.

RQ Redis : Connection Refused after a successful connection

My redis server is running under ubuntu 16.04 and I have RQ Dashboard running to monitor the queue. The redis server has a password which I supply for the initial connection. Here's my code:
from rq import Queue, Connection, Worker
from redis import Redis
from dblogger import DbLogger
def _redisCon():
redis_host = "192.168.1.169"
redis_port = "6379"
redis_password = "SecretPassword"
return Redis(host=redis_host, port=redis_port, password=redis_password)
rcon = _redisCon()
if rcon is not None:
with Connection(rcon):
DbLogger.log("rqworker", 0, "Launching Worker", "launching an RQ Worker - default Queue")
worker = Worker(list(map(Queue, 'default'))) # this works - I see the worker registered in RQ dashboard
worker.work() # this eventually fails with the Connection error:
"""
16:28:49 RQ worker 'rq:worker:steve-imac.95379' started, version 0.12.0
16:28:49 *** Listening on default...
16:28:49 Cleaning registries for queue: default
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/redis/connection.py", line 177, in _read_from_socket
raise socket.error(SERVER_CLOSED_CONNECTION_ERROR)
OSError: Connection closed by server.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/redis/client.py", line 668, in execute_command
return self.parse_response(connection, command_name, **options)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/redis/client.py", line 680, in parse_response
response = connection.read_response()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/redis/connection.py", line 624, in read_response
response = self._parser.read_response()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/redis/connection.py", line 284, in read_response
response = self._buffer.readline()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/redis/connection.py", line 216, in readline
self._read_from_socket()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/redis/connection.py", line 191, in _read_from_socket
(e.args,))
redis.exceptions.ConnectionError: Error while reading from socket: ('Connection closed by server.',)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/redis/connection.py", line 489, in connect
raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 61 connecting to 192.168.1.169:6379. Connection refused.
"""
I've tried removing the password and enabling the unixsocket in the redis.conf -- neither seemed to help. This seems to be happening in some sort of timeout, since in other testing the worker actually loads a task and executes it before eventually dying with this error.