I am using Celery 3.1.17, RabbitMQ server as broker, backend disabled.
Celery settings:
CELERY_CREATE_MISSING_QUEUES = True
CELERY_IGNORE_RESULT = True
CELERY_DISABLE_RATE_LIMITS = True
CELERY_ACKS_LATE = True
CELERY_TASK_RESULT_EXPIRES = 1
The broker is hosted on a separate server.
RabbitMQ server configuration : 2GB RAM, dual core
Codebase server : 8 GB RAM, quad core
Celery worker settings:
Workers : 4
Concurrency : 12
Memory Consumption in the codebase server : 2GB (max)
CPU Load in codebase server : 4.5, 2.1, 1.5
Tasks Fired : 50,000
Publishing Rate : 1000/s
Problem :
After every 16,000 tasks, the worker halts for about 1-2 minutes and then restarts.
Also about 200-300 tasks had failed.
CPU and memory consumption are NOT bottlenecks in this case.
Is it a ulimits thing?
How do I ensure that constant execution rate of tasks and prevent messages from getting lost?
Related
Problem Statment:
Hive LLAP Daemons not consuming Cluster VCPU allocation. 80-100 cores available for LLAP daemon, but only using 16.
Summary:
I am testing Hive LLAP on Azure using 2 D14_v2 head nodes, 16 D14_V2 Worker Nodes, and 3 A series Zookeeper nodes. (D14_V2 = 112GB Ram/12vcpu)
The 15 nodes of the 16 node Cluster is dedicated to LLAP
The Distribution is HDP 2.6.3.2-14
Currently the cluster has a total of 1.56TB of Ram Available and 128vcpu. The LLAP Daemons are allocated the proper amount of memory, but the LLAP Daemons only uses 16vcpus total ( 1 vcpu per daemon + 1 vcpu for slider).
Configuration:
My relevant hive configs are as follows:
hive.llap.daemon.num.executors = 10 (10 of the 12 available vcpu per
node)
Yarn Max Vcores per container - 8
Other:
I have been load testing the cluster but unable to get any more vcpus engaged in the process. Any thoughts or insights would be greatly appreciated.
Resource Manager UI will only show you query co-ordinator and slider's core and memory allocation, each query co-ordinator in LLAP occupy 1 core and mininum alloted Tez-AM memory (tez.am.resource.memory.mb). To check realtime core usage by LLAP service for HDP 2.6.3 version, follow below steps:
Ambari -> Hive -> Quick Links -> Grafana -> Hive LLAP overview ->
Total Execution Slots
With roughly 0.9 million messages in rabbitmq, celery workers stopped processing tasks. On killing celery and running it again, processing resumed. Rabbit never went out of memory. Nothing suspicious in any logs or statuses except:
** WARNING ** Mnesia is overloaded: {dump_log,write_threshold}
from /var/log/rabbitmq/rabbit.log. Similar symptoms were present before with around 1.6m messages en-queued.
More info:
Celery concurrency: 4
RAM installed: 4GB
Swap space 8GB
disk_free_limit (Rabbit): 8GB
vm_memory_high_watermark: 2
vm_memory_high_watermark_paging_ratio: 0.75
How can the actual cause of workers stopping be diagnosed and how can it be prevented from reoccurring.
Thanks.
Probably submitting/consuming the messages from queue too fast?
If you don't need messages to be durable and can store those in memory only, it will significantly improve the RabbitMQ performance.
http://docs.celeryproject.org/en/latest/userguide/optimizing.html#using-transient-queues
I have the following setup: a ELK stack on CentOS server with RB with mirrored Queue. Publishers are using Nlog-HAF-Rabbitmq appender to forward logs to RabbitMQ nodes behind load balancer.
One publisher is a web application which hosted on IIS. Sometimes, it stops logging to ELK stack post-recycle which happens early in morning.
Here are my findings:
I logged Nlog internal logs to identify if any connection failure.
while recycling we have Warning on Event viewer on WAS
A process serving application pool exceeded time limits during shut down. The process id.
A worker process serving application pool failed to stop a listener channel for protocol 'http' in the allotted time. The data field contains the error number.
IIS shutdown time limit 3 seconds (default)
IIS start up time limit 3 seconds (default)
Based information above, what could be the reason?
I am testing my celery implementation over 3 ec2 machines right now. I am pretty confident in my implementation now, but I am getting problems with the actual worker execution. My test structure is as follows:
1 ec2 machine is designated as the broker, also runs a celery worker
1 ec2 machine is designated as the client (runs the client celery script that enqueues all the tasks using .delay(), also runs a celery worker
1 ec2 machine is purely a worker.
All the machines have 1 celery worker running. Before, I was immediately getting the message:
"Substantial drift from celery#[other ec2 ip] may mean clocks are out of sync."
A drift amount in seconds would then be printed, which would increase over time.
I would also get messages : "missed heartbeat from celery#[other ec2 ip].
The machine would be doing very little work at this point, so my AutoScaling config in ec2 would shut down the instance automatically once it got to cpu utilization levels very low (<5%)
So to try to solve this problem, i attempted to sync all my machine's clocks (although I thought celery handled this) with this command, which was performed upon start up for all machines:
apt-get -qy install ntp
service ntp start
With this, they all performed well for about 10 minutes with no hitches, after which I started getting missed heartbeats and my ec2 instances stalled and shut down. The weird thing is, the drift increased and then decreased sometimes.
Any idea on why this is happening?
I am using the newest version of celery (3.1) and rabbitmq
EDIT: It should be noted that I am utilizing us-west-1a and us-west-1c availability zones on ec2.
EDIT2: I am starting to think memory problems might be an issue. I am using a t2.micro instance, and running 3 celery workers on the same machine (only 1 instance) which is also the broker, still cause heartbeat misses and stalls.
I have a celery + rabbitmq setup with a busy django site, in the celery setting I have this config:
CELERY_RESULT_BACKEND = "amqp"
CELERY_AMQP_TASK_RESULT_EXPIRES = 5
I am monitoring the queues with the "watch" command, what I have observed is whilst most of the temporary queues get deleled after a few seconds, there are some queues (same guid) did not get deleted, and the list grows slowly, regardless how many workers used.
The django site does generate about 60 tasks per second, accepts various information and use the tasks to digest information. The whole setup runs on a 16 core cpu server with plenty of RAM. Would this still caused by performance issue or a celery bug?
Cheers
James