Celery workers missing heartbeats and getting substantial drift over Ec2 - rabbitmq

I am testing my celery implementation over 3 ec2 machines right now. I am pretty confident in my implementation now, but I am getting problems with the actual worker execution. My test structure is as follows:
1 ec2 machine is designated as the broker, also runs a celery worker
1 ec2 machine is designated as the client (runs the client celery script that enqueues all the tasks using .delay(), also runs a celery worker
1 ec2 machine is purely a worker.
All the machines have 1 celery worker running. Before, I was immediately getting the message:
"Substantial drift from celery#[other ec2 ip] may mean clocks are out of sync."
A drift amount in seconds would then be printed, which would increase over time.
I would also get messages : "missed heartbeat from celery#[other ec2 ip].
The machine would be doing very little work at this point, so my AutoScaling config in ec2 would shut down the instance automatically once it got to cpu utilization levels very low (<5%)
So to try to solve this problem, i attempted to sync all my machine's clocks (although I thought celery handled this) with this command, which was performed upon start up for all machines:
apt-get -qy install ntp
service ntp start
With this, they all performed well for about 10 minutes with no hitches, after which I started getting missed heartbeats and my ec2 instances stalled and shut down. The weird thing is, the drift increased and then decreased sometimes.
Any idea on why this is happening?
I am using the newest version of celery (3.1) and rabbitmq
EDIT: It should be noted that I am utilizing us-west-1a and us-west-1c availability zones on ec2.
EDIT2: I am starting to think memory problems might be an issue. I am using a t2.micro instance, and running 3 celery workers on the same machine (only 1 instance) which is also the broker, still cause heartbeat misses and stalls.

Related

How to debug aws fargate task running out of memory?

I'm running a task at fargate with CPU as 2048 and memory as 8192. Task after running some time is stopped with error
container was stopped as it ran out of memory.
Thing is that task does not fails every time. If I run the same task 10 time it fails 5 times and works 5 times. However If I take an ec2 machine with 2 vcpu and 4GB memory and try to run the same container it runs successfully.(Infact the memory usage on ec2 instance is very low).
Can somebody please guide me how to figure out the memory issue while running a fargate task?
Thanks
The way to start would be enabling memory metrics from container insights for your fargate tasks and Further correlating the Memory Usage graph with Application logs. help here
The difference between running on EC2 vs Fargate could probably be due to the fact that when you run a container on ECS Fargate, it runs on AWS's internal EC2 Instances. Now, here could possibly arise a Noisy Neighbour Situation although the chances would be pretty low.

How to track celery and rabbitmq in production server

I have installed both celery and rabbitmq. Now i would like to track how many messages are there in the queue and how it is distributed, want to see the list of celery consumers and tasks they are executing etc. this is bcoz i had issues with celery getting stuck when there is a memory pressure. I tried installing rabbitmq management for a start and when i tried to login at myservr.com:15672 it said can only be used through localhost, is there any workaround? Also is it a good idea to run such monitoring on production servers? Will there be any chance for memory leaks?

RabbitMQ creates a number of strange processes

I happened to find a number of strange processes created by rabbitmq on my RabbitMQ server. I ran rabbitmq server in a docker container. I recreated the container and hours later those processes appeared again. There're some consumers connecting to it. Any idea about what those processes for? Thanks!

Multiple broker machine rabbitmq configuration, how does HA work?

I'm trying to figure out how HA works. (high availability queues)
The current configuration I have is: every machine has multiple celery workers and points to itself as broker. Each machine can do this rather than point at one broker machine because of HA; in this way, there is less load on any one machine, as all are brokers and have copies of the same queue.
My question is, is my above logic correct? Or do all workers need to point to one broker machine regardless of HA?
If you have looked at HA and clustering and have ensured that the queues mirror each other then what you are doing should be fine. But that may seem a tad inefficient to run it on every server where you run your workers.
The other option is to run your queues on a few servers for HA and have other servers running the workers to point to them. But since the celery worker config can only point to one broker url, you would need to work around that by possibly using a load balancer to which all workers will point to. This is to the best of what I've come to understand over the past few years on RabbitMQ HA for celery.

Is it recommended to run redis using Supervisor

Is it a good practice to run redis in production with Supervisor?
I've googled around, but haven't seen many examples of doing so. If not, what is the proper way of running redis in production?
I personally just use Monit on Redis in production. If Redis crash Monit will restart it but more importantly Monit will be able to monitor (and alert when a threeshold is reach) the amount of RAM that Redis currently takes (which is the biggest issue)
Configuration could be something like this (if maxmemory was set to 1Gb in Redis)
check process redis
with pidfile /var/run/redis.pid
start program = "/etc/init.d/redis-server start"
stop program = "/etc/init.d/redis-server stop"
if 10 restarts within 10 cycles
then timeout
if failed host 127.0.0.1 port 6379 then restart
if memory is greater than 1GB for 2 cycles then alert
Well..it depends. If I were do use redis under daemon control I would use runit. I do use monit but only for monitoring. I like to see the green light.
However, for redis to exploit the true power, you dont run redis as a deamon esp a master. If a master goes down, you will have to switch a slave to a master. Quit simply, I just shoot the node in the head and I have a chef recipe bring up a new node.
But then again....it also depends on how often you snapshot. I do not snapshot thus no need for deamon control.
People use reids for brute force speed. that means not writing to disk and keep all data in ram. If a node goes down...and you dont snapshot...data is lost.