AWS Fargate container exiting instead of Autoscaling causing 502 error - load-balancing

I Configured fargate autoscaling with minimum capacity as 2 and autoscaling metrics as CPU utilization of 70% but when we start test, 1 container is running and after 4 minutes, it gets exited with autoscaling peak cpu 95% and memory 81.3%.After a delay of 2 minutes, new container got started and during this 2 minutes, we experienced 502 errors, 504 and 503. What can be the cause
I tried autoscaling from the initial one container to 2 minimum and desired capacity and also reduce cpu utilization from 80 to 70% but it had little to no effect as the minimum capacity still stayed as 1

Related

How to debug aws fargate task running out of memory?

I'm running a task at fargate with CPU as 2048 and memory as 8192. Task after running some time is stopped with error
container was stopped as it ran out of memory.
Thing is that task does not fails every time. If I run the same task 10 time it fails 5 times and works 5 times. However If I take an ec2 machine with 2 vcpu and 4GB memory and try to run the same container it runs successfully.(Infact the memory usage on ec2 instance is very low).
Can somebody please guide me how to figure out the memory issue while running a fargate task?
Thanks
The way to start would be enabling memory metrics from container insights for your fargate tasks and Further correlating the Memory Usage graph with Application logs. help here
The difference between running on EC2 vs Fargate could probably be due to the fact that when you run a container on ECS Fargate, it runs on AWS's internal EC2 Instances. Now, here could possibly arise a Noisy Neighbour Situation although the chances would be pretty low.

How to check what is causing the high CPU usage on EC2 instance

We have EC2 instance of type m5.xlarge which is having 4 CPUs but the CPU usage is 100% on weekend when the DB connections are normal.
How to debug what is causing the high CPU usage. we checked the cron running on server as well but everything is normal.
This also causes the increase in DB connections on RDS when the actual site users are less.
Please help finding the solution.

Hive LLAP low Vcore allocation

Problem Statment:
Hive LLAP Daemons not consuming Cluster VCPU allocation. 80-100 cores available for LLAP daemon, but only using 16.
Summary:
I am testing Hive LLAP on Azure using 2 D14_v2 head nodes, 16 D14_V2 Worker Nodes, and 3 A series Zookeeper nodes. (D14_V2 = 112GB Ram/12vcpu)
The 15 nodes of the 16 node Cluster is dedicated to LLAP
The Distribution is HDP 2.6.3.2-14
Currently the cluster has a total of 1.56TB of Ram Available and 128vcpu. The LLAP Daemons are allocated the proper amount of memory, but the LLAP Daemons only uses 16vcpus total ( 1 vcpu per daemon + 1 vcpu for slider).
Configuration:
My relevant hive configs are as follows:
hive.llap.daemon.num.executors = 10 (10 of the 12 available vcpu per
node)
Yarn Max Vcores per container - 8
Other:
I have been load testing the cluster but unable to get any more vcpus engaged in the process. Any thoughts or insights would be greatly appreciated.
Resource Manager UI will only show you query co-ordinator and slider's core and memory allocation, each query co-ordinator in LLAP occupy 1 core and mininum alloted Tez-AM memory (tez.am.resource.memory.mb). To check realtime core usage by LLAP service for HDP 2.6.3 version, follow below steps:
Ambari -> Hive -> Quick Links -> Grafana -> Hive LLAP overview ->
Total Execution Slots

How fast can ECS fargate boot a container?

What the the minimum/average time for AWS ECS Fargate to boot and run a docker image?
For arguments sake, the 45MB anapsix/alpine-java image.
I would like to investigate using ECS Fargate to speed up the process of building software locally on a slow laptop/pc, by having the software built on a faster remote server.
As such the boot up time of the image is crucial in making the endevour worth while.
I would disagree with the accepted answer given my experience with Fargate.
I have launched 1000's of containers on Fargate, and was even featured in an AWS architecture blog for our usage of Fargate. https://aws.amazon.com/blogs/architecture/building-real-time-ai-with-aws-fargate/
Private subnets, behind a NAT gateway have no different launch times for us than containers behind an IGW. If you use single NAT instances sure, your mileage may vary.
Container launch times in Fargate are entirely determined by how large your container is. Fargate does not cache containers, so every run task results in a docker pull happening. If your images are based on Ubuntu, you will have a bad time.
We have a mix of GO from scratch containers and Alpine node containers.
On average based on the metrics we have aggregated from 1000's of launches, From scratch containers start and are healthy in the target group in 10-15 seconds.
Alpine containers take on average 30-40 seconds to launch and become healthy.
Anything longer than that and your containers are likely too large for Fargate to make any sense until they offer pre cached ecr or something similar.
For your specific example, we have similar sized containers, if your entrypoint is healthy quickly (Ie not a 60 second java start time), your container of 45mb should launch and be ready to go in 30-60 seconds.
I am still waiting for caching in Fargate that is already available in ECS+EC2. This feature request can be tracked here. It is a pain in the ass that containers take such a long time to boot on AWS Fargate. Google Cloud Platform already offers this feature as generally available with a managed Cloud Run (K8s) environment, where containers spin up on the fly (~ 2 seconds) when they receive a request. They go idle after (a configurable) 5 minutes, which causes you to only be billed for those 5 minutes.
AWS Fargate does not offer such a nice feature of "warm containers" yet, although I would highly recommend them in doing so. It is probably technically difficult in getting compute and storage close together to accomplish this, it would require an enormous amount of internal bandwidth to load those containers as fast as Google does.
Nevertheless, below is my experience with Docker containers on AWS Fargate. Boot time is highly correlated with container image size as you can see from the following sample of containers I booted (February 2019):
4000 MB ~ 5 minutes
2400 MB ~ 4 minutes
1000 MB ~ 2 minutes
350 MB ~ 50 seconds
I would recommend building your container image on a light-weight base image, such as Minideb or Alpine. This would make your container image pretty small, ranging from a few 10MBs to a few 100MBs. But then again, when you need a JVM or Python with some additional packages and c-libs, you would easily go to 1000 MB.
I've launched more than 100 containers now in Fargate and on a public VPC they take about 4 mins on average, but I've seen it as long as 7-8 mins on a bad day.
If you launch it on a Private VPC then the timing can go south in a hurry. I've seen it take 2 hours to launch a Fargate container if the NAT instance is overloaded.
Hopefully AWS will speed this up over time. It shouldn't take me longer to launch a Fargate container than it does to upload my docker image to ECR.
One could use ECS_IMAGE_PULL_BEHAVIOR = prefer-cached on EC2 launch type to reduce agent start up timings to great extent.

Celery workers missing heartbeats and getting substantial drift over Ec2

I am testing my celery implementation over 3 ec2 machines right now. I am pretty confident in my implementation now, but I am getting problems with the actual worker execution. My test structure is as follows:
1 ec2 machine is designated as the broker, also runs a celery worker
1 ec2 machine is designated as the client (runs the client celery script that enqueues all the tasks using .delay(), also runs a celery worker
1 ec2 machine is purely a worker.
All the machines have 1 celery worker running. Before, I was immediately getting the message:
"Substantial drift from celery#[other ec2 ip] may mean clocks are out of sync."
A drift amount in seconds would then be printed, which would increase over time.
I would also get messages : "missed heartbeat from celery#[other ec2 ip].
The machine would be doing very little work at this point, so my AutoScaling config in ec2 would shut down the instance automatically once it got to cpu utilization levels very low (<5%)
So to try to solve this problem, i attempted to sync all my machine's clocks (although I thought celery handled this) with this command, which was performed upon start up for all machines:
apt-get -qy install ntp
service ntp start
With this, they all performed well for about 10 minutes with no hitches, after which I started getting missed heartbeats and my ec2 instances stalled and shut down. The weird thing is, the drift increased and then decreased sometimes.
Any idea on why this is happening?
I am using the newest version of celery (3.1) and rabbitmq
EDIT: It should be noted that I am utilizing us-west-1a and us-west-1c availability zones on ec2.
EDIT2: I am starting to think memory problems might be an issue. I am using a t2.micro instance, and running 3 celery workers on the same machine (only 1 instance) which is also the broker, still cause heartbeat misses and stalls.