How to debug aws fargate task running out of memory? - aws-fargate

I'm running a task at fargate with CPU as 2048 and memory as 8192. Task after running some time is stopped with error
container was stopped as it ran out of memory.
Thing is that task does not fails every time. If I run the same task 10 time it fails 5 times and works 5 times. However If I take an ec2 machine with 2 vcpu and 4GB memory and try to run the same container it runs successfully.(Infact the memory usage on ec2 instance is very low).
Can somebody please guide me how to figure out the memory issue while running a fargate task?
Thanks

The way to start would be enabling memory metrics from container insights for your fargate tasks and Further correlating the Memory Usage graph with Application logs. help here
The difference between running on EC2 vs Fargate could probably be due to the fact that when you run a container on ECS Fargate, it runs on AWS's internal EC2 Instances. Now, here could possibly arise a Noisy Neighbour Situation although the chances would be pretty low.

Related

How to check what is causing the high CPU usage on EC2 instance

We have EC2 instance of type m5.xlarge which is having 4 CPUs but the CPU usage is 100% on weekend when the DB connections are normal.
How to debug what is causing the high CPU usage. we checked the cron running on server as well but everything is normal.
This also causes the increase in DB connections on RDS when the actual site users are less.
Please help finding the solution.

GCP VM consistently shutting down without warning

Been using a GCP preemptible VM for a few months without problems, but in the last 4 weeks my instances have consistently shut off anywhere from 10 minutes to 20 minutes into operation.
I'll be in the middle of training, and my notebook will suddenly disconnect. The terminal will show this error:
jupyter#fastai-instance:~$ Connection to 104.154.142.171 closed by remote host.
Connection to 104.154.142.171 closed.
ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255].
I then check the status of my VM, to see that it has shutdown.
I searched the terminal traceback and found this thread, which seemed promising: ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255]
When I ran sudo gcloud compute config-ssh my VM ran for much longer than usual before shutting down, yet shutdown in the same way after about an hour. Since then, back to the same behavior.
I know preemptible instances can be shutdown when the platform needs resources, but my understanding is that comes with some kind of warning. I've checked the status of GCP's servers after shutdowns and they appear to be fine. This is also happening the same way every time I turn my VM on, which seems too frequent for preempting.
I am not sure where to look for any clues – has anyone else had a problem like this? What's especially puzzling to me is, if it is in fact an SSH problem, why would that cause the VM itself to shutdown, rather than just break the connection?
Thanks very much for any help!
Did you try to set a shutdown script and to print something in a file for validating the state of the VM when it goes down ?
Try this as shutdown script
#!/bin/bash
curl "http://metadata.google.internal/computeMetadata/v1/instance/preempted" -H "Metadata-Flavor: Google" > /tmp/preempted.log
If there is TRUE in the file, it's because the VM has been preempted.
If a VM stops and you have an active SSH connection to that VM (via gcloud compute ssh), then it's normal that you are receiving an error. Since the VM goes down, all connections are closed, so does your SSH connection (you cannot connect to a stopped instance). The VM termination causes the SSH error, not the opposite.
When using preemptible instances, Google can reclaim the instance whenever it's needed. Note that (from the docs about preemptible instances limitations) :
Compute Engine might terminate preemptible instances at any time due to system events. The probability that Compute Engine will terminate a preemptible instance for a system event is generally low, but might vary from day to day and from zone to zone depending on current conditions.
It means that one day, your instance may be running for 24 hours without being terminated, but an other day, your instance may be stopped 30 minutes after being started if Compute Engine needs to reclaim some resources.
A comment on the "continuously shutting down" part:
(I have experienced this as well)
Keep in mind that Google prefers to shut down RECENTLY STARTED preemptible instances, over ones started earlier.
The link below (and supplied earlier) has the statement:
Generally, Compute Engine avoids preempting too many instances from a single customer and preempts new instances over older instances whenever possible.
This would generally mean that, yes, I suppose, if you are preempted, and boot up again, it is quite likely that you are going to be preempted again and again until the load in the zone reduces.
I'm surprised that Google don't simply preclude you starting the preemptible VM for a while (like 30-60 minutes?). - How much CPU is being wasted bouncing VMs up and down and crossing our fingers???
P.S. There is a dirty trick to end-around your frustration - Have 2 VMs identically configured, except for preemptibility, but only 1 underlying book disk. If you are having a bad day with preempts, simply 'move' the boot disk to the non-preemptible VM, boot it, and carry on. - It's a couple of simple gcloud commands to achieve this, easily scripted and very fast. Don't tell Google I told ya....
https://cloud.google.com/compute/docs/instances/preemptible#limitations

How fast can ECS fargate boot a container?

What the the minimum/average time for AWS ECS Fargate to boot and run a docker image?
For arguments sake, the 45MB anapsix/alpine-java image.
I would like to investigate using ECS Fargate to speed up the process of building software locally on a slow laptop/pc, by having the software built on a faster remote server.
As such the boot up time of the image is crucial in making the endevour worth while.
I would disagree with the accepted answer given my experience with Fargate.
I have launched 1000's of containers on Fargate, and was even featured in an AWS architecture blog for our usage of Fargate. https://aws.amazon.com/blogs/architecture/building-real-time-ai-with-aws-fargate/
Private subnets, behind a NAT gateway have no different launch times for us than containers behind an IGW. If you use single NAT instances sure, your mileage may vary.
Container launch times in Fargate are entirely determined by how large your container is. Fargate does not cache containers, so every run task results in a docker pull happening. If your images are based on Ubuntu, you will have a bad time.
We have a mix of GO from scratch containers and Alpine node containers.
On average based on the metrics we have aggregated from 1000's of launches, From scratch containers start and are healthy in the target group in 10-15 seconds.
Alpine containers take on average 30-40 seconds to launch and become healthy.
Anything longer than that and your containers are likely too large for Fargate to make any sense until they offer pre cached ecr or something similar.
For your specific example, we have similar sized containers, if your entrypoint is healthy quickly (Ie not a 60 second java start time), your container of 45mb should launch and be ready to go in 30-60 seconds.
I am still waiting for caching in Fargate that is already available in ECS+EC2. This feature request can be tracked here. It is a pain in the ass that containers take such a long time to boot on AWS Fargate. Google Cloud Platform already offers this feature as generally available with a managed Cloud Run (K8s) environment, where containers spin up on the fly (~ 2 seconds) when they receive a request. They go idle after (a configurable) 5 minutes, which causes you to only be billed for those 5 minutes.
AWS Fargate does not offer such a nice feature of "warm containers" yet, although I would highly recommend them in doing so. It is probably technically difficult in getting compute and storage close together to accomplish this, it would require an enormous amount of internal bandwidth to load those containers as fast as Google does.
Nevertheless, below is my experience with Docker containers on AWS Fargate. Boot time is highly correlated with container image size as you can see from the following sample of containers I booted (February 2019):
4000 MB ~ 5 minutes
2400 MB ~ 4 minutes
1000 MB ~ 2 minutes
350 MB ~ 50 seconds
I would recommend building your container image on a light-weight base image, such as Minideb or Alpine. This would make your container image pretty small, ranging from a few 10MBs to a few 100MBs. But then again, when you need a JVM or Python with some additional packages and c-libs, you would easily go to 1000 MB.
I've launched more than 100 containers now in Fargate and on a public VPC they take about 4 mins on average, but I've seen it as long as 7-8 mins on a bad day.
If you launch it on a Private VPC then the timing can go south in a hurry. I've seen it take 2 hours to launch a Fargate container if the NAT instance is overloaded.
Hopefully AWS will speed this up over time. It shouldn't take me longer to launch a Fargate container than it does to upload my docker image to ECR.
One could use ECS_IMAGE_PULL_BEHAVIOR = prefer-cached on EC2 launch type to reduce agent start up timings to great extent.

spring-data-redis cluster recovery issue

We're running a 7-node redis cluster, with all nodes as masters (no slave replication). We're using this as an in-memory cache, so we've commented out all saves in redis.conf, and we've got the following other non-defaults in redis.conf:
maxmemory: 30gb
maxmemory-policy allkeys-lru
cluster-enabled yes
cluster-config-file nodes.conf
cluster-node-timeout 5000
cluster-require-full-coverage no
The client for this cluster is a spring-boot rest api application, using spring-data-redis with jedis as the driver. We mainly use the spring caching annotations.
We had an issue the other day where one of the masters went down for a while. With a single master down in a 7-node cluster we noted a marked increase in the average response time for api calls involving redis, which I would expect.
When the down master was brought back online and re-joined the cluster, we had a massive spike in response time. Via newrelic I can see that the app started making a ton of redis cluster calls (newrelic doesn't tell me which cluster subcommand was being used). Our normal avg response time is around 5ms; during this time it went up to 800ms and we had a few slow sample transactions that took > 70sec. On all app jvms I see the number of active threads jump from a normal 8-9 up to around 300 during this time. We have configured the tomcat http thread pool to allow 400 threads max. After about 3 minutes, the problem cleared itself up, but I now have people questioning the stability of the caching solution we chose. Newrelic doesn't give any insight into where the additional time on the long requests is being spent (it's apparently in an area that Newrelic doesn't instrument).
I've made some attempt to reproduce by running some jmeter load tests against a development environment, and while I see some moderate response time spikes when re-attaching a redis-cluster master, I don't see anything near what we saw in production. I've also run across https://github.com/xetorthio/jedis/issues/1108, but I'm not gaining any useful insight from that. I tried reducing spring.redis.cluster.max-redirects from the default 5 to 0, which didn't seem to have much effect on my load test results. I'm also not sure how appropriate a change that is for my use case.

Celery workers missing heartbeats and getting substantial drift over Ec2

I am testing my celery implementation over 3 ec2 machines right now. I am pretty confident in my implementation now, but I am getting problems with the actual worker execution. My test structure is as follows:
1 ec2 machine is designated as the broker, also runs a celery worker
1 ec2 machine is designated as the client (runs the client celery script that enqueues all the tasks using .delay(), also runs a celery worker
1 ec2 machine is purely a worker.
All the machines have 1 celery worker running. Before, I was immediately getting the message:
"Substantial drift from celery#[other ec2 ip] may mean clocks are out of sync."
A drift amount in seconds would then be printed, which would increase over time.
I would also get messages : "missed heartbeat from celery#[other ec2 ip].
The machine would be doing very little work at this point, so my AutoScaling config in ec2 would shut down the instance automatically once it got to cpu utilization levels very low (<5%)
So to try to solve this problem, i attempted to sync all my machine's clocks (although I thought celery handled this) with this command, which was performed upon start up for all machines:
apt-get -qy install ntp
service ntp start
With this, they all performed well for about 10 minutes with no hitches, after which I started getting missed heartbeats and my ec2 instances stalled and shut down. The weird thing is, the drift increased and then decreased sometimes.
Any idea on why this is happening?
I am using the newest version of celery (3.1) and rabbitmq
EDIT: It should be noted that I am utilizing us-west-1a and us-west-1c availability zones on ec2.
EDIT2: I am starting to think memory problems might be an issue. I am using a t2.micro instance, and running 3 celery workers on the same machine (only 1 instance) which is also the broker, still cause heartbeat misses and stalls.