How fast can ECS fargate boot a container? - aws-fargate

What the the minimum/average time for AWS ECS Fargate to boot and run a docker image?
For arguments sake, the 45MB anapsix/alpine-java image.
I would like to investigate using ECS Fargate to speed up the process of building software locally on a slow laptop/pc, by having the software built on a faster remote server.
As such the boot up time of the image is crucial in making the endevour worth while.

I would disagree with the accepted answer given my experience with Fargate.
I have launched 1000's of containers on Fargate, and was even featured in an AWS architecture blog for our usage of Fargate. https://aws.amazon.com/blogs/architecture/building-real-time-ai-with-aws-fargate/
Private subnets, behind a NAT gateway have no different launch times for us than containers behind an IGW. If you use single NAT instances sure, your mileage may vary.
Container launch times in Fargate are entirely determined by how large your container is. Fargate does not cache containers, so every run task results in a docker pull happening. If your images are based on Ubuntu, you will have a bad time.
We have a mix of GO from scratch containers and Alpine node containers.
On average based on the metrics we have aggregated from 1000's of launches, From scratch containers start and are healthy in the target group in 10-15 seconds.
Alpine containers take on average 30-40 seconds to launch and become healthy.
Anything longer than that and your containers are likely too large for Fargate to make any sense until they offer pre cached ecr or something similar.
For your specific example, we have similar sized containers, if your entrypoint is healthy quickly (Ie not a 60 second java start time), your container of 45mb should launch and be ready to go in 30-60 seconds.

I am still waiting for caching in Fargate that is already available in ECS+EC2. This feature request can be tracked here. It is a pain in the ass that containers take such a long time to boot on AWS Fargate. Google Cloud Platform already offers this feature as generally available with a managed Cloud Run (K8s) environment, where containers spin up on the fly (~ 2 seconds) when they receive a request. They go idle after (a configurable) 5 minutes, which causes you to only be billed for those 5 minutes.
AWS Fargate does not offer such a nice feature of "warm containers" yet, although I would highly recommend them in doing so. It is probably technically difficult in getting compute and storage close together to accomplish this, it would require an enormous amount of internal bandwidth to load those containers as fast as Google does.
Nevertheless, below is my experience with Docker containers on AWS Fargate. Boot time is highly correlated with container image size as you can see from the following sample of containers I booted (February 2019):
4000 MB ~ 5 minutes
2400 MB ~ 4 minutes
1000 MB ~ 2 minutes
350 MB ~ 50 seconds
I would recommend building your container image on a light-weight base image, such as Minideb or Alpine. This would make your container image pretty small, ranging from a few 10MBs to a few 100MBs. But then again, when you need a JVM or Python with some additional packages and c-libs, you would easily go to 1000 MB.

I've launched more than 100 containers now in Fargate and on a public VPC they take about 4 mins on average, but I've seen it as long as 7-8 mins on a bad day.
If you launch it on a Private VPC then the timing can go south in a hurry. I've seen it take 2 hours to launch a Fargate container if the NAT instance is overloaded.
Hopefully AWS will speed this up over time. It shouldn't take me longer to launch a Fargate container than it does to upload my docker image to ECR.

One could use ECS_IMAGE_PULL_BEHAVIOR = prefer-cached on EC2 launch type to reduce agent start up timings to great extent.

Related

HTTPD response time is increasing after 90 TPS

I am doing load test to tune my apache to server maximum concurrent https request. Below is the details of my test.
System
I dockerized my httpd and deployed in openshift with pod configuration is 4CPU, 8GB RAM.
Running load from Jmeter with 200 thread, 600sec ramup time, loop is for infinite. duration is long run (Jmeter is running in same network with VM configuration 16CPU, 32GB RAM ).
I compiled by setting module with worker and deployed in openshift.
Issue
Httpd is not scaling more than 90TPS, even after tried multiple mpm worker configuration (no difference with default and higher configuration)
2.Issue which i'am facing after 90TPS, average time is increasing and TPS is dropping.
Please let me know what could be the issue, if any information is required further suggestions.
I don't have the answer, but I do have questions.
1/ What does your Dockerfile look like?
2/ What does your OpenShift cluster look like? How many nodes? Separate control plane and workers? What version?
2b/ Specifically, how is traffic entering the pod (if you are going in via a route, you'll want to look at your load balancer; if you want to exclude OpenShift from the equation then for the short term, expose a NodePort and have Jmeter hit that directly)
3/ Do I read correctly that your single pod was assigned 8G ram limit? Did you mean the worker node has 8G ram?
4/ How did you deploy the app -- raw pod, deployment config? Any cpu/memory limits set, or assumed? Assuming a deployment, how many pods does it spawn? What happens if you double it? Doubled TPS or not - that'll help point to whether the problem is inside httpd or inside the ingress route.
5/ What's the nature of the test request? Does it make use of any files stored on the network, or "local" files provisioned in a network PV.
And,
6/ What are you looking to achieve? Maximum concurrent requests in one container, or maximum requests in the cluster? If you've not already look to divide and conquer -- more pods on more nodes.
Most likely you have run into a bottleneck/limitation at the SUT. See the following post for a detailed answer:
JMeter load is not increasing when we increase the threads count

How to debug aws fargate task running out of memory?

I'm running a task at fargate with CPU as 2048 and memory as 8192. Task after running some time is stopped with error
container was stopped as it ran out of memory.
Thing is that task does not fails every time. If I run the same task 10 time it fails 5 times and works 5 times. However If I take an ec2 machine with 2 vcpu and 4GB memory and try to run the same container it runs successfully.(Infact the memory usage on ec2 instance is very low).
Can somebody please guide me how to figure out the memory issue while running a fargate task?
Thanks
The way to start would be enabling memory metrics from container insights for your fargate tasks and Further correlating the Memory Usage graph with Application logs. help here
The difference between running on EC2 vs Fargate could probably be due to the fact that when you run a container on ECS Fargate, it runs on AWS's internal EC2 Instances. Now, here could possibly arise a Noisy Neighbour Situation although the chances would be pretty low.

Is AWS Fargate useful for deploying a "web" service stack?

I see Fargate as a good service for deploying a Docker Compose based stack, but I was wondering if it is any good for "long-running" web services, not just ones where you need dynamic scaling and undeterminate workloads (e.g. containers that are created and die on demand).
That depends on your use-case. ECS lets you quickly deploy containerized applications. With Fargate we don't need to manage the underlying infrastructure (say server-less approach for containers!). Fargate is suitable for long-running apps, microservices, and batch jobs.
Few of my observations on Fargate are:
Fargate storage is ephemeral - We cannot store container data in disk such as volumes. (although Fargate provides 10 GB of volume mounts that is nonpersistent empty storage.)
Logs can be sent to Cloudwatch using awslogs driver. Recently AWS announced the support for Splunk log driver.
Fargate uses only awavpc network mode.
Fargate supports environment variables. Environment variables are the only way to pass parameters to the container.

spring-data-redis cluster recovery issue

We're running a 7-node redis cluster, with all nodes as masters (no slave replication). We're using this as an in-memory cache, so we've commented out all saves in redis.conf, and we've got the following other non-defaults in redis.conf:
maxmemory: 30gb
maxmemory-policy allkeys-lru
cluster-enabled yes
cluster-config-file nodes.conf
cluster-node-timeout 5000
cluster-require-full-coverage no
The client for this cluster is a spring-boot rest api application, using spring-data-redis with jedis as the driver. We mainly use the spring caching annotations.
We had an issue the other day where one of the masters went down for a while. With a single master down in a 7-node cluster we noted a marked increase in the average response time for api calls involving redis, which I would expect.
When the down master was brought back online and re-joined the cluster, we had a massive spike in response time. Via newrelic I can see that the app started making a ton of redis cluster calls (newrelic doesn't tell me which cluster subcommand was being used). Our normal avg response time is around 5ms; during this time it went up to 800ms and we had a few slow sample transactions that took > 70sec. On all app jvms I see the number of active threads jump from a normal 8-9 up to around 300 during this time. We have configured the tomcat http thread pool to allow 400 threads max. After about 3 minutes, the problem cleared itself up, but I now have people questioning the stability of the caching solution we chose. Newrelic doesn't give any insight into where the additional time on the long requests is being spent (it's apparently in an area that Newrelic doesn't instrument).
I've made some attempt to reproduce by running some jmeter load tests against a development environment, and while I see some moderate response time spikes when re-attaching a redis-cluster master, I don't see anything near what we saw in production. I've also run across https://github.com/xetorthio/jedis/issues/1108, but I'm not gaining any useful insight from that. I tried reducing spring.redis.cluster.max-redirects from the default 5 to 0, which didn't seem to have much effect on my load test results. I'm also not sure how appropriate a change that is for my use case.

Running Redis with Docker (performance issue)

Has anyone else seen performance issues with running Redis in a Docker container environment?
Here's what I've noticed...
Setup A: Local machine, traditional Redis install
Setup B: Local machine, using canonical Redis image https://registry.hub.docker.com/_/redis/
I've got an identical HTTP server on my local machine that fires as fast as the request/response cycle will allow.
Observations:
- A can sustain approximately 2X the throughput of B.
- B performs identical to A when you benchmark (from within the container)
So, this leads me to believe that B is slower than A because of a networking issue: i.e. the networking relays introduced by running software in a virtualized environment are creating significant performance issues...
Just wondering if anyone else has noticed anything like this?
Docker's default networking option, --net=bridge introduces overhead due to NAT packet rewriting, noticeable with high packet rates.
Network performance can be improved by using --net=host, instructing Docker to not create a separate network stack for the container, allowing full access to the host network interfaces.
This option should be used carefully though, as it lets container processes open low-numbered ports like any other root process, and access local network services like D-bus, which can lead to processes in the container being able to do unexpected things.
In short: If you know what you are running inside the container it is safe. If you suspect unwanted or aggressive behavior - do not do it.