EKS can only run 2 pod of Fargate concurrently

EKS can only run 2 pod of Fargate concurrently - amazon-eks

this is my situation:
I follow this guide to build my cluster on EKS (with fargate profile): https://docs.aws.amazon.com/eks/latest/userguide/getting-started-console.html. But when i installing my helm chart, only 2 pods can run properly, another ones got pending status with this message: Your AWS account has reached the limit on the number of Fargate pods it can run concurrently.
Anyone could help my fix this issue? Thanks.

Your region has not yet been activated. You have to launch an EC2 instance and terminate it to get past this error.
https://forums.aws.amazon.com/thread.jspa?threadID=292840

I've found the solution: provide ec2FullAccess for IAM user.

I have found that even though the AWS Default Quota for Fargate tasks is high, the Applied Quota for new accounts is 2, which is stunningly low. I am in the same situation, and I have requested a quota increase.

Related

GitLab CI stuck at "Waiting Fargate task to be ready" - but Fargate task is in fact running, but never completes

Having set up GitLab CI and AWS Fargate resources as described in the documentation, we have a situation where the runner can trigger the Fargate task, which goes into RUNNING state, but the master runner never seems to realize this.
Running with gitlab-runner 14.7.0 (98daeee0)
on gitlab-fargate-master DyE5BsVA
Preparing the "custom" executor
INFO[2022-01-27T13:54:49Z] Starting fargate PID=1447 version="0.2.0 (933d940)"
INFO[2022-01-27T13:54:49Z] Executing the command PID=1447 command=config_exec
Using Custom executor with driver fargate 0.2.0 (933d940)...
INFO[2022-01-27T13:54:49Z] Starting fargate PID=1452 version="0.2.0 (933d940)"
INFO[2022-01-27T13:54:49Z] Executing the command PID=1452 command=prepare_exec
INFO[2022-01-27T13:54:56Z] Starting new Fargate task PID=1452 command=prepare_exec
INFO[2022-01-27T13:54:58Z] Persisting data that will be used by other commands PID=1452 command=prepare_exec taskARN="arn:aws:ecs:us-east-1:558517226390:task/gitlab-ci-cluster/ee488fa1d7d7475fab9be01d5bad180e"
INFO[2022-01-27T13:54:58Z] Waiting Fargate task to be ready PID=1452 command=prepare_exec taskARN="arn:aws:ecs:us-east-1:558517226390:task/gitlab-ci-cluster/ee488fa1d7d7475fab9be01d5bad180e"
Within AWS, the task has created its Log Stream in Cloudwatch, but there are no events in that log. It's unclear what is actually happening.
What can be done to find out?
We have reverted to using a vanilla Docker container from the GitLab documentation registry.gitlab.com/tmaczukin-test-projects/fargate-driver-debian:latest but exactly same happens.

Solved - problem was missing AWS permission ECS:DescribeTasks, which for some reason was not causing an error message in the Runner.
(I had mistakenly added AmazonEC2_FullAccess, not AmazonECS_FullAccess as described in the docs)
Having run a "Generate Policy" in AWS based on CloudTrail Events (awesome new feature!), I can now confirm the permissions actually being used are:
EC2: DescribeNetworkInterfaces.
ECS: StopTask, DescribeTasks, RunTask
Note the EC2 permission, which is missing from the docs.

Not sure if you have solved your problem but I noticed this question as I had the exact same issue yesterday. For me this was caused as my gitlab manager task was using an IAM role which was limited to start and stop tasks but it was apparently missing permissions to check weather a task is in the RUNNING state. So I fixed my ecs execution role and then it started working for me.

S3 - Kubernetes probe

I have the following situation:
Application uses S3 to store data in Amazon. Application is deployed as a pod in kubernetes. Sometimes some of developers messes with access data for S3 (eg. user/password) and application fails to connect to S3 - but pod starts normally and kills previous pod version that worked OK (since all readiness and aliveness probes are OK). I thought of adding S3 probe to readiness - in order to execute HeadBucketRequest on S3 and if this one succeeds it is able to connect to S3. The problem here is that these requests cost money, and I really need them only on start of the pod.
Are there any best-practices related to this one?

If you (quote) "... really need them [the probes] only on start of the pod" then look into adding a startup probe.
In addition to what startup probes help with - pods that take longer time to start - a startup probe will make it possible to verify a condition only at pod startup time.

Readiness and liveness prove as for checking the health of POD or container while running. You scenario is quite wired but with Readiness & liveness probe it wont work as it fire on internal and which cost money.
in this case you might can use the lifecycle hook :
containers:
- image: MAGE_NAME
lifecycle:
postStart:
exec:
command: ["/bin/sh", "-c", "script.sh"]
which will run the hook at starting of the container you can keep shell file inside the POD or image.
inside shell file you can right logic if 200 response move a head and container get started.
https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/

AWS EKS node group migration stopped sending logs to Kibana

I encounter a problem while using EKS with fluent bit and I will be grateful for the community help, first I'll describe the cluster.
We are running EKS cluster in a VPC that had an unmanaged node group.
The EKS cluster network configuration is marked as "public and private" and
using fluent-bit with Elasticsearch service we show logs in Kibana.
We've decided that we want to move to managed node group in that cluster and therefore migrated from the unmanaged node group to a managed node group successfully.
Since our migration we cannot see any logs in Kibana, when getting the logs manually from the fluent bit pods there are no errors.
I toggled debug level logs for fluent bit to get better look at it.
I can see that fluent-bit gathers all the log files and then I saw that we get messages:
[debug] [out_es] HTTP Status=403 URI=/_bulk
[debug] [retry] re-using retry for task_id=63 attemps=3
[debug] [sched] retry=0x7ff56260a8e8 63 in 321 seconds
Furthermore, we have managed node group in other EKS clusters but we did not migrate to them they were created with managed node group.
The created managed node group were created from the same template we have from working managed node group with the only difference is the compute power.
The template has nothing special in it except auto scale.
I compared between the node group IAM role of working node group logs and my non working node group and the Roles seems to be the same.
As far for my fluent bit configuration I have the same configuration in few EKS clusters and it works so I don't think that the root cause but if anyone thinks something else I can add it if requested.
Someone had that kind of problem? why node group migration could cause such issue?
Thanks in advance!

Lesson learned, always look at the access policy of the resource you are having issue with, maybe it does not match your node group role

Spinnaker Pipeline is failing after starting it manually

Spinnaker on Gke :
Pipeline is getting failed after starting manually
the first issue is the .spin/config didnt get created, I created that manually as missing in the steps https://cloud.google.com/solutions/continuous-delivery-spinnaker-kubernetes-engine
then when i started pipeline manually , it is giving me an error on production stage
Exception ( Wait For Manifest To Stabilize )
WaitForManifestStableTask of stage Deploy Production Backend timed out after 30 minutes 4 seconds. pausedDuration: 0 seconds, elapsedTime: 30 minutes 4 seconds,timeoutValue: 30 minutes

Many times this is caused by Kubernetes being unable to schedule the pods. Just had the same problem this morning and it was due to lack of resources. That would be the first thing I'd check. You can find the pod in the "Server Group" under the "Infrastructure" tab and view the details of why it's in the state it is.

I find this Spinnaker GitHub page insightful. I wanted to add it as a comment because I still don't have enough reputation.

ERROR: The overall deployment failed because too many individual instances failed deployment

I'm trying to deploy using CircleCI -> S3 -> CodeDeploy -> EC2.
I was able to upload deploy image onto S3 from CircleCI, but unable to deploy S3 to EC2 instance. Here's the error.
The overall deployment failed because too many individual instances
failed deployment, too few healthy instances are available for
deployment, or some instances in your deployment group are
experiencing problems. (Error code: HEALTH_CONSTRAINTS)
The error was provided from CodeDeploy. I can't figure out why and how.
I'd appreciate if you give some advise.

If you are running on Ubuntu there might be plenty of reasons, here is a checklist can verify
Check code-deploy agent is installed on your EC2 Instance. Please refer this document to install code deploy agent.
https://docs.aws.amazon.com/codedeploy/latest/userguide/codedeploy-agent-operations-install-ubuntu.html
$ sudo service codedeploy-agent status
In case if you are running Ubuntu release 20.x and you get this error
./install:22:in block in method_missing': undefined method path' for
#<IO:> (NoMethodError)
try running the install file via this script
sudo ./install auto > /tmp/logfile
Check you have EC2 Instance Code Deploy Role -> Create a code deployment role and assign it to the Instance, https://docs.aws.amazon.com/codedeploy/latest/userguide/getting-started-create-service-role.html.
In case if you assign the EC2 Role after initiate, restart the server.
Check your appsec.yml file placement as per the top answer, try to avoid any long timeout in it.
Log into your instance check your error log
$ tail -f /var/log/aws/codedeploy-agent/codedeploy-agent.log

You should be able to figure out what caused the individual instances to fail by digging into the deployment instance details:
http://docs.aws.amazon.com/codedeploy/latest/userguide/how-to-view-instance-details.html
These should contain more detailed information about why your application was unable to be deployed.

This error is commonly due to problems in the configuration of the appSpec.yml or appSpec.json file (It depends on the format you are using).
"If you have any Hook I recommend that you remove them, check if it works, then you can add one by one (the Hooks) and so you can identify the error"
The appspec.yml file should be located at the root of your project:
│-- appspec.yml
│-- index.html
└-- scripts
│-- install_dependencies
│-- start_server
└-- stop_server
In the scripts folder you will have to place the processes that you want to be executed according to the Hook
Here is an example of the appspec.yml file
version: 0.0
os: linux
files:
- source: /index.html
destination: /var/www/html/
hooks:
BeforeInstall:
- location: scripts/install_dependencies
timeout: 300
runas: root
- location: scripts/start_server
timeout: 300
runas: root
ApplicationStop:
- location: scripts/stop_server
timeout: 300
runas: root
I hope I can help you 😃👻🕺🏾

Make sure the CodeDeploy Host Agent Service is running in your target EC2 instance.

The error you are facing is a generic error message thrown on any of the event failure which could be beforeblockTraffic, blockTraffic, ApplicationStop etc.
The first step in this case would be check whether code deploy agent is running or not if first event i.e. BeforeBlockTraffic event is failed.
As you can see in the screenshot below, the event failure message would tell you the exact error behind.

From the failed deployments, I can see all lifecycle events were skipped. Instance i-0bcc36e73851297f2 is currently in Stopped state but I can see the IAM instance profile is missing. Your Amazon EC2 instances need permission to access the Amazon S3 buckets or GitHub repositories where the applications that will be deployed by AWS CodeDeploy are stored. To launch Amazon EC2 instances that are compatible with AWS CodeDeploy, you must create an additional IAM role, an instance profile. 1
For such failures, you can always begin with a general troubleshooting checklist for a failed deployment 2 and then look for troubleshooting guides on Deployment Issues and Instance issues3.
1[http://docs.aws.amazon.com/codedeploy/latest/userguide/how-to-create-iam-instance-profile.html]1
2 [http://docs.aws.amazon.com/codedeploy/latest/userguide/troubleshooting-general.html]2
3 [http://docs.aws.amazon.com/codedeploy/latest/userguide/troubleshooting.html]3

Check the status of the Code Deploy Agent. In my case, the agent wasn't up.

Please check the role given to the ec2 machine(where the agent is running). It should have s3 access as well. This resolved my issue.

"The CodeDeploy agent did not find an AppSpec file within the unpacked revision directory at revision-relative path 'appspec.yml'"
Please place your appspec.yml file in your root folder to solve this error
To access your after script and before script

The overall deployment failed because too many individual instances failed deployment, too few healthy instances are available for deployment, or some instances in your deployment group are experiencing problems.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas