S3 - Kubernetes probe - amazon-s3

I have the following situation:
Application uses S3 to store data in Amazon. Application is deployed as a pod in kubernetes. Sometimes some of developers messes with access data for S3 (eg. user/password) and application fails to connect to S3 - but pod starts normally and kills previous pod version that worked OK (since all readiness and aliveness probes are OK). I thought of adding S3 probe to readiness - in order to execute HeadBucketRequest on S3 and if this one succeeds it is able to connect to S3. The problem here is that these requests cost money, and I really need them only on start of the pod.
Are there any best-practices related to this one?

If you (quote) "... really need them [the probes] only on start of the pod" then look into adding a startup probe.
In addition to what startup probes help with - pods that take longer time to start - a startup probe will make it possible to verify a condition only at pod startup time.

Readiness and liveness prove as for checking the health of POD or container while running. You scenario is quite wired but with Readiness & liveness probe it wont work as it fire on internal and which cost money.
in this case you might can use the lifecycle hook :
containers:
- image: MAGE_NAME
lifecycle:
postStart:
exec:
command: ["/bin/sh", "-c", "script.sh"]
which will run the hook at starting of the container you can keep shell file inside the POD or image.
inside shell file you can right logic if 200 response move a head and container get started.
https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/

Related

GitLab CI stuck at "Waiting Fargate task to be ready" - but Fargate task is in fact running, but never completes

Having set up GitLab CI and AWS Fargate resources as described in the documentation, we have a situation where the runner can trigger the Fargate task, which goes into RUNNING state, but the master runner never seems to realize this.
Running with gitlab-runner 14.7.0 (98daeee0)
on gitlab-fargate-master DyE5BsVA
Preparing the "custom" executor
INFO[2022-01-27T13:54:49Z] Starting fargate PID=1447 version="0.2.0 (933d940)"
INFO[2022-01-27T13:54:49Z] Executing the command PID=1447 command=config_exec
Using Custom executor with driver fargate 0.2.0 (933d940)...
INFO[2022-01-27T13:54:49Z] Starting fargate PID=1452 version="0.2.0 (933d940)"
INFO[2022-01-27T13:54:49Z] Executing the command PID=1452 command=prepare_exec
INFO[2022-01-27T13:54:56Z] Starting new Fargate task PID=1452 command=prepare_exec
INFO[2022-01-27T13:54:58Z] Persisting data that will be used by other commands PID=1452 command=prepare_exec taskARN="arn:aws:ecs:us-east-1:558517226390:task/gitlab-ci-cluster/ee488fa1d7d7475fab9be01d5bad180e"
INFO[2022-01-27T13:54:58Z] Waiting Fargate task to be ready PID=1452 command=prepare_exec taskARN="arn:aws:ecs:us-east-1:558517226390:task/gitlab-ci-cluster/ee488fa1d7d7475fab9be01d5bad180e"
Within AWS, the task has created its Log Stream in Cloudwatch, but there are no events in that log. It's unclear what is actually happening.
What can be done to find out?
We have reverted to using a vanilla Docker container from the GitLab documentation registry.gitlab.com/tmaczukin-test-projects/fargate-driver-debian:latest but exactly same happens.
Solved - problem was missing AWS permission ECS:DescribeTasks, which for some reason was not causing an error message in the Runner.
(I had mistakenly added AmazonEC2_FullAccess, not AmazonECS_FullAccess as described in the docs)
Having run a "Generate Policy" in AWS based on CloudTrail Events (awesome new feature!), I can now confirm the permissions actually being used are:
EC2: DescribeNetworkInterfaces.
ECS: StopTask, DescribeTasks, RunTask
Note the EC2 permission, which is missing from the docs.
Not sure if you have solved your problem but I noticed this question as I had the exact same issue yesterday. For me this was caused as my gitlab manager task was using an IAM role which was limited to start and stop tasks but it was apparently missing permissions to check weather a task is in the RUNNING state. So I fixed my ecs execution role and then it started working for me.

Managing the health and well being of multiple pods with dependencies

We have several pods (as service/deployments) in our k8s workflow that are dependent on each other, such that if one goes into a CrashLoopBackOff state, then all these services need to be redeployed.
Instead of having to manually do this, is there a programatic way of handling this?
Of course we are trying to figure out why the pod in question is crashing.
If these are so tightly dependant on each other, I would consider these options
a) Rearchitect your system to be more resilient towards failure and tolerate, if a pod is temporary unavailable
b) Put all parts into one pod as separate containers, making the atomic design more explicit
If these don't fit your needs, you can use the Kubernetes API to create a program that automates the task of restarting all dependent parts. There are client libraries for multiple languages and integration is quite easy. The next step would be a custom resource definition (CRD) so you can manage your own system using an extension to the Kubernetes API.
First thing to do is making sure that pods are started in correct sequence. This can be done using initContainers like that:
spec:
initContainers:
- name: waitfor
image: jwilder/dockerize
args:
- -wait
- "http://config-srv/actuator/health"
- -wait
- "http://registry-srv/actuator/health"
- -wait
- "http://rabbitmq:15672"
- -timeout
- 600s
Here your pod will not start until all the services in a list are responding to HTTP probes.
Next thing you may want to define liveness probe that periodically executes curl to the same services
spec:
livenessProbe:
exec:
command:
- /bin/sh
- -c
- curl http://config-srv/actuator/health &&
curl http://registry-srv/actuator/health &&
curl http://rabbitmq:15672
Now if any of those services fail - you pod will fail liveness probe, be restarted and wait for services to become back online.
That's just an example how it can be done. In your case checks can be different of course.

Build spinnaker with docker-compose, redirect to localhost

i build spinnaker using docker-compose follow here
but it always redirect to localhost, how can i fix this.
e.g.
http://localhost:8084/auth/redirect?to=http%3A%2F%2F192.168.99.100%3A9000%2F%23%2Finfrastructure
i set the host:0.0.0.0 in spinnaker-local.yml and configured deck apache2 with proxyPreserve=On, it's not working.
where is the configuration about 'redirect'?
All containers running well but fiat gets error mesages, like this:
WARN 1 --- [ecutionAction-1] c.n.s.fiat.roles.UserRolesSyncer : [] User permission sync failed. Server status is DOWN. Trying again in 10000 ms. Cause:(Provider: DefaultServiceAccountProvider) retrofit.RetrofitError: unexpected url: front50/serviceAccounts
i'm sure set fiat false, is this matter?
thanks.
The docker-compose link project is not available anymore. That deployment type is not supported anymore.
The easiest way i suggest for people to get started quick is by using Armory Open source Minnaker. It runs on top of a K3S small cluster and contains a functional spinnaker deployment.
Great way to get started.
I tried the debian local deployment and it failed all the time.
Enjoy your CD operations.

YARN Architecture of Hadoop 2.0

From below link of Apache Hadoop site, I learn that
ApplicationMaster has the responsibility of negotiating appropriate
resource containers from the Scheduler (ResourceManager)
and also learn that
ApplicationsManager negotiating the first container for executing the
ApplicationMaster
Link : http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
So here is my confusion.
If ApplicationMaster has the responsilibility to request ResourceManager for Container, then Who is creating the first container and what is the process to create the first container for executing the ApplicationMaster?
Is there anyone giving and request to create the first container?
What are the resonsibilities of the first Container? First Container only executes the ApplicationMaster or it is also behaving like other Resource Container?
Please let me know if anyone has the idea regarding this.
First of all, you are confusing the terms ApplicationManager and ApplicationMaster. They are not the same, have a look at my answer to understand difference between Application Manager and Application Master in YARN.
Answers to your questions are given below:
YarnClient has the responsibility to submit the application to ResourceManager, it sends an ApplicationSubmissionContext object to ResourceManager, which represents all of the information needed by the ResourceManager to launch the ApplicationMaster for an application.
Yes, YarnClient does that!
First Container is the Application Master, its job is to request the resources(containers) from ResourceManager and make application level decisions. If a sufficient number of containers (defined by the logic in your ApplicationMaster) are provided by the ResourceManager, then ApplicationMaster can go ahead and launch the application code on containers. FurtherMore, ApplicationMaster keeps track of failed containers and relauch them or terminates the application(kills all other containers), again based on the logic of your ApplicationMaster.
To understand the internals of Hadoop YARN, i would suggest you to read YARN paper or if you have more time you can read a book on Hadoop YARN.

ERROR: The overall deployment failed because too many individual instances failed deployment

I'm trying to deploy using CircleCI -> S3 -> CodeDeploy -> EC2.
I was able to upload deploy image onto S3 from CircleCI, but unable to deploy S3 to EC2 instance. Here's the error.
The overall deployment failed because too many individual instances
failed deployment, too few healthy instances are available for
deployment, or some instances in your deployment group are
experiencing problems. (Error code: HEALTH_CONSTRAINTS)
The error was provided from CodeDeploy. I can't figure out why and how.
I'd appreciate if you give some advise.
If you are running on Ubuntu there might be plenty of reasons, here is a checklist can verify
Check code-deploy agent is installed on your EC2 Instance. Please refer this document to install code deploy agent.
https://docs.aws.amazon.com/codedeploy/latest/userguide/codedeploy-agent-operations-install-ubuntu.html
$ sudo service codedeploy-agent status
In case if you are running Ubuntu release 20.x and you get this error
./install:22:in block in method_missing': undefined method path' for
#<IO:> (NoMethodError)
try running the install file via this script
sudo ./install auto > /tmp/logfile
Check you have EC2 Instance Code Deploy Role -> Create a code deployment role and assign it to the Instance, https://docs.aws.amazon.com/codedeploy/latest/userguide/getting-started-create-service-role.html.
In case if you assign the EC2 Role after initiate, restart the server.
Check your appsec.yml file placement as per the top answer, try to avoid any long timeout in it.
Log into your instance check your error log
$ tail -f /var/log/aws/codedeploy-agent/codedeploy-agent.log
You should be able to figure out what caused the individual instances to fail by digging into the deployment instance details:
http://docs.aws.amazon.com/codedeploy/latest/userguide/how-to-view-instance-details.html
These should contain more detailed information about why your application was unable to be deployed.
This error is commonly due to problems in the configuration of the appSpec.yml or appSpec.json file (It depends on the format you are using).
"If you have any Hook I recommend that you remove them, check if it works, then you can add one by one (the Hooks) and so you can identify the error"
The appspec.yml file should be located at the root of your project:
│-- appspec.yml
│-- index.html
└-- scripts
│-- install_dependencies
│-- start_server
└-- stop_server
In the scripts folder you will have to place the processes that you want to be executed according to the Hook
Here is an example of the appspec.yml file
version: 0.0
os: linux
files:
- source: /index.html
destination: /var/www/html/
hooks:
BeforeInstall:
- location: scripts/install_dependencies
timeout: 300
runas: root
- location: scripts/start_server
timeout: 300
runas: root
ApplicationStop:
- location: scripts/stop_server
timeout: 300
runas: root
I hope I can help you 😃👻🕺🏾
Make sure the CodeDeploy Host Agent Service is running in your target EC2 instance.
The error you are facing is a generic error message thrown on any of the event failure which could be beforeblockTraffic, blockTraffic, ApplicationStop etc.
The first step in this case would be check whether code deploy agent is running or not if first event i.e. BeforeBlockTraffic event is failed.
As you can see in the screenshot below, the event failure message would tell you the exact error behind.
From the failed deployments, I can see all lifecycle events were skipped. Instance i-0bcc36e73851297f2 is currently in Stopped state but I can see the IAM instance profile is missing. Your Amazon EC2 instances need permission to access the Amazon S3 buckets or GitHub repositories where the applications that will be deployed by AWS CodeDeploy are stored. To launch Amazon EC2 instances that are compatible with AWS CodeDeploy, you must create an additional IAM role, an instance profile. 1
For such failures, you can always begin with a general troubleshooting checklist for a failed deployment 2 and then look for troubleshooting guides on Deployment Issues and Instance issues3.
1[http://docs.aws.amazon.com/codedeploy/latest/userguide/how-to-create-iam-instance-profile.html]1
2 [http://docs.aws.amazon.com/codedeploy/latest/userguide/troubleshooting-general.html]2
3 [http://docs.aws.amazon.com/codedeploy/latest/userguide/troubleshooting.html]3
Check the status of the Code Deploy Agent. In my case, the agent wasn't up.
Please check the role given to the ec2 machine(where the agent is running). It should have s3 access as well. This resolved my issue.
"The CodeDeploy agent did not find an AppSpec file within the unpacked revision directory at revision-relative path 'appspec.yml'"
Please place your appspec.yml file in your root folder to solve this error
To access your after script and before script
The overall deployment failed because too many individual instances failed deployment, too few healthy instances are available for deployment, or some instances in your deployment group are experiencing problems.