Running AWS Log Agent from inside a Fargate container - amazon-cloudwatch

Trying to run the AWS Logs Agent inside a docker container running on AWS ECS Fargate.
This has been working fine under EC2 for several years. Under Fargate context, it does not seem to be able to resolve the task role being passed to it.
Permissions on the Task Role should be good... I've even tried giving it full CloudWatch permissions to eliminate that as a reason.
I've managed to hack the python based launcher script to add a --debug flag which gave me this in the log:
Caught retryable HTTP exception while making metadata service request to
http://169.254.169.254/latest/meta-data/iam/security-credentials
It does not appear to be properly resolving the credentials that are passed into the task as the 'Task Role'

I managed to find a hack workaround, that may illustrate what I believe to be a bug or inadequacy in the agent. I had to hack the launcher script using sed as follows:
sed -i "s|HTTPS_PROXY|AWS_CONTAINER_CREDENTIALS_RELATIVE_URI=$AWS_CONTAINER_CREDENTIALS_RELATIVE_URI HTTPS_PROXY|"
/var/awslogs/bin/awslogs-agent-launcher.sh
This essentially de-references the ENV variable holding the URI for retrieving the task role and passes it to the agent's launcher.
It results in something like this:
/usr/bin/env -i AWS_CONTAINER_CREDENTIALS_RELATIVE_URI=/v2/credentials/f4ca7e30-b73f-4919-ae14-567b1262b27b (etc...)
With this in place, I restart the log agent and it works as expected.
Note that you can do something like this to add --debug flag to the launcher also which was very helpful in trying to figure out where it went astray.

Related

GitLab CI stuck at "Waiting Fargate task to be ready" - but Fargate task is in fact running, but never completes

Having set up GitLab CI and AWS Fargate resources as described in the documentation, we have a situation where the runner can trigger the Fargate task, which goes into RUNNING state, but the master runner never seems to realize this.
Running with gitlab-runner 14.7.0 (98daeee0)
on gitlab-fargate-master DyE5BsVA
Preparing the "custom" executor
INFO[2022-01-27T13:54:49Z] Starting fargate PID=1447 version="0.2.0 (933d940)"
INFO[2022-01-27T13:54:49Z] Executing the command PID=1447 command=config_exec
Using Custom executor with driver fargate 0.2.0 (933d940)...
INFO[2022-01-27T13:54:49Z] Starting fargate PID=1452 version="0.2.0 (933d940)"
INFO[2022-01-27T13:54:49Z] Executing the command PID=1452 command=prepare_exec
INFO[2022-01-27T13:54:56Z] Starting new Fargate task PID=1452 command=prepare_exec
INFO[2022-01-27T13:54:58Z] Persisting data that will be used by other commands PID=1452 command=prepare_exec taskARN="arn:aws:ecs:us-east-1:558517226390:task/gitlab-ci-cluster/ee488fa1d7d7475fab9be01d5bad180e"
INFO[2022-01-27T13:54:58Z] Waiting Fargate task to be ready PID=1452 command=prepare_exec taskARN="arn:aws:ecs:us-east-1:558517226390:task/gitlab-ci-cluster/ee488fa1d7d7475fab9be01d5bad180e"
Within AWS, the task has created its Log Stream in Cloudwatch, but there are no events in that log. It's unclear what is actually happening.
What can be done to find out?
We have reverted to using a vanilla Docker container from the GitLab documentation registry.gitlab.com/tmaczukin-test-projects/fargate-driver-debian:latest but exactly same happens.
Solved - problem was missing AWS permission ECS:DescribeTasks, which for some reason was not causing an error message in the Runner.
(I had mistakenly added AmazonEC2_FullAccess, not AmazonECS_FullAccess as described in the docs)
Having run a "Generate Policy" in AWS based on CloudTrail Events (awesome new feature!), I can now confirm the permissions actually being used are:
EC2: DescribeNetworkInterfaces.
ECS: StopTask, DescribeTasks, RunTask
Note the EC2 permission, which is missing from the docs.
Not sure if you have solved your problem but I noticed this question as I had the exact same issue yesterday. For me this was caused as my gitlab manager task was using an IAM role which was limited to start and stop tasks but it was apparently missing permissions to check weather a task is in the RUNNING state. So I fixed my ecs execution role and then it started working for me.

Azure Container Instance is immediately killed on Startup

I am trying to run an azure container instance but it appears to be getting killed off the second I run it. This works fine in 2 other resource groups but not my production resource group where I see the following:
In events I see 'Successfully pulled image
selenium/standalone-chrome:latest' with count 1 and then 'Started
container' and then 'Killing container' with count 31. The times for
started and killed are the same.
In logs, it just says 'No logs available'
The metrics for CPU and memory on the container never show any change from zero.
Looked at this article but the proposed solution didn't work: Azure Container Group Instance I have tried putting on both an empty directory volume and 2Gb of ram as advised here: https://github.com/SeleniumHQ/docker-selenium but nothing works.
This is the code I am using to create the container:
containerGroup = await azure.ContainerGroups.Define(containerName)
.WithRegion("West Europe")
.WithExistingResourceGroup(configuration.ContainerResourceGroup)
.WithLinux()
.WithPublicImageRegistryOnly()
.WithEmptyDirectoryVolume("devshm")
.DefineContainerInstance(containerName)
.WithImage("selenium/standalone-chrome")
.WithExternalTcpPorts(4444)
.WithVolumeMountSetting("devshm", "/dev/shm")
.WithMemorySizeInGB(2)
.Attach()
.WithDnsPrefix(configuration.AppServiceName + "container")
.WithRestartPolicy(ContainerGroupRestartPolicy.OnFailure)
.CreateAsync(cancellationToken);
How do I debug what is going wrong?
What is wrong with the container?
In case this helps someone I renamed the "containerName" parameter in the above example from myinstance to myinstance1 and changed the region from West Europe to UK South. This fixed the issue. I can only think that Azure caches instances somehow to reduce start up times and the cached image I was using was poisoned somehow.
One issue could be the restart policy - have a look at the Microsoft restart policy troubleshooting on Microsoft's ACI troubleshooting page. According to the website under the Container continually exits and restarts (no long-running process) header in the page:
Container groups default to a restart policy of Always, so containers
in the container group always restart after they run to completion.
You may need to change this to OnFailure or Never if you intend to run
task-based containers. If you specify OnFailure and still see
continual restarts, there might be an issue with the application or
script executed in your container.
In your case you may need to adjust the code as follows using the withStartingCommand:
containerGroup = await azure.ContainerGroups.Define(containerName)
.WithRegion("West Europe")
.WithExistingResourceGroup(configuration.ContainerResourceGroup)
.WithLinux()
.WithPublicImageRegistryOnly()
.WithEmptyDirectoryVolume("devshm")
.DefineContainerInstance(containerName)
.WithImage("selenium/standalone-chrome")
.WithExternalTcpPorts(4444)
.WithVolumeMountSetting("devshm", "/dev/shm")
.WithMemorySizeInGB(2)
.WithStartingCommandLine("tail")
.WithStartingCommandLine("-f")
.WithStartingCommandLine("/dev/null")
.Attach()
.WithDnsPrefix(configuration.AppServiceName + "container")
.WithRestartPolicy(ContainerGroupRestartPolicy.OnFailure)
.CreateAsync(cancellationToken);
This link is helpful for this issue.
--command-line
linux => "tail -f /dev/null"
windows => "ping -t localhost"
# .yml
command: tail -f /dev/null
It will keep your azure instance running.
As now azure do have a endpoint to connect/analyze the process on.

Azure Container Instances stuck in "Creating" state

Whether I have the azure agent plugin for Jenkins make my container, or if I do it manually, it seems like either way it never enters a running state.
az container create \
--os-type Windows \
--location eastus \
--registry-login-server SERVER.azurecr.io \
--registry-password PASSWORD \
--registry-username USERNAME \
--image namespace/image \
--name jenkins-permanent \
--resource-group devops-aci \
--cpu 2 \
--memory 3.5 \
--restart-policy Always \
--command-line "-jnlpUrl http://host:8080/computer/NAME/slave-agent.jnlp -secret SECRET -workDir \"C:\\jenkins\""
I've gone through all the troubleshooting steps that apply, tried a different region, but to no avail.
Here's a current event that I got which seems to be the most progress I've had yet:
{
"count": 1,
"firstTimestamp": "2017-12-07T03:02:56+00:00",
"lastTimestamp": "2017-12-07T03:02:56+00:00",
"message": "Failed to pull image \"MYREPO.azurecr.io/my-company/windows-agent:latest\": Error response from da
emon: {\"message\":\"Get https://MYREPO.azurecr.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout
exceeded while awaiting headers)\"}",
"name": "Failed",
"type": "Warning"
}
The funny thing is, this event happens before and after one case of the instance working (but unfortunately my entrypoint command was wrong, so it never started).
I really feel like Azure is punting on this and I just have no way to change the order I do anything. It's simply one command.
Alexander, here's a lead to actually check what could be causing the delay, or if the deployment has failed in the background, this information would be critical to narrow down what the issue is: https://learn.microsoft.com/en-us/azure/azure-resource-manager/resource-manager-troubleshoot-tips#determine-error-code
From the article above check on deployment logs:
Enable debug logging:
PowerShell
In PowerShell, set the DeploymentDebugLogLevel parameter to All, ResponseContent, or RequestContent.
New-AzureRmResourceGroupDeployment -ResourceGroupName examplegroup -TemplateFile c:\Azure\Templates\storage.json -DeploymentDebugLogLevel All
or Azure CLI:
az group deployment operation list --resource-group ExampleGroup --name vmlinux
Check Also Check deployment sequence:
Many deployment errors happen when resources are deployed in an unexpected sequence. These errors arise when dependencies are not correctly set. When you are missing a needed dependency, one resource attempts to use a value for another resource but the other does not yet exist.
The above link contains more details. Let me know if this helps.
Figured it out, turns out the backslashes in the command in my executable path were not having their escapes honoured. Either because I was calling az from bash, or because something Azure side isn't handling the escaping correctly, or not escaping them itself.
My solution has been to just use forward slashes in the paths. Windows seems to be handling them correctly, and I prefer to not be bothered with its odd preference for backslashes.
Related to my issue is that the speed of the service makes troubleshooting very difficult. It takes a long time to go round trip with any fixes. So if you're using Azure Container Instances and want better performance, go upvote this feedback item that I've created.
How big is your image? You can always debug with 2 steps.
Run az container show -g devops-aci -n jenkins-permanent. It should contain a list of container events in the container json object. The event message should give you hint what's going on.
Run az container logs -g devops-aci -n jenkins-permanent. It should give you the logs of your container. If it's a problem within your image, you should be able to see some error output.

Build spinnaker with docker-compose, redirect to localhost

i build spinnaker using docker-compose follow here
but it always redirect to localhost, how can i fix this.
e.g.
http://localhost:8084/auth/redirect?to=http%3A%2F%2F192.168.99.100%3A9000%2F%23%2Finfrastructure
i set the host:0.0.0.0 in spinnaker-local.yml and configured deck apache2 with proxyPreserve=On, it's not working.
where is the configuration about 'redirect'?
All containers running well but fiat gets error mesages, like this:
WARN 1 --- [ecutionAction-1] c.n.s.fiat.roles.UserRolesSyncer : [] User permission sync failed. Server status is DOWN. Trying again in 10000 ms. Cause:(Provider: DefaultServiceAccountProvider) retrofit.RetrofitError: unexpected url: front50/serviceAccounts
i'm sure set fiat false, is this matter?
thanks.
The docker-compose link project is not available anymore. That deployment type is not supported anymore.
The easiest way i suggest for people to get started quick is by using Armory Open source Minnaker. It runs on top of a K3S small cluster and contains a functional spinnaker deployment.
Great way to get started.
I tried the debian local deployment and it failed all the time.
Enjoy your CD operations.

ERROR: The overall deployment failed because too many individual instances failed deployment

I'm trying to deploy using CircleCI -> S3 -> CodeDeploy -> EC2.
I was able to upload deploy image onto S3 from CircleCI, but unable to deploy S3 to EC2 instance. Here's the error.
The overall deployment failed because too many individual instances
failed deployment, too few healthy instances are available for
deployment, or some instances in your deployment group are
experiencing problems. (Error code: HEALTH_CONSTRAINTS)
The error was provided from CodeDeploy. I can't figure out why and how.
I'd appreciate if you give some advise.
If you are running on Ubuntu there might be plenty of reasons, here is a checklist can verify
Check code-deploy agent is installed on your EC2 Instance. Please refer this document to install code deploy agent.
https://docs.aws.amazon.com/codedeploy/latest/userguide/codedeploy-agent-operations-install-ubuntu.html
$ sudo service codedeploy-agent status
In case if you are running Ubuntu release 20.x and you get this error
./install:22:in block in method_missing': undefined method path' for
#<IO:> (NoMethodError)
try running the install file via this script
sudo ./install auto > /tmp/logfile
Check you have EC2 Instance Code Deploy Role -> Create a code deployment role and assign it to the Instance, https://docs.aws.amazon.com/codedeploy/latest/userguide/getting-started-create-service-role.html.
In case if you assign the EC2 Role after initiate, restart the server.
Check your appsec.yml file placement as per the top answer, try to avoid any long timeout in it.
Log into your instance check your error log
$ tail -f /var/log/aws/codedeploy-agent/codedeploy-agent.log
You should be able to figure out what caused the individual instances to fail by digging into the deployment instance details:
http://docs.aws.amazon.com/codedeploy/latest/userguide/how-to-view-instance-details.html
These should contain more detailed information about why your application was unable to be deployed.
This error is commonly due to problems in the configuration of the appSpec.yml or appSpec.json file (It depends on the format you are using).
"If you have any Hook I recommend that you remove them, check if it works, then you can add one by one (the Hooks) and so you can identify the error"
The appspec.yml file should be located at the root of your project:
│-- appspec.yml
│-- index.html
└-- scripts
│-- install_dependencies
│-- start_server
└-- stop_server
In the scripts folder you will have to place the processes that you want to be executed according to the Hook
Here is an example of the appspec.yml file
version: 0.0
os: linux
files:
- source: /index.html
destination: /var/www/html/
hooks:
BeforeInstall:
- location: scripts/install_dependencies
timeout: 300
runas: root
- location: scripts/start_server
timeout: 300
runas: root
ApplicationStop:
- location: scripts/stop_server
timeout: 300
runas: root
I hope I can help you 😃👻🕺🏾
Make sure the CodeDeploy Host Agent Service is running in your target EC2 instance.
The error you are facing is a generic error message thrown on any of the event failure which could be beforeblockTraffic, blockTraffic, ApplicationStop etc.
The first step in this case would be check whether code deploy agent is running or not if first event i.e. BeforeBlockTraffic event is failed.
As you can see in the screenshot below, the event failure message would tell you the exact error behind.
From the failed deployments, I can see all lifecycle events were skipped. Instance i-0bcc36e73851297f2 is currently in Stopped state but I can see the IAM instance profile is missing. Your Amazon EC2 instances need permission to access the Amazon S3 buckets or GitHub repositories where the applications that will be deployed by AWS CodeDeploy are stored. To launch Amazon EC2 instances that are compatible with AWS CodeDeploy, you must create an additional IAM role, an instance profile. 1
For such failures, you can always begin with a general troubleshooting checklist for a failed deployment 2 and then look for troubleshooting guides on Deployment Issues and Instance issues3.
1[http://docs.aws.amazon.com/codedeploy/latest/userguide/how-to-create-iam-instance-profile.html]1
2 [http://docs.aws.amazon.com/codedeploy/latest/userguide/troubleshooting-general.html]2
3 [http://docs.aws.amazon.com/codedeploy/latest/userguide/troubleshooting.html]3
Check the status of the Code Deploy Agent. In my case, the agent wasn't up.
Please check the role given to the ec2 machine(where the agent is running). It should have s3 access as well. This resolved my issue.
"The CodeDeploy agent did not find an AppSpec file within the unpacked revision directory at revision-relative path 'appspec.yml'"
Please place your appspec.yml file in your root folder to solve this error
To access your after script and before script
The overall deployment failed because too many individual instances failed deployment, too few healthy instances are available for deployment, or some instances in your deployment group are experiencing problems.