How can I be alerted if a Fargate RunTask triggered by EventBridge fails - aws-fargate

We have very bursty load and use EventBridge to trigger tasks. Sometimes this fails silently. There is no failed invocations in the EventBridge rule. CloudTrail shows RunTask is executed. There is no corresponding CreateLogStream (or for that matter log in CloudWatch) so it seems like starting the task fails. I don't see any error anywhere.
In the AWS console, I see that the tasks has stopped reason "ResourceInitializationError: failed to configure ENI: failed to setup regular eni: netplugin failed with no error message".
Is there a way to detect this or automatically retry?

Related

Enter Key throws error when running WDIO on Browserstack

I'm trying to run some WDIO tests on Browserstack (#wdio/browserstack-service#7.24.0). Using chrome browser and Samsung Galaxy S20 capabilities. There is a point where I need to send Enter key, I'm using .keys('\uE007') and the action is performed on the device as expected, but there is an error coming from BS. ERROR webdriver: Request failed with status 404 due to Error: The URL '/wd/hub/session/0296501ef9a8127fa1f5663921d3b1e52463eaf1/actions' did not map to a valid resource
I tried to use try/catch to prevent that error marking my execution as failed, but try/catch does not stop it. My guess, it is because of async logic during they send key.
so, I need help to prevent that error or find a way to catch it properly.
any suggestion?

GitLab CI stuck at "Waiting Fargate task to be ready" - but Fargate task is in fact running, but never completes

Having set up GitLab CI and AWS Fargate resources as described in the documentation, we have a situation where the runner can trigger the Fargate task, which goes into RUNNING state, but the master runner never seems to realize this.
Running with gitlab-runner 14.7.0 (98daeee0)
on gitlab-fargate-master DyE5BsVA
Preparing the "custom" executor
INFO[2022-01-27T13:54:49Z] Starting fargate PID=1447 version="0.2.0 (933d940)"
INFO[2022-01-27T13:54:49Z] Executing the command PID=1447 command=config_exec
Using Custom executor with driver fargate 0.2.0 (933d940)...
INFO[2022-01-27T13:54:49Z] Starting fargate PID=1452 version="0.2.0 (933d940)"
INFO[2022-01-27T13:54:49Z] Executing the command PID=1452 command=prepare_exec
INFO[2022-01-27T13:54:56Z] Starting new Fargate task PID=1452 command=prepare_exec
INFO[2022-01-27T13:54:58Z] Persisting data that will be used by other commands PID=1452 command=prepare_exec taskARN="arn:aws:ecs:us-east-1:558517226390:task/gitlab-ci-cluster/ee488fa1d7d7475fab9be01d5bad180e"
INFO[2022-01-27T13:54:58Z] Waiting Fargate task to be ready PID=1452 command=prepare_exec taskARN="arn:aws:ecs:us-east-1:558517226390:task/gitlab-ci-cluster/ee488fa1d7d7475fab9be01d5bad180e"
Within AWS, the task has created its Log Stream in Cloudwatch, but there are no events in that log. It's unclear what is actually happening.
What can be done to find out?
We have reverted to using a vanilla Docker container from the GitLab documentation registry.gitlab.com/tmaczukin-test-projects/fargate-driver-debian:latest but exactly same happens.
Solved - problem was missing AWS permission ECS:DescribeTasks, which for some reason was not causing an error message in the Runner.
(I had mistakenly added AmazonEC2_FullAccess, not AmazonECS_FullAccess as described in the docs)
Having run a "Generate Policy" in AWS based on CloudTrail Events (awesome new feature!), I can now confirm the permissions actually being used are:
EC2: DescribeNetworkInterfaces.
ECS: StopTask, DescribeTasks, RunTask
Note the EC2 permission, which is missing from the docs.
Not sure if you have solved your problem but I noticed this question as I had the exact same issue yesterday. For me this was caused as my gitlab manager task was using an IAM role which was limited to start and stop tasks but it was apparently missing permissions to check weather a task is in the RUNNING state. So I fixed my ecs execution role and then it started working for me.

Google Cloud Platform - Catch and log errors and automatically terminate VM

I am running a workflow on a n1-ultramem-40 instance that will run for several days. If an error occurs, I would like to catch and log the error, be notified, and automatically terminate the Virtual Machine. Could I use StackDriver and gcloud logging to achieve this? How could I automatically terminate the VM using these tools? Thanks!
Let's break the puzzle into two parts. The first is logging an error to Stackdriver and the second is performing an external action automatically when such an error is detected.
Stackdriver provides a wide variety of language bindings and package integrations that result in log messages being written. You could include such API calls in your application which detects the error. If you don't have access to the source code of your application but it instead logs to an external file, you could use the Stackdriver agents to monitor log files and relay the log messages to Stackdriver.
Once you have the error messages being sent to Stackdriver, the next task would be defining a Stackdriver log export definition. This is the act of defining a "filter" that looks for the specific log entry message(s) that you are interested in acting upon. Associated with this export definition and filter would be a PubSub topic. A pubsub message would then be written to this topic when an Stackdriver log entry is made.
Finally, we now have our trigger to perform your action. We could use a Cloud Function triggered from a PubSub message to execute arbitrary API logic. This could be code that performs an API request to GCP to terminate the VM.

How to log recovery by redelivery in Apache-Camel error handler?

How can I find when a Camel route redelivery error handler successfully recovered one error case.
I would like to be able to get metrics around successful redelivery by a camel error handler retry.
I would like to know how many message exchange instances that happened to have a network error performing file transfer where successful recovered after retry.

Troubleshooting Web App process restarting

Our web app process is restarting regularly and we are unable to determine the reason.
When looking into Application Events (using the 'Diagnostics and solve problems' blade in the Azure Portal), there exists a bunch of the following Info logs by 'IIS AspNetCore Module'
Event ID 1005:
Failed to gracefully shutdown process '14040'.
Event ID 1001:
Application 'MACHINE/WEBROOT/APPHOST/myapplication__xxxx' started process '31628' successfully and is listening on port '17663'.
There is nothing fishy with general resource usage and nothing in our application logs.
What is the best way to troubleshoot the reason behind these process restarts?
EDIT 1:
After fiddling around with web logging in the Web App's Diagnostic Logs, I now get an error logged from W3SVC-WP after each restart, but the message is nonsense:
1<br/>5<br/>50000780
EDIT 2:
Event Id 2284 refers to this:
FailedRequestTracing module failed to write buffered events to log
file for the request that matched failure definition. No logs will be
generated until this condition is corrected. The problem happened at
least %1 times in the last %2 minutes. The data is the error.
I'm not sure if this could be related to our Diagnostic Logs configuration, but seems unlikely.
EDIT 3:
As per Brando Zhang's suggestion, I've used the Web App Crash Diagnoser extension and tried monitoring 2nd Chance Unhandled Exceptions on both my application process AND on w3wp, but nothing is dumped.
From how I understand it, 1st Chance Exceptions will not crash the process, so no need to monitor these.
Very likely application is crashing due to fatal exception and causing the restarts.
On Azure App Service platform.You can use the Diagnostics as a
Service (DaaS) to troubleshoot this
It can also do an analysis and tell you the root cause most of the time.More step by step infofrmation can be found on this msdn blog .Also refer tips for using crash diagnoser