How to monitor a EMR cluster whether it is terminated using cloudwatch - amazon-emr

I want to set alarm, when any EMR cluster is terminated(caused by internal errors), I know there is a "IsIdle" option, but my EMR clusters are designed to be persistent, so "IsIdle" is not really fit my case. Is there a health-check metric that I can used?

You can configure Amazon CloudWatch to send a "State Change" event to another service like an AWS Lambda function or an Amazon SNS topic.
To achieve this, open the CloudWatch console, in the navigation pane click on Rules > Create rule.
Service Name: EMR
Event Type: State Change
Specific detail type(s): EMR Cluster State Change
Specific State: TERMINATED and TERMINATED_WITH_ERRORS
Targets: Put the receiving service of your choice.
Here's an example of such an event:
{
"version": "0",
"id": "8535abb0-f87e-4640-b7b6-8de000dfc30a",
"detail-type": "EMR Cluster State Change",
"source": "aws.emr",
"account": "123456789012",
"time": "2016-12-16T21:00:23Z",
"region": "us-east-1",
"resources": [],
"detail": {
"severity": "INFO",
"stateChangeReason": "{\"code\":\"USER_REQUEST\",\"message\":\"Terminated by user request\"}",
"name": "Development Cluster",
"clusterId": "j-1YONHTCP3YZKC",
"state": "TERMINATED",
"message": "Amazon EMR Cluster j-1YONHTCP3YZKC (Development Cluster) has terminated at 2016-12-16 21:00 UTC with a reason of USER_REQUEST."
}
}

Related

Make CloudWatch Event pass an integer timestamp instead of string of UTC time

I'm invoking a scheduled Step Function with a CloudWatch event. The input of the first batch job in the step function state machine is like the following:
{
"version": "0",
"id": "sdjlafgdf05-7c32-435hf-aa3b5a8sfade815",
"detail-type": "Scheduled Event",
"source": "aws.events",
"account": "xxxxxxxx",
"time": "2022-01-14T19:46:49Z",
"region": "us-east-1",
"resources": [
"arn:aws:events:us-east-1:xxxxxxxxxxxx:rule/adfnwelkqnlngqrej-SAFFJKHF734"
],
"detail": {}
}
I want the "time" field can give me the integer. Specifically, instead of "2022-01-14T19:46:49Z", I want "1642189609" (epoch in seconds), so I don't need to parse it in my batch job code. I'm using CDK to build the infrastructure. Is there any way of doing this?
At the moment this is not possible directly with native Step Functions "Intrinsic Functions". So you have two options:
Do it via a lambda as explained in this other answer
Pass it first to EventBridge and then create a rule for EventBridge to foward it to CloudWatch

How to save queries executed by Athena in LogsGroup CloudWatch

I want to save the requests executed by Athena in a LogsGroup of the CloudWatch service.
In CloudWatch, I created this rule:
{
"source": [
"aws.athena"
],
"detail-type": [
"Athena Query State Change"
],
"detail": {
"currentState": [
"QUEUED",
"RUNNING",
"SUCCEEDED",
"FAILED",
"CANCELLED"
]
}
}
And, I attached the rule to a CloudWatch LogsGroup like this:
LogsGroup
I managed to register logs in CloudWatch -> Log groups -> /aws/events/TestAthena but I don't have the information I want:
{
"version": "0",
"id": "a8bad43b-1b9a-da7e-c004-f3c920e1bddd",
"detail-type": "Athena Query State Change",
"source": "aws.athena",
"account": "<account_id>",
"time": "2021-08-23T15:54:13Z",
"region": "eu-west-3",
"resources": [],
"detail": {
"currentState": "RUNNING",
"previousState": "QUEUED",
"queryExecutionId": "b0fe7373-676d-43d5-b866-19d701c9dc56",
"sequenceNumber": "2",
"statementType": "DML",
"versionId": "0",
"workgroupName": "dev-Connect-CardBulk"
}
}
I wish to have :
The request executed
The time the request was executed
The user who executed the request
It is possible to have this with CloudWatch ?
Thank you in advance for your help,
Out of the box, you can have QueryPlanningTime, QueryQueuetime etc. metrics.
Nonetheless, you need Cloudtrail to track who executed.
Refer to these links:
List of CloudWatch Metrics and Dimensions for Athena
Monitoring Amazon Athena Queries using Amazon CloudWatch

Terraform data source of an existing S3 bucket fails plan stage attempting a GetBucketWebsite request which returns NoSuchWebsiteConfiguration

I'm trying to use a data source of an existing S3 bucket like this:
data "aws_s3_bucket" "src-config-bucket" {
bucket = "single-word-name" }
And Terraform always fails the plan stage with the message:
Error: UnauthorizedOperation: You are not authorized to perform this operation.
status code: 403, request id: XXXXX
The requests failing can be viewed with the following info in the results:
{
"eventVersion": "1.08",
"userIdentity": {
​​
"type": "IAMUser",
"principalId": "ANONYMIZED",
"arn": "arn:aws:iam::1234567890:user/terraformops",
"accountId": "123456789012",
"accessKeyId": "XXXXXXXXXXXXXXXXXX",
"userName": "terraformops"
}​​,
"eventTime": "2021-02-02T18:12:19Z",
"eventSource": "s3.amazonaws.com",
"eventName": "GetBucketWebsite",
"awsRegion": "eu-west-1",
"sourceIPAddress": "X.Y.Z.W",
"userAgent": "[aws-sdk-go/1.36.28 (go1.15.5; linux; amd64) APN/1.0 HashiCorp/1.0 Terraform/0.14.4 (+https://www.terraform.io)]",
"errorCode": "NoSuchWebsiteConfiguration",
"errorMessage": "The specified bucket does not have a website configuration",
"requestParameters": {
​​
"bucketName": "s3-bucket-name",
"website": "",
"Host": "s3-bucket-name.s3.eu-west-1.amazonaws.com"
}
Why can't I use an existing S3 bucket as a data source within Terraform ? I don't treat it as a website anywhere in the terraform project so I don't know why it asks the server the GetBucketWebsite call and fail. Hope someone can help.
Thanks.
I don't know why it asks the server the GetBucketWebsite call and fail.
It asks GetBucketWebsite as data source aws_s3_bucket returns this information by providing website_endpoint and website_domain.
So you need to have permissions to call this action on the bucket. The error message suggests that the IAM user/role which you use for querying the bucket does not have all permissions to get needed information.

Cloudwatch event for out of region creation

I am trying to create a auto-remediation process that will stop/delete any VPC, Cloudformation Stack, VPC, Lambda, Internet Gateway or EC2 created outside of the eu-central-1 region. My first step is to parameter a CloudWatch event rule to detect any of the previously mentioned event.
{
"source": [
"aws.cloudtrail"
],
"detail-type": [
"AWS API Call via CloudTrail"
],
"detail": {
"eventSource": [
"ec2.amazonaws.com",
"cloudformation.amazonaws.com",
"lambda.amazonaws.com"
],
"eventName": [
"CreateStack",
"CreateVpc",
"CreateFunction20150331",
"CreateInternetGateway",
"RunInstances"
],
"awsRegion": [
"us-east-1",
"us-east-2",
"us-west-1",
"us-west-2",
"ap-northeast-1",
"ap-northeast-2",
"ap-south-1",
"ap-southeast-1",
"ap-southeast-2",
"ca-central-1",
"ap-south-1",
"eu-west-1",
"eu-west-2",
"eu-west-3"
"sa-east-1"
]
}
}
For now, the event should only trigger an SNS topic that will send me an email, but in the future there will be a lambda fonction to do the remediation.
Unfortunately, when I go create an Internet Gateway in another region (let's say eu-west-1), no notification occur. The Event does not appear if I want to set an alarm on it either, while it does appear in CloudWatch Events).
Any idea what could be wrong with my event config?
OK, I figured it out. The source of the event changes even if the notification comes from CloudTrail. The "source" parameters should therefore be:
"source": [
"aws.cloudtrail",
"aws.ec2",
"aws.cloudformation",
"aws.lambda"
]

My AKS Cluster was brought down, how can I recover?

I have been playing around with load-testing my application on a single agent cluster in AKS. During the testing, the connection to the dashboard stalled and never resumed. My application seems down as well, so I am assuming the cluster is in a bad state.
The API server is restate-f4cbd3d9.hcp.centralus.azmk8s.io
kubectl cluster-info dump shows the following error:
{
"name": "kube-dns-v20-6c8f7f988b-9wpx9.14fbbbd6bf60f0cf",
"namespace": "kube-system",
"selfLink": "/api/v1/namespaces/kube-system/events/kube-dns-v20-6c8f7f988b-9wpx9.14fbbbd6bf60f0cf",
"uid": "47f57d3c-d577-11e7-88d4-0a58ac1f0249",
"resourceVersion": "185572",
"creationTimestamp": "2017-11-30T02:36:34Z",
"InvolvedObject": {
"Kind": "Pod",
"Namespace": "kube-system",
"Name": "kube-dns-v20-6c8f7f988b-9wpx9",
"UID": "9d2b20f2-d3f5-11e7-88d4-0a58ac1f0249",
"APIVersion": "v1",
"ResourceVersion": "299",
"FieldPath": "spec.containers{kubedns}"
},
"Reason": "Unhealthy",
"Message": "Liveness probe failed: Get http://10.244.0.4:8080/healthz-kubedns: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)",
"Source": {
"Component": "kubelet",
"Host": "aks-agentpool-34912234-0"
},
"FirstTimestamp": "2017-11-30T02:23:50Z",
"LastTimestamp": "2017-11-30T02:59:00Z",
"Count": 6,
"Type": "Warning"
}
As well as some Pod Sync errors in Kube-System.
Example of issue:
az aks browse -g REstate.Server -n REstate
Merged "REstate" as current context in C:\Users\User\AppData\Local\Temp\tmp29d0conq
Proxy running on http://127.0.0.1:8001/
Press CTRL+C to close the tunnel...
error: error upgrading connection: error dialing backend: dial tcp 10.240.0.4:10250: getsockopt: connection timed out
You'll probably need to ssh to the node to see if the Kubelet service is running. For future you can set Resource quotas from exhausting all resources in the cluster nodes.
Resource Quotas -https://kubernetes.io/docs/concepts/policy/resource-quotas/