Mapping EMR steps to YARN applications - amazon-s3

Am aggregating EMR yarn application logs to S3 using below YARN configuration:
[
{
"Classification": "yarn-site",
"Properties": {
"yarn.log-aggregation-enable": "true",
"yarn.log-aggregation.retain-seconds": "-1",
"yarn.nodemanager.remote-app-log-dir": "s3://mybucket/logs"
}
}
]
The logs are grouped by yarn application ID in S3:
s3://mybucket/logs/appplication_id_001/
s3://mybucket/logs/appplication_id_002/
s3://mybucket/logs/appplication_id_003/
I want to map the EMR step ID to YARN application ID, so that given a step ID I will be able to fetch its logs.
The reason why I need this is because am using Apache Airflow for orchestration and would like to fetch the logs and show it it Airflow. My Airflow DAG looks like this:
create-cluster
-> add-step-1 -> watch-step-1
-> add-step-2 -> watch-step-2
-> add-step-3 -> watch-step-3
At the end of every watch-step-n task I would like to fetch the logs for that step from s3 and print it. Since none of the tasks in the DAG are aware of the YARN application ID am trying to find a way to get the application ID for a step ID.
EDIT
I couldn't find a way to map EMR step ID to YARN application ID. However I was able to group the logs by cluster ID. For example,
s3://mybucket/logs/cluster-id-0/appplication_id_001/
s3://mybucket/logs/cluster-id-0/appplication_id_002/
s3://mybucket/logs/cluster-id-1/appplication_id_001/
I found a way to get cluster ID from within the EMR nodes:
cat /mnt/var/lib/info/job-flow.json | jq -r '.jobFlowID'
I injected the cluster ID (as property named JOB_FLOW_ID) with YARN_OPTS in the yarn-env configuration. And set the yarn.nodemanager.remote-app-log-dir-suffix to JOB_FLOW_ID in yarn-site configuration.
yarn-env configuration:
{
"Classification": "yarn-env",
"Properties": {},
"Configurations": [
{
"Classification": "export",
"Properties": {
"JOB_FLOW_ID": "\"$(cat /mnt/var/lib/info/job-flow.json | jq -r '.jobFlowId')\"",
"YARN_OPTS": "\"$YARN_OPTS -Djob_flow_id=$JOB_FLOW_ID\""
},
"Configurations": []
}
]
}
yarn-site configuration:
{
"Classification": "yarn-site",
"Properties": {
"yarn.log-aggregation.retain-seconds": "-1",
"yarn.log-aggregation-enable": "true",
"yarn.nodemanager.remote-app-log-dir": "s3://mybucket/logs",
"yarn.nodemanager.remote-app-log-dir-suffix": "${job_flow_id}"
}
}

Related

Amazon Cloudwatch only receiving mem_used_percent and nothing else, despite numerous other metrics specified in config

I am trying to get CloudWatch running properly on my Lightsail instance, which I appear to achieved with only partial success.
I have ran the Wizard using sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard which has produced a config file outlining numerous metrics including cpu, memory and disk usage as outlined here. The service loads and starts the config file, and doesn't complain about invalid json (this did happen a few times, but I fixed it).
I can stop the service with sudo amazon-cloudwatch-agent-ctl -a stop
I then reload the config with sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -s -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json
Verify the service is running: sudo amazon-cloudwatch-agent-ctl -a status
Which outputs this:
{
"status": "running",
"starttime": "2022-01-10T21:53:12+00:00",
"configstatus": "configured",
"cwoc_status": "stopped",
"cwoc_starttime": "",
"cwoc_configstatus": "not configured",
"version": "1.247349.0b251399"
}
Logging into my CloudWatch console, I can see the data being received, and the single line appearing on the graph there corresponds to the times that I started and stopped the service-- so it's definitely doing something. And yet... the only metric that appears on that graph is mem_used_percent... why? Why only this one metric? Where is the rest of my data pertaining to cpu, etc? What am I doing wrong?
Here is my config.json, which as I said, is being loaded by the service without issue.
{
"agent": {
"metrics_collection_interval": 60,
"run_as_user": "root"
},
"metrics": {
"append_dimensions": {
"ImageID": "${aws:ImageId}",
"InstanceId":"${aws:InstanceId}",
"InstanceType":"${aws:InstanceType}"
},
"metrics_collected": {
"cpu": {
"resources": [
"*"
],
"measurement": [
"cpu_usage_active"
],
"metrics_collection_interval": 60,
"totalcpu": false
},
"disk": {
"measurement": [
"free",
"total",
"used",
"used_percent"
],
"metrics_collection_interval": 60,
"resources": [
"*"
]
},
"mem": {
"measurement": [
"mem_active",
"mem_available",
"mem_available_percent",
"mem_free",
"mem_total",
"mem_used",
"mem_used_percent"
],
"metrics_collection_interval": 60
},
"netstat": {
"measurement": [
"tcp_established",
"udp_socket"
]
}
}
}
}
Any help greatly appreciated here. TIA.
You likely haven't fetched the configuration yet.
Check the logfile, i.e. /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log, to see which inputs are loaded:
2022-05-18T10:18:57Z I! Loaded inputs: mem disk
To fetch the configuration, do as follows (you'll need to adapt this to your environment - this is for systemd, on-premise, without SSM):
sudo amazon-cloudwatch-agent-ctl -a fetch-config -m onPremise -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json
sudo systemctl status amazon-cloudwatch-agent.service restart
After:
2022-05-18T11:45:05Z I! Loaded inputs: mem net netstat swap cpu disk diskio
Maybe you face the same issue as I did. In my case two configuration json files
/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d/file_config.json
/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json
were merged.
The files are then translated to
/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml.
When I was checking the file, only the mem definition of /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d/file_config.json was taken. Thus, I deleted the file and restarted the service.
sudo systemctl restart amazon-cloudwatch-agent
After the restart, the toml file contained what I expected and the metrics were in place.

How to use input transformer for ECS Fargate launch type with Terraform CloudWatch event trigger

I'm using terraform to create a CloudWatch Event Trigger with a ECS Fargate launch type which the event source is S3. When I use the input_transformer field to pass in the bucket and key into the ECS task, my event rule results in a failed invocation.
This is the aws_cloudwatch_event_rule:
resource "aws_cloudwatch_event_rule" "event_rule" {
name = "dev-gnss-source-put-rule-tf"
description = "Capture S3 events on uploads bucket"
event_pattern = <<PATTERN
{
"source": [
"aws.s3"
],
"detail-type": [
"AWS API Call via CloudTrail"
],
"detail": {
"eventSource": [
"s3.amazonaws.com"
],
"eventName": [
"PutObject"
],
"requestParameters": {
"bucketName": [
"example-bucket-name"
]
}
}
}
PATTERN
}
This is the aws_cloudwatch_event_target:
resource "aws_cloudwatch_event_target" "event_target" {
target_id = "dev-gnss-upload-event-target-tf"
arn = "example-cluster-arn"
rule = aws_cloudwatch_event_rule.event_rule.name
role_arn = aws_iam_role.uploads_events.arn
ecs_target {
launch_type = "FARGATE"
task_count = 1 # Launch one container / event
task_definition_arn = "example-task-definition-arn"
network_configuration {
subnets = ["example-subnet"]
security_groups = []
}
}
input_transformer {
input_paths = {
s3_bucket = "$.detail.requestParameters.bucketName"
s3_key = "$.detail.requestParameters.key"
}
input_template = <<TEMPLATE
{
"containerOverrides": [
{
"name": "myproject-task",
"environment": [
{ "name": "S3_BUCKET", "value": <s3_bucket> },
{ "name": "S3_KEY", "value": <s3_key> }
]
}
]
}
TEMPLATE
}
}
If I remove the input_transformer section, it will work fine, but I need to pass in the s3 bucket and key to process the particular file.
My rationale for doing this is to remove the need for an intermediary Lambda and was guided by this Medium post: https://medium.com/#bowbaq/trigger-an-ecs-job-when-an-s3-upload-completes-3559c44c37d1
Any advice is appreciated.
After hours of going in circles, I found an answer!
So the first step is to check what the cause of the failed invocation is. You can do this by checking CloudTrail logs by navigating to Cloud Trail > Event history > Search by Event name and type RunTask in the search box. You should see a series of events from the event source ecs.amazonaws.com. Find one that relates to your the Failed Invocation you experienced.
When you click into the event, you can see under the Event record section an errorMessage. In my case, it was the following:
"errorCode": "InvalidParameterException",
"errorMessage": "Override for container named myproject-task is not a container in the TaskDefinition.",
This may be different for you. For me, it was because my containerOverride name was incorrect. This field refers to: The name of the container that receives the override. This parameter is required if any override is specified. ref: https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_ContainerOverride.html
Correcting this field fixed my issue.

Fargate environment variable redis.yaml

I have a microservice and I need to pass in a file redis.yaml to configure Elasticache for Redis.
Assume I have a file called redis.yaml with contents:
clusterServersConfig:
idleConnectionTimeout: 10000
pingTimeout: 1000
connectTimeout: 10000
timeout: 60000
retryAttempts: 3
retryInterval: 60000
And my application.properties I use:
redis.config.location=file:/opt/usr/conf/redis.yaml
In Kubernetes, I can just create a secret with --from-file redis.yaml and the application runs properly.
I do not know how to do the same with AWS Fargate. I believe it could be done with AWS SSM but any help/steps on how to do it would be appreciated.
For externalized configuration, Fargate supports environment variables. Environment variables can be passed in Task definition.
"environment": [
{ "name": "env_name1", "value": "value1" },
{ "name": "env_name2", "value": "value2" }
]
If it's sensitive information, store it in AWS SSM-Parameter store (you can use KMS) and specify the parameter key in the task definition.
{
"containerDefinitions": [{
"secrets": [{
"name": "environment_variable_name",
"valueFrom": "arn:aws:ssm:region:aws_account_id:parameter/parameter_name"
}]
}]
}
In your case, you can convert your yaml to JSON and store it in the Parameter store and refer it in the task definition.
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/specifying-sensitive-data.html

Can I have multiple outputs in an OpenShift Origin build?

I'm building several base images for our infrastructure and would like to mimic the Docker Hub nomenclature for the image tags. For example, Java image on Docker Hub includes several aliases for the same image, e.g. 8 and latest is the same image.
If I were to replicate this system in ImageStreams, I would need to create a BuildConfig with an output specification like this:
"output": {
"to": {
"kind": "ImageStreamTag"
"name": "jdk:8"
}
}
Obviously, this only includes one tag, so even if I were to write
"output": {
"to": {
"kind": "ImageStreamTag"
"name": "jdk:8"
},
"to": {
"kind": "ImageStreamTag"
"name": "jdk:latest"
}
}
only the latest definition would actually be executed.
Is there any proper way to push the same image into different tags apart from creating a different BuildConfig (which would probably "build" from Docker image to Docker image)?
There is a card on the trello board to do this: https://trello.com/c/nOX8FTRq/686-5-support-multiple-tags-for-a-build-output .
You should also be able to do this using oc tag to avoid having to run the same build twice.

AWS data pipeline activity with multiple inputs

As part of an Amazon AWS data pipeline, I have a hive activity using two unstaged S3 data nodes as input. What I want is to be able to set two script variables on the activity, each pointing to an input data node, but I can't get the syntax right. With the single input, I could write the following and it would work just fine:
INPUT_FOO=#{input.directoryPath}
When I add the second input, I run into a problem of how to reference them since they are now an array of inputs, as you can see in the pipeline definition below. Essentially, I want to achieve the following, but can't figure out the correct syntax:
INPUT_FOO=#{input[1].directoryPath}
INPUT_BAR=#{input[2].directoryPath}
Here's the activity portion of the pipeline definition:
{
"id": "ActivityId_7u1sR",
"input": [
{
"ref": "DataNodeId_iYnxf"
},
{
"ref": "DataNodeId_162Ka"
}
],
"schedule": {
"ref": "DefaultSchedule"
},
"scriptUri": "#{myS3ScriptLocation}calculate-results.q",
"name": "Perform Calculations",
"runsOn": {
"ref": "EmrClusterId_jHeiV"
},
"scriptVariable": [
"INPUT_SOURCE1=#{input[1].directoryPath}",
"OUTPUT=#{output.directoryPath}Results/",
"INPUT_SOURCE2=#{input[2].directoryPath}"
],
"output": {
"ref": "DataNodeId_2jY6v"
},
"type": "HiveActivity",
"stage": "false"
}
I plan to keep the tables unstaged and take care of table creation in the hive script so that it's easier to run each Hive activity in isolation as well as in the pipeline itself.
Here's the error I see when using array syntax:
Unable to resolve input[1].directoryPath for object ActivityId_7u1sR'
As it stands now, this scenario is not supported, but a feature request was added to support it in the future.