Can't get Apache Airflow to write to S3 using EMR Operators

Can't get Apache Airflow to write to S3 using EMR Operators - amazon-s3

I am using the Airflow EMR Operators to create an AWS EMR Cluster that runs a Jar file contained in S3 and then writes the output back to S3. It seems to be able to run the job using the Jar file from S3, but I cannot get it to write the output to S3. I am able to get it to write the output to S3 when running it as an AWS EMR CLI Bash command, but I need to do it using the Airflow EMR Operators. I have the S3 output directory set both in the Airflow step config and in the environment config in the Jar file and still cannot get the Operators to write to it.
Here is the code I have for my Airflow DAG
from datetime import datetime, timedelta
import airflow
from airflow import DAG
from airflow.contrib.operators.emr_create_job_flow_operator import EmrCreateJobFlowOperator
from airflow.contrib.operators.emr_add_steps_operator import EmrAddStepsOperator
from airflow.contrib.operators.emr_terminate_job_flow_operator import EmrTerminateJobFlowOperator
from airflow.contrib.sensors.emr_step_sensor import EmrStepSensor
from airflow.hooks.S3_hook import S3Hook
from airflow.operators.s3_file_transform_operator import S3FileTransformOperator
DEFAULT_ARGS = {
'owner': 'AIRFLOW_USER',
'depends_on_past': False,
'start_date': datetime(2019, 9, 9),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False
}
RUN_STEPS = [
{
"Name": "run-custom-create-emr",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep": {
"Jar": "command-runner.jar",
"Args": [
"spark-submit", "--deploy-mode", "cluster", "--master", "yarn", "--conf",
"spark.yarn.submit.waitAppCompletion=false", "--class", "CLASSPATH",
"s3://INPUT_JAR_FILE",
"s3://OUTPUT_DIR"
]
}
}
]
JOB_FLOW_OVERRIDES = {
"Name": "JOB_NAME",
"LogUri": "s3://LOG_DIR/",
"ReleaseLabel": "emr-5.23.0",
"Instances": {
"Ec2KeyName": "KP_USER_NAME",
"Ec2SubnetId": "SUBNET",
"EmrManagedMasterSecurityGroup": "SG-ID",
"EmrManagedSlaveSecurityGroup": "SG-ID",
"InstanceGroups": [
{
"Name": "Master nodes",
"Market": "ON_DEMAND",
"InstanceRole": "MASTER",
"InstanceType": "m4.large",
"InstanceCount": 1
},
{
"Name": "Slave nodes",
"Market": "ON_DEMAND",
"InstanceRole": "CORE",
"InstanceType": "m4.large",
"InstanceCount": 1
}
],
"TerminationProtected": True,
"KeepJobFlowAliveWhenNoSteps": True,
},
"Applications": [
{
"Name": "Spark"
},
{
"Name": "Ganglia"
},
{
"Name": "Hadoop"
},
{
"Name": "Hive"
}
],
"JobFlowRole": "ROLE_NAME",
"ServiceRole": "ROLE_NAME",
"ScaleDownBehavior": "TERMINATE_AT_TASK_COMPLETION",
"EbsRootVolumeSize": 10,
"Tags": [
{
"Key": "Country",
"Value": "us"
},
{
"Key": "Environment",
"Value": "dev"
}
]
}
dag = DAG(
'AWS-EMR-JOB',
default_args=DEFAULT_ARGS,
dagrun_timeout=timedelta(hours=2),
schedule_interval=None
)
cluster_creator = EmrCreateJobFlowOperator(
task_id='create_job_flow',
job_flow_overrides=JOB_FLOW_OVERRIDES,
aws_conn_id='aws_default',
emr_conn_id='emr_connection_CustomCreate',
dag=dag
)
step_adder = EmrAddStepsOperator(
task_id='add_steps',
job_flow_id="{{ task_instance.xcom_pull('create_job_flow', key='return_value') }}",
aws_conn_id='aws_default',
steps=RUN_STEPS,
dag=dag
)
step_checker = EmrStepSensor(
task_id='watch_step',
job_flow_id="{{ task_instance.xcom_pull('create_job_flow', key='return_value') }}",
step_id="{{ task_instance.xcom_pull('add_steps', key='return_value')[0] }}",
aws_conn_id='aws_default',
dag=dag
)
cluster_remover = EmrTerminateJobFlowOperator(
task_id='remove_cluster',
job_flow_id="{{ task_instance.xcom_pull('create_job_flow', key='return_value') }}",
aws_conn_id='aws_default',
dag=dag
)
cluster_creator.set_downstream(step_adder)
step_adder.set_downstream(step_checker)
step_checker.set_downstream(cluster_remover)
Does anyone have any ideas how I can solve this problem? Any help would be appreciated.

I believe that I just solved my problem. After really digging deep into all the local Airflow logs and the S3 EMR logs I found a Hadoop Memory Exception, so I increased the number of cores to run the EMR on and it seems to work now.

Related

Set Subnet ID and EC2 Key Name in EMR Cluster Config via Step Functions

As of November 2019 AWS Step Function has native support for orchestrating EMR Clusters. Hence we are trying to configure a Cluster and run some jobs on it.
We could not find any documentation on how to set the SubnetId as well as the Key Name used for the EC2 instances in the cluster. Is there any such possibility?
As of now our create cluster step looks as following:
"States": {
"Create an EMR cluster": {
"Type": "Task",
"Resource": "arn:aws:states:::elasticmapreduce:createCluster.sync",
"Parameters": {
"Name": "TestCluster",
"VisibleToAllUsers": true,
"ReleaseLabel": "emr-5.26.0",
"Applications": [
{ "Name": "spark" }
],
"ServiceRole": "SomeRole",
"JobFlowRole": "SomeInstanceProfile",
"LogUri": "s3://some-logs-bucket/logs",
"Instances": {
"KeepJobFlowAliveWhenNoSteps": true,
"InstanceFleets": [
{
"Name": "MasterFleet",
"InstanceFleetType": "MASTER",
"TargetOnDemandCapacity": 1,
"InstanceTypeConfigs": [
{
"InstanceType": "m3.2xlarge"
}
]
},
{
"Name": "CoreFleet",
"InstanceFleetType": "CORE",
"TargetSpotCapacity": 2,
"InstanceTypeConfigs": [
{
"InstanceType": "m3.2xlarge",
"BidPriceAsPercentageOfOnDemandPrice": 100 }
]
}
]
}
},
"ResultPath": "$.cluster",
"End": "true"
}
}
As soon as we try to add "SubnetId" key in any of the subobjects in Parameters, or in Parameter itself we get the error:
Invalid State Machine Definition: 'SCHEMA_VALIDATION_FAILED: The field "SubnetId" is not supported by Step Functions at /States/Create an EMR cluster/Parameters' (Service: AWSStepFunctions; Status Code: 400; Error Code: InvalidDefinition;

Referring to the SF docs on the emr integration we can see that createCluster.sync uses the emr API RunJobFlow. In RunJobFlow we can specify the Ec2KeyName and Ec2SubnetId located at the paths $.Instances.Ec2KeyName and $.Instances.Ec2SubnetId.
With that said I managed to create a State Machine with the following definition (on a side note, your definition had a syntax error with "End": "true", which should be "End": true)
{
"Comment": "A Hello World example of the Amazon States Language using Pass states",
"StartAt": "Create an EMR cluster",
"States": {
"Create an EMR cluster": {
"Type": "Task",
"Resource": "arn:aws:states:::elasticmapreduce:createCluster.sync",
"Parameters": {
"Name": "TestCluster",
"VisibleToAllUsers": true,
"ReleaseLabel": "emr-5.26.0",
"Applications": [
{
"Name": "spark"
}
],
"ServiceRole": "SomeRole",
"JobFlowRole": "SomeInstanceProfile",
"LogUri": "s3://some-logs-bucket/logs",
"Instances": {
"Ec2KeyName": "ENTER_EC2KEYNAME_HERE",
"Ec2SubnetId": "ENTER_EC2SUBNETID_HERE",
"KeepJobFlowAliveWhenNoSteps": true,
"InstanceFleets": [
{
"Name": "MasterFleet",
"InstanceFleetType": "MASTER",
"TargetOnDemandCapacity": 1,
"InstanceTypeConfigs": [
{
"InstanceType": "m3.2xlarge"
}
]
},
{
"Name": "CoreFleet",
"InstanceFleetType": "CORE",
"TargetSpotCapacity": 2,
"InstanceTypeConfigs": [
{
"InstanceType": "m3.2xlarge",
"BidPriceAsPercentageOfOnDemandPrice": 100
}
]
}
]
}
},
"ResultPath": "$.cluster",
"End": true
}
}
}

AWS Code Build - cached DOWNLOAD_SOURCE taking too long

I am trying to improve the build speed of one of my projects on CodeBuild. The project uses a Github source provider and source caching of type local is enabled.
The first time I ran the build it took 103 secs. I ran it again immediately after the first one finished expecting it to run in a few seconds due to source caching, but it took 60 secs.
What I am missing here? Is the cache not working? If it is working, why does it take that long on the second run?
Thanks
Project Details:
{
"projectsNotFound": [],
"projects": [
{
"environment": {
"computeType": "BUILD_GENERAL1_LARGE",
"imagePullCredentialsType": "SERVICE_ROLE",
"privilegedMode": true,
"image": "111669150171.dkr.ecr.us-east-1.amazonaws.com/***********/ep-build-env:latest",
"environmentVariables": [
{
"type": "PLAINTEXT",
"name": "NEXUS_URI",
"value": "http://***************"
},
{
"type": "PLAINTEXT",
"name": "REGISTRY",
"value": "111669150171.dkr.ecr.us-east-1.amazonaws.com/*********"
}
],
"type": "LINUX_CONTAINER"
},
"timeoutInMinutes": 60,
"name": "StorefrontApi",
"serviceRole": "arn:aws:iam::111669150171:role/CodeBuild-ECRReadOnly",
"tags": [],
"artifacts": {
"type": "NO_ARTIFACTS"
},
"lastModified": 1571227097.581,
"cache": {
"type": "LOCAL",
"modes": [
"LOCAL_DOCKER_LAYER_CACHE",
"LOCAL_SOURCE_CACHE",
"LOCAL_CUSTOM_CACHE"
]
},
"vpcConfig": {
"subnets": [
"subnet-fd7f958b"
],
"vpcId": "vpc-71e3f414",
"securityGroupIds": [
"sg-19b65e6c",
"sg-9e28e9f9"
]
},
"created": 1571082681.262,
"sourceVersion": "refs/heads/ep-mysql",
"source": {
"buildspec": "version: 0.2\n\nphases:\n build:\n commands:\n - env\n - cd extensions\n - mvn --settings $CODEBUILD_SRC_DIR_DEVOPS_WINE/pipelines/storefront/build-war/settings.xml --projects storefront/ext-storefront-webapp -am -DskipAllTests clean install\n\nartifacts:\n secondary-artifacts:\n storefront-war:\n base-directory: $CODEBUILD_SRC_DIR/extensions/storefront/ext-storefront-webapp/target\n files:\n - \"*.war\"\n\ncache:\n paths:\n - '/root/.m2/**/*'\n - '/root/.npm/**/*'",
"insecureSsl": false,
"gitSubmodulesConfig": {
"fetchSubmodules": false
},
"location": "https://github.com/*****************.git",
"gitCloneDepth": 1,
"type": "GITHUB",
"reportBuildStatus": false
},
"badge": {
"badgeEnabled": false
},
"queuedTimeoutInMinutes": 480,
"secondaryArtifacts": [],
"logsConfig": {
"s3Logs": {
"status": "DISABLED",
"encryptionDisabled": false
},
"cloudWatchLogs": {
"status": "ENABLED"
}
},
"secondarySources": [
{
"insecureSsl": false,
"gitSubmodulesConfig": {
"fetchSubmodules": false
},
"location": "https://github.com/*****************.git",
"sourceIdentifier": "DEVOPS_WINE",
"gitCloneDepth": 1,
"type": "GITHUB",
"reportBuildStatus": false
}
],
"encryptionKey": "arn:aws:kms:us-east-1:111669150171:alias/aws/s3",
"arn": "arn:aws:codebuild:us-east-1:111669150171:project/StorefrontApi",
"secondarySourceVersions": [
{
"sourceVersion": "refs/heads/staging",
"sourceIdentifier": "DEVOPS_WINE"
}
]
}
]
}

Apparently, at time of writing, CodeBuild does not use the native git client to fetch the source from GitHub. I understand that the CodeBuild internal teams have an internal feature request to move from whatever they're using to the native git client to improve performance.
Does your repository, by change, have lots of large files in its history? You can use this answer for a command to run to analyze your repository.
If you have lots of large files in your history and you're able to remove them, you can then use a tool like BFG Repo Cleaner to rewrite history. That should speed up the DOWNLOAD_SOURCE phase.
Also, if you have a dedicated support plan with AWS, you should reach out to your TAM to upvote the feature request to move to native git for GitHub source downloads.

Terraform cannot read from remote state

Terraform version
v0.12.1
AWS provider version
v2.16.0
I've Terraform workspace configured, as for now my workspace is pointing to dev where I've a tfstate file for my VPCs and Subnets and a different one for my security groups, however when I'm trying to refer vpc_id from my vpc remote tfstate into my security group then I get below error message
No stored state was found for the given workspace in the given backend.
My s3 bucket looks like below
nonprod-us-east-1
|-- env
|-- dev
|-- vpc_subnet/tfstate
|-- security_group/tfstate
Terraform Configuration Files
Security-Group tf config
terraform {
backend "s3"{
# Configuration will be injected by environment variables.
}
}
provider "aws" {
region = "${var.region}"
}
data "terraform_remote_state" "vpc_subnet" {
backend = "s3"
config = {
bucket = "nonprod-us-east-1"
key = "vpc_subnet/tfstate"
region = "us-east-1"
}
}
vpc_id = "${data.terraform_remote_state.vpc_subnet.outputs.vpc_id}"
And I've verified that my vpc_subnet/tfstate oputputs has vpc_id
Outputs from VPC subnet tf state
outputs": {
"private_subnet_cidr_blocks": {
"value": [
"10.0.3.0/24",
"10.0.4.0/24",
"10.0.5.0/24"
],
"type": [
"tuple",
[
"string",
"string",
"string"
]
]
},
"private_subnet_ids": {
"value": [
"subnet-042a16dd291e90add",
"subnet-02e8322d996968a3f",
"subnet-078f525c24015b364"
],
"type": [
"tuple",
[
"string",
"string",
"string"
]
]
},
"public_subnet_cidr_blocks": {
"value": [
"10.0.0.0/24",
"10.0.1.0/24",
"10.0.2.0/24"
],
"type": [
"tuple",
[
"string",
"string",
"string"
]
]
},
"public_subnet_ids": {
"value": [
"subnet-0ba92a28f6e8ddd95",
"subnet-08efcb80bed22f4e2",
"subnet-0b641797bfe207a0b"
],
"type": [
"tuple",
[
"string",
"string",
"string"
]
]
},
"vpc_id": {
"value": "vpc-0bb7595ff05fed581",
"type": "string"
}
}
Expected Behavior
It should be able to read vpc_id from remote tf state location.
Actual Behavior
Failing to read output from remote tf state

Finally sorted it out, it comes out to be an issue with bucket key, since I'm using Terraform workspace so the tfstate files are created under folder env:/dev/vpc_subnet/tfstate, after correcting bucket key, it's able to resolve the tfstate files.

How to troubleshoot logging to cloudwatch from docker in an ECS container?

I'm tearing my hair out over this.
Here's my setup:
ECS Container on an EC2 Instance that contains an
ECS Task Definition that runs a
Docker instance with a python script that logs to stderr
I see a Cloudwatch log group fo the ECS Task get created, but nothing I print to stderr appears.
My ECS container has the default ecsInstanceRole.
The ecsInstanceRole has the AmazonEC2ContainerServiceforEC2Role policy which is as follows:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ecs:CreateCluster",
"ecs:DeregisterContainerInstance",
"ecs:DiscoverPollEndpoint",
"ecs:Poll",
"ecs:RegisterContainerInstance",
"ecs:StartTelemetrySession",
"ecs:UpdateContainerInstancesState",
"ecs:Submit*",
"ecr:GetAuthorizationToken",
"ecr:BatchCheckLayerAvailability",
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "*"
}
]
}
Task definition
{
"containerDefinitions": [
{
"dnsSearchDomains": [],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "torres",
"awslogs-region": "us-west-2",
"awslogs-stream-prefix": "torres"
}
},
"command": [
"sleep",
"infinity"
],
"cpu": 100,
"mountPoints": [
{
"readOnly": null,
"containerPath": "/mnt/efs",
"sourceVolume": "efs"
}
],
"memoryReservation": 512,
"image": "X.dkr.ecr.us-west-2.amazonaws.com/torres-docker:latest",
"interactive": true,
"essential": true,
"pseudoTerminal": true,
"readonlyRootFilesystem": false,
"privileged": false,
"name": "torres"
}
],
"family": "torres",
"volumes": [
{
"name": "efs",
"host": {
"sourcePath": "/mnt/efs"
}
}
]
}

I figured out the issue, but I don't know why it solves it.
When I changed my command from sleep infinity to just running a piece of code that logs python script.py the logging in CloudWatch started working.
Before I was launching the container with sleep infinity ssh-ing into the container and launching script.py from the shell and this was NOT logging.

Create env fails when using a daemonset to create processes in Kubernetes

I want to deploy a software in to nodes with daemonset, but it is not a docker app. I created a daemonset json like this :
"template": {
"metadata": {
"creationTimestamp": null,
"labels": {
"app": "uniagent"
},
"annotations": {
"scheduler.alpha.kubernetes.io/tolerations": "[{\"key\":\"beta.k8s.io/accepted-app\",\"operator\":\"Exists\", \"effect\":\"NoSchedule\"}]"
},
"enable": true
},
"spec": {
"restartPolicy": "Always",
"terminationGracePeriodSeconds": 30,
"dnsPolicy": "ClusterFirst",
"securityContext": {},
"processes": [
{
"name": "foundation",
"package": "xxxxx",
"resources": {
"limits": {
"cpu": "100m",
"memory": "1Gi"
}
},
"lifecyclePlan": {
"kind": "ProcessLifecycle",
"namespace": "engb",
"name": "app-plc"
},
"env": [
{
"name": "SECRET_USERNAME",
"valueFrom": {
"secretKeyRef": {
"name": "key-secret",
"key": "uniagentuser"
}
}
},
{
"name": "SECRET_PASSWORD",
"valueFrom": {
"secretKeyRef": {
"name": "key-secret",
"key": "uniagenthash"
}
}
}
]
},
when the app deploy succeeds, the env variables do not exist at all.
What should I do to solve this problem?
Thanks

Daemon Sets have to be docker containers. You can't have non-containerized programs run as Daemon Sets. https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/ Kubernetes only launches containers.
Also in your YAML manifest file, I see a "processes" key and I have reason to believe it's not a valid manifest file, so I doubt you deployed it successfully.
You have not pasted the "full" YAML file, but I'm guessing the "template" key at the beginning is the spec.template key of the file.
Run kubectl explain daemonset.spec.template.spec and you'll see that there is no "processes" field.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Can't get Apache Airflow to write to S3 using EMR Operators - amazon-s3

I believe that I just solved my problem. After really digging deep into all the local Airflow logs and the S3 EMR logs I found a Hadoop Memory Exception, so I increased the number of cores to run the EMR on and it seems to work now.

Related

Set Subnet ID and EC2 Key Name in EMR Cluster Config via Step Functions

AWS Code Build - cached DOWNLOAD_SOURCE taking too long

Terraform cannot read from remote state

How to troubleshoot logging to cloudwatch from docker in an ECS container?

Create env fails when using a daemonset to create processes in Kubernetes

Categories

Resources