Is there a way to get Step Functions input values into EMR step Args - amazon-emr

We are running batch spark jobs using AWS EMR clusters. Those jobs run periodically and we would like to orchestrate those via AWS Step Functions.
As of November 2019 Step Functions has support for EMR natively. When adding a Step to the cluster we can use the following config:
"Some Step": {
"Type": "Task",
"Resource": "arn:aws:states:::elasticmapreduce:addStep.sync",
"Parameters": {
"ClusterId.$": "$.cluster.ClusterId",
"Step": {
"Name": "FirstStep",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep": {
"Jar": "command-runner.jar",
"Args": [
"spark-submit",
"--class",
"com.some.package.Class",
"JarUri",
"--startDate",
"$.time",
"--daysToLookBack",
"$.daysToLookBack"
]
}
}
},
"Retry" : [
{
"ErrorEquals": [ "States.ALL" ],
"IntervalSeconds": 1,
"MaxAttempts": 1,
"BackoffRate": 2.0
}
],
"ResultPath": "$.firstStep",
"End": true
}
Within the Args List of the HadoopJarStep we would like to set arguments dynamically. e.g. if the input of the state machine execution is:
{
"time": "2020-01-08",
"daysToLookBack": 2
}
The strings in the config starting with "$." should be replaced accordingly when executing the State Machine, and the step on the EMR cluster should run command-runner.jar spark-submit --class com.some.package.Class JarUri --startDate 2020-01-08 --daysToLookBack 2. But instead it runs command-runner.jar spark-submit --class com.some.package.Class JarUri --startDate $.time --daysToLookBack $.daysToLookBack.
Does anyone know if there is a way to do this?

Parameters allow you to define key-value pairs, so as the value for the "Args" key is an array, you won't be able to dynamically reference a specific element in the array, you would need to reference the whole array instead. For example "Args.$": "$.Input.ArgsArray".
So for your use-case the best way to achieve this would be to add a pre-processing state, before calling this state. In the pre-processing state you can either call a Lambda function and format your input/output through code or for something as simple as adding a dynamic value to an array you can use a Pass State to reformat the data and then inside your task State Parameters you can use JSONPath to get the array which you defined in in the pre-processor. Here's an example:
{
"Comment": "A Hello World example of the Amazon States Language using Pass states",
"StartAt": "HardCodedInputs",
"States": {
"HardCodedInputs": {
"Type": "Pass",
"Parameters": {
"cluster": {
"ClusterId": "ValueForClusterIdVariable"
},
"time": "ValueForTimeVariable",
"daysToLookBack": "ValueFordaysToLookBackVariable"
},
"Next": "Pre-Process"
},
"Pre-Process": {
"Type": "Pass",
"Parameters": {
"FormattedInputsForEmr": {
"ClusterId.$": "$.cluster.ClusterId",
"Args": [
{
"Arg1": "spark-submit"
},
{
"Arg2": "--class"
},
{
"Arg3": "com.some.package.Class"
},
{
"Arg4": "JarUri"
},
{
"Arg5": "--startDate"
},
{
"Arg6.$": "$.time"
},
{
"Arg7": "--daysToLookBack"
},
{
"Arg8.$": "$.daysToLookBack"
}
]
}
},
"Next": "Some Step"
},
"Some Step": {
"Type": "Pass",
"Parameters": {
"ClusterId.$": "$.FormattedInputsForEmr.ClusterId",
"Step": {
"Name": "FirstStep",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep": {
"Jar": "command-runner.jar",
"Args.$": "$.FormattedInputsForEmr.Args[*][*]"
}
}
},
"End": true
}
}
}

You can use the States.Array() intrinsic function. Your Parameters becomes:
"Parameters": {
"ClusterId.$": "$.cluster.ClusterId",
"Step": {
"Name": "FirstStep",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep": {
"Jar": "command-runner.jar",
"Args.$": "States.Array('spark-submit', '--class', 'com.some.package.Class', 'JarUri', '--startDate', $.time, '--daysToLookBack', '$.daysToLookBack')"
}
}
}
Intrinsic functions are documented here but I don't think it explains the usage very well. The code snippets provided in the Step Functions console are more useful.
Note that you can also do string formatting on the args using States.Format(). For example, you could construct a path using an input variable as the final path segment:
"Args.$": "States.Array('mycommand', '--path', States.Format('my/base/path/{}', $.someInputVariable))"

Related

Can't make PUT /raylight/v1/documents/id/parameter/id work properly

I need to update document parameters via REST API.
I've tried using the following:
PUT .../raylight/v1/documents/33903/parameters/3
with the following json payload
{
"parameters":{
"parameter": {
"id": 3,
"answer": {
"values": {
"value": [
"2019/9"
]
}
}
}
}
}
But the returned answer shows unmodified parameters:
{
"parameter": {
"#optional": "false",
"#type": "prompt",
...
"id": 3,
...
"answer": {
...
"info": {
...
"previous": {
"value": [
"2015\/12"
]
}
},
"values": {
"value": [
"2015\/12"
]
}
}
}
}
How can I properly set new prompt parameters?
Do:
PUT .../raylight/v1/documents/33903/parameters
instead of:
PUT .../raylight/v1/documents/33903/parameters/3
Adding a parameter ID at the end performs a different function: it returns the list of parameters that are dependent upon the one provided. You have only one in this case, and it's returning itself. Leave it off, to refresh the document.

How to pass AWS Lambda error in AWS SNS notification through AWS Step Functions?

I have created an AWS Step Function which triggers a Lambda python code, terminates without error if Lambda succeeds, otherwise calls an SNS topic to message the subscribed users if the Lambda fails. It is running, but the message was fixed. The Step Function JSON is as follows:
{
"StartAt": "Lambda Trigger",
"States": {
"Lambda Trigger": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-2:xxxxxxxxxxxx:function:helloworldTest",
"End": true,
"Catch": [
{
"ErrorEquals": [
"States.ALL"
],
"ResultPath": "$.error",
"Next": "Notify Failure"
}
]
},
"Notify Failure": {
"Type": "Task",
"Resource": "arn:aws:states:::sns:publish",
"Parameters": {
"Message": "Batch job submitted through Step Functions failed with the following error, $.error",
"TopicArn": "arn:aws:sns:us-east-2:xxxxxxxxxxxx:lambda-execution-failure"
},
"End": true
}
}
}
Only thing is, I want to append the failure error message to my message string, which I tried, but is not working as expected.
But I get a mail as follows:
How to go about it?
I could solve the problem using "Error.$": "$.Cause".
The following is a working example of the failure portion of state machine:
"Job Failure": {
"Type": "Task",
"Resource": "arn:aws:states:::sns:publish",
"Parameters": {
"Subject": "Lambda Job Failed",
"Message": {
"Alarm": "Lambda Job Failed",
"Error.$": "$.Cause"
},
"TopicArn": "arn:aws:sns:us-east-2:xxxxxxxxxxxx:Job-Run-Notification"
},
"End": true
}
Hope this helps!
Here is the full version of the code
{
"Comment": "A Hello World example of the Amazon States Language using an AWS Lambda function",
"StartAt": "HelloWorld",
"States": {
"HelloWorld": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:XXXXXXXXXXXXX:function:StepFunctionTest",
"End": true,
"Catch": [
{
"ErrorEquals": [
"States.ALL"
],
"Next": "NotifyFailure"
}
]
},
"NotifyFailure": {
"Type": "Task",
"Resource": "arn:aws:states:::sns:publish",
"Parameters": {
"Subject": "[ERROR]: Task failed",
"Message": {
"Alarm": "Batch job submitted through Step Functions failed with the following error",
"Error.$": "$.Cause"
},
"TopicArn": "arn:aws:sns:us-east-1:XXXXXXXXXXXXX:Notificaiton"
},
"End": true
}
}
}
This line is already appending exception object to 'error' path.
"ResultPath": "$.error"
We just need pass '$' to Message.$ to SNS task, both input and error details will be sent to SNS.
{
"TopicArn":"${SnsTopic}",
"Message.$":"$"
}
if we don't want input to Lambda to be appended in email, we should skip ResultPath or have just '$' as ResultPath, input object is ignored.
"ResultPath": "$"

Set Subnet ID and EC2 Key Name in EMR Cluster Config via Step Functions

As of November 2019 AWS Step Function has native support for orchestrating EMR Clusters. Hence we are trying to configure a Cluster and run some jobs on it.
We could not find any documentation on how to set the SubnetId as well as the Key Name used for the EC2 instances in the cluster. Is there any such possibility?
As of now our create cluster step looks as following:
"States": {
"Create an EMR cluster": {
"Type": "Task",
"Resource": "arn:aws:states:::elasticmapreduce:createCluster.sync",
"Parameters": {
"Name": "TestCluster",
"VisibleToAllUsers": true,
"ReleaseLabel": "emr-5.26.0",
"Applications": [
{ "Name": "spark" }
],
"ServiceRole": "SomeRole",
"JobFlowRole": "SomeInstanceProfile",
"LogUri": "s3://some-logs-bucket/logs",
"Instances": {
"KeepJobFlowAliveWhenNoSteps": true,
"InstanceFleets": [
{
"Name": "MasterFleet",
"InstanceFleetType": "MASTER",
"TargetOnDemandCapacity": 1,
"InstanceTypeConfigs": [
{
"InstanceType": "m3.2xlarge"
}
]
},
{
"Name": "CoreFleet",
"InstanceFleetType": "CORE",
"TargetSpotCapacity": 2,
"InstanceTypeConfigs": [
{
"InstanceType": "m3.2xlarge",
"BidPriceAsPercentageOfOnDemandPrice": 100 }
]
}
]
}
},
"ResultPath": "$.cluster",
"End": "true"
}
}
As soon as we try to add "SubnetId" key in any of the subobjects in Parameters, or in Parameter itself we get the error:
Invalid State Machine Definition: 'SCHEMA_VALIDATION_FAILED: The field "SubnetId" is not supported by Step Functions at /States/Create an EMR cluster/Parameters' (Service: AWSStepFunctions; Status Code: 400; Error Code: InvalidDefinition;
Referring to the SF docs on the emr integration we can see that createCluster.sync uses the emr API RunJobFlow. In RunJobFlow we can specify the Ec2KeyName and Ec2SubnetId located at the paths $.Instances.Ec2KeyName and $.Instances.Ec2SubnetId.
With that said I managed to create a State Machine with the following definition (on a side note, your definition had a syntax error with "End": "true", which should be "End": true)
{
"Comment": "A Hello World example of the Amazon States Language using Pass states",
"StartAt": "Create an EMR cluster",
"States": {
"Create an EMR cluster": {
"Type": "Task",
"Resource": "arn:aws:states:::elasticmapreduce:createCluster.sync",
"Parameters": {
"Name": "TestCluster",
"VisibleToAllUsers": true,
"ReleaseLabel": "emr-5.26.0",
"Applications": [
{
"Name": "spark"
}
],
"ServiceRole": "SomeRole",
"JobFlowRole": "SomeInstanceProfile",
"LogUri": "s3://some-logs-bucket/logs",
"Instances": {
"Ec2KeyName": "ENTER_EC2KEYNAME_HERE",
"Ec2SubnetId": "ENTER_EC2SUBNETID_HERE",
"KeepJobFlowAliveWhenNoSteps": true,
"InstanceFleets": [
{
"Name": "MasterFleet",
"InstanceFleetType": "MASTER",
"TargetOnDemandCapacity": 1,
"InstanceTypeConfigs": [
{
"InstanceType": "m3.2xlarge"
}
]
},
{
"Name": "CoreFleet",
"InstanceFleetType": "CORE",
"TargetSpotCapacity": 2,
"InstanceTypeConfigs": [
{
"InstanceType": "m3.2xlarge",
"BidPriceAsPercentageOfOnDemandPrice": 100
}
]
}
]
}
},
"ResultPath": "$.cluster",
"End": true
}
}
}

How to pass a variable to EMR addStep in AWS StepFunctions

AWS Stepfunctions recently added EMR integration, which is cool, but i couldn't find a way to pass a variable from step functions into the addstep args.
For example i would like to pass "$.dayid" variable into "Parameters">"Step">"HadoopJarStep">Args. Similar to "ClusterId.$": "$.ClusterId" (this cluster id variable works).
{
"Step_One": {
"Type": "Task",
"Resource": "arn:aws:states:::elasticmapreduce:addStep.sync",
"Parameters": {
"ClusterId.$": "$.ClusterId",
"Step": {
"Name": "The first step",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep": {
"Jar": "command-runner.jar",
"Args": [
"hive-script",
"--run-hive-script",
"--args",
"-f",
"s3://<region>.elasticmapreduce.samples/cloudfront/code/Hive_CloudFront.q",
"-d",
"INPUT=s3://<region>.elasticmapreduce.samples",
"-d",
"OUTPUT=s3://<mybucket>/MyHiveQueryResults/$.dayid"
]
}
}
},
"End": true
}
Parameters allow you to define key-value pairs, so as the value for the "Args" key is an array, you won't be able to dynamically reference a specific element in the array, you would need to reference the whole array instead. For example "Args.$": "$.Input.ArgsArray". With that said, you also won't be able to reference substitute a value inside a string like you are trying to do in "OUTPUT=s3:///MyHiveQueryResults/$.dayid"
So for your use-case the best way to achieve this would be to add a pre-processing state, before calling this state. In the pre-processing state I would recommend you call a Lambda function to construct the string "OUTPUT=s3:///MyHiveQueryResults/$.dayid" as well as the full Array you send to Args.
{
"StartAt": "Pre-Process",
"States": {
"Pre-Process": {
"Type": "Task",
"Resource": "<Lambda function to generate the string OUTPUT=s3://<mybucket>/MyHiveQueryResults/$.dayid and output the Args array>",
"Next": "Step_One"
},
"Step_One": {
"Type": "Task",
"Resource": "arn:aws:states:::elasticmapreduce:addStep.sync",
"Parameters": {
"ClusterId.$": "$.ClusterId",
"Step": {
"Name": "The first step",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep": {
"Jar": "command-runner.jar",
"Args.$": "$.ArgsGeneratedByPreProcessingState"
}
}
},
"End": true
}
}
}
Step Functions now has intrinsic functions that can help this situation.
"PayloadString.$": "States.Format('[[{}]]', States.JsonToString($.in.summary))",
"CmdLine.$": "States.Array('--maxp', $.params.maxpr, '--minp', $.params.minpr)"
Can't believe it took this long for these functions to become available.
See documentation

Azure Data Factory v2 If activity always fails

I'm currently struggling with the Azure Data Factory v2 If activity which always fails with this error message:
enter image description here
I've designed two separate pipelines, one takes the full snapshot of the data (1333 records) from the on-premises SQL Server and loads the data into the Azure SQL Database, and another one just takes delta from the same source.
Both pipelines work fine when executed independently.
I then decided to wrap these two pipelines into the one parent pipeline which would do this:
1.
Execute LookUp activity to check if the target table in Azure SQL Database has any records, basic Select Count(Request_ID) As record_count From target_table - activity works fine, I can preview the returned record count.
2.
Pass the output from the LookUp activity to the If activity with the conditions that if record_count = 0, the parent pipeline would invoke the full load pipeline, otherwise the parent pipeline would invoke the delta load pipeline.
This is the actual expression:
{#activity('lookup_sites_record_count').output.firstRow.record_count}==0"
Whenever I try to execute this parent pipeline, it fails with the above message of "Activity failed: Activity failed because an inner activity failed."
Both inner activities, that is, full load and delta load pipelines, work just fine when triggered independently.
What I'm missing?
Many thanks in advance :).
mikhailg
Pipeline's JSON definition below:
{
"name": "pl_remedyreports_load_rs_sites",
"properties": {
"activities": [
{
"name": "lookup_sites_record_count",
"type": "Lookup",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false
},
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "Select Count(Request_ID) As record_count From mdp.RS_Sites;"
},
"dataset": {
"referenceName": "ds_azure_sql_db_sites",
"type": "DatasetReference"
}
}
},
{
"name": "If_check_site_record_count",
"type": "IfCondition",
"dependsOn": [
{
"activity": "lookup_sites_record_count",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"expression": {
"value": "{#activity('lookup_sites_record_count').output.firstRow.record_count}==0",
"type": "Expression"
},
"ifFalseActivities": [
{
"name": "pl_remedyreports_invoke_load_sites_inc",
"type": "ExecutePipeline",
"typeProperties": {
"pipeline": {
"referenceName": "pl_remedyreports_load_sites_inc",
"type": "PipelineReference"
}
}
}
],
"ifTrueActivities": [
{
"name": "pl_remedyreports_invoke_load_sites_full",
"type": "ExecutePipeline",
"typeProperties": {
"pipeline": {
"referenceName": "pl_remedyreports_load_sites_full",
"type": "PipelineReference"
}
}
}
]
}
}
],
"folder": {
"name": "Load Remedy Reference Data"
}
}
}
Your expression should be:
#equals(activity('lookup_sites_record_count').output.firstRow.record_count,0)