AWS data pipeline activity with multiple inputs - variables

As part of an Amazon AWS data pipeline, I have a hive activity using two unstaged S3 data nodes as input. What I want is to be able to set two script variables on the activity, each pointing to an input data node, but I can't get the syntax right. With the single input, I could write the following and it would work just fine:
INPUT_FOO=#{input.directoryPath}
When I add the second input, I run into a problem of how to reference them since they are now an array of inputs, as you can see in the pipeline definition below. Essentially, I want to achieve the following, but can't figure out the correct syntax:
INPUT_FOO=#{input[1].directoryPath}
INPUT_BAR=#{input[2].directoryPath}
Here's the activity portion of the pipeline definition:
{
"id": "ActivityId_7u1sR",
"input": [
{
"ref": "DataNodeId_iYnxf"
},
{
"ref": "DataNodeId_162Ka"
}
],
"schedule": {
"ref": "DefaultSchedule"
},
"scriptUri": "#{myS3ScriptLocation}calculate-results.q",
"name": "Perform Calculations",
"runsOn": {
"ref": "EmrClusterId_jHeiV"
},
"scriptVariable": [
"INPUT_SOURCE1=#{input[1].directoryPath}",
"OUTPUT=#{output.directoryPath}Results/",
"INPUT_SOURCE2=#{input[2].directoryPath}"
],
"output": {
"ref": "DataNodeId_2jY6v"
},
"type": "HiveActivity",
"stage": "false"
}
I plan to keep the tables unstaged and take care of table creation in the hive script so that it's easier to run each Hive activity in isolation as well as in the pipeline itself.
Here's the error I see when using array syntax:
Unable to resolve input[1].directoryPath for object ActivityId_7u1sR'

As it stands now, this scenario is not supported, but a feature request was added to support it in the future.

Related

Generate "Instances" definition programmatically to create EMR cluster in StepFunctions

I have a case where I want to dynamically create an EMR cluster based on a user-defined configuration and execute a sequence of steps on it using AWS Step Functions.
For this, I am planning to provide the instance configuration as an input to the step functions workflow.
Based on the StepFunctions-EMR Integration Documentation, the definition is the same as that of the RunJobFlow API.
However, when I try to generate the definition by serializing an instance of JobFlowInstancesConfig to JSON and pass it to the StateMachine as an input, it throws an error saying:
The field 'Instances.KeepJobFlowAliveWhenNoSteps' is required but was missing
Here is the JSON generated post serialization:
{
"instanceFleets": [
{
"instanceFleetType": "MAIN",
"targetOnDemandCapacity": 1,
"instanceTypeConfigs": [
{
"instanceType": "m5.xlarge"
}
]
},
{
"instanceFleetType": "CORE",
"targetOnDemandCapacity": 1,
"instanceTypeConfigs": [
{
"instanceType": "c5.2xlarge"
}
]
}
],
"keepJobFlowAliveWhenNoSteps": true
}
I am passing this in the input, and accessing it in my StepFunctions definition in the below Task (where I expect the above definition to be replacing $.jobFlowInstancesConfig):
...
"GetCluster": {
"Type": "Task",
"Resource": "arn:aws:states:::elasticmapreduce:createCluster.sync",
"Parameters": {
"Name.$": "$.clusterName",
"VisibleToAllUsers": true,
"ReleaseLabel": "emr-5.30.0",
"Applications": [
{
"Name": "Spark"
}
],
"ServiceRole": "EMR_DefaultRole",
"JobFlowRole": "EMR_EC2_DefaultRole",
"LogUri": "s3://my-aws-logs/elasticmapreduce/",
"Instances.$": "$.jobFlowInstancesConfig"
}
}
...
My suspicion is that this is failing because StepFunctions expects the field names to start with upper case.
Question: How do I programmatically generate the appropriate definition without having to play around with Strings for generating the JSON? Is there a straightforward way to serialize the above definition to one that will work with StepFunctions?

Error loading multiple files to bigquery too many positional args

~edited: I'm running the bq command line using my VM instance in Google Compute Engine
Ive been trying to load multiple csv files to bigquery using bq command line, and i keep getting this error
Too many positional args, still have ['/home/username/csvschema.json']
All my files contain the same schema since I copied and paste it only and rename for testing purposes. So not sure why I keep getting this error. [testFiles_1.csv, testFiles_2.csv, testFiles_3.csv]
These are the steps I took:
1. Created my bigquery table and manually insert 1 file there so I dont need to manually add schema, but rather auto detect.
2. Then, I type this command:
bq load --skip_leading_rows=1 gcstransfer.testFile /home/username/testfile_*.csv /home/username/csvschema.json
My schema contains by running the bq show --format=prettyjson dataset.table
[
{
"mode": "NULLABLE",
"name": "Channel",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "Date",
"type": "INTEGER"
},
{
"mode": "NULLABLE",
"name": "ID",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "Referral",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "Browser",
"type": "STRING"
}
]
I tried omitting the JSON part, but I get this error instead:
BigQuery error in load operation: Error decoding JSON schema from file /home/username/testfile_2.csv: No JSON object could be decoded
To specify a one-column schema, use "name:string".
Looks like you cannot use wildcards when loading from local data source. For this you can upload the files to a GCS Bucket and load them from there. See the Limitations paragraph in the docs: https://cloud.google.com/bigquery/docs/loading-data-local
Wildcards and comma separated lists are not supported when you load
files from a local data source. Files must be loaded individually.

ADF v2 - Web Activity- POST output not retrievable

With Azure Data Factory v2, I created Web Activity using the POST method and got the desired response output.
But can't get the rows data from the output response in the next activity.
How do I reference columns in the rows in this output?
The data in the rows doesn't have any headers.
{
"Tables": [
{
"TableName": "Table_0",
"Columns": [
{
"ColumnName": "MyFieldA",
"DataType": "String",
"ColumnType": "string"
},
{
"ColumnName": "MyFieldB",
"DataType": "String",
"ColumnType": "string"
}
],
"Rows": [
[
"ABCDEF",
"AAAABBBBBCCCDDDDD"
],
[
"CCCCCCC",
"CCCCCCC"
],
I can't reference the value in the rows
I've tried numerous things
e.g. #activity('WebActivity').output.Rows
Nothing seems to work.
What's the point of getting a response from a web activity and then not being able to reference the output in data factory?
Thanks Pacodel!!! You've helped me out.
And to use in the a For Each Loop and Array, when I pass in the Rows to my Execute pipeline activity #activity('WebActivity').output.Tables[0].Rows:
[
[
"ABCDEF",
"AAAABBBBBCCCDDDDD"
],
[
"CCCCCCC",
"CCCCCCC"
]
]
I can use the following to reference the rows:
#{item()[0]}
#{item()[1]}
I use the #item to populate parameters in a stored procedure activity which loads my table
Thanks
For your example.
#activity('WebActivity').output.Tables[0].Rows
will return the following:
[
[
"ABCDEF",
"AAAABBBBBCCCDDDDD"
],
[
"CCCCCCC",
"CCCCCCC"
]
]
If you want to access even deeper, you just need to specify the index.
#activity('WebActivity').output.Tables[0].Rows[0][0] will return ABCDEF
If you need to automate this, you can have a pattern of foreach with an execute pipeline inside passing an array as a parameter until you get to the properties that you need.

Writing failed row inserts in a streaming job to bigquery using apache beam JAVA SDK?

While running a streaming job its always good to have logs of rows which were not processed while inserting into big query. Catching and write those into another big query table will give an idea for what went wrong.
Below are the steps that you can try to achieve the same.
Pre-requisites:
apache-beam >= 2.10.0 or latest
Using the getFailedInsertsWithErr() function available in the sdk you can easily catch the failed inserts and push to another table for performing RCA. This becomes an important feature for debugging streaming pipelines which are running infinitely.
BigQueryInsertError is an error function that is thrown back by big query for a failed TableRow. This will contain the following parameters
Row.
Error stacktrace and error message payload.
Table reference object.
The above parameters can be captured and pushed into another bq table. Example schema for error records.
"fields": [{
"name": "timestamp",
"type": "TIMESTAMP",
"mode": "REQUIRED"
},
{
"name": "payloadString",
"type": "STRING",
"mode": "REQUIRED"
},
{
"name": "errorMessage",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "stacktrace",
"type": "STRING",
"mode": "NULLABLE"
}
]
}

AWS Data pipeline CSV data from S3 to DynamoDB

I am trying to transfer CSV data from S3 bucket to DynamoDB using AWS pipeline, following is my pipe line script, it is not working properly,
CSV file structure
Name, Designation,Company
A,TL,C1
B,Prog, C2
DynamoDb : N_Table, with Name as hash value
{
"objects": [
{
"id": "Default",
"scheduleType": "cron",
"name": "Default",
"role": "DataPipelineDefaultRole",
"resourceRole": "DataPipelineDefaultResourceRole"
},
{
"id": "DynamoDBDataNodeId635",
"schedule": {
"ref": "ScheduleId639"
},
"tableName": "N_Table",
"name": "MyDynamoDBData",
"type": "DynamoDBDataNode"
},
{
"emrLogUri": "s3://onlycsv/error",
"id": "EmrClusterId636",
"schedule": {
"ref": "ScheduleId639"
},
"masterInstanceType": "m1.small",
"coreInstanceType": "m1.xlarge",
"enableDebugging": "true",
"installHive": "latest",
"name": "ImportCluster",
"coreInstanceCount": "1",
"logUri": "s3://onlycsv/error1",
"type": "EmrCluster"
},
{
"id": "S3DataNodeId643",
"schedule": {
"ref": "ScheduleId639"
},
"directoryPath": "s3://onlycsv/data.csv",
"name": "MyS3Data",
"dataFormat": {
"ref": "DataFormatId1"
},
"type": "S3DataNode"
},
{
"id": "ScheduleId639",
"startDateTime": "2013-08-03T00:00:00",
"name": "ImportSchedule",
"period": "1 Hours",
"type": "Schedule",
"endDateTime": "2013-08-04T00:00:00"
},
{
"id": "EmrActivityId637",
"input": {
"ref": "S3DataNodeId643"
},
"schedule": {
"ref": "ScheduleId639"
},
"name": "MyImportJob",
"runsOn": {
"ref": "EmrClusterId636"
},
"maximumRetries": "0",
"myDynamoDBWriteThroughputRatio": "0.25",
"attemptTimeout": "24 hours",
"type": "EmrActivity",
"output": {
"ref": "DynamoDBDataNodeId635"
},
"step": "s3://elasticmapreduce/libs/script-runner/script-runner.jar,s3://elasticmapreduce/libs/hive/hive-script,--run-hive-script,--hive-versions,latest,--args,-f,s3://elasticmapreduce/libs/hive/dynamodb/importDynamoDBTableFromS3,-d,DYNAMODB_OUTPUT_TABLE=#{output.tableName},-d,S3_INPUT_BUCKET=#{input.directoryPath},-d,DYNAMODB_WRITE_PERCENT=#{myDynamoDBWriteThroughputRatio},-d,DYNAMODB_ENDPOINT=dynamodb.us-east-1.amazonaws.com"
},
{
"id": "DataFormatId1",
"name": "DefaultDataFormat1",
"column": [
"Name",
"Designation",
"Company"
],
"columnSeparator": ",",
"recordSeparator": "\n",
"type": "Custom"
}
]
}
Out of four steps while executing the pipeline, two are getting finished, but it is not executing completely
Currently (2015-04) default import pipeline template does not support importing CSV files.
If your CSV file is not too big (under 1GB or so) you can create a ShellCommandActivity to convert CSV to DynamoDB JSON format first and the feed that to EmrActivity that imports the resulting JSON file into your table.
As a first step you can create sample DynamoDB table including all the field types you need, populate with dummy values and then export the records using pipeline (Export/Import button in DynamoDB console). This will give you the idea about the format that is expected by Import pipeline. The type names are not obvious, and the Import activity is very sensitive about the correct case (e.g. you should have bOOL for boolean field).
Afterwards it should be easy to create an awk script (or any other text converter, at least with awk you can use the default AMI image for your shell activity), which you can feed to your shellCommandActivity. Don't forget to enable "staging" flag, so your output is uploaded back to S3 for the Import activity to pick it up.
If you are using the template data pipeline for Importing data from S3 to DynamoDB, these dataformats won't work. Instead, use the format in the link below to store the input S3 data file http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-importexport-ddb-pipelinejson-verifydata2.html
This format of the output file generated by the template data pipeline that exports data from DynamoDB to S3.
Hope that helps.
I would recommend using the CSV data format provided by datapipeline instead of custom.
For debugging the errors on cluster, you can lookup the jobflow in EMR console and look at the log files for the tasks that failed.
See below link for a solution that works (in the question section), albeit EMR 3.x. Just change the delimiter to "columnSeparator": ",". Personally, I wouldn't do CSV unless you are certain the data is sanitized correctly.
How to upgrade Data Pipeline definition from EMR 3.x to 4.x/5.x?