ADF v2 - Web Activity- POST output not retrievable - azure-data-factory-2

With Azure Data Factory v2, I created Web Activity using the POST method and got the desired response output.
But can't get the rows data from the output response in the next activity.
How do I reference columns in the rows in this output?
The data in the rows doesn't have any headers.
{
"Tables": [
{
"TableName": "Table_0",
"Columns": [
{
"ColumnName": "MyFieldA",
"DataType": "String",
"ColumnType": "string"
},
{
"ColumnName": "MyFieldB",
"DataType": "String",
"ColumnType": "string"
}
],
"Rows": [
[
"ABCDEF",
"AAAABBBBBCCCDDDDD"
],
[
"CCCCCCC",
"CCCCCCC"
],
I can't reference the value in the rows
I've tried numerous things
e.g. #activity('WebActivity').output.Rows
Nothing seems to work.
What's the point of getting a response from a web activity and then not being able to reference the output in data factory?

Thanks Pacodel!!! You've helped me out.
And to use in the a For Each Loop and Array, when I pass in the Rows to my Execute pipeline activity #activity('WebActivity').output.Tables[0].Rows:
[
[
"ABCDEF",
"AAAABBBBBCCCDDDDD"
],
[
"CCCCCCC",
"CCCCCCC"
]
]
I can use the following to reference the rows:
#{item()[0]}
#{item()[1]}
I use the #item to populate parameters in a stored procedure activity which loads my table
Thanks

For your example.
#activity('WebActivity').output.Tables[0].Rows
will return the following:
[
[
"ABCDEF",
"AAAABBBBBCCCDDDDD"
],
[
"CCCCCCC",
"CCCCCCC"
]
]
If you want to access even deeper, you just need to specify the index.
#activity('WebActivity').output.Tables[0].Rows[0][0] will return ABCDEF
If you need to automate this, you can have a pattern of foreach with an execute pipeline inside passing an array as a parameter until you get to the properties that you need.

Related

Generate "Instances" definition programmatically to create EMR cluster in StepFunctions

I have a case where I want to dynamically create an EMR cluster based on a user-defined configuration and execute a sequence of steps on it using AWS Step Functions.
For this, I am planning to provide the instance configuration as an input to the step functions workflow.
Based on the StepFunctions-EMR Integration Documentation, the definition is the same as that of the RunJobFlow API.
However, when I try to generate the definition by serializing an instance of JobFlowInstancesConfig to JSON and pass it to the StateMachine as an input, it throws an error saying:
The field 'Instances.KeepJobFlowAliveWhenNoSteps' is required but was missing
Here is the JSON generated post serialization:
{
"instanceFleets": [
{
"instanceFleetType": "MAIN",
"targetOnDemandCapacity": 1,
"instanceTypeConfigs": [
{
"instanceType": "m5.xlarge"
}
]
},
{
"instanceFleetType": "CORE",
"targetOnDemandCapacity": 1,
"instanceTypeConfigs": [
{
"instanceType": "c5.2xlarge"
}
]
}
],
"keepJobFlowAliveWhenNoSteps": true
}
I am passing this in the input, and accessing it in my StepFunctions definition in the below Task (where I expect the above definition to be replacing $.jobFlowInstancesConfig):
...
"GetCluster": {
"Type": "Task",
"Resource": "arn:aws:states:::elasticmapreduce:createCluster.sync",
"Parameters": {
"Name.$": "$.clusterName",
"VisibleToAllUsers": true,
"ReleaseLabel": "emr-5.30.0",
"Applications": [
{
"Name": "Spark"
}
],
"ServiceRole": "EMR_DefaultRole",
"JobFlowRole": "EMR_EC2_DefaultRole",
"LogUri": "s3://my-aws-logs/elasticmapreduce/",
"Instances.$": "$.jobFlowInstancesConfig"
}
}
...
My suspicion is that this is failing because StepFunctions expects the field names to start with upper case.
Question: How do I programmatically generate the appropriate definition without having to play around with Strings for generating the JSON? Is there a straightforward way to serialize the above definition to one that will work with StepFunctions?

JSON Schema Array Validation Woes Using oneOf

Hope I might find some help with this validation issue: I have a JSON array that can have multiple object types (video, image). Within that array, the objects have a rel field value. The case I'm working on is that there can only be one object with "rel": "primaryMedia" allowed — either a video or an image.
Here are the object representations for the image and video case, both showing the "rel": "primaryMedia".
{
"media": [
{
"caption": "Caption goes here",
"id": "ncim87659842",
"rel": "primaryMedia",
"type": "image"
},
{
"description": "Shaima Swileh arrived in San Francisco after fighting for 17 months to get a waiver from the U.S. government to be allowed into the country to visit her son.",
"headline": "Yemeni mother arrives in U.S. to be with dying 2-year-old son",
"id": "mmvo1402810947621",
"rel": "primaryMedia",
"type": "video"
}
]
}
Here's a stripped-down version of the schema I created to validate this using oneOf (assuming this will handle my case). It doesn't however, work as intended.
{
"$id": "http://example.com/schema/rockcms/article.json",
"$schema": "http://json-schema.org/draft-07/schema#",
"definitions": {},
"properties": {
"media": {
"items": {
"oneOf": [
{
"additionalProperties": false,
"properties": {
"caption": {
"type": "string"
},
"id": {
"type": "string"
},
"rel": {
"enum": [
"primaryMedia"
],
"type": "string"
},
"type": {
"enum": [
"image"
],
"type": "string"
}
},
"required": [
"caption",
"id",
"rel",
"type"
],
"type": "object"
},
{
"additionalProperties": false,
"properties": {
"description": {
"type": "string"
},
"headline": {
"type": "string"
},
"id": {
"type": "string"
},
"rel": {
"enum": [
"primaryMedia"
],
"type": "string"
},
"type": {
"enum": [
"video"
],
"type": "string"
}
},
"required": [
"description",
"headline",
"id",
"rel",
"type"
],
"type": "object"
}
]
},
"type": "array"
}
}
}
Using the JSON Schema validator at https://www.jsonschemavalidator.net, the schema validates when data is correctly presented, but the doesn't work when trying to catch errors.
In the case below, headline is added for a video, and it's missing id. This should fail because headline isn't allowed on video, and id is required.
{
"media": [
{
"headline": "Yemeni mother arrives in U.S. to be with dying 2-year-old son",
"rel": "primaryMedia",
"type": "video"
}
]
}
The results I get from the validator, however, aren't entirely expected. Seems it's conflating the two object schemas in its response.
In addition to this, I've separately found that the schema will allow the population of BOTH a video and image object in media, which isn't expected.
Been trying to figure out what I've done wrong, but am stumped. Would very much appreciate some feedback, if anyone has to offer. Thanks in advance!
Validator Response:
Message: JSON is valid against no schemas from 'oneOf'.
Schema path: #/properties/media/items/oneOf
Message: Property 'headline' has not been defined and the schema does not allow additional properties.
Schema path: #/properties/media/items/oneOf/0/additionalProperties
Message: Value "video" is not defined in enum.
Schema path: #/properties/media/items/oneOf/0/properties/type/enum
Message: Required properties are missing from object: description, id.
Schema path: #/properties/media/items/oneOf/1/required
Message: Required properties are missing from object: caption, id.
Schema path: #/properties/media/items/oneOf/0/required
Your question is formed of three parts, but the first two are linked, so I'll address those, although they don't have a "solution" as such.
How can I validate that only one object in an array has a specific key, and the others do not.
You cannot do this with JSON Schema.
The set of JSON Schema keywords which are applicable to arrays do not have a means for expressing "one of the values must match a schema", but rather are applicable to either ALL of the items in a the array, or A SPECIFIC item in the array (if items is an array as opposed to an object).
https://datatracker.ietf.org/doc/html/draft-handrews-json-schema-validation-01#section-6.4
The validation output is not what I expect. I expect to only see the failing branch of oneOf which relates to my object type, which
is defined by the type key.
The JSON Schema specification (as of draft-7) does not specify any format for returning errors, however the error structure you get is pretty "complete" in terms of what it's telling you (and is similar to how we are specifying errors should be returned for draft-8).
Consider, the validator knows nothing about your schema or your JSON instance in terms of your business logic.
When validating a JSON instance, a validator may step through all values, and test validation rules against all applicable subschemas. Looking at your errors, both of the schemas in oneOf are applicable to all items in your array, and so all are tested for validation. If one does not satisfy the condition, the others will be tested also. The validator cannot know, when using a oneOf, what your intent was.
You MAY be able to get around this issue, by using an if / then combination. If your if schema is simply a const of the type, and your then schema is the full object type, you may get a cleaner error response, but I haven't tested this theory.
I want ALL items in the array to be one of the types. What's going on here?
From the spec...
items:
If "items" is a schema, validation succeeds if all elements in the
array successfully validate against that schema.
oneOf:
An instance validates successfully against this keyword if it
validates successfully against exactly one schema defined by this
keyword's value.
What you have done is say: Each item in the array should be valid according to [schemaA]. SchemA: The object should be valid according to on of these schemas: [the schemas inside the oneOf].
I can see how this is confusing when you think "items must be one of the following", but items applies the value schema to each of the items in the array independantly.
To make it do what you mean, move the oneOf above items, and then refactor one of the media types to another items schema.
Here's a sudo schema.
oneOf: [items: properties: type: const: image], [items: properties: type: image]
Let me know if any of this isn't clear or you have any follow up questions.

Apache Druid segment merge task submition failure

I am using Druid 0.9.1.1 and trying to merge all the segment of a datasource per day to a single segment. Whereas the merge task initiation fails with error :
{"error":"Instantiation of [simple type, class io.druid.timeline.DataSegment] value failed: null (through reference chain: java.util.ArrayList[0])"}
I have got the segment details from segment metadata query. There is no help from driud documents as only specify raw structure of the overall query, but not the required segment detail structure(Below is how druid document suggests).
{
"type": "merge",
"id": <task_id>,
"dataSource": <task_datasource>,
"aggregations": <list of aggregators>,
"segments": <JSON list of DataSegment objects to merge>
}
example queries :
{
"type": "merge",
"id": "envoy_merge_task",
"dataSource": "dcap.envoy.diskmounts.kafka",
"segments": [{"id":"dcap.sermon.threshold.kafka_2017-05-22T00:00:00.000Z_2017-05-23T00:00:00.000Z_2017-05-22T07:00:02.951Z","intervals":["2017-05-22T00:00:00.000Z/2017-05-23T00:00:00.000Z"],"columns":{},"size":5460959,"numRows":41577,"aggregators":null,"queryGranularity":null},{"id":"dcap.sermon.threshold.kafka_2017-05-22T00:00:00.000Z_2017-05-23T00:00:00.000Z_2017-05-22T07:00:02.951Z_1","intervals":["2017-05-22T00:00:00.000Z/2017-05-23T00:00:00.000Z"],"columns":{},"size":5448881,"numRows":41577,"aggregators":null,"queryGranularity":null},{"id":"dcap.sermon.threshold.kafka_2017-05-22T00:00:00.000Z_2017-05-23T00:00:00.000Z_2017-05-22T07:00:02.951Z_2","intervals":["2017-05-22T00:00:00.000Z/2017-05-23T00:00:00.000Z"],"columns":{},"size":5454452,"numRows":41571,"aggregators":null,"queryGranularity":null},{"id":"dcap.sermon.threshold.kafka_2017-05-22T00:00:00.000Z_2017-05-23T00:00:00.000Z_2017-05-22T07:00:02.951Z_3","intervals":["2017-05-22T00:00:00.000Z/2017-05-23T00:00:00.000Z"],"columns":{},"size":5456267,"numRows":41569,"aggregators":null,"queryGranularity":null}] }
I have tried different forms of structure for "segments" key, results in same error.
example :
"segments": [{"id":"dcap.envoy.diskmounts.kafka_2017-05-21T06:00:00.000Z_2017-05-21T07:00:00.000Z_2017-05-21T06:02:43.482Z"},{"id":"dcap.envoy.diskmounts.kafka_2017-05-21T06:00:00.000Z_2017-05-21T07:00:00.000Z_2017-05-21T06:02:43.482Z_1"},{"id":"dcap.envoy.diskmounts.kafka_2017-05-21T06:00:00.000Z_2017-05-21T07:00:00.000Z_2017-05-21T06:02:43.482Z_2"},{"id":"dcap.envoy.diskmounts.kafka_2017-05-21T06:00:00.000Z_2017-05-21T07:00:00.000Z_2017-05-21T06:02:43.482Z_3"}]
What is right structure for segment-merge tasks.
the format i used for segments is
"segments":[
{
"dataSource": "wikiticker88",
"interval": "2015-09-12T02:00:00.000Z/2015-09-12T03:00:00.000Z",
"version": "2018-01-16T07:23:16.425Z",
"loadSpec": {
"type": "local",
"path": "/home/linux/druid-0.11.0/var/druid/segments/wikiticker88/2015-09-12T02:00:00.000Z_2015-09-12T03:00:00.000Z/2018-01-16T07:23:16.425Z/0/index.zip"
},
"dimensions": "channel,cityName,comment,countryIsoCode,countryName,isAnonymous,isMinor,isNew,isRobot,isUnpatrolled,metroCode,namespace,page,regionIsoCode,regionName,user",
"metrics": "count,added,deleted,delta,user_unique",
"shardSpec": {
"type": "none"
},
"binaryVersion": 9,
"size": 198267,
"identifier": "wikiticker88_2015-09-12T02:00:00.000Z_2015-09-12T03:00:00.000Z_2018-01-16T07:23:16.425Z"
},
]
use this to get your metadata of segments
/druid/coordinator/v1/metadata/datasources/{dataSourceName}/segments?full

ARM - How can I get the access key from a storage account to use in AppSettings later in the template?

I'm creating an Azure Resource Manager template that instantiates multiple resources, including an Azure storage account and an Azure App Service with a Web App.
I'd like to be able to capture the primary access key (or the full connection string, either way is fine) from the newly-created storage account, and use that as a value for one of the AppSettings for the Web App.
Is that possible?
Use the listkeys helper function.
"appSettings": [
{
"name": "STORAGE_KEY",
"value": "[listKeys(resourceId('Microsoft.Storage/storageAccounts', parameters('storageAccountName')), providers('Microsoft.Storage', 'storageAccounts').apiVersions[0]).keys[0].value]"
}
]
This quickstart does something similar:
https://azure.microsoft.com/en-us/documentation/articles/cache-web-app-arm-with-redis-cache-provision/
The syntax has changed since the other answer was accepted. The error you will now hit is 'Template language expression property 'key1' doesn't exist, available properties are 'keys'
Keys are now represented as an array of keys, and the syntax is now:
"StorageAccount": "[Concat('DefaultEndpointsProtocol=https;AccountName=',variables('StorageAccountName'),';AccountKey=',listKeys(resourceId('Microsoft.Storage/storageAccounts', variables('StorageAccountName')), providers('Microsoft.Storage', 'storageAccounts').apiVersions[0]).keys[0].value)]",
See: http://samcogan.com/retrieve-azure-storage-key-in-arm-script/
I faced with this issue two times. First in the 2015 and last today in May of 2017.
I need to add connection strings to the WebApp - I want to add strings automatically from generated resources during deployment from the ARM template. It can help later to not add manually this values.
First time I used old version of the function listKeys (it looks like old version returns result not as object but as value):
"AzureWebJobsStorage": {
"type": "Custom",
"value": "[concat(variables('storageConnectionString'), listKeys(resourceId('Microsoft.Storage/storageAccounts', parameters('storageAccountName')), '2015-05-01-preview').key1)]"
},
Today last version of the working template is:
"resources": [
{
"apiVersion": "2015-08-01",
"type": "config",
"name": "connectionstrings",
"dependsOn": [
"[resourceId('Microsoft.Web/Sites/', parameters('webSiteName'))]"
],
"properties": {
"DefaultConnection": {
"value": "[concat('Data Source=tcp:', reference(resourceId('Microsoft.Sql/servers/', parameters('sqlserverName'))).fullyQualifiedDomainName, ',1433;Initial Catalog=', parameters('databaseName'), ';User Id=', parameters('administratorLogin'), '#', parameters('sqlserverName'), ';Password=', parameters('administratorLoginPassword'), ';')]",
"type": "SQLServer"
},
"AzureWebJobsStorage": {
"type": "Custom",
"value": "[concat(variables('storageConnectionString'), listKeys(resourceId('Microsoft.Storage/storageAccounts', parameters('storageName')), '2016-01-01').keys[0].value)]"
},
"AzureWebJobsDashboard": {
"type": "Custom",
"value": "[concat(variables('storageConnectionString'), listKeys(resourceId('Microsoft.Storage/storageAccounts', parameters('storageName')), '2016-01-01').keys[0].value)]"
}
}
},
Thanks.
below is example for adding storage account to ADLA
"storageAccounts": [
{
"name": "[parameters('DataLakeAnalyticsStorageAccountname')]",
"properties": {
"accessKey": "[listKeys(variables('storageAccountid'),'2015-05-01-preview').key1]"
}
}
],
in variable you can keep
"variables": {
"apiVersion": "[providers('Microsoft.Storage', 'storageAccounts').apiVersions[0]]",
"storageAccountid": "[concat(resourceGroup().id,'/providers/','Microsoft.Storage/storageAccounts/', parameters('DataLakeAnalyticsStorageAccountname'))]"
},

AWS data pipeline activity with multiple inputs

As part of an Amazon AWS data pipeline, I have a hive activity using two unstaged S3 data nodes as input. What I want is to be able to set two script variables on the activity, each pointing to an input data node, but I can't get the syntax right. With the single input, I could write the following and it would work just fine:
INPUT_FOO=#{input.directoryPath}
When I add the second input, I run into a problem of how to reference them since they are now an array of inputs, as you can see in the pipeline definition below. Essentially, I want to achieve the following, but can't figure out the correct syntax:
INPUT_FOO=#{input[1].directoryPath}
INPUT_BAR=#{input[2].directoryPath}
Here's the activity portion of the pipeline definition:
{
"id": "ActivityId_7u1sR",
"input": [
{
"ref": "DataNodeId_iYnxf"
},
{
"ref": "DataNodeId_162Ka"
}
],
"schedule": {
"ref": "DefaultSchedule"
},
"scriptUri": "#{myS3ScriptLocation}calculate-results.q",
"name": "Perform Calculations",
"runsOn": {
"ref": "EmrClusterId_jHeiV"
},
"scriptVariable": [
"INPUT_SOURCE1=#{input[1].directoryPath}",
"OUTPUT=#{output.directoryPath}Results/",
"INPUT_SOURCE2=#{input[2].directoryPath}"
],
"output": {
"ref": "DataNodeId_2jY6v"
},
"type": "HiveActivity",
"stage": "false"
}
I plan to keep the tables unstaged and take care of table creation in the hive script so that it's easier to run each Hive activity in isolation as well as in the pipeline itself.
Here's the error I see when using array syntax:
Unable to resolve input[1].directoryPath for object ActivityId_7u1sR'
As it stands now, this scenario is not supported, but a feature request was added to support it in the future.