Generate "Instances" definition programmatically to create EMR cluster in StepFunctions

Generate "Instances" definition programmatically to create EMR cluster in StepFunctions - amazon-emr

I have a case where I want to dynamically create an EMR cluster based on a user-defined configuration and execute a sequence of steps on it using AWS Step Functions.
For this, I am planning to provide the instance configuration as an input to the step functions workflow.
Based on the StepFunctions-EMR Integration Documentation, the definition is the same as that of the RunJobFlow API.
However, when I try to generate the definition by serializing an instance of JobFlowInstancesConfig to JSON and pass it to the StateMachine as an input, it throws an error saying:
The field 'Instances.KeepJobFlowAliveWhenNoSteps' is required but was missing
Here is the JSON generated post serialization:
{
"instanceFleets": [
{
"instanceFleetType": "MAIN",
"targetOnDemandCapacity": 1,
"instanceTypeConfigs": [
{
"instanceType": "m5.xlarge"
}
]
},
{
"instanceFleetType": "CORE",
"targetOnDemandCapacity": 1,
"instanceTypeConfigs": [
{
"instanceType": "c5.2xlarge"
}
]
}
],
"keepJobFlowAliveWhenNoSteps": true
}
I am passing this in the input, and accessing it in my StepFunctions definition in the below Task (where I expect the above definition to be replacing $.jobFlowInstancesConfig):
...
"GetCluster": {
"Type": "Task",
"Resource": "arn:aws:states:::elasticmapreduce:createCluster.sync",
"Parameters": {
"Name.$": "$.clusterName",
"VisibleToAllUsers": true,
"ReleaseLabel": "emr-5.30.0",
"Applications": [
{
"Name": "Spark"
}
],
"ServiceRole": "EMR_DefaultRole",
"JobFlowRole": "EMR_EC2_DefaultRole",
"LogUri": "s3://my-aws-logs/elasticmapreduce/",
"Instances.$": "$.jobFlowInstancesConfig"
}
}
...
My suspicion is that this is failing because StepFunctions expects the field names to start with upper case.
Question: How do I programmatically generate the appropriate definition without having to play around with Strings for generating the JSON? Is there a straightforward way to serialize the above definition to one that will work with StepFunctions?

Related

OPA authorization policies with scopes and roles

I'm using Open Policy Agent as an authorization component together with OIDC enabled apps.
I have input from the apps in the format:
{
"token": {
"scopes": [
"read:books",
"write:books"
]
},
"principal": {
"roles": [
"user",
"moderator"
]
},
"context": {
"action": "read",
"resource": "books"
}
}
Then I have data with access mapping in the format:
{
"user": [
"read:books"
],
"moderator": [
"read:books",
"write:books"
],
"administrator": [
"read:books",
"write:books",
"read:store",
"write:store"
]
}
And the policy currently looks like this:
package whatever.authz
context_scope := concat(":", [input.context.action, input.context.resource])
default allow = false
allow {
token_has_context_scope
principal_has_resource_access
}
token_has_context_scope {
context_scope == input.token.scopes[_]
}
principal_has_resource_access {
principal_role := input.principal.roles[_]
context_scope == data[principal_role][_]
}
This produces the following error:
2 errors occurred:
policy.rego:16: rego_recursion_error: rule principal_has_resource_access is recursive: principal_has_resource_access -> principal_has_resource_access
policy.rego:7: rego_recursion_error: rule allow is recursive: allow -> principal_has_resource_access -> allow
It is the recursive lookup in the principal_has_resource_access function that is causing the error.
I need to check if one of the roles of the principal is allowed to access the resource as specified by the context. Since roles is an array i need to find the union of all access scopes in the data and see if one of them matches the context scope. What am I doing wrong in the policy?
The snippet can be found in the Rego Playground https://play.openpolicyagent.org/p/KhovLRgMup

OPA stores all data under the data path, including policy and rules. There's no way for the compiler to know that the input you're providing isn't referencing the policy itself (i.e. data["whatever"]) which would be recursive. The easiest way to work around this is to simply use a top level attribute for your data which differs from your policy (i.e package name), like this:
{
"attributes": {
"user": [
"read:books"
],
"moderator": [
"read:books",
"write:books"
],
"administrator": [
"read:books",
"write:books",
"read:store",
"write:store"
]
}
}
And update your policy to reference this:
context_scope == data["attributes"][principal_role][_]
Since data.attributes != data.whatever.authz there is no risk of recursion, and the compiler won't complain. You might want a better name than "attributes", but I'll leave that to you :)

Grafana-LogQL: HowTo extract labels from key-value objects in json array

I am working with ASP.NET 5.0 json logger and logging scopes. I want to populate the scope key-values as labels.
The json produced is of the following format (excerpt):
{
"LogLevel": "Information",
"Scopes": [
{
"Message": "System.Collections.Generic.Dictionary\u00602[System.String,System.Object]",
"MsgId": "c08e834e8edb4287ab8abf0b5510bb53"
},
{
"Message": "System.Collections.Generic.Dictionary\u00602[System.String,System.Object]",
"EventId": "03ec8be0-9975-482e-95b9-2ba6185a4ed4",
"EventName": "someEvent",
"EntityKeyValue": "someNonTechId"
}
]
}
The only way I found was to do
| json MsgId="Scopes[0].MsgId", EventName="Scopes[1].EventName" etc. ...
Problem is that:
not all scopes are present at all times
so also the indices could change...
Is there any solution for that?
BTW we operate on a managed cluster, so custom plugins won't work...

ADF v2 - Web Activity- POST output not retrievable

With Azure Data Factory v2, I created Web Activity using the POST method and got the desired response output.
But can't get the rows data from the output response in the next activity.
How do I reference columns in the rows in this output?
The data in the rows doesn't have any headers.
{
"Tables": [
{
"TableName": "Table_0",
"Columns": [
{
"ColumnName": "MyFieldA",
"DataType": "String",
"ColumnType": "string"
},
{
"ColumnName": "MyFieldB",
"DataType": "String",
"ColumnType": "string"
}
],
"Rows": [
[
"ABCDEF",
"AAAABBBBBCCCDDDDD"
],
[
"CCCCCCC",
"CCCCCCC"
],
I can't reference the value in the rows
I've tried numerous things
e.g. #activity('WebActivity').output.Rows
Nothing seems to work.
What's the point of getting a response from a web activity and then not being able to reference the output in data factory?

Thanks Pacodel!!! You've helped me out.
And to use in the a For Each Loop and Array, when I pass in the Rows to my Execute pipeline activity #activity('WebActivity').output.Tables[0].Rows:
[
[
"ABCDEF",
"AAAABBBBBCCCDDDDD"
],
[
"CCCCCCC",
"CCCCCCC"
]
]
I can use the following to reference the rows:
#{item()[0]}
#{item()[1]}
I use the #item to populate parameters in a stored procedure activity which loads my table
Thanks

For your example.
#activity('WebActivity').output.Tables[0].Rows
will return the following:
[
[
"ABCDEF",
"AAAABBBBBCCCDDDDD"
],
[
"CCCCCCC",
"CCCCCCC"
]
]
If you want to access even deeper, you just need to specify the index.
#activity('WebActivity').output.Tables[0].Rows[0][0] will return ABCDEF
If you need to automate this, you can have a pattern of foreach with an execute pipeline inside passing an array as a parameter until you get to the properties that you need.

Substitute parts of a typed array in ASP.NET core appsettings.json from secrets/environment variables?

We have an ASP.NET Core web app with this appsettings.json:
{
"Subscriptions": [
{
"Name": "Production",
"PublishSettings": "<PublishData>SECRET</PublishData>",
"Environments": [
{
"Name": "Prod",
"DeploymentServiceNames": [
"api1",
"api2",
"api3"
]
}
]
},
{
"Name": "Test",
"PublishSettings": "<PublishData>SECRET</PublishData>",
"Environments": [
{
"Name": "Test1",
"DeploymentServiceNames": [
"api1",
"api2"
]
},
{
"Name": "Test2",
"DeploymentServiceNames": [
"api1",
"api2"
]
}
]
}
]
}
The PublishSettings values are secret so I want these in my local user secrets file, and in environment variables for my deployments. But, because Subscriptions is an array I'm not sure how. I don't particularly want to swap in the entire Subscriptions section. Is there a way to swap in a single property for each item in such an array, perhaps by defining a key property on the strongly typed subscription model?

When you load configuration in .NET Core, under the hood it's represented as a set of key-value pairs (both key and value have string type) supplied by added configuration providers.
For example, appsettings.json will be represented by JsonConfigurationProvider as the following settings list:
{Subscriptions:0:Environments:0:DeploymentServiceNames:0, api1}
{Subscriptions:0:Environments:0:DeploymentServiceNames:1, api2}
{Subscriptions:0:Environments:0:DeploymentServiceNames:2, api3}
{Subscriptions:0:Environments:0:Name, Prod}
{Subscriptions:0:Name, Production}
{Subscriptions:0:PublishSettings, <PublishData>SECRET</PublishData>}
{Subscriptions:1:Environments:0:DeploymentServiceNames:0, api1}
{Subscriptions:1:Environments:0:DeploymentServiceNames:1, api2}
{Subscriptions:1:Environments:0:Name, Test1}
{Subscriptions:1:Environments:1:DeploymentServiceNames:0, api1}
{Subscriptions:1:Environments:1:DeploymentServiceNames:1, api2}
{Subscriptions:1:Environments:1:Name, Test2}
{Subscriptions:1:Name, Test}
{Subscriptions:1:PublishSettings, <PublishData>SECRET</PublishData>}
As you see JSON structure was flattened and keys are built by joining inner section names with a colon. Array element are added with appropriate index as a name.
If you add another configuration source, e.g. environment variables or another secrets json file, which will have settings with the same keys, it will overwrite the setting.
So if you want to add or overwrite PublishSettings, you could add either another JSON file as configuration source:
{
"Subscriptions": [
{
"PublishSettings": "<PublishData>SECRET</PublishData>"
},
{
"PublishSettings": "<PublishData>SECRET</PublishData>"
}
]
}
Or add it as environment variables with the following keys:
Subscriptions:0:PublishSettings
Subscriptions:1:PublishSettings
Such setting override (or addition) is transparent for .NET Core configuration binder. Settings POCO will contain value of PublishSettings from the last configuration source that provides such value.

AWS data pipeline activity with multiple inputs

As part of an Amazon AWS data pipeline, I have a hive activity using two unstaged S3 data nodes as input. What I want is to be able to set two script variables on the activity, each pointing to an input data node, but I can't get the syntax right. With the single input, I could write the following and it would work just fine:
INPUT_FOO=#{input.directoryPath}
When I add the second input, I run into a problem of how to reference them since they are now an array of inputs, as you can see in the pipeline definition below. Essentially, I want to achieve the following, but can't figure out the correct syntax:
INPUT_FOO=#{input[1].directoryPath}
INPUT_BAR=#{input[2].directoryPath}
Here's the activity portion of the pipeline definition:
{
"id": "ActivityId_7u1sR",
"input": [
{
"ref": "DataNodeId_iYnxf"
},
{
"ref": "DataNodeId_162Ka"
}
],
"schedule": {
"ref": "DefaultSchedule"
},
"scriptUri": "#{myS3ScriptLocation}calculate-results.q",
"name": "Perform Calculations",
"runsOn": {
"ref": "EmrClusterId_jHeiV"
},
"scriptVariable": [
"INPUT_SOURCE1=#{input[1].directoryPath}",
"OUTPUT=#{output.directoryPath}Results/",
"INPUT_SOURCE2=#{input[2].directoryPath}"
],
"output": {
"ref": "DataNodeId_2jY6v"
},
"type": "HiveActivity",
"stage": "false"
}
I plan to keep the tables unstaged and take care of table creation in the hive script so that it's easier to run each Hive activity in isolation as well as in the pipeline itself.
Here's the error I see when using array syntax:
Unable to resolve input[1].directoryPath for object ActivityId_7u1sR'

As it stands now, this scenario is not supported, but a feature request was added to support it in the future.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Generate "Instances" definition programmatically to create EMR cluster in StepFunctions - amazon-emr

Related

OPA authorization policies with scopes and roles

Grafana-LogQL: HowTo extract labels from key-value objects in json array

ADF v2 - Web Activity- POST output not retrievable

Substitute parts of a typed array in ASP.NET core appsettings.json from secrets/environment variables?

AWS data pipeline activity with multiple inputs

Categories

Resources