Can I customize partitioning in Kinesis Firehose before delivering to S3? - amazon-s3

I have a Firehose stream that is intended to ingest millions of events from different sources and of different event-types. The stream should deliver all data to one S3 bucket as a store of raw\unaltered data.
I was thinking of partitioning this data in S3 based on metadata embedded within the event message like event-souce, event-type and event-date.
However, Firehose follows its default partitioning based on record arrival time. Is it possible to customize this partitioning behavior to fit my needs?
Update: Accepted answer updated as a new answer suggests the feature is available as of Sep 2021

As of writing this, the dynamic partitioning feature Vlad has mentioned is still pretty new. I needed it to be a part of CloudFormation template, which was still not properly documented. I had to add in DynamicPartitioningConfiguration to get it working properly. MetadataExtractionQuery syntax was also not properly documented.
MyKinesisFirehoseStream:
Type: AWS::KinesisFirehose::DeliveryStream
...
Properties:
ExtendedS3DestinationConfiguration:
Prefix: "clients/client_id=!{client_id}/dt=!{timestamp:yyyy-MM-dd}/"
ErrorOutputPrefix: "errors/!{firehose:error-output-type}/"
DynamicPartitioningConfiguration:
Enabled: "true"
RetryOptions:
DurationInSeconds: "300"
ProcessingConfiguration:
Enabled: "true"
Processors:
- Type: AppendDelimiterToRecord
- Type: MetadataExtraction
Parameters:
- ParameterName: MetadataExtractionQuery
ParameterValue: "{client_id:.client_id}"
- ParameterName: JsonParsingEngine
ParameterValue: JQ-1.6

Since September 1st, 2021, AWS Kinesis Firehose supports this feature. Read the announcement blog post here.
From the documentation:
You can use the Key and Value fields to specify the data record parameters to be used as dynamic partitioning keys and jq queries to generate dynamic partitioning key values. ...
Here is how it looks like from UI:

No. You cannot 'partition' based upon event content.
Some options are:
Send to separate Firehose streams
Send to a Kinesis Data Stream (instead of Firehose) and write your own custom Lambda function to process and save the data (See: AWS Developer Forums: Athena and Kinesis Firehose)
Use Kinesis Analytics to process the message and 'direct' it to different Firehose streams
If you are going to use the output with Amazon Athena or Amazon EMR, you could also consider converting it into Parquet format, which has much better performance. This would require post-processing of the data in S3 as a batch rather than converting the data as it arrives in a stream.

To build on John's answer, if you don't have the near real-time streaming requirements, we've found batch-processing with Athena to be a simple solution for us.
Kinesis streams to a given table unpartitioned_event_data, which can make use of the native record arrival time partitioning.
We define another Athena table partitioned_event_table which can be defined with custom partition keys and make use of the INSERT INTO capabilities that Athena has. Athena will automatically repartition your data in the format you want without requiring any custom consumers or new infrastructure to manage. This can be scheduled with a cron, SNS, or something like Airflow.
What's cool is you can create a view that does a UNION of the two tables to query historical and real-time data in one place.
We actually dealt with this problem at Radar and talk about more trade-offs in this blog post.

To expand on Murali's answer, we have implemented it in CDK:
Our incomming json data looks something like this:
{
"data":
{
"timestamp":1633521266990,
"defaultTopic":"Topic",
"data":
{
"OUT1":"Inactive",
"Current_mA":3.92
}
}
}
The CDK code looks as following:
const DeliveryStream = new CfnDeliveryStream(this, 'deliverystream', {
deliveryStreamName: 'deliverystream',
extendedS3DestinationConfiguration: {
cloudWatchLoggingOptions: {
enabled: true,
},
bucketArn: Bucket.bucketArn,
roleArn: deliveryStreamRole.roleArn,
prefix: 'defaultTopic=!{partitionKeyFromQuery:defaultTopic}/!{timestamp:yyyy/MM/dd}/',
errorOutputPrefix: 'error/!{firehose:error-output-type}/',
bufferingHints: {
intervalInSeconds: 60,
},
dynamicPartitioningConfiguration: {
enabled: true,
},
processingConfiguration: {
enabled: true,
processors: [
{
type: 'MetadataExtraction',
parameters: [
{
parameterName: 'MetadataExtractionQuery',
parameterValue: '{Topic: .data.defaultTopic}',
},
{
parameterName: 'JsonParsingEngine',
parameterValue: 'JQ-1.6',
},
],
},
{
type: 'AppendDelimiterToRecord',
parameters: [
{
parameterName: 'Delimiter',
parameterValue: '\\n',
},
],
},
],
},
},
})

My scenario is:
Firehose needs to send data to s3, which is tied to glue table, parquet as format, and dynamic partitioning enabled since I want to consider the year, month, and day from the data I push to firehose instead of the default.
Below is the working code
rawdataFirehose:
Type: AWS::KinesisFirehose::DeliveryStream
Properties:
DeliveryStreamName: !Join ["-", [rawdata, !Ref AWS::StackName]]
DeliveryStreamType: DirectPut
ExtendedS3DestinationConfiguration:
BucketARN: !GetAtt rawdataS3bucket.Arn
Prefix: parquetdata/year=!{partitionKeyFromQuery:year}/month=!{partitionKeyFromQuery:month}/day=!{partitionKeyFromQuery:day}/
BufferingHints:
IntervalInSeconds: 300
SizeInMBs: 128
ErrorOutputPrefix: errors/
RoleARN: !GetAtt FirehoseRole.Arn
DynamicPartitioningConfiguration:
Enabled: true
ProcessingConfiguration:
Enabled: true
Processors:
- Type: MetadataExtraction
Parameters:
- ParameterName: MetadataExtractionQuery
ParameterValue: "{year:.year,month:.month,day:.day}"
- ParameterName: "JsonParsingEngine"
ParameterValue: "JQ-1.6"
DataFormatConversionConfiguration:
Enabled: true
InputFormatConfiguration:
Deserializer:
HiveJsonSerDe: {}
OutputFormatConfiguration:
Serializer:
ParquetSerDe: {}
SchemaConfiguration:
CatalogId: !Ref AWS::AccountId
RoleARN: !GetAtt FirehoseRole.Arn
DatabaseName: !Ref rawDataDB
TableName: !Ref rawDataTable
Region:
Fn::ImportValue: AWSRegion
VersionId: LATEST
FirehoseRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Principal:
Service: firehose.amazonaws.com
Action: sts:AssumeRole
Policies:
- PolicyName: !Sub firehose-glue-${Envname}
PolicyDocument: |
{
"Version": "2012-10-17",
"Statement":
[
{
"Effect": "Allow",
"Action":
[
"glue:*",
"iam:ListRolePolicies",
"iam:GetRole",
"iam:GetRolePolicy",
"tag:GetResources",
"s3:*",
"cloudwatch:*",
"ssm:*"
],
"Resource": "*"
}
]
}
Note:
rawDataDB is a reference to glue database
rawDataTable is a reference to table
rawdataS3bucket is a reference to s3 bucket

Related

Define a condition in cloudformation template with alarms

How to define/declare a condition to create an alarm in prod?
With the condition:Isprod would work to create an alarm in prod?
WOULD this work? how to define a condition below?
LambdaInvocationsAlarm:
Condition: IsProd
Type: AWS::CloudWatch::Alarm
Properties:
AlarmDescription: Lambda invocations
AlarmName: LambdaInvocationsAlarm
ComparisonOperator: LessThanLowerOrGreaterThanUpperThreshold
EvaluationPeriods: 1
Metrics:
- Expression: ANOMALY_DETECTION_BAND(m1, 2)
Id: ad1
- Id: m1
MetricStat:
Metric:
MetricName: Invocations
Namespace: AWS/Lambda
Period: !!int 86400
Stat: Sum
ThresholdMetricId: ad1
TreatMissingData: breaching
As #Marcin said, you should explain what you have tried and what is blocking more precisely.
But what you suggest could work yes: you can define a Condition named isProd and use it to create - or not - resources. Regarding this condition: AWS does not know what is a production stage in your environment, so you need to specify that. Does your production stage matches an account? Does it match a region? Something else?
As an example and if we assume that your production stage matches a specific AWS account, then you could define the condition as below (it's JSON, feel free to convert to YAML):
{
"Parameters": {
"ProdAccountParameter": {
"Type": "String",
"Description": "Enter the production account identifier."
}
},
"Conditions": {
"isProd": {
"Fn::Equals": [
{
"Ref": "ProdAccountParameter"
},
{
"Ref": "AWS::AccountId"
}
]
}
},
...
}
(Then, when deploying the template, you'll need to provide your AWS production account).

Fargate environment variable redis.yaml

I have a microservice and I need to pass in a file redis.yaml to configure Elasticache for Redis.
Assume I have a file called redis.yaml with contents:
clusterServersConfig:
idleConnectionTimeout: 10000
pingTimeout: 1000
connectTimeout: 10000
timeout: 60000
retryAttempts: 3
retryInterval: 60000
And my application.properties I use:
redis.config.location=file:/opt/usr/conf/redis.yaml
In Kubernetes, I can just create a secret with --from-file redis.yaml and the application runs properly.
I do not know how to do the same with AWS Fargate. I believe it could be done with AWS SSM but any help/steps on how to do it would be appreciated.
For externalized configuration, Fargate supports environment variables. Environment variables can be passed in Task definition.
"environment": [
{ "name": "env_name1", "value": "value1" },
{ "name": "env_name2", "value": "value2" }
]
If it's sensitive information, store it in AWS SSM-Parameter store (you can use KMS) and specify the parameter key in the task definition.
{
"containerDefinitions": [{
"secrets": [{
"name": "environment_variable_name",
"valueFrom": "arn:aws:ssm:region:aws_account_id:parameter/parameter_name"
}]
}]
}
In your case, you can convert your yaml to JSON and store it in the Parameter store and refer it in the task definition.
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/specifying-sensitive-data.html

How to define CF resource as function event source in serverless framework

I'm trying to create a AWS Lambda with the serverless framework. The Lambda is triggered through an AWS IoT Topic Rule. In case the execution of the Rule fails I want to have an error action executed. The entire configuration should take place within the serverless.yml.
As far as I can tell from the documentation there is no option to describe an errorAction for an iot event:
functions:
foobar:
events:
- iot:
errorAction: ?
It is possible though to define a Cloud Formation resource with an ErrorAction inside the serverless.yml:
resources:
Resources:
FoobarIotTopicRule1:
Type: AWS::IoT::TopicRule
Properties:
ErrorAction:
Republish:
RoleArn: arn:aws:iam::1234567890:role/service-role/iot_execution_role
Topic: FAILURE
But then I don't know how to link the resource to act as a trigger of the Lambda function.
functions:
foobar:
handler: index.handler
events:
- iot:
name: iot_magic_rule
sql: "SELECT * FROM 'my/dedicated/topic'"
enabled: true
sqlVersion: '2016-03-23'
resources:
Resources:
FoobarIotTopicRule1:
Type: AWS::IoT::TopicRule
Properties:
RuleName: iot_magic_rule
TopicRulePayload:
AwsIotSqlVersion: '2016-03-23'
RuleDisabled: false
Sql: "SELECT * FROM 'my/dedicated/topic'"
ErrorAction:
Republish:
RoleArn: arn:aws:iam::1234567890:role/service-role/iot_execution_role
Topic: FAILURE
With the above configuration, trying to deploy on AWS fails as Cloud Formation tries to create the AWS IoT Topic Rule twice. Once for the definition in events and once as the defined resource FoobarIoTTopicRule1.
EDIT1
Defining the Lambda action inside the IoTTopicRule resource, creates the rule as intended, with Lambda action and error event. Unfortunately the rule does not show up as a trigger within the Lambda.
To be able to define an AWS IoT Topic Rule with an ErrorAction that will also show up as a trigger event on AWS Lambda, the configuration should look somewhat like this:
functions:
foobar:
handler: index.handler
resources:
Resources:
FoobarIotTopicRule1:
Type: AWS::IoT::TopicRule
Properties:
RuleName: iot_magic_rule
TopicRulePayload:
AwsIotSqlVersion: '2016-03-23'
RuleDisabled: false
Sql: "SELECT * FROM 'my/dedicated/topic'"
Actions:
- Lambda:
FunctionArn: { "Fn::GetAtt": ['FoobarLambdaFunction', 'Arn']}
ErrorAction:
Republish:
RoleArn: arn:aws:iam::1234567890:role/service-role/iot_execution_role
Topic: FAILURE
FoobarLambdaPermissionIotTopicRule1:
Type: AWS::Lambda::Permission
Properties:
FunctionName: { "Fn::GetAtt": [ "FoobarLambdaFunction", "Arn" ] }
Action: lambda:InvokeFunction
Principal: { "Fn::Join": ["", [ "iot.", { "Ref": "AWS::URLSuffix" } ]]}
SourceArn:
Fn::Join:
- ""
- - "arn:"
- "Ref": "AWS::Partition"
- ":iot:"
- "Ref": "AWS::Region"
- ":"
- "Ref": "AWS::AccountId"
- ":rule/"
- "Ref": "FoobarIotTopicRule1"

splitting swagger definition across many files

Question: how can I split swagger definition across files? What are the possibilities in that area? The question details are described below:
example of what I want - in RAML
I do have experience in RAML and what I do is, for example:
/settings:
description: |
This resource defines application & components configuration
get:
is: [ includingCustomHeaders ]
description: |
Fetch entire configuration
responses:
200:
body:
example: !include samples/settings.json
schema: !include schemas/settings.json
The last two lines are important here - theones with !include <filepath> - in RAML I can split my entire contract into many files that just get included dynamically by the RAML parser (and RAML parser is used by all tools that base on RAML).
My benefit from this is that:
I get my contract more clear and easier to maintain, because schemas are not inline
but that's really important: I can reuse the schema files within other tools to do validation, mock generation, stubs, generate tests, etc. In other words, this way I can reuse schema files within both contract (RAML, this case) and other tools (non-RAML, non-swagger, just JSONschema-based ones).
back to Swagger
As far as I read, swagger supports $ref keyword which allows to load external files. But is that files fetched through HTTP/AJAX or can they just be local files?
And is that supported by the whole specification or is it just some tools that support it and some that don't?
What I found here is that the input for swagger has to be one file. And this is extremely inconvenient for big projects:
because of size
and because I can't reuse the schema if I want to use something non-swagger
Or, in other words, can I achieve the same with swagger, that I can with RAML - in terms of splitting files?
The specification allows for references in multiple locations but not everywhere. These references are resolved depending on where the specification is being hosted--and what you're trying to do.
For something like rendering a dynamic user interface, then yes you do need to eventually load the entire definition into "a single object" which may be composed from many files. If performing a code generation, the definitions may be loaded directly from the file system. But ultimately there are swagger parsers doing the resolution, which is much more fine grained and controllable in Swagger than other definition formats.
In your case, you would use a JSON pointer to the schema reference:
responses:
200:
description: the response
schema:
via local reference
$ref: '#/definitions/myModel'
via absolute reference:
$ref: 'http://path/to/your/resource'
via relative reference, which would be 'relative to where this doc is loaded'
$ref: 'resource.json#/myModel
via inline definition
type: object
properties:
id:
type: string
When I split OpenAPI V3 files using references, I try to avoid the sock drawer anti-pattern and instead use functional groupings for the YAML files.
I also make it so that each YAML file itself is a valid OpenAPI V3 spec.
I start out with the openapi.yaml file.
openapi: 3.0.3
info:
title: MyAPI
description: |
This is the public API for my stuff.
version: "3"
tags:
# NOTE: the name is needed as the info block uses `title` rather than name
- name: Authentication
$ref: 'authn.yaml#/info'
paths:
# NOTE: here are the references to the other OpenAPI files
# from the path. Note because OpenAPI requires paths to
# start with `/` and that is already used as a separator
# replace the `/` with `%2F` for the path reference.
'/authn/start':
$ref: 'authn.yaml#/paths/%2Fstart'
Then in the functional group:
openapi: 3.0.3
info:
title: Authentication
description: |
This is the authentication module.
version: "3"
paths:
# NOTE: don't include the `/authn` prefix here that top level grouping is
# in the `openapi.yaml` file.
'/start':
get:
responses:
"200":
description: OK
By doing this separation you can independently test each file or the whole API as a group.
There may be points where you repeat yourself, but by doing this you limit the chance of breaking changes to other API endpoints when using a "common" library.
However, you should still have a common definition library for some things such as:
errors
security
There is a limitation on this approach and that's the "Discriminators" (it may be a ReDoc issue though, but if you had types that have discriminators outside of the openapi.yaml ReDoc fails to render correctly.
See this answer for details on how to split your Swagger documentation across many files. This is done using JSON, but the same concept can apply to RAML.
EDIT: Adding content of link here
The basic structure of your Swagger JSON should look something like this:
{
"swagger": "2.0",
"info": {
"title": "",
"version": "version number here"
},
"basePath": "/",
"host": "host goes here",
"schemes": [
"http"
],
"produces": [
"application/json"
],
"paths": {},
"definitions": {}
}
The paths and definitions are where you need to insert the paths that your API supports and the model definitions describing your response objects. You can populate these objects dynamically. One way of doing this could be to have a separate file for each entity's paths and models.
Let's say one of the objects in your API is a "car".
Path:
{
"paths": {
"/cars": {
"get": {
"tags": [
"Car"
],
"summary": "Get all cars",
"description": "Returns all of the cars.",
"responses": {
"200": {
"description": "An array of cars",
"schema": {
"type": "array",
"items": {
"$ref": "#/definitions/car"
}
}
},
"404": {
"description": "error fetching cars",
"schema": {
"$ref": "#/definitions/error"
}
}
}
}
}
}
Model:
{
"car": {
"properties": {
"_id": {
"type": "string",
"description": "car unique identifier"
},
"make": {
"type": "string",
"description": "Make of the car"
},
"model":{
"type": "string",
"description": "Model of the car."
}
}
}
}
You could then put each of these in their own files. When you start your server, you could grab these two JSON objects, and append them to the appropriate object in your base swagger object (either paths or definitions) and serve that base object as your Swagger JSON object.
You could also further optimize this by only doing the appending once when the server is started (since the API documentation will not change while the server is running). Then, when when the "serve Swagger docs" endpoint is hit, you can just return the cached Swagger JSON object that you created when the server was started.
The "serve Swagger docs" endpoint can be intercepted by catching a request to /api-docs like below:
app.get('/api-docs', function(req, res) {
// return the created Swagger JSON object here
});
You can use $ref but not have good flexibility, I suggest you process YAML with an external tool like 'Yamlinc' that mix multiple files into one using '$include' tag.
read more: https://github.com/javanile/yamlinc

AWS data pipeline activity with multiple inputs

As part of an Amazon AWS data pipeline, I have a hive activity using two unstaged S3 data nodes as input. What I want is to be able to set two script variables on the activity, each pointing to an input data node, but I can't get the syntax right. With the single input, I could write the following and it would work just fine:
INPUT_FOO=#{input.directoryPath}
When I add the second input, I run into a problem of how to reference them since they are now an array of inputs, as you can see in the pipeline definition below. Essentially, I want to achieve the following, but can't figure out the correct syntax:
INPUT_FOO=#{input[1].directoryPath}
INPUT_BAR=#{input[2].directoryPath}
Here's the activity portion of the pipeline definition:
{
"id": "ActivityId_7u1sR",
"input": [
{
"ref": "DataNodeId_iYnxf"
},
{
"ref": "DataNodeId_162Ka"
}
],
"schedule": {
"ref": "DefaultSchedule"
},
"scriptUri": "#{myS3ScriptLocation}calculate-results.q",
"name": "Perform Calculations",
"runsOn": {
"ref": "EmrClusterId_jHeiV"
},
"scriptVariable": [
"INPUT_SOURCE1=#{input[1].directoryPath}",
"OUTPUT=#{output.directoryPath}Results/",
"INPUT_SOURCE2=#{input[2].directoryPath}"
],
"output": {
"ref": "DataNodeId_2jY6v"
},
"type": "HiveActivity",
"stage": "false"
}
I plan to keep the tables unstaged and take care of table creation in the hive script so that it's easier to run each Hive activity in isolation as well as in the pipeline itself.
Here's the error I see when using array syntax:
Unable to resolve input[1].directoryPath for object ActivityId_7u1sR'
As it stands now, this scenario is not supported, but a feature request was added to support it in the future.