Syncing Kafka with aws s3 with different directory structure - amazon-s3

We have events coming to Kafka and using kafka connect we are syncing these events with aws s3.
Data is visible in s3 in below dir structure:
bucket_name/sub_folder/
Partition=0/events.json
Partition=1/events.json
Partition=2/events.json
is there a way to store in below dir structure:
Bucket_name/sub_folder/date=today_date/ events.json or Partition=0..2/date=today/events.json
Bucket_name/sub_folder/date=today_date/ events.json or
Motivation is to store that days events in that that days directory, i searched web but could not find any other way .
Thanks in advance.

You can use the TimeBasedPartitioner which
partitions data according to ingestion time.
e.g. for hourly partioning:
[…]
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"path.format": "'year'=YYYY/'month'=MM/'day'=dd/'hour'=HH",
"locale": "US",
"timezone": "UTC",
"partition.duration.ms": "3600000",
"timestamp.extractor": "RecordField",
"timestamp.field": "my_record_field_with_timestamp_in",
[…]

Related

Azure Datalake Analytics U-SQL with Azure Datalake Storage Gen 2

Question : what is the path forward for using ADLA (U-SQL) with ADLS(Gen2) ?
I have been running Azure Data lake Analytics (U-SQL) jobs via Azure Data factory (ADF v2) with Azure Data lake Store Generation 1 for quite a while now in East US2
I was planning to have another instance deployed to cater Canadian clients and wanted to setup Azure Data lake Store Generation 1
What I tried :
I was not able to create an Azure Datalake Storage Gen 1 account in Central Canada (or any Canadian region for that matter)
I tried to move to Azure Datalake Storage Gen2 but then ran into an issue where Azure Data Factory - U-SQL activity could not be linked with Gen2 Storage linked service to pick up U-SQL script
I stumbled upon multiple links about this topic :
https://feedback.azure.com/forums/327234-data-lake/suggestions/36445702-add-support-for-adls-gen2-to-adla
https://social.msdn.microsoft.com/Forums/en-US/5ce97eef-8940-4591-a19c-934f71825e7d/connect-data-lake-analytics-to-adls-gen-2
which essentially say that U-SQL / ADLA won't be supporting ADLS Gen2
I am a bit confused since there is no official documentation on ADLA's direction
Update:
This is the structure of my u-sql activity. It can work and process successfully:(You can try to create a new json of u-sql activity to replace your u-sql activity.)
{
"name": "pipeline4",
"properties": {
"activities": [
{
"name": "U-SQL1",
"type": "DataLakeAnalyticsU-SQL",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"scriptPath": "test1/u-sql.txt",
"scriptLinkedService": {
"referenceName": "LinkTo0730",
"type": "LinkedServiceReference"
}
},
"linkedServiceName": {
"referenceName": "AzureDataLakeAnalytics1",
"type": "LinkedServiceReference"
}
}
],
"annotations": []
}
}
Original Answer:
I was not able to create an Azure Datalake Storage Gen 1 account in
Central Canada (or any Canadian region for that matter)
On my side, I also cannot create datalake gen1 on region Central Canada. This is the limit of my subscription. But you can have a check of the resource manager on your side, maybe you can.(Azure data lake gen1 is 'Microsoft.DataLakeStore')
Resource Manager is supported in all regions, but the resources you deploy might not be supported in all regions. In addition, there may be limitations on your subscription that prevent you from using some regions that support the resource. The resource explorer displays valid locations for the resource type.
Please have a check of this document:
https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/resource-providers-and-types
I tried to move to Azure Datalake Storage Gen2 but then ran into an
issue where Azure Data Factory - U-SQL activity could not be linked
with Gen2 Storage linked service to pick up U-SQL script
On my side, seems it is reading the u-sql script in gen2, did you get some error?

azure stream analytics implementation or the best approach

I am new to Steam analytics and I need help here to achieve a specific task.
I have telemetry data coming from iot hub in this format.
Basically i will be getting machines telemetry data and the stage of the operations on that machine streamed to iot hub.
The stages will be indicated with tag ex:"stageid":"stage1". I need to calculate the time taken for each stage using stream analytics based on timestamp and stage tag.
packet Ex:
[{
"Payload": {
"devid": "01",
"locid": "loc01",
"machid": "mac01",
"stageid": "stage1",
"timestamp": "2020-01-24T09:22:00.3270000Z"
},
"Payload": {
"devid": "02",
"locid": "loc01",
"machid": "mac01",
"stageid": "stage1",
"timestamp": "2020-01-24T09:22:00.3270000Z"
}
}]
[{
"Payload": {
"devid": "01",
"locid": "loc01",
"machid": "mac01",
"stageid": "stage2",
"timestamp": "2020-01-24T09:26:00.3270000Z"
},
"Payload": {
"devid": "02",
"locid": "loc01",
"machid": "mac01",
"stageid": "stage2",
"timestamp": "2020-01-26T09:24:00.3270000Z"
}
}]
pls help me can we achieve this with query and what could be the query or what is the other best approach?
Thanks,
Per my knowledge,your needs can't be implemented by ASA built-in features. ASA is a real-time collect data and analytics service.In other words,data need to be processed in the real-time.Current data can't wait for next dataset to do some calculate or merge things. Even if you could use windows function and group by,i believe the frequency of messages pushed by the device is also variable.
As a workaround,my idea is using iot-hub azure function trigger.Inside trigger,you could use code to parse message and save key columns(stageid,timestamp,devid) into some storage,maybe azure table storage. Before every insert,just grab latest row of current device to calculate the time taken with current message so that you could produce that time to store in other residence.In the end, update the latest row for every device.

How is data in kinesis decrypted before hitting s3

I currently have an architecture where my kinesis -> kinesis firehouse -> s3
I am creating records directly in kinesis using:
aws kinesis put-record --stream-name <some_kinesis_stream> --partition-key 123 --data testdata --profile sandbox
The data when I run:
aws kinesis get-records --shard-iterator --profile sandbox
looks like this:
{
"SequenceNumber": "49597697038430366340153578495294928515816248592826368002",
"ApproximateArrivalTimestamp": 1563835989.441,
"Data": "eyJrZXkiOnsiZW1wX25vIjo1Mjc2OCwiZGVwdF9ubyI6ImQwMDUifSwidmFsdWUiOnsiYmVmb3JlIjpudWxsLCJhZnRlciI6eyJlbXBfbm8iOjUyNzY4LCJkZXB0X25vIjoiZDAwNSIsImZyb21fZGF0ZSI6Nzk2NSwidG9fZGF0ZSI6MjkzMjUzMX0sInNvdXJjZSI6eyJ2ZXJzaW9uIjoiMC45LjUuRmluYWwiLCJjb25uZWN0b3IiOiJteXNxbCIsIm5hbWUiOiJraW5lc2lzIiwic2VydmVyX2lkIjowLCJ0c19zZWMiOjAsImd0aWQiOm51bGwsImZpbGUiOiJteXNxbC1iaW4tY2hhbmdlbG9nLjAwMDAwMiIsInBvcyI6MTU0LCJyb3ciOjAsInNuYXBzaG90Ijp0cnVlLCJ0aHJlYWQiOm51bGwsImRiIjoiZW1wbG95ZWVzIiwidGFibGUiOiJkZXB0X2VtcCIsInF1ZXJ5IjpudWxsfSwib3AiOiJjIiwidHNfbXMiOjE1NjM4MzEzMTI2Njh9fQ==",
"PartitionKey": "-591791328"
}
but in s3, it looks like:
`testdatatestdatatestdatatestdatatestdatatestdatatestdatatestdata`
because I ran the putrecords several times.
So what is going on? When I run get-records, what records am I obtaining? What is that data? How is that data then decrypted into my original string? What is going on?
15 days old now, so hopefully you found the answer already.
If not, it seems the reason you have a mismatch in data between get-records and what you see in S3 is based on how you performed the aws kinesis get-records --shard-iterator --profile sandbox call, you didn't explicitly provide a shard iterator value.
What you saw in S3 is correct and expected based on your --data testdata put-record calls.
testdatatestdatatestdatatestdatatestdatatestdatatestdatatestdata
What you saw in Kinesis is base64 encoded:
"Data": "eyJrZXkiOnsiZW1wX25vIjo1Mjc2OCwiZGVwdF9ubyI6ImQwMDUifSwidmFsdWUiOnsiYmVmb3JlIjpudWxsLCJhZnRlciI6eyJlbXBfbm8iOjUyNzY4LCJkZXB0X25vIjoiZDAwNSIsImZyb21fZGF0ZSI6Nzk2NSwidG9fZGF0ZSI6MjkzMjUzMX0sInNvdXJjZSI6eyJ2ZXJzaW9uIjoiMC45LjUuRmluYWwiLCJjb25uZWN0b3IiOiJteXNxbCIsIm5hbWUiOiJraW5lc2lzIiwic2VydmVyX2lkIjowLCJ0c19zZWMiOjAsImd0aWQiOm51bGwsImZpbGUiOiJteXNxbC1iaW4tY2hhbmdlbG9nLjAwMDAwMiIsInBvcyI6MTU0LCJyb3ciOjAsInNuYXBzaG90Ijp0cnVlLCJ0aHJlYWQiOm51bGwsImRiIjoiZW1wbG95ZWVzIiwidGFibGUiOiJkZXB0X2VtcCIsInF1ZXJ5IjpudWxsfSwib3AiOiJjIiwidHNfbXMiOjE1NjM4MzEzMTI2Njh9fQ==",
So decoding gets you:
{
"key":
{
"emp_no": 52768,
"dept_no": "d005"
},
"value":
{
"before": null,
"after":
{
"emp_no": 52768,
"dept_no": "d005",
"from_date": 7965,
"to_date": 2932531
},
"source":
{
"version": "0.9.5.Final",
"connector": "mysql",
"name": "kinesis",
"server_id": 0,
"ts_sec": 0,
"gtid": null,
"file": "mysql-bin-changelog.000002",
"pos": 154,
"row": 0,
"snapshot": true,
"thread": null,
"db": "employees",
"table": "dept_emp",
"query": null
},
"op": "c",
"ts_ms": 1563831312668
}
}
The reason why it didn't match your "testdata" is because you were looking into the wrong shard iterator on possibly the wrong shard. Unsure what your kinesis setup is exactly.
Give this article a once over, https://docs.aws.amazon.com/streams/latest/dev/fundamental-stream.html . Should give you the steps to test this workflow.
It seems that you've configured your firehose to enable server-side data encryption. If this is the case then the following applies:
When you configure a Kinesis data stream as the data source of a Kinesis Data Firehose delivery stream, Kinesis Data Firehose no longer stores the data at rest. Instead, the data is stored in the data stream.
When you send data from your data producers to your data stream, Kinesis Data Streams encrypts your data using an AWS Key Management Service (AWS KMS) key before storing the data at rest. When your Kinesis Data Firehose delivery stream reads the data from your data stream, Kinesis Data Streams first decrypts the data and then sends it to Kinesis Data Firehose. Kinesis Data Firehose buffers the data in memory based on the buffering hints that you specify. It then delivers it to your destinations without storing the unencrypted data at rest.
Find out more at: https://docs.aws.amazon.com/firehose/latest/dev/encryption.html

Exporting Cloudwatch logs in original format

I am looking to find a way to export CW logs in their original form to s3. I used the console to export a days worth of logs from a log group, and it seems that a timestamp was prepended on each line, breaking the original JSON formatting. I was looking to import this into glue as a json file for a test transformation script. The original data used is formated as a normal json string when imported to cloudwatch and normally process the data it looks like:
{ "a": 123, "b": "456", "c": 789 }
After exporting and decompressing the data it looks like
2019-06-28T00:00:00.099Z { "a": 123, "b": "456", "c": 789 }
Which breaks reading the line as a json string since its no long a standard format.
The dataset is fairly large(100GB+) for this run, and will possibly grow larger in the future, so running the command a CLI command and processing each line locally isn't feasible in my opinion. Is there any known way to do what I am looking to do?
Thank you
TimeStamps are automatically added when you push the logs to the CloudWatch.
All the log events present in the CloudWatch has timestamp.
You can create a subscription filter to Kinesis Firehose and on Kinesis using lambda function you can formate the log events(remove the timestamp) then store the logs in the S3.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Subscriptions.html

Properly Configuring Kafka Connect S3 Sink TimeBasedPartitioner

I am trying to use the TimeBasedPartitioner of the Confluent S3 sink. Here is my config:
{
"name":"s3-sink",
"config":{
"connector.class":"io.confluent.connect.s3.S3SinkConnector",
"tasks.max":"1",
"file":"test.sink.txt",
"topics":"xxxxx",
"s3.region":"yyyyyy",
"s3.bucket.name":"zzzzzzz",
"s3.part.size":"5242880",
"flush.size":"1000",
"storage.class":"io.confluent.connect.s3.storage.S3Storage",
"format.class":"io.confluent.connect.s3.format.avro.AvroFormat",
"schema.generator.class":"io.confluent.connect.storage.hive.schema.DefaultSchemaGenerator",
"partitioner.class":"io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"timestamp.extractor":"Record",
"timestamp.field":"local_timestamp",
"path.format":"YYYY-MM-dd-HH",
"partition.duration.ms":"3600000",
"schema.compatibility":"NONE"
}
}
The data is binary and I use an avro scheme for it. I would want to use the actual record field "local_timestamp" which is a UNIX timestamp to partition the data, say into hourly files.
I start the connector with the usual REST API call
curl -X POST -H "Content-Type: application/json" --data #s3-config.json http://localhost:8083/connectors
Unfortunately the data is not partitioned as I wish. I also tried to remove the flush size because this might interfere. But then I got the error
{"error_code":400,"message":"Connector configuration is invalid and contains the following 1 error(s):\nMissing required configuration \"flush.size\" which has no default value.\nYou can also find the above list of errors at the endpoint `/{connectorType}/config/validate`"}%
Any idea how to properly set the TimeBasedPartioner? I could not find a working example.
Also how can one debug such a problem or gain further insight what the connector is actually doing?
Greatly appreciate any help or further suggestions.
After studying the code at TimeBasedPartitioner.java and the logs with
confluent log connect tail -f
I realized that both timezone and locale are mandatory, although this is not specified as such in the Confluent S3 Connector documentation. The following config fields solve the problem and let me upload the records properly partitioned to S3 buckets:
"flush.size": "10000",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"format.class": "io.confluent.connect.s3.format.avro.AvroFormat",
"schema.generator.class": "io.confluent.connect.storage.hive.schema.DefaultSchemaGenerator",
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"path.format": "'year'=YYYY/'month'=MM/'day'=dd/'hour'=HH",
"locale": "US",
"timezone": "UTC",
"partition.duration.ms": "3600000",
"timestamp.extractor": "RecordField",
"timestamp.field": "local_timestamp",
Note two more things: First a value for flush.size is also necessary, files are partitioned eventually into smaller chunks, not larger than specified by flush.size. Second, the path.format is better selected as displayed above so a proper tree structure is generated.
I am still not 100% sure if really the record field local_timestamp is used to partition the records.
Any comments or improvements are greatly welcome.
Indeed your amended configuration seems correct.
Specifically, setting timestamp.extractor to RecordField allows you to partition your files based on the timestamp field that your records have and which you identify by setting the property timestamp.field.
When instead one sets timestamp.extractor=Record, then a time-based partitioner will use the Kafka timestamp for each record.
Regarding flush.size, setting this property to a high value (e.g. Integer.MAX_VALUE) will be practically synonymous to ignore it.
Finally, schema.generator.class is no longer required in the most recent versions of the connector.