I currently have an architecture where my kinesis -> kinesis firehouse -> s3
I am creating records directly in kinesis using:
aws kinesis put-record --stream-name <some_kinesis_stream> --partition-key 123 --data testdata --profile sandbox
The data when I run:
aws kinesis get-records --shard-iterator --profile sandbox
looks like this:
{
"SequenceNumber": "49597697038430366340153578495294928515816248592826368002",
"ApproximateArrivalTimestamp": 1563835989.441,
"Data": "eyJrZXkiOnsiZW1wX25vIjo1Mjc2OCwiZGVwdF9ubyI6ImQwMDUifSwidmFsdWUiOnsiYmVmb3JlIjpudWxsLCJhZnRlciI6eyJlbXBfbm8iOjUyNzY4LCJkZXB0X25vIjoiZDAwNSIsImZyb21fZGF0ZSI6Nzk2NSwidG9fZGF0ZSI6MjkzMjUzMX0sInNvdXJjZSI6eyJ2ZXJzaW9uIjoiMC45LjUuRmluYWwiLCJjb25uZWN0b3IiOiJteXNxbCIsIm5hbWUiOiJraW5lc2lzIiwic2VydmVyX2lkIjowLCJ0c19zZWMiOjAsImd0aWQiOm51bGwsImZpbGUiOiJteXNxbC1iaW4tY2hhbmdlbG9nLjAwMDAwMiIsInBvcyI6MTU0LCJyb3ciOjAsInNuYXBzaG90Ijp0cnVlLCJ0aHJlYWQiOm51bGwsImRiIjoiZW1wbG95ZWVzIiwidGFibGUiOiJkZXB0X2VtcCIsInF1ZXJ5IjpudWxsfSwib3AiOiJjIiwidHNfbXMiOjE1NjM4MzEzMTI2Njh9fQ==",
"PartitionKey": "-591791328"
}
but in s3, it looks like:
`testdatatestdatatestdatatestdatatestdatatestdatatestdatatestdata`
because I ran the putrecords several times.
So what is going on? When I run get-records, what records am I obtaining? What is that data? How is that data then decrypted into my original string? What is going on?
15 days old now, so hopefully you found the answer already.
If not, it seems the reason you have a mismatch in data between get-records and what you see in S3 is based on how you performed the aws kinesis get-records --shard-iterator --profile sandbox call, you didn't explicitly provide a shard iterator value.
What you saw in S3 is correct and expected based on your --data testdata put-record calls.
testdatatestdatatestdatatestdatatestdatatestdatatestdatatestdata
What you saw in Kinesis is base64 encoded:
"Data": "eyJrZXkiOnsiZW1wX25vIjo1Mjc2OCwiZGVwdF9ubyI6ImQwMDUifSwidmFsdWUiOnsiYmVmb3JlIjpudWxsLCJhZnRlciI6eyJlbXBfbm8iOjUyNzY4LCJkZXB0X25vIjoiZDAwNSIsImZyb21fZGF0ZSI6Nzk2NSwidG9fZGF0ZSI6MjkzMjUzMX0sInNvdXJjZSI6eyJ2ZXJzaW9uIjoiMC45LjUuRmluYWwiLCJjb25uZWN0b3IiOiJteXNxbCIsIm5hbWUiOiJraW5lc2lzIiwic2VydmVyX2lkIjowLCJ0c19zZWMiOjAsImd0aWQiOm51bGwsImZpbGUiOiJteXNxbC1iaW4tY2hhbmdlbG9nLjAwMDAwMiIsInBvcyI6MTU0LCJyb3ciOjAsInNuYXBzaG90Ijp0cnVlLCJ0aHJlYWQiOm51bGwsImRiIjoiZW1wbG95ZWVzIiwidGFibGUiOiJkZXB0X2VtcCIsInF1ZXJ5IjpudWxsfSwib3AiOiJjIiwidHNfbXMiOjE1NjM4MzEzMTI2Njh9fQ==",
So decoding gets you:
{
"key":
{
"emp_no": 52768,
"dept_no": "d005"
},
"value":
{
"before": null,
"after":
{
"emp_no": 52768,
"dept_no": "d005",
"from_date": 7965,
"to_date": 2932531
},
"source":
{
"version": "0.9.5.Final",
"connector": "mysql",
"name": "kinesis",
"server_id": 0,
"ts_sec": 0,
"gtid": null,
"file": "mysql-bin-changelog.000002",
"pos": 154,
"row": 0,
"snapshot": true,
"thread": null,
"db": "employees",
"table": "dept_emp",
"query": null
},
"op": "c",
"ts_ms": 1563831312668
}
}
The reason why it didn't match your "testdata" is because you were looking into the wrong shard iterator on possibly the wrong shard. Unsure what your kinesis setup is exactly.
Give this article a once over, https://docs.aws.amazon.com/streams/latest/dev/fundamental-stream.html . Should give you the steps to test this workflow.
It seems that you've configured your firehose to enable server-side data encryption. If this is the case then the following applies:
When you configure a Kinesis data stream as the data source of a Kinesis Data Firehose delivery stream, Kinesis Data Firehose no longer stores the data at rest. Instead, the data is stored in the data stream.
When you send data from your data producers to your data stream, Kinesis Data Streams encrypts your data using an AWS Key Management Service (AWS KMS) key before storing the data at rest. When your Kinesis Data Firehose delivery stream reads the data from your data stream, Kinesis Data Streams first decrypts the data and then sends it to Kinesis Data Firehose. Kinesis Data Firehose buffers the data in memory based on the buffering hints that you specify. It then delivers it to your destinations without storing the unencrypted data at rest.
Find out more at: https://docs.aws.amazon.com/firehose/latest/dev/encryption.html
Related
I am new to Steam analytics and I need help here to achieve a specific task.
I have telemetry data coming from iot hub in this format.
Basically i will be getting machines telemetry data and the stage of the operations on that machine streamed to iot hub.
The stages will be indicated with tag ex:"stageid":"stage1". I need to calculate the time taken for each stage using stream analytics based on timestamp and stage tag.
packet Ex:
[{
"Payload": {
"devid": "01",
"locid": "loc01",
"machid": "mac01",
"stageid": "stage1",
"timestamp": "2020-01-24T09:22:00.3270000Z"
},
"Payload": {
"devid": "02",
"locid": "loc01",
"machid": "mac01",
"stageid": "stage1",
"timestamp": "2020-01-24T09:22:00.3270000Z"
}
}]
[{
"Payload": {
"devid": "01",
"locid": "loc01",
"machid": "mac01",
"stageid": "stage2",
"timestamp": "2020-01-24T09:26:00.3270000Z"
},
"Payload": {
"devid": "02",
"locid": "loc01",
"machid": "mac01",
"stageid": "stage2",
"timestamp": "2020-01-26T09:24:00.3270000Z"
}
}]
pls help me can we achieve this with query and what could be the query or what is the other best approach?
Thanks,
Per my knowledge,your needs can't be implemented by ASA built-in features. ASA is a real-time collect data and analytics service.In other words,data need to be processed in the real-time.Current data can't wait for next dataset to do some calculate or merge things. Even if you could use windows function and group by,i believe the frequency of messages pushed by the device is also variable.
As a workaround,my idea is using iot-hub azure function trigger.Inside trigger,you could use code to parse message and save key columns(stageid,timestamp,devid) into some storage,maybe azure table storage. Before every insert,just grab latest row of current device to calculate the time taken with current message so that you could produce that time to store in other residence.In the end, update the latest row for every device.
I am looking to find a way to export CW logs in their original form to s3. I used the console to export a days worth of logs from a log group, and it seems that a timestamp was prepended on each line, breaking the original JSON formatting. I was looking to import this into glue as a json file for a test transformation script. The original data used is formated as a normal json string when imported to cloudwatch and normally process the data it looks like:
{ "a": 123, "b": "456", "c": 789 }
After exporting and decompressing the data it looks like
2019-06-28T00:00:00.099Z { "a": 123, "b": "456", "c": 789 }
Which breaks reading the line as a json string since its no long a standard format.
The dataset is fairly large(100GB+) for this run, and will possibly grow larger in the future, so running the command a CLI command and processing each line locally isn't feasible in my opinion. Is there any known way to do what I am looking to do?
Thank you
TimeStamps are automatically added when you push the logs to the CloudWatch.
All the log events present in the CloudWatch has timestamp.
You can create a subscription filter to Kinesis Firehose and on Kinesis using lambda function you can formate the log events(remove the timestamp) then store the logs in the S3.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Subscriptions.html
We have events coming to Kafka and using kafka connect we are syncing these events with aws s3.
Data is visible in s3 in below dir structure:
bucket_name/sub_folder/
Partition=0/events.json
Partition=1/events.json
Partition=2/events.json
is there a way to store in below dir structure:
Bucket_name/sub_folder/date=today_date/ events.json or Partition=0..2/date=today/events.json
Bucket_name/sub_folder/date=today_date/ events.json or
Motivation is to store that days events in that that days directory, i searched web but could not find any other way .
Thanks in advance.
You can use the TimeBasedPartitioner which
partitions data according to ingestion time.
e.g. for hourly partioning:
[…]
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"path.format": "'year'=YYYY/'month'=MM/'day'=dd/'hour'=HH",
"locale": "US",
"timezone": "UTC",
"partition.duration.ms": "3600000",
"timestamp.extractor": "RecordField",
"timestamp.field": "my_record_field_with_timestamp_in",
[…]
I'm taking my first experimental steps with google-pre-setup templates in a Google Cloud Template (Cloud Pub/Sub to BigQuery).
As a milestone to my final goal (having physical gadgets reporting a data stream to Google Cloud Pub/Bub), my wish is to achieve something like this:
POSTMAN (make authenticated POST request with JSON message to an Google Cloud Platform, GPC, endpoint) --> GPC Pub/Sub --> GPC DataFlow --> GPC BigQuery.
Right now I am following the tutorial found in Executing Templates, https://cloud.google.com/dataflow/docs/templates/executing-templates, "Example 2: Custom template, streaming job". This section states:
...This example projects.templates.launch request creates a streaming job
from a template that reads from a Pub/Sub topic and writes to a
BigQuery table. The BigQuery table must already exist with the
appropriate schema. If successful, the response body contains an
instance of LaunchTemplateResponse. ...
and further more how to do a POST:
https://dataflow.googleapis.com/v1b3/projects/[YOUR_PROJECT_ID]/templates:launch?gcsPath=gs://[YOUR_BUCKET_NAME]/templates/TemplateName
{
"jobName": "[JOB_NAME]",
"parameters": {
"topic": "projects/[YOUR_PROJECT_ID]/topics/[YOUR_TOPIC_NAME]",
"table": "[YOUR_PROJECT_ID]:[YOUR_DATASET].[YOUR_TABLE_NAME]"
},
"environment": {
"tempLocation": "gs://[YOUR_BUCKET_NAME]/temp",
"zone": "us-central1-f"
}
}
There are two things that confuses me. Let's for the sake of a simple example say that I have multiple vehicles who continuously should report their current status. I have already created my MQTT topic: VEHICLE_STATUS. Each och my vehicles should be able to report its:
Position [String]
Speed [Float]
Time [String]
VehicleID [Integer]
=======
I'm aware of the prototype for a PubsubMessage:
{
"data": string,
"attributes": {
string: string,
...
},
"messageId": string,
"publishTime": string,
}
My questions:
How should my BigQuery table schema look (which columns do I need to create)?
How should the entire corresponding JSON message look? What should my vehicle report to the endpoint each time?
So I am trying to create a simple pipeline in Amazon AWS. I want to execute a step function using data generated by a stream which triggers the first lambda of the state machine
What I want to do is following.
Input data is streamed by AWS Kinesis
This Kinesis stream is used as a trigger for a lambda1 that executes and writes to S3 Bucket.
This would trigger (using step function) a lambda2 that would read the content from the given bucket and write it to another bucket
Now I want to implement a state machine using Amazon Step Function. I have created the state machine which is quite straightforward
{
"Comment": "Linear step function test",
"StartAt": "lambda1",
"States": {
"lambda1": {
"Type": "Task",
"Resource": "arn:....",
"Next": "lambda2"
},
"lambda2": {
"Type": "Task",
"Resource": "arn:...",
"End": true
}
}
}
What I want is, that Kinesis should trigger the first Lambda and once its executed the step function would execute lambda 2. Which does not seem to happen. Step function does nothing even though my Lambda 1 is triggered from the stream and writing to S3 bucket. I have an option to manually start a new execution and pass a JSON as input, but that is not the work flow I am looking for
you did wrong to kick off State machine.
you need to add another Starter Lambda function to use SDK to invoke State Machine. The process is like this:
kinesis -> starter(lambda) -> StateMachine (start Lambda 1 and Lambda 2)
The problem of using Step Function is lack of triggers. There are only 3 triggers which are CloudWatch Events, SDK or API Gateway.