How store streaming data from Amazon Kinesis Data Firehose to s3 bucket - amazon-s3

I want to improve my current application. I am using redis using ElastiCache in AWS in order to store some user data from my website.
This solution is not scalable and I want to scale it using Amazon Kinesis Data Firehose for the autoscale streaming output, AWS Lambda to modify my input data, store it in S3 bucket and access it using AWS Athena.
I have been googling for several days but I really don't know how Amazon Kinesis Data Firehose store the data in S3.
Is Firehose going to store the data as a single file per each process that it will process or there is a way to add this data in the same csv or group the data in different csv's?

Amazon Kinesis Data Firehose will group data into a file based on:
Size of data (eg 5MB)
Duration (eg every 5 minutes)
Whichever one hits the limit first will trigger the data storage in Amazon S3.
Therefore, if you need near-realtime reporting, go for a short duration. Otherwise, go for larger files.
Once a file is written in Amazon S3, it is immutable and Kinesis will not modify its contents. (No appending or modification of objects.)

Related

AWS Kinesis Data Firehose and Lambda

I have different data sources and I need to publish them to S3 in real-time. I also need to process and validate data before delivering them to S3 buckets. So, I have to use AWS Lambda and validating data. The question is that what is the difference between AWS Kinesis Data Firehose and using AWS Lambda to directly store data into S3 Bucket? Clearly, what is the advantages of using Kinesis Data Firehose? because we can use AWS Lambda to directly put records into S3!
We might want to clarify near real time, as for me, it is below 1 sec.
Kinesis Firehose in this case will batch the items before delivering them into S3. This will result in more items per S3 object.
You can configured how often you want the data to be stored. (You can also connect a lambda to firehose, so you can process the data before delivering them to S3). Kinesis Firehose will scale automatically.
Note that each PUT to S3 as a cost associated to it.
If you connect your data source to AWS Lambda, then each event will trigger the lambda (unless you have a batching mechanism in place, which you didn't mention) and for each event, you will make a PUT request to S3. This will result in a lot of small object in S3 and therefore a lot of S3 PUT api.
Also, depending on the number of items received per seconds, Lambda might not be able to scale and cost associated will increase.

How to have EMRFS consistent view on S3 buckets with retention policy?

I am using an AWS EMR compute cluster (version 5.27.0) , which uses S3 for data persistence.
This cluster both reads and writes to S3.
S3 has an issue of eventual consistency, because of which after writing data, it cannot be immediately listed. Due to this I use EMRFS with DynamoDB to store newly written paths for immediate listing.
Problem now is that I have to set a retention policy on S3, because of which data more than a month old will get deleted from S3. However, in doing so , the data does not get deleted from EMRFS DynamoDB table, leading to consistency issues.
My question is , how can I ensure that on setting the retention policy in S3, the same paths get deleted from the DynamoDB table?
One naive solution I have come up with is to define a Lambda, which fires periodically, and sets TTL of say 1 day on the DynamoDB records manually. Is there a better approach than this ?
You can configure DynamoDB with same expiration policy as your S3 objects have
https://aws.amazon.com/blogs/aws/new-manage-dynamodb-items-using-time-to-live-ttl/
and in this case, you ensure both DynamoDB and S3 have the same existing objects

Loading incremental data from Dynamo DB to S3 in AWS Glue

Is there any option to load data incrementally from Amazon DynamoDB to Amazon S3 in AWS Glue. Bookmark option is enabled but It is not working. It is loading complete data. Is bookmark option not applicable for loading from Dynamodb?
It looks like Glue doesn't support job bookmarking for DynamoDB source, it only accepts S3 source :/
To load DynamoDB data incrementally you might use DynamoDB Streams to only process new data.
Enable dynamo streams for your table and use lambda to save those streams on s3. This will provide you more control over your data

S3 bucket does not append new data objects

I'm trying to send all my AWS IoT incoming sensor value messages to the same s3 bucket, but despite turning on versioning in my bucket, the file keeps getting overwritten and showing only the last input sensor value rather then all of them. I'm using "Store messages in an Amazon S3 bucket" direct from the AWS IoT console. Any easy way to solve this problem?
So after further research and speaking with Amazon Dev support you actually cant append records tot he same file in S3 from the IoT console directly. I mentioned this was a feature most IoT developers would want as a default, and he said it would likely be possible soon but not way to do it now. Anyway the simplest workaound I tested is to set up a Kinesis stream with a firehose to a S3 bucket. This will be constrained by an adjustable data size and stream duration but it works well otherwise. It also allows you to insert a Lambda functino for data transform if needed.

Stream data from S3 bucket to redshift periodically

I have some data stored in S3 . I need to clone/copy this data periodically from S3 to Redshift cluster. To do bulk copy , I can use copy command to copy from S3 to redshift.
Similarly is there any trivial way to copy data from S3 to Redshift periodically .
Thanks
Try using AWS Data Pipeline which has various templates for moving data from one AWS service to other. The "Load data from S3 into Redshift" template copies data from an Amazon S3 folder into a Redshift table. You can load the data into an existing table or provide a SQL query to create the table. The Redshift table must have the same schema as the data in Amazon S3.
Data Pipeline supports pipelines to be running on a schedule. You have a cron style editor for scheduling
AWS Lambda Redshift Loader is a good solution that runs a COPY command on Redshift whenever a new file appears pre-configured location on Amazon S3.
Links:
https://aws.amazon.com/blogs/big-data/a-zero-administration-amazon-redshift-database-loader/
https://github.com/awslabs/aws-lambda-redshift-loader
I believe Kinesis Firehose is the simplest way to get this done. Simply create a Kinesis Forehose stream, point it a a specific table in your Redshift cluster, write data to the stream, done :)
Full setup procedure here:
https://docs.aws.amazon.com/ses/latest/DeveloperGuide/event-publishing-redshift-firehose-stream.html
Kinesis option works only if redshift is publicly accessible.
You can use copy command with lambda. You can configure 2 lambdas. One will create a manifest file for you upcoming new data and another will read from that manifest for load it on redshift with Redshift data api.