How to have EMRFS consistent view on S3 buckets with retention policy? - amazon-s3

I am using an AWS EMR compute cluster (version 5.27.0) , which uses S3 for data persistence.
This cluster both reads and writes to S3.
S3 has an issue of eventual consistency, because of which after writing data, it cannot be immediately listed. Due to this I use EMRFS with DynamoDB to store newly written paths for immediate listing.
Problem now is that I have to set a retention policy on S3, because of which data more than a month old will get deleted from S3. However, in doing so , the data does not get deleted from EMRFS DynamoDB table, leading to consistency issues.
My question is , how can I ensure that on setting the retention policy in S3, the same paths get deleted from the DynamoDB table?
One naive solution I have come up with is to define a Lambda, which fires periodically, and sets TTL of say 1 day on the DynamoDB records manually. Is there a better approach than this ?

You can configure DynamoDB with same expiration policy as your S3 objects have
https://aws.amazon.com/blogs/aws/new-manage-dynamodb-items-using-time-to-live-ttl/
and in this case, you ensure both DynamoDB and S3 have the same existing objects

Related

AWS Kinesis Data Firehose and Lambda

I have different data sources and I need to publish them to S3 in real-time. I also need to process and validate data before delivering them to S3 buckets. So, I have to use AWS Lambda and validating data. The question is that what is the difference between AWS Kinesis Data Firehose and using AWS Lambda to directly store data into S3 Bucket? Clearly, what is the advantages of using Kinesis Data Firehose? because we can use AWS Lambda to directly put records into S3!
We might want to clarify near real time, as for me, it is below 1 sec.
Kinesis Firehose in this case will batch the items before delivering them into S3. This will result in more items per S3 object.
You can configured how often you want the data to be stored. (You can also connect a lambda to firehose, so you can process the data before delivering them to S3). Kinesis Firehose will scale automatically.
Note that each PUT to S3 as a cost associated to it.
If you connect your data source to AWS Lambda, then each event will trigger the lambda (unless you have a batching mechanism in place, which you didn't mention) and for each event, you will make a PUT request to S3. This will result in a lot of small object in S3 and therefore a lot of S3 PUT api.
Also, depending on the number of items received per seconds, Lambda might not be able to scale and cost associated will increase.

How store streaming data from Amazon Kinesis Data Firehose to s3 bucket

I want to improve my current application. I am using redis using ElastiCache in AWS in order to store some user data from my website.
This solution is not scalable and I want to scale it using Amazon Kinesis Data Firehose for the autoscale streaming output, AWS Lambda to modify my input data, store it in S3 bucket and access it using AWS Athena.
I have been googling for several days but I really don't know how Amazon Kinesis Data Firehose store the data in S3.
Is Firehose going to store the data as a single file per each process that it will process or there is a way to add this data in the same csv or group the data in different csv's?
Amazon Kinesis Data Firehose will group data into a file based on:
Size of data (eg 5MB)
Duration (eg every 5 minutes)
Whichever one hits the limit first will trigger the data storage in Amazon S3.
Therefore, if you need near-realtime reporting, go for a short duration. Otherwise, go for larger files.
Once a file is written in Amazon S3, it is immutable and Kinesis will not modify its contents. (No appending or modification of objects.)

AWS DMS Redshift as target

I am planning to do continuous migration of RDS to Redshift using DMS. As per the docs, it states if the target is redshift , DMS uses a S3 bucket to temporarily store the data before copying to redshift. I could not find any document confirming if this S3 bucket is temporary (used only for initial copying) and is deleted once the copying is done. (https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.Redshift.html)
Any thoughts on this?
Probably you've figured out the answer already. If not, yes it does create a bucket and the contents are deleted but the bucket itself is not deleted. Generally the name of the bucket starts with dms-.
Also there is an option to provide a custom bucket.

How can I search the changes made on a `s3` bucket between two timestamp?

I am using s3 bucket to store my data. And I keep pushing data to this bucket every single day. I wonder whether there is feature I can compare the files different in my bucket between two date. I not, is there a way for me to build one via aws cli or sdk?
The reason I want to check this is that I have a s3 bucket and my clients keep pushing data to this bucket. I want to have a look how much data they pushed since the last time I load them. Is there a pattern in aws support this query? Or do I have to create any rules in s3 bucket to analyse it?
Listing from Amazon S3
You can activate Amazon S3 Inventory, which can provide a daily file listing the contents of an Amazon S3 bucket. You could then compare differences between two inventory files.
List it yourself and store it
Alternatively, you could list the contents of a bucket and look for objects dated since the last listing. However, if objects are deleted, you will only know this if you keep a list of objects that were previously in the bucket. It's probably easier to use S3 inventory.
Process it in real-time
Instead of thinking about files in batches, you could configure Amazon S3 Events to trigger something whenever a new file is uploaded to the Amazon S3 bucket. The event can:
Trigger a notification via Amazon Simple Notification Service (SNS), such as an email
Invoke an AWS Lambda function to run some code you provide. For example, the code could process the file and send it somewhere.

what's the use of periodically scheduling a AWS Glue crawler. Running it once seems to be enough

I've created an AWS glue table based on contents of a S3 bucket. This allows me to query data in this S3 bucket using AWS Athena. I've defined an AWS Glue crawler and run it once to auto-determine the schema of the data. This all works nicely.
Afterwards, all newly uploaded data into the S3 bucket is nicely reflected in the table. (by doing a select count(*) ... in Athena.
Why then would I need to periodically run (i.e.: schedule) an AWS Glue Crawler? After all, as said, updates to the s3 bucket seem to be properly reflected in the table. Is it to update statistics on the table so the queryplanner can be optimized or something?
Crawler is needed to register new data partitions in Data Catalog. For example, your data is located in folder /data and partitioned by date (/data/year=2018/month=9/day=11/<data-files>). Each day files are coming into a new folder (day=12, day=13 etc). To make new data available for querying these partitions must be registered in Data Catalog which can be done by running a crawler. Alternative solution is to run 'MSCK REPAIR TABLE {table-name}' in Athena.
Besides that crawler can detect a change in schema and make appropriate actions depending on your configuration.