Nifi: Delete files from S3 when they are X days old

Nifi: Delete files from S3 when they are X days old - amazon-s3

I have Nifi flow which is supposed to delete any files from S3 that are older than 7 days. I have used the following setup to get it done.
My UpdateAttribute processor has a epoch_now attribute that gets current epoch time.
On my RouteOnAttribute I have the following logic to filter out files that are younger than 7 days using this expression: ${epoch_now:minus(${s3.lastModified}):ge(604800000)}
The problem is that ListS3 processor will maintain state and it won't re-list all the files the next time to calculate if any files are expiring and need to be deleted. I looked around but I could not find something like Get* processor which would not maintain state. How do I fix this flow so that it run periodically and keeps deleting files that are 7 days old?

You are correct, NiFi does not currently have a processor to query S3 that way.
This might be a better fit for an S3 Lifecycle Rule. You can configure a rule for specific key prefixes, so S3 will automagically delete objects after 7 days. From the S3 console:
Select your bucket
Select Properties
Expand the Lifecycle section
Click Add rule
There is a wizard-style interface to walk you through the configuration.

Related

Perform action after all s3 files processed

I have files uploaded onto my s3 bucket - always 6, twice per day. I am running a fairly standard setup:
S3 -> SNS -> SQS -> lambda
Each lambda processes the newly uploaded file and then deletes it. I know the exact number of files in each batch, but I cannot require the client to perform any other action (e.g. message on SNS).
Question: what's the easiest/best way to perform a certain action after processing all files?
My ideas:
Step Functions - not sure how?
Simply check in each lambda if s3 items count is zero (or check sqs message queue size?) - not sure if there won't be a race condition against a delete immediately before (is it always consistent) or similar issues?
CloudWatch alarm when SQS queue depth is zero -> SNS -> lambda - I guess it should work, not sure about the correct metric?
I would appreciate info on the best/simplest way to achieve it.

If you are sure that by x o'clock, all your 6 files will proceed then simply you can create a cloud watch and schedule it at 11:50 PM, and based on your validation just delete the files.

You could use the number of files in the S3 bucket location to capture "file count" state. The processing lambda runs on each file-add, but conditionally initiates the delete and post-processing steps only when objectCount === 6.
How can we use S3 to keep track of file count? Lots of possibilities, here are two:
Option 1: defer processing until all 6 files have arrived
When triggered on OBJECT_CREATED, the lambda counts the S3 objects. If objectCount < 6 the lambda exits without further action. If 6 files exist, process all 6, delete the files and perform the post-processing action.
Option 2: use S3 tags to indicate PROCESSED status
When triggered on OBJECT_CREATED, the lambda processes the new file and adds a PROCESSED tag to the S3 Object. The lambda then counts the S3 objects with PROCESSED tags. If 6, delete the files and perform the post-processing action.
In any case, think through race conditions and other error states. This is often where "best" and "simplest" sometimes conflict.
N.B. Step Functions could be used to chain together the processing steps, but they don't offer a different way to keep track of file count state.

Amazon S3 | Do lifecycle management considers the upload date or the last access date?

I'm configuring my application S3 Lifecycle Management but only with Amazon's documentation, I couldn't get it if what is considered before moving the data to a lower tier is the upload date or the last operation made on the data.
For example, let's consider I configured the Lifecycle Management to lower the tier of the data after 30 days. I have a data that was uploaded 40 days ago, but I handle it every day. Will S3 lower consider the time I uploaded the data or the last time I handled it?
If by default it uses the upload date, there is a way to configure S3 to consider the last handle date?

It uses the object "upload date" or "creation date" or "last modified date". It's not currently supported to configure the S3 Lifecycle Management to use the last access date.
You might be able to get the last access date by doing some SQL queries on S3 access logs, and then perform the transition manually. However, this seems pretty painful.
A better way to do this might be to maintain a DynamoDB table which records the last access time. Then you can create a cron job to perform the transition periodically.
Another simpler approach is to use the S3 Intelligent-Tiering. Feel free to take a look, and it might fit your needs.

What is the best way to know the latest file in s3 bucket?

I have a process which is uploading files to S3. The rate at which these are files are pumped to S3 is not constant. Another process needs to look into the latest files uploaded in this bucket and update, say watermark. Now we need the best-effort strategy to make this "latest file" available information as soon as possible.
S3 has event notification integration with SNS/SQS. Since I don't need a fan-out, I thought I could simply do a S3 -> SQS integration. But on digging deep into SQS, I see that though there are no limits on the number of SQS queues you can have per account (I would need quite a lot of queues if I were to assign SQS per partition in S3), there is a limit on max number of messages you can receive per call - 10
Though I can setup SQS per partition, i.e Q1 for root/child1, Q2 for root/child2, etc. The number of files getting pumped into these child folder itself could be massive. In that case, instead of trying to drain everything in the queue - JUST to get the latest file in the child directory, is there any other mechanism I could apply?
Note I am not 100% done with my POC and I certainly don't have the metrics - but given that long-polling (the more you wait, the more delay in getting out the latest file information. so short poll is probably what I should be using - but then there is a possibility that it does not send the request to all SQS servers so I would need multiple calls to get the latest event out from SQS. Need to find a balance there), 10 per call limit, etc, I just doubt if I am using the right tool for the problem here. Am I missing something? or am I terribly wrong about SQS?
I am yet to experiment SNS - does it do rate limiting for events? "If there are 10000 events per minute I will only send you the latest one" sort of?
Please let me know what is the best way to get the latest file uploaded in S3 when the rate of files uploaded is high.

Seaweedfs: TTL options for S3 and/or volume

I'm really interested in TTL feature for files found in Seaweedfs. The only missing part in my understanding is interaction with S3 and on the volume level.
Currently my app is using S3 libraries to interact with the storage. It writes a lot of small files, which are never rewritten, but being accessed for time to time. Also I need to keep only x days of the written data, while older entries should be removed. Seaweedfs looks like a perfect solution to me. But as far as I understand the docs, I can only set TTL using it's own API.
So here's my two questions:
Can I somehow set the TTL for a file using S3 abstraction layer?
If not, can I have default TTL value for say per bucket (or volume, I guess) or per cluster?

Currently not. This can be a good feature to add.

Is it safe to apply new lifecycle policy to existing AWS S3 bucket with huge number of items?

We have a huge S3 bucket with some billions of items inside. For now it doesn't have any item lifecycle policy applied, so items are stored forever. We want to purge items which are older than half a year and do it on a regular basis further. Using S3 lifecycle policy such task seems easy, but is it safe (in terms of performance and availability) to apply a policy to existing huge bucket or lifecycle policies should be safely applied to new buckets only?

This should be a safe operation, as far as performance is concerned, regardless of bucket size or number of objects impacted, because lifecycle policies are handled asynchronously, in the background, in stages.
Expiring Objects: General Considerations
When an object reaches the end of its lifetime, Amazon S3 queues it for removal and removes it asynchronously. There may be a delay between the expiration date and the date at which Amazon S3 removes an object. You are not charged for storage time associated with an object that has expired.
http://docs.aws.amazon.com/AmazonS3/latest/dev/object-lifecycle-mgmt.html

I have experienced performance issues, especially with EMR, JetS3t, and S3 console, when using lifecycles to delete billions of objects on versioning enabled buckets. A deleted current version has a delete marker object added, and becomes a previous version, then the previous version is cleaned up later by the lifecycle's previous version delete setting. The delete marker is another object added that must be filtered in the console or bucket listings. So we get 2 times the number of objects until they are all deleted. We also pay standard charges for the previous version storage until it is deleted.
Any application that lists objects may need to filter current, delete markers, and previous versions. I found the S3 console would hang until i selected "Versions: Show". In some cases hadoop 1.x and JetS3t would timeout until most delete markers previous versions were cleaned up, especially from the bucket root.
For versioning enabled buckets I recommend you start deleting gradually by setting the lifecycle interval large then reduce the interval over time. I also found using lifecycles is not instantaneous and it takes several days (at least) for lots of objects to be removed.
I use http://www.insight4storage.com/ from AWS Marketplace to monitor my storage types and versions per bucket and prefix. It helped me track lifecycle progress over time by showing the current and previous versions, IA, RRS, and Glacier storage by prefix and per bucket.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas