Perform action after all s3 files processed

Perform action after all s3 files processed - amazon-s3

I have files uploaded onto my s3 bucket - always 6, twice per day. I am running a fairly standard setup:
S3 -> SNS -> SQS -> lambda
Each lambda processes the newly uploaded file and then deletes it. I know the exact number of files in each batch, but I cannot require the client to perform any other action (e.g. message on SNS).
Question: what's the easiest/best way to perform a certain action after processing all files?
My ideas:
Step Functions - not sure how?
Simply check in each lambda if s3 items count is zero (or check sqs message queue size?) - not sure if there won't be a race condition against a delete immediately before (is it always consistent) or similar issues?
CloudWatch alarm when SQS queue depth is zero -> SNS -> lambda - I guess it should work, not sure about the correct metric?
I would appreciate info on the best/simplest way to achieve it.

If you are sure that by x o'clock, all your 6 files will proceed then simply you can create a cloud watch and schedule it at 11:50 PM, and based on your validation just delete the files.

You could use the number of files in the S3 bucket location to capture "file count" state. The processing lambda runs on each file-add, but conditionally initiates the delete and post-processing steps only when objectCount === 6.
How can we use S3 to keep track of file count? Lots of possibilities, here are two:
Option 1: defer processing until all 6 files have arrived
When triggered on OBJECT_CREATED, the lambda counts the S3 objects. If objectCount < 6 the lambda exits without further action. If 6 files exist, process all 6, delete the files and perform the post-processing action.
Option 2: use S3 tags to indicate PROCESSED status
When triggered on OBJECT_CREATED, the lambda processes the new file and adds a PROCESSED tag to the S3 Object. The lambda then counts the S3 objects with PROCESSED tags. If 6, delete the files and perform the post-processing action.
In any case, think through race conditions and other error states. This is often where "best" and "simplest" sometimes conflict.
N.B. Step Functions could be used to chain together the processing steps, but they don't offer a different way to keep track of file count state.

Related

What is the best way to know the latest file in s3 bucket?

I have a process which is uploading files to S3. The rate at which these are files are pumped to S3 is not constant. Another process needs to look into the latest files uploaded in this bucket and update, say watermark. Now we need the best-effort strategy to make this "latest file" available information as soon as possible.
S3 has event notification integration with SNS/SQS. Since I don't need a fan-out, I thought I could simply do a S3 -> SQS integration. But on digging deep into SQS, I see that though there are no limits on the number of SQS queues you can have per account (I would need quite a lot of queues if I were to assign SQS per partition in S3), there is a limit on max number of messages you can receive per call - 10
Though I can setup SQS per partition, i.e Q1 for root/child1, Q2 for root/child2, etc. The number of files getting pumped into these child folder itself could be massive. In that case, instead of trying to drain everything in the queue - JUST to get the latest file in the child directory, is there any other mechanism I could apply?
Note I am not 100% done with my POC and I certainly don't have the metrics - but given that long-polling (the more you wait, the more delay in getting out the latest file information. so short poll is probably what I should be using - but then there is a possibility that it does not send the request to all SQS servers so I would need multiple calls to get the latest event out from SQS. Need to find a balance there), 10 per call limit, etc, I just doubt if I am using the right tool for the problem here. Am I missing something? or am I terribly wrong about SQS?
I am yet to experiment SNS - does it do rate limiting for events? "If there are 10000 events per minute I will only send you the latest one" sort of?
Please let me know what is the best way to get the latest file uploaded in S3 when the rate of files uploaded is high.

Notification Service on AWS S3 bucket (prefix) size

I have a specific use-case where we have a huge amount of data that is continuously streamed into the AWS bucket.
we want a notification service for s3 bucket on the specific folder where if a folder reaches specific size(for example 100 TB) a cleaning service should be triggered via (SNS, Aws lambda)
I have checked into AWS documentation. I did not found any direct support from Aws regarding this issue.
https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
We are planning to have a script that will periodically run and check the size of s3 Object and kicks AWS lambda.
is there any elegant way to handle case like this .any suggestion or opinion is really appreciated.

Attach s3 trigger event to a lambda function which will get triggered, whenever any file is added to the S3 bucket.
Then in the lambda function check for the file size. This will eliminate to run a script periodically to check the size.
Below is a sample code for adding S3 trigger to a lambda function.
s3_trigger:
handler: lambda/lambda.s3handler
timeout: 900
events:
- s3:
bucket: ${self:custom.sagemakerBucket}
event: s3:ObjectCreated:*
existing: true
rules:
- prefix: csv/
- suffix: .csv

There is no direct method for obtaining the size of a folder in Amazon S3 (because folders do not actually exist).
Here's a few ideas...
Periodic Lambda function to calculate total
Create an Amazon CloudWatch Event to trigger an AWS Lambda function at specific intervals. The Lambda function would list all objects with the given Prefix (effectively a folder) and total the sizes. If it exceeds 100TB, the Lambda function could trigger the cleaning process.
However, if there are thousands of files in that folder, this would be somewhat slow. Each API call can only retrieve 1000 objects. Thus, it might take many calls to count the total, and this would be done every checking interval.
Keep a running total
Configure Amazon S3 Events to trigger an AWS Lambda function whenever a new object is created with that Prefix. The Lambda function can retrieve increment the running total in a database. If the total exceeds 100TB, the Lambda function could trigger the cleaning process.
Which database to use? Amazon DynamoDB would be the quickest and it supports an 'increment' function, but you could be sneaky and just use AWS Systems Manager Parameter Store. This might cause a problem if new objects are created quickly because there's no locking. So, if files are coming in every few seconds or faster, definitely use DynamoDB.
Slow motion
You did not indicate how often this 100TB limit is likely to be triggered. If it only happens after a few days, you could use Amazon S3 Inventory, which provides a daily CSV containing a listing of objects in the bucket. This solution, of course, would not be applicable if the 100TB limit is hit in less than a day.

Lambda function and triggered S3 events

I have a lambda function that gets triggered every time a file is written onto an S3 bucket. My understanding is that every time a single file gets in (this is a potential scenario, rather than having a batch of files being sent), an API call is fired up and that means that I am charged. My question is: can I batch multiple files so that each API calls will only be called if, for example, I have a batch of 10 files? Is this a good practice? I should not be in the position of having a processing time greater than 15 minutes, so the use of the lambda is still fine.
Thank you

You can use SQS to decouple this scenario, the lambda triggering point will be SQS, in there you can set batch size whatever you want.

1 - One solution is group your files into a rar and put into S3. Thus for multiple files, your api will be triggered once only.
2 - The other solution as said by kamprasad is to use SQS.
3 - One last solution that I can think of is to use a cronjob to trigger the lambda as per your requirement. Inside your lambda do the processing using threads to make your task done faster. Keep in mind you have to choose Memory and time carefully in this scenario.
I've personally used the last solution quite frequently.

Deleting events published by AWS S3 buckets which are still in queue to be processed by lambda

My architecture is:
1.Drop multiple files in aws S# bucket
2. Lambda picks the file one by one and starts processing it
Problem is :
I am not able to stop the lambda to process the files in between. Even if i stop the lambda instance and restart it, it picks from where it left.
Is there a way to achieve this?

You have no control over the events pushed by S3. You'll be better off if you just cancel the Lambda subscription if you want to stop it for good, but I am afraid that already emitted events will be processed as long as your Lambda is active.
What exactly are you trying to achieve?
If you want to limit the number of files your Lambda functions can process, you can just limit of concurrent executions on your function to 1, so it won't auto-scale based on demand.
Simply go to Concurrency as the image below shows, set it to 1 and save it.

Detach the lambda S3 trigger and add it newly.
This way all new events will be picked up and not the old events

Nifi: Delete files from S3 when they are X days old

I have Nifi flow which is supposed to delete any files from S3 that are older than 7 days. I have used the following setup to get it done.
My UpdateAttribute processor has a epoch_now attribute that gets current epoch time.
On my RouteOnAttribute I have the following logic to filter out files that are younger than 7 days using this expression: ${epoch_now:minus(${s3.lastModified}):ge(604800000)}
The problem is that ListS3 processor will maintain state and it won't re-list all the files the next time to calculate if any files are expiring and need to be deleted. I looked around but I could not find something like Get* processor which would not maintain state. How do I fix this flow so that it run periodically and keeps deleting files that are 7 days old?

You are correct, NiFi does not currently have a processor to query S3 that way.
This might be a better fit for an S3 Lifecycle Rule. You can configure a rule for specific key prefixes, so S3 will automagically delete objects after 7 days. From the S3 console:
Select your bucket
Select Properties
Expand the Lifecycle section
Click Add rule
There is a wizard-style interface to walk you through the configuration.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas