Lambda function and triggered S3 events - amazon-s3

I have a lambda function that gets triggered every time a file is written onto an S3 bucket. My understanding is that every time a single file gets in (this is a potential scenario, rather than having a batch of files being sent), an API call is fired up and that means that I am charged. My question is: can I batch multiple files so that each API calls will only be called if, for example, I have a batch of 10 files? Is this a good practice? I should not be in the position of having a processing time greater than 15 minutes, so the use of the lambda is still fine.
Thank you

You can use SQS to decouple this scenario, the lambda triggering point will be SQS, in there you can set batch size whatever you want.

1 - One solution is group your files into a rar and put into S3. Thus for multiple files, your api will be triggered once only.
2 - The other solution as said by kamprasad is to use SQS.
3 - One last solution that I can think of is to use a cronjob to trigger the lambda as per your requirement. Inside your lambda do the processing using threads to make your task done faster. Keep in mind you have to choose Memory and time carefully in this scenario.
I've personally used the last solution quite frequently.

Related

Perform action after all s3 files processed

I have files uploaded onto my s3 bucket - always 6, twice per day. I am running a fairly standard setup:
S3 -> SNS -> SQS -> lambda
Each lambda processes the newly uploaded file and then deletes it. I know the exact number of files in each batch, but I cannot require the client to perform any other action (e.g. message on SNS).
Question: what's the easiest/best way to perform a certain action after processing all files?
My ideas:
Step Functions - not sure how?
Simply check in each lambda if s3 items count is zero (or check sqs message queue size?) - not sure if there won't be a race condition against a delete immediately before (is it always consistent) or similar issues?
CloudWatch alarm when SQS queue depth is zero -> SNS -> lambda - I guess it should work, not sure about the correct metric?
I would appreciate info on the best/simplest way to achieve it.
If you are sure that by x o'clock, all your 6 files will proceed then simply you can create a cloud watch and schedule it at 11:50 PM, and based on your validation just delete the files.
You could use the number of files in the S3 bucket location to capture "file count" state. The processing lambda runs on each file-add, but conditionally initiates the delete and post-processing steps only when objectCount === 6.
How can we use S3 to keep track of file count? Lots of possibilities, here are two:
Option 1: defer processing until all 6 files have arrived
When triggered on OBJECT_CREATED, the lambda counts the S3 objects. If objectCount < 6 the lambda exits without further action. If 6 files exist, process all 6, delete the files and perform the post-processing action.
Option 2: use S3 tags to indicate PROCESSED status
When triggered on OBJECT_CREATED, the lambda processes the new file and adds a PROCESSED tag to the S3 Object. The lambda then counts the S3 objects with PROCESSED tags. If 6, delete the files and perform the post-processing action.
In any case, think through race conditions and other error states. This is often where "best" and "simplest" sometimes conflict.
N.B. Step Functions could be used to chain together the processing steps, but they don't offer a different way to keep track of file count state.

What is the best way to know the latest file in s3 bucket?

I have a process which is uploading files to S3. The rate at which these are files are pumped to S3 is not constant. Another process needs to look into the latest files uploaded in this bucket and update, say watermark. Now we need the best-effort strategy to make this "latest file" available information as soon as possible.
S3 has event notification integration with SNS/SQS. Since I don't need a fan-out, I thought I could simply do a S3 -> SQS integration. But on digging deep into SQS, I see that though there are no limits on the number of SQS queues you can have per account (I would need quite a lot of queues if I were to assign SQS per partition in S3), there is a limit on max number of messages you can receive per call - 10
Though I can setup SQS per partition, i.e Q1 for root/child1, Q2 for root/child2, etc. The number of files getting pumped into these child folder itself could be massive. In that case, instead of trying to drain everything in the queue - JUST to get the latest file in the child directory, is there any other mechanism I could apply?
Note I am not 100% done with my POC and I certainly don't have the metrics - but given that long-polling (the more you wait, the more delay in getting out the latest file information. so short poll is probably what I should be using - but then there is a possibility that it does not send the request to all SQS servers so I would need multiple calls to get the latest event out from SQS. Need to find a balance there), 10 per call limit, etc, I just doubt if I am using the right tool for the problem here. Am I missing something? or am I terribly wrong about SQS?
I am yet to experiment SNS - does it do rate limiting for events? "If there are 10000 events per minute I will only send you the latest one" sort of?
Please let me know what is the best way to get the latest file uploaded in S3 when the rate of files uploaded is high.

Deleting events published by AWS S3 buckets which are still in queue to be processed by lambda

My architecture is:
1.Drop multiple files in aws S# bucket
2. Lambda picks the file one by one and starts processing it
Problem is :
I am not able to stop the lambda to process the files in between. Even if i stop the lambda instance and restart it, it picks from where it left.
Is there a way to achieve this?
You have no control over the events pushed by S3. You'll be better off if you just cancel the Lambda subscription if you want to stop it for good, but I am afraid that already emitted events will be processed as long as your Lambda is active.
What exactly are you trying to achieve?
If you want to limit the number of files your Lambda functions can process, you can just limit of concurrent executions on your function to 1, so it won't auto-scale based on demand.
Simply go to Concurrency as the image below shows, set it to 1 and save it.
Detach the lambda S3 trigger and add it newly.
This way all new events will be picked up and not the old events

Concatenate files in S3 using AWS Lambda

Is there a way to use Lambda for S3 file concatenation?
I have Firehose streaming data into S3 with the longest possible interval (15 minutes or 128mb) and therefore I have 96 data files daily, but I want to aggregate all the data to a single daily data file for the fastest performance when reading the data later in Spark (EMR).
I created a solution where Lambda function gets invoked when Firehose streams a new file into S3. Then the function reads (s3.GetObject) the new file from source bucket and the concatenated daily data file (if it already exists with previous daily data, otherwise creates a new one) from the destination bucket, decode both response bodies to string and then just add them together and write to the destination bucket with s3.PutObject (which overwrites the previous aggregated file).
The problem is that when the aggregated file reaches 150+ MB, the Lambda function reaches its ~1500mb memory limit when reading the two files and then fails.
Currently I have a minimal amount of data, with a few hundred MB-s per day, but this amount will be growing exponentially in the future. It is weird for me that Lambda has such low limits and that they are already reached with so small files.
Or what are the alternatives of concatenating S3 data, ideally invoked by S3 object created event or somehow a scheduled job, for example scheduled daily?
I would reconsider whether you actually want to do this:
The S3 costs will go up.
The pipeline complexity will go up.
The latency from Firehose input to Spark input will go up.
If a single file injection into Spark fails (this will happen in a distributed system) you have to shuffle around a huge file, maybe slice it if injection is not atomic, upload it again, all of which could take very long for lots of data. At this point you may find that the time to recover is so long that you'll have to postpone the next injection…
Instead, unless it's impossible in the situation, if you make the Firehose files as small as possible and send them to Spark immediately:
You can archive S3 objects almost immediately, lowering costs.
Data is available in Spark as soon as possible.
If a single file injection into Spark fails there's less data to shuffle around, and if you have automated recovery this shouldn't even be noticeable unless some system is running full tilt at all times (at which point bulk injections would be even worse).
There's a tiny amount of latency increase from establishing TCP connections and authentication.
I'm not familiar with Spark specifically, but in general such a "piped" solution would involve:
A periodic trigger or (even better) an event listener on the Firehose output bucket to process input ASAP.
An injector/transformer to move data efficiently from S3 to Spark. It sounds like Parquet could help with this.
A live Spark/EMR/underlying data service instance ready to receive the data.
In case of an underlying data service, some way of creating a new Spark cluster to query the data on demand.
Of course, if it is not possible to keep Spark data ready (but not queriable ("queryable"? I don't know)) for a reasonable amount of money, this may not be an option. It may also be possible that it's extremely time consuming to inject small chunks of data, but that seems unlikely for a production-ready system.
If you really need to chunk the data into daily dumps you can use multipart uploads. As a comparison, we're doing light processing of several files per minute (many GB per day) from Firehose with no appreciable overhead.
You may create a Lambda function that will be invoked only once a day using Scheduled Events and in your Lambda function you should use Upload Part - Copy that does not need to download your files on the Lambda function. There is already an example of this in this thread

Background jobs on amazon web services

I am new to AWS so I needed some advice on how to correctly create background jobs. I've got some data (about 30GB) that I need to:
a) download from some other server; it is a set of zip archives with links within an RSS feed
b) decompress into S3
c) process each file or sometime group of decompressed files, perform transformations of data, and store it into SimpleDB/S3
d) repeat forever depending on RSS updates
Can someone suggest a basic architecture for proper solution on AWS?
Thanks.
Denis
I think you should run an EC2 instance to perform all the tasks you need and shut it down when done. This way you will pay only for the time EC2 runs. Depending on your architecture however you might need to run it all the times, small instances are very cheap however.
download from some other server; it is a set of zip archives with links within an RSS feed
You can use wget
decompress into S3
Try to use s3-tools (github.com/timkay/aws/raw/master/aws)
process each file or sometime group of decompressed files, perform transformations of data, and store it into SimpleDB/S3
Write your own bash script
repeat forever depending on RSS updates
One more bash script to check updates + run the script by Cron
First off, write some code that does a) through c). Test it, etc.
If you want to run the code periodically, it's a good candidate for using a background process workflow. Add the job to a queue; when it's deemed complete, remove it from the queue. Every hour or so add a new job to the queue meaning "go fetch the RSS updates and decompress them".
You can do it by hand using AWS Simple Queue Service or any other background job processing service / library. You'd set up a worker instance on EC2 or any other hosting solution that will poll the queue, execute the task, and poll again, forever.
It may be easier to use Amazon Simple Workflow Service, which seems to be intended for what you're trying to do (automated workflows). Note: I've never actually used it.
I think deploying your code on an Elasticbeanstalk Instance will do the job for you at scale. Because I see that you are processing a huge chunk of data here, and using a normal EC2 Instance might max out resources mostly memory. Also the AWS SQS idea of batching the processing will also work to optimize the process and effectively manage time outs on your server-side