What is the best way to know the latest file in s3 bucket? - amazon-s3

I have a process which is uploading files to S3. The rate at which these are files are pumped to S3 is not constant. Another process needs to look into the latest files uploaded in this bucket and update, say watermark. Now we need the best-effort strategy to make this "latest file" available information as soon as possible.
S3 has event notification integration with SNS/SQS. Since I don't need a fan-out, I thought I could simply do a S3 -> SQS integration. But on digging deep into SQS, I see that though there are no limits on the number of SQS queues you can have per account (I would need quite a lot of queues if I were to assign SQS per partition in S3), there is a limit on max number of messages you can receive per call - 10
Though I can setup SQS per partition, i.e Q1 for root/child1, Q2 for root/child2, etc. The number of files getting pumped into these child folder itself could be massive. In that case, instead of trying to drain everything in the queue - JUST to get the latest file in the child directory, is there any other mechanism I could apply?
Note I am not 100% done with my POC and I certainly don't have the metrics - but given that long-polling (the more you wait, the more delay in getting out the latest file information. so short poll is probably what I should be using - but then there is a possibility that it does not send the request to all SQS servers so I would need multiple calls to get the latest event out from SQS. Need to find a balance there), 10 per call limit, etc, I just doubt if I am using the right tool for the problem here. Am I missing something? or am I terribly wrong about SQS?
I am yet to experiment SNS - does it do rate limiting for events? "If there are 10000 events per minute I will only send you the latest one" sort of?
Please let me know what is the best way to get the latest file uploaded in S3 when the rate of files uploaded is high.

Related

Seaweedfs: TTL options for S3 and/or volume

I'm really interested in TTL feature for files found in Seaweedfs. The only missing part in my understanding is interaction with S3 and on the volume level.
Currently my app is using S3 libraries to interact with the storage. It writes a lot of small files, which are never rewritten, but being accessed for time to time. Also I need to keep only x days of the written data, while older entries should be removed. Seaweedfs looks like a perfect solution to me. But as far as I understand the docs, I can only set TTL using it's own API.
So here's my two questions:
Can I somehow set the TTL for a file using S3 abstraction layer?
If not, can I have default TTL value for say per bucket (or volume, I guess) or per cluster?
Currently not. This can be a good feature to add.

Lambda function and triggered S3 events

I have a lambda function that gets triggered every time a file is written onto an S3 bucket. My understanding is that every time a single file gets in (this is a potential scenario, rather than having a batch of files being sent), an API call is fired up and that means that I am charged. My question is: can I batch multiple files so that each API calls will only be called if, for example, I have a batch of 10 files? Is this a good practice? I should not be in the position of having a processing time greater than 15 minutes, so the use of the lambda is still fine.
Thank you
You can use SQS to decouple this scenario, the lambda triggering point will be SQS, in there you can set batch size whatever you want.
1 - One solution is group your files into a rar and put into S3. Thus for multiple files, your api will be triggered once only.
2 - The other solution as said by kamprasad is to use SQS.
3 - One last solution that I can think of is to use a cronjob to trigger the lambda as per your requirement. Inside your lambda do the processing using threads to make your task done faster. Keep in mind you have to choose Memory and time carefully in this scenario.
I've personally used the last solution quite frequently.

Apache Flink - s3 folder monitoring - many files missed

Good day everyone,
I have a Flink job that has an S3 folder as a source, and we keep putting thousands of small (around 1KB each) gzip files into that folder, with the rate of about 5000 files per minute.
Here is how I created that source in Scala:
val my_input_format = new TextInputFormat(
new org.apache.flink.core.fs.Path(my_path))
my_input_format.setFilesFilter(FilePathFilter.createDefaultFilter())
my_input_format.setNestedFileEnumeration(true)
val my_raw_stream = streamEnv
.readFile(my_input_format,
my_path,
FileProcessingMode.PROCESS_CONTINUOUSLY,
1000)
The problem is, with the monitoring interval of 1,000ms as above, about 20% of the files were missed. From Apache Flink Dashboard, at the subsequent operators I could only see ~80% of the total number of files recorded ("Records sent" column).
If I increase the monitoring interval, the number of missed files would reduce. At 5,000ms, it is about 10%, and at 30,000ms, only about 2% missed.
No WARNING/ERROR recorded though.
I could not simulate this in HDFS, as I could not reach that high file writing speed in our cluster.
Could someone please help.
Thank you very much.
Amazon S3 provides eventual consistency for listing directories (see this question).
The monitoring source lists the files in a directory and tracks which files it processed by remembering their maximum modification timestamp. Since S3 listing is not guaranteed to be immediately consistent, the max modification timestamp might advance and files with a smaller timestamp might be missed.
I think the problem cannot be completely fixed by increasing the monitoring interval. Instead, we would need an additional parameter that adds an offset to the max timestamp. It would be great if you could reach out to the Flink community, either via the mailing list or by opening a Jira ticket.
============ UPDATE =============
I have implemented the change following the suggestion by Fabian. Functionality-wise it is completed and working. Would need to spend some more time to write proper unit-test / documenting.
My implementation is here

Background jobs on amazon web services

I am new to AWS so I needed some advice on how to correctly create background jobs. I've got some data (about 30GB) that I need to:
a) download from some other server; it is a set of zip archives with links within an RSS feed
b) decompress into S3
c) process each file or sometime group of decompressed files, perform transformations of data, and store it into SimpleDB/S3
d) repeat forever depending on RSS updates
Can someone suggest a basic architecture for proper solution on AWS?
Thanks.
Denis
I think you should run an EC2 instance to perform all the tasks you need and shut it down when done. This way you will pay only for the time EC2 runs. Depending on your architecture however you might need to run it all the times, small instances are very cheap however.
download from some other server; it is a set of zip archives with links within an RSS feed
You can use wget
decompress into S3
Try to use s3-tools (github.com/timkay/aws/raw/master/aws)
process each file or sometime group of decompressed files, perform transformations of data, and store it into SimpleDB/S3
Write your own bash script
repeat forever depending on RSS updates
One more bash script to check updates + run the script by Cron
First off, write some code that does a) through c). Test it, etc.
If you want to run the code periodically, it's a good candidate for using a background process workflow. Add the job to a queue; when it's deemed complete, remove it from the queue. Every hour or so add a new job to the queue meaning "go fetch the RSS updates and decompress them".
You can do it by hand using AWS Simple Queue Service or any other background job processing service / library. You'd set up a worker instance on EC2 or any other hosting solution that will poll the queue, execute the task, and poll again, forever.
It may be easier to use Amazon Simple Workflow Service, which seems to be intended for what you're trying to do (automated workflows). Note: I've never actually used it.
I think deploying your code on an Elasticbeanstalk Instance will do the job for you at scale. Because I see that you are processing a huge chunk of data here, and using a normal EC2 Instance might max out resources mostly memory. Also the AWS SQS idea of batching the processing will also work to optimize the process and effectively manage time outs on your server-side

Distributed datastore

We're trying to add some kind of persistence in our app.
The app generates about 250 entries per second. Each of these entries belong to one of 2M files. For each file, we want to keep the last 10 entries, so we can look them up later.
The way our client application works :
it gets a stream of all the data
it fetches the right file (GET)
it adds the new content
it saves the file back (PUT)
We're looking for an efficient way to store this data that can scale horizontally as the amount of data we're getting is doubling every few weeks.
We initially looked at S3. It works fine, but becomes very expensive very fast (>$1000 monthly just in PUT operations!)
We then gave a shot at Riak. But it seems we can't get more than 60 write/sec on each node, which is very very slow.
Any other solution out there?
There are lots of knobs you can turn in Riak - ask the mailing list if you haven't already and we'll figure out a sane configuration for you. 60 writes/sec is not within the norm.
See: http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
What about Hadoop's HDFS spread over Amazon EC2 instances? I know each instance has a good amount of storage space, and you don't have to pay for put/get, only the inbound transfer.
I would suggest looking at CloudIQ Storage from Appistry. Its a fully distributed file store. Its accessible via a REST-based API, and can run on commodity hardware. You can define the number of copies retained on a file by file basis. It supports an Eventually Consistent model so you can balance file consistency with performance.