Event-driven Elastic Transcoder? - amazon-s3

Is there a way to setup a transcoding pipeline on AWS such that it automatically transcodes any new files uploaded to a particular S3 bucket, and places them in another bucket?
I know there is a REST API, and that in theory the uploader could also issue a REST request to the transcoder after it has uploaded the file, but for a variety of reasons, this isn't really an option.

This can now be accomplished using AWS Lambda.
Lambda basically allows you to trigger/run scripts based off of events. You could easily create a Lambda script that runs as soon as a new file is uploaded to a designated s3 bucket. The script would then start a transcoding job for that newly uploaded video file.
This is literally one of the example use cases provided in the Lambda documentation.

Related

Automatic sync S3 to local

I want to auto sync my local folder with S3 bucket. I mean, when i change some file in S3, automatically this file would update in the local folder.
I tried using scheduler task and AWS cli but i think there is a better way to do that.
Do you know some app or better solution?
Hope you can help me.
#mgg, You can mount the s3 bucket to the local server using s3fs, this way you can sync your local changes to s3 bucket.
You could execute code (Lambda Functions) that responds to some events in a given bucket (such file change, deleted or created), so, you could have a simple http service that receives a post or a get request from that lambda and update your local data accordingly.
Read more:
Tutorial, Using AWS Lambda with Amazon S3
Working with Lambda Functions
The other approach (I don't recommend this) is to have some code "pulling" for changes in some bucket and then reflecting those changes locally. At first glance it looks easier to implement, but ... it get complicated when you try to handle not just creation events.
And of course for each cycle of your "pulling" component you have to check all elements in your local directory against all elements in the bucket, it is a performance killing approach!

Merging pdf files stored on Amazon S3

Currently I'm using pdfbox to download all my pdf files on my server and then using pdfbox to merge them together. It's working perfectly fine but it's very slow--since I have to download them all.
Is there a way to perform all of this on S3 directly? I'm trying to find a way to do it, even if not in java also in python and unable to do so.
I read the following:
Merging files on S3 Amazon
https://github.com/boazsegev/combine_pdf/issues/18
Is there a way to merge files stored in S3 without having to download them?
EDIT
The way I ended up doing it was using concurrent.futures and implementing it with concurrent.futures.ThreadPoolExecutor. I set a maximum of 8 worker threads to download all the pdf files from s3.
Once all files were downloaded I merged them with pdfbox. Simple.
S3 is just a data store, so at some level you need to transfer the PDF files from S3 to a server and then back. You'll probably gain the best speed by doing your conversions on an EC2 instance located in the same region as your S3 bucket.
If you don't want to spin up an EC2 instance yourself just to do this then another alternative may be to make use of AWS Lambda, which is a compute service where you can upload your code and have AWS manage the execution of it.

Adding new documents automatically to Amazon CloudSearch Domain

I have created a CloudSearch Domain. Using CLI data is successfully uploaded to the domain. Data is copied from S3 to CloudSearch domain using command-
cs-import-documents -d searchdev3 --source s3://mybucket/html
I am wondering how the data will be added to search domain later when a new file is added to S3 bucket.
Can we perform any of the following-
Create some kind of schedule that will upload the documents to Search Domain or
Any way to automatically detect if any new file is added to S3 and upload it directly to Search Domain.
Above options seems to be feasible but performing uploading operation manually every time does not seems to be a good idea at all.
I'd use the AWS Lambdas event processing service. It is pretty simple to set up an event stream based on S3 (see http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html)
Your Lambda would then submit a search document to CloudSearch based on the S3 event. For an example of submitting a document from a Lambda, see https://gist.github.com/fzakaria/4f93a8dbf483695fb7d5

Using data present in S3 inside EMR mappers

I need to access some data during the map stage. It is a static file, from which I need to read some data.
I have uploaded the data file to S3.
How can I access that data while running my job in EMR?
If I just specify the file path as:
s3n://<bucket-name>/path
in the code, will that work ?
Thanks
S3n:// url is for Hadoop to read the s3 files. If you want to read the s3 file in your map program, either you need to use a library that handles s3:// URL format - such as jets3t - https://jets3t.s3.amazonaws.com/toolkit/toolkit.html - or access S3 objects via HTTP.
A quick search for an example program brought up this link.
https://gist.github.com/lucastex/917988
You can also access the S3 object through HTTP or HTTPS. This may need making the object public or configuring additional security. Then you can access it using the HTTP url package supported natively by java.
Another good option is to use s3dist copy as a bootstrap step to copy the S3 file to HDFS before your Map step starts. http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
What I ended up doing:
1) Wrote a small script that copies my file from s3 to the cluster
hadoop fs -copyToLocal s3n://$SOURCE_S3_BUCKET/path/file.txt $DESTINATION_DIR_ON_HOST
2) Created bootstrap step for my EMR Job, that runs the script in 1).
This approach doesn't require to make the S3 data public.

Allowing users to download files as a batch from AWS s3 or Cloudfront

I have a website that allows users to search for music tracks and download those they they select as mp3.
I have the site on my server and all of the mp3s on s3 and then distributed via cloudfront. So far so good.
The client now wishes for users to be able to select a number of music track and then download them all in bulk or as a batch instead of 1 at a time.
Usually I would place all the files in a zip and then present the user a link to that new zip file to download. In this case, as the files are on s3 that would require I first copy all the files from s3 to my webserver process them in to a zip and then download from my server.
Is there anyway i can create a zip on s3 or CF or is there someway to batch / group files in to a zip?
Maybe i could set up an EC2 instance to handle this?
I would greatly appreciate some direction.
Best
Joe
I am afraid you won't be able to create the batches w/o additional processing. firing up an EC2 instance might be an option to create a batch per user
I am facing the exact same problem. So far the only thing I was able to find is Amazon's s3sync tool:
https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
In my case, I am using Rails + its Paperclip addon which means that I have no way to easily download all of the user's images in one go, because the files are scattered in a lot of subdirectories.
However, if you can group your user's files in a better way, say like this:
/users/<ID>/images/...
/users/<ID>/songs/...
...etc., then you can solve your problem right away with:
aws s3 sync s3://<your_bucket_name>/users/<user_id>/songs /cache/<user_id>
Do have in mind you'll have to give your server the proper credentials so the S3 CLI tools can work without prompting for usernames/passwords.
And that should sort you.
Additional discussion here:
Downloading an entire S3 bucket?
s3 is single http request based.
So the answer is threads to achieve the same thing
Java api - uses TransferManager
http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/transfer/TransferManager.html
You can get great performance with multi threads.
There is no bulk download sorry.