Uploading large files with Carrierwave to S3 - file-upload

So I will need to upload large files (zip files that are a few GB large) to S3, and I would like Carrierwave to manage the download/distribution of those files.
Meaning, when a user pays Carrierwave can automagically generate the dynamic URL and send it to them. I know how to do this already, but it just occurred to me that I have never uploaded files via Carrierwave that are bigger than a few dozen MB, much less a few GB to S3.
Given the flakiness of HTTP connections, I figure this is a suboptimal way to do it.
I don't have that many files to upload (maybe 10 - 20 max), and users won't be uploading them. It will be a storefront where the customers will be buying/downloading the files, not uploading them.
It would be nice if there was a way for me to upload the files into my S3 bucket separately (say FTP, git, or some other mechanism) and then just link it to my app through Carrierwave in some way.
What's the best way to approach this?

Also, don't forget that you will encounter the Heroku 30 second timeout when you are uploading the file in the first place.
Don't worry though, there are options:
Direct Upload - S3 supports direct upload where you present a form which uploads directly to s3 bypassing Heroku, you then receive a call back into your application with the uploaded files details for you to process (https://github.com/dwilkie/carrierwave_direct)
Upload to S3 and then expose bucket/folder in your application to connect to your models. We do this approach with a number of clients. They use Transmit (Mac Client) to upload large assets to S3 and then visit their app to link the asset to a Rails model.
Also, I'm pretty sure S3 is an HTTP based service so you're only going to be able to upload via HTTP.

Related

File Service Architecture & Cost Analysis

Context
I am developping a webapp that
Takes an URL from the user
Downloads and stores the associated file onto my server
The user can fetch the file from my server at any time before the file is eventually expired and removed
I am planning to deploy this application on the AWS. More specifically, using EC2 and S3.
Challenge
I am trying to come up with a design that is both cost-effective and performant to offer this service.
Analysis
The following assumptions are used:
the downloaded file will be available to only one user, the one who provided the URL and initiated the download
the user will only fetch the file once from the server
the file will only stay on the server for at maximum 24 hours before getting removed
the file sizes are in the 100MB - 5GB range
Consider the following application flow:
Internet → EC2: Download the file onto local storage
EC2 → S3: Upload the downloaded file onto S3, deletes the local copy on EC2
EC2 → User: Provide the user with a direct URL to fetch from S3
S3 → user: The user fetches the file from S3
S3: file is removed after 24 hours.
In terms of network performance, step 1 and 2 will be the bottlenecks as EC2 has limited downloading and uploading bandwidth. Step 4 should not be a problem since S3 is taking care of the bandwidth for transferring file to the end user.
In terms of costs, fixed costs are the EC2 instances, and the main variable cost is step 4, where AWS charges 0.09$/GB in data transfer. Since the files are removed after 24 hours, the storage fee is comparatively tiny.
Question
Have I correctly identified the performance bottlenecks in this application flow?
Is my cost analysis correct?
Is this the optimal flow in terms of costs? Is there any way to further reduce the cost?
Since step 1 and step 2 (downloading from Internet and uploading to S3) will be very bandwidth-consuming when simultaneously downloading multiple large files, will it significantly affect the responsiveness of my server to serve regular API requests? Should I use a dedicated EC2 instance just for handling API calls from the clients, and another dedicated EC2 instance just for downloading and uploading? This will slightly further complicate the design, as I will have to manage the communication between the 2 instances as well.
Can you use more AWS Services? Are you aware of AWS Lambda? https://aws.amazon.com/lambda/details/ It can perform actions in response to actions, e.g. it could delete a file from S3 shortly after it is downloaded. http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html
This alleviates the need to track downloads and delete them, once you get past the learning curve of AWS Lambda. It can also handle other processing, so you only have to upload to S3 from EC2.
Regarding cost, S3 has different quality levels, and the "reduced redundancy" might be sufficient for your needs, saving a little money.
How about allowing the client to upload files directly to S3?
Your application would generate a pre-signed url, so that you can control which users can upload files, but after that the client interacts directly with S3. This would remove the costly "download then upload" process in steps 1 & 2.
See this document http://docs.aws.amazon.com/AmazonS3/latest/dev/PresignedUrlUploadObject.html

AWS s3 configuration to avoid waiting time for multiple request

I have static content uploaded on S3 bucket.
When I hit URL for the First time, the contents take while to load. It has a single html page with multiple CSS and JS.
So is there any kind on configuration needed at S3 level to optimize.
I am trying to figure out settings such as number of connections like we have in Apache.
There are no configurations available for Amazon S3. It just works!
Some ideas for speeding your download:
Create a bucket that is located closer to you/your users (less latency)
Zip your files before uploading to Amazon S3 (faster download)
Check the Network console in your web browser to determine where the time is being taken

AWS S3 and AjaXplorer

I'm using AjaXplorer to give access to my clients to a shared directory stored in Amazon S3. I installed the SD, configured the plugin (http://ajaxplorer.info/plugins/access/s3/) and could upload and download files but the upload size is limited to my host PHP limit which is 64MB.
Is there a way I can upload directly to S3 without going over my host to improve speed and have S3 limit, no PHP's?
Thanks
I think that is not possible, because the server will first climb to the PHP file and then make transfer to bucket.
Maybe
The only way around this is to use some JQuery or JS that can bypass your server/PHP entirely and stream directly into S3. This involves enabling CORS and creating a signed policy on the fly to allow your uploads, but it can be done!
I ran into just this issue with some inordinately large media files for our website users that I no longer wanted to host on the web servers themselves.
The best place to start, IMHO is here:
https://github.com/blueimp/jQuery-File-Upload
A demo is here:
https://blueimp.github.io/jQuery-File-Upload/
This was written to upload+write files to a variety of locations, including S3. The only tricky bits are getting your MIME type correct for each particular upload, and getting your bucket policy the way you need it.

Allowing users to download files as a batch from AWS s3 or Cloudfront

I have a website that allows users to search for music tracks and download those they they select as mp3.
I have the site on my server and all of the mp3s on s3 and then distributed via cloudfront. So far so good.
The client now wishes for users to be able to select a number of music track and then download them all in bulk or as a batch instead of 1 at a time.
Usually I would place all the files in a zip and then present the user a link to that new zip file to download. In this case, as the files are on s3 that would require I first copy all the files from s3 to my webserver process them in to a zip and then download from my server.
Is there anyway i can create a zip on s3 or CF or is there someway to batch / group files in to a zip?
Maybe i could set up an EC2 instance to handle this?
I would greatly appreciate some direction.
Best
Joe
I am afraid you won't be able to create the batches w/o additional processing. firing up an EC2 instance might be an option to create a batch per user
I am facing the exact same problem. So far the only thing I was able to find is Amazon's s3sync tool:
https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
In my case, I am using Rails + its Paperclip addon which means that I have no way to easily download all of the user's images in one go, because the files are scattered in a lot of subdirectories.
However, if you can group your user's files in a better way, say like this:
/users/<ID>/images/...
/users/<ID>/songs/...
...etc., then you can solve your problem right away with:
aws s3 sync s3://<your_bucket_name>/users/<user_id>/songs /cache/<user_id>
Do have in mind you'll have to give your server the proper credentials so the S3 CLI tools can work without prompting for usernames/passwords.
And that should sort you.
Additional discussion here:
Downloading an entire S3 bucket?
s3 is single http request based.
So the answer is threads to achieve the same thing
Java api - uses TransferManager
http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/transfer/TransferManager.html
You can get great performance with multi threads.
There is no bulk download sorry.

How to receive an uploaded file using node.js formidable library and save it to Amazon S3 using knox?

I would like to upload a form from a web page and directly save the file to S3 without first saving it to disk. This node.js app will be deployed to Heroku, where there is no local disk to save the file to.
The node-formidable library provides a great way to upload files and save them to disk. I am not sure how to turn off formidable (or connect-form) from saving file first. The Knox library on the other hand provides a way to read a file from the disk and save it on Amazon S3.
1) Is there a way to hook into formidable's events (on Data) to send the stream to Knox's events, so that I can directly save the uploaded file in my Amazon S3 bucket?
2) Are there any libraries or code snippets that can allow me to directly take the uploaded file and save it Amazon S3 using node.js?
There is a similar question here but the answers there do not address NOT saving the file to disk.
It looks like there is no good way to do it. One reason might be that the node-formidable library saves the uploaded file to disk. I could not find any options to do otherwise. The knox library takes the saved file on the disk and using your Amazon S3 credentials uploads it to Amazon.
Since on Heroku I cannot save files locally, I ended up using transloadit service. Though their authentication docs have some learning curve, I found the service useful.
For those who want to use transloadit using node.js, the following code sample may help (transloadit page had only Ruby and PHP examples)
var crypto, signature;
crypto = require('crypto');
signature = crypto.createHmac("sha1", 'auth secret').
update('some string').
digest("hex")
console.log(signature);
this is Andy, creator of AwsSum:
https://github.com/appsattic/node-awssum/
I just released v0.2.0 of this library. It uploads the files that were created by Express' bodyParser() though as you say, this won't work on Heroku:
https://github.com/appsattic/connect-stream-s3
However, I shall be looking at adding the ability to stream from formidable directly to S3 in the next (v0.3.0) version. For the moment though, take a look and see if it can help. :)