Ruby on rails timeout during large uploads - amazon-s3

I am trying to upload my files on aws s3 using ruby on rails. Code is working great for smaller upload but for uploads greater than 3-4mb, i get timeout error. I am uploading files on s3 using code:
AWS::S3::S3Object.store(filename, params[:file].read, #BUCKET_NAME, :access => :private)
How can i resolve my issue for larger uploads. Can i increase the timeout interval time for ruby scripts for allowing larger uploads?
Please help...

I would suggest taking advantage of the recent CORS support. I tried to detail clearly how to use it there : http://pjambet.github.com/blog/direct-upload-to-s3/

Assuming you are using: aws-s3 gem
When you are dealing with large files you have to use I/O stream, so that file is read in segments.
Instead you might use something like this:
S3Object.store('roots.mpeg', open('roots.mpeg'), #BUCKET_NAME, :access => :private)
More details can be found: http://amazon.rubyforge.org/

I would suggest you to use http streaming for long request

Related

RecordRTC: merge blobs server-side with PHP

I need to build a recording feature on top of a web conferencing app that makes use of WebRTC. To do this I am using the RecordRTC js library.
The recording is NOT uploaded at the end of the call, but for practical reasons every 3 seconds one portion of the stream is uploaded from client to server. This is to avoid waiting at the end for a large upload.
Here's the JavaScript:
RTC_recorder = RecordRTC(stream, {
type: 'video',
mimeType: 'video/webm;codecs=vp8',
timeSlice: 3000,
ondataavailable: function(blob){
upload_to_server(blob);
}
});
I have been able to save separate blobs on the server:
-blob1.webm (readable video)
-blob2.webm (not readable)
-blob3.webm (not readable)
But unfortunately, I don't understand how to merge the blobs into 1 video (SERVER SIDE), and haven't found any working example in the documentation, nor any clear answer to this question.
Can anyone help?
Thanks.
Concatenating the files without any further modification should result in a valid file.
A simple search revealed this question which was about how concatenating files works in PHP.

Locally reading S3 files through Spark (or better: pyspark)

I want to read an S3 file from my (local) machine, through Spark (pyspark, really). Now, I keep getting authentication errors like
java.lang.IllegalArgumentException: AWS Access Key ID and Secret
Access Key must be specified as the username or password
(respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId
or fs.s3n.awsSecretAccessKey properties (respectively).
I looked everywhere here and on the web, tried many things, but apparently S3 has been changing over the last year or months, and all methods failed but one:
pyspark.SparkContext().textFile("s3n://user:password#bucket/key")
(note the s3n [s3 did not work]). Now, I don't want to use a URL with the user and password because they can appear in logs, and I am also not sure how to get them from the ~/.aws/credentials file anyway.
So, how can I read locally from S3 through Spark (or, better, pyspark) using the AWS credentials from the now standard ~/.aws/credentials file (ideally, without copying the credentials there to yet another configuration file)?
PS: I tried os.environ["AWS_ACCESS_KEY_ID"] = … and os.environ["AWS_SECRET_ACCESS_KEY"] = …, it did not work.
PPS: I am not sure where to "set the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties" (Google did not come up with anything). However, I did try many ways of setting these: SparkContext.setSystemProperty(), sc.setLocalProperty(), and conf = SparkConf(); conf.set(…); conf.set(…); sc = SparkContext(conf=conf). Nothing worked.
Yes, you have to use s3n instead of s3. s3 is some weird abuse of S3 the benefits of which are unclear to me.
You can pass the credentials to the sc.hadoopFile or sc.newAPIHadoopFile calls:
rdd = sc.hadoopFile('s3n://my_bucket/my_file', conf = {
'fs.s3n.awsAccessKeyId': '...',
'fs.s3n.awsSecretAccessKey': '...',
})
The problem was actually a bug in the Amazon's boto Python module. The problem was related to the fact that MacPort's version is actually old: installing boto through pip solved the problem: ~/.aws/credentials was correctly read.
Now that I have more experience, I would say that in general (as of the end of 2015) Amazon Web Services tools and Spark/PySpark have a patchy documentation and can have some serious bugs that are very easy to run into. For the first problem, I would recommend to first update the aws command line interface, boto and Spark every time something strange happens: this has "magically" solved a few issues already for me.
Here is a solution on how to read the credentials from ~/.aws/credentials. It makes use of the fact that the credentials file is an INI file which can be parsed with Python's configparser.
import os
import configparser
config = configparser.ConfigParser()
config.read(os.path.expanduser("~/.aws/credentials"))
aws_profile = 'default' # your AWS profile to use
access_id = config.get(aws_profile, "aws_access_key_id")
access_key = config.get(aws_profile, "aws_secret_access_key")
See also my gist at https://gist.github.com/asmaier/5768c7cda3620901440a62248614bbd0 .
Environment variables setup could help.
Here in Spark FAQ under the question "How can I access data in S3?" they suggest to set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.
I cannot say much about the java objects you have to give to the hadoopFile function, only that this function already seems depricated for some "newAPIHadoopFile". The documentation on this is quite sketchy and I feel like you need to know Scala/Java to really get to the bottom of what everything means.
In the mean time, I figured out how to actually get some s3 data into pyspark and I thought I would share my findings.
This documentation: Spark API documentation says that it uses a dict that gets converted into a java configuration (XML). I found the configuration for java, this should probably reflect the values you should put into the dict: How to access S3/S3n from local hadoop installation
bucket = "mycompany-mydata-bucket"
prefix = "2015/04/04/mybiglogfile.log.gz"
filename = "s3n://{}/{}".format(bucket, prefix)
config_dict = {"fs.s3n.awsAccessKeyId":"FOOBAR",
"fs.s3n.awsSecretAccessKey":"BARFOO"}
rdd = sc.hadoopFile(filename,
'org.apache.hadoop.mapred.TextInputFormat',
'org.apache.hadoop.io.Text',
'org.apache.hadoop.io.LongWritable',
conf=config_dict)
This code snippet loads the file from the bucket and prefix (file path in the bucket) specified on the first two lines.

Access EXIF data from file uploaded with Paperclip in tests

Can I safely use self.image.staged_path to access file which is uploaded to Amazon S3 using Paperclip ? I noticed that I can use self.image.url (which returns https...s3....file) to read EXIF from file on S3 in Production or Development environments. I can't use the same approach in test though.
I found staged_path method which allows me to read EXIF from file in all environments (it returns something like: /var/folders/dv/zpc...-6331-fq3gju )
I couldn't find more information about this method, so the question is: does anyone have experience with this and could advise on reliability of this approach? I'm reading EXIF data in before_post_process callback
before_post_process :load_date_from_exif
def load_date_from_exif
...
EXIFR::JPEG.new(self.image.staged_path).date_time
...
end

How can _know_ which JSON renderer is active in my Rails 3 app?

This is a direct follow-on to this question: What is the fastest way to render json in rails?
My app does a database query and render to JSON for a JS callback. It takes at least 8 seconds for a small (1 MB) dataset, and more like 20 for a large (3.5 MB) one. This is basically going to kill my app as an idea. My users aren't going to put up with this sort of wait.
I've read about multi_json and oj and yajl, and I think I've got them installed, but none of the ways I've tried to activate the various gems in my Gemfile show any improvement in serializing time. How can I prove that I'm using one over the other, so that I compare results between them? I can't find any way of outputting (to the Rails debug log or the JS console in the browser) which library might have gotten used for the actual 'render :json => #data' call.
Instead of fiddling with your controller, a better way is to use the Rails console, like so:
$ rails console
Loading development environment (Rails 3.2.8)
1.8.7 :001 > MultiJson.engine
=> MultiJson::Adapters::JsonGem
You can interact directly with your Rails stack that way.
I finally figured out I could do 'render :text => MultiJson.engine' in my controller. This yielded "MultiJson::Engines::Oj".
It confirms that I'm already using the supposedly fastest JSON library, and I may be hosed. I guess I'll try to return pure text through the controller (which takes 2 seconds compared to 8) and see how fast a routine to convert that to a hash will take...

I need Multi-Part DOWNLOADS from Amazon S3 for huge files

I know Amazon S3 added the multi-part upload for huge files. That's great. What I also need is a similar functionality on the client side for customers who get part way through downloading a gigabyte plus file and have errors.
I realize browsers have some level of retry and resume built in, but when you're talking about huge files I'd like to be able to pick up where they left off regardless of the type of error out.
Any ideas?
Thanks,
Brian
S3 supports the standard HTTP "Range" header if you want to build your own solution.
S3 Getting Objects
I use aria2c. For private content, you can use "GetPreSignedUrlRequest" to generate temporary private URLs that you can pass to aria2c
S3 has a feature called byte range fetches. It’s kind of the download compliment to multipart upload:
Using the Range HTTP header in a GET Object request, you can fetch a byte-range from an object, transferring only the specified portion. You can use concurrent connections to Amazon S3 to fetch different byte ranges from within the same object. This helps you achieve higher aggregate throughput versus a single whole-object request. Fetching smaller ranges of a large object also allows your application to improve retry times when requests are interrupted. For more information, see Getting Objects.
Typical sizes for byte-range requests are 8 MB or 16 MB. If objects are PUT using a multipart upload, it’s a good practice to GET them in the same part sizes (or at least aligned to part boundaries) for best performance. GET requests can directly address individual parts; for example, GET ?partNumber=N.
Source: https://docs.aws.amazon.com/whitepapers/latest/s3-optimizing-performance-best-practices/use-byte-range-fetches.html
Just updating for current situation, S3 natively supports multipart GET as well as PUT. https://youtu.be/uXHw0Xae2ww?t=1459.
NOTE: For Ruby user only
Try aws-sdk gem from Ruby, and download
object = AWS::S3::Object.new(...)
object.download_file('path/to/file.rb')
Because it download a large file with multipart by default.
Files larger than 5MB are downloaded using multipart method
http://docs.aws.amazon.com/sdkforruby/api/Aws/S3/Object.html#download_file-instance_method