I know Amazon S3 added the multi-part upload for huge files. That's great. What I also need is a similar functionality on the client side for customers who get part way through downloading a gigabyte plus file and have errors.
I realize browsers have some level of retry and resume built in, but when you're talking about huge files I'd like to be able to pick up where they left off regardless of the type of error out.
Any ideas?
Thanks,
Brian
S3 supports the standard HTTP "Range" header if you want to build your own solution.
S3 Getting Objects
I use aria2c. For private content, you can use "GetPreSignedUrlRequest" to generate temporary private URLs that you can pass to aria2c
S3 has a feature called byte range fetches. It’s kind of the download compliment to multipart upload:
Using the Range HTTP header in a GET Object request, you can fetch a byte-range from an object, transferring only the specified portion. You can use concurrent connections to Amazon S3 to fetch different byte ranges from within the same object. This helps you achieve higher aggregate throughput versus a single whole-object request. Fetching smaller ranges of a large object also allows your application to improve retry times when requests are interrupted. For more information, see Getting Objects.
Typical sizes for byte-range requests are 8 MB or 16 MB. If objects are PUT using a multipart upload, it’s a good practice to GET them in the same part sizes (or at least aligned to part boundaries) for best performance. GET requests can directly address individual parts; for example, GET ?partNumber=N.
Source: https://docs.aws.amazon.com/whitepapers/latest/s3-optimizing-performance-best-practices/use-byte-range-fetches.html
Just updating for current situation, S3 natively supports multipart GET as well as PUT. https://youtu.be/uXHw0Xae2ww?t=1459.
NOTE: For Ruby user only
Try aws-sdk gem from Ruby, and download
object = AWS::S3::Object.new(...)
object.download_file('path/to/file.rb')
Because it download a large file with multipart by default.
Files larger than 5MB are downloaded using multipart method
http://docs.aws.amazon.com/sdkforruby/api/Aws/S3/Object.html#download_file-instance_method
Related
I'm trying to implement the file storage ыукмшсу with basic S3 compatible API using akka-http.
I use s3 java sdk to test my service API and got the problem with the putObject(...) method. I can't consume file properly on my akka-http backend. I wrote simple route for the test purposes:
def putFile(bucket: String, file: String) = put{
extractRequestEntity{ ent =>
val finishedWriting = ent.dataBytes.runWith(FileIO.toPath(new File(s"/tmp/${file}").toPath))
onComplete(finishedWriting) { ioResult =>
complete("Finished writing data: " + ioResult)
}
}
}
It saves file, but file is always corrupted. Looking inside the file I found the lines like these:
"20000;chunk-signature=73c6b865ab5899b5b7596b8c11113a8df439489da42ddb5b8d0c861a0472f8a1".
When I try to PUT file with any other rest client it works as fine as expected.
I know S3 uses "Expect: 100-continue" header and may it he causes problems.
I really can't figure out how to deal with that. Any help appreciated.
This isn't exactly corrupted. Your service is not accounting for one of the four¹ ways S3 supports uploads to be sent on the wire, using Content-Encoding: aws-chunked and x-amz-content-sha256: STREAMING-AWS4-HMAC-SHA256-PAYLOAD.
It's a non-standards-based mechanism for streaming an object, and includes chunks that look exactly like this:
string(IntHexBase(chunk-size)) + ";chunk-signature=" + signature + \r\n + chunk-data + \r\n
...where IntHexBase() is pseudocode for a function that formats an integer as a hexadecimal number as a string.
This chunk-based algorithm is similar to, but not compatible with, Transfer-Encoding: chunked, because it embeds checksums in the stream.
Why did they make up a new HTTP transfer encoding? It's potentially useful on the client side because it eliminates the need to either "read your payload twice or buffer [the entire object payload] in memory [concurrently]" -- one or the other of which is otherwise necessary if you are going to calculate the x-amz-content-sha256 hash before the upload begins, as you otherwise must, since it's required for integrity checking.
I am not overly familiar with the internals of the Java SDK, but this type of upload might be triggered by using .withInputStream() or it might be standard behavor for files too, or for files over a certain size.
Your minimum workaround would be to throw an HTTP error if you see x-amz-content-sha256: STREAMING-AWS4-HMAC-SHA256-PAYLOAD in the request headers since you appear not to have implemented this in your API, but this would most likely only serve to prevent storing objects uploaded by this method. The fact that this isn't already what happens automatically suggests that you haven't implemented x-amz-content-sha256 handling at all, so you are not doing the server-side payload integrity checks that you need to be doing.
For full compatibility, you'll need to implement the algorithm supported by S3 and assumed to be available by the SDKs, unless the SDKs specifically support a mechanism for disabling this algorithm -- which seems unlikely, since it serves a useful purpose, particularly (it appears) for streams whose length is known but that aren't seekable.
¹ one of four -- the other three are a standard PUT, a web-based html form POST, and the multipart API that is recommended for large files and mandatory for files larger than 5 GB.
I understand that in order to upload a file to Amazon S3 using Multipart, the instructions are here:
http://docs.aws.amazon.com/AmazonS3/latest/dev/llJavaUploadFile.html
How do I go about replacing the bytes (say, between the range 4-1523) of an uploaded file? Do I need to make use of Multipart Upload to achieve this? or do I fire a REST call with the range specified in the HTTP header?
Appreciate any advice.
Objects in S3 are immutable.
If it's a small object, you'll need to upload the entire object again.
If it's an object over 5MB in size, then there is a workaround that allows you to "patch" a file, using a modified approach to the multipart upload API.
Background:
As you know, a multipart upload allows you to upload a file in "parts," with minimum part size 5MB and maximum part count 10,000.
However a multipart "upload" doesn't mean you have to "upload" all the data again, if some or all of it already exists in S3, and you can address it.
PUT part/copy allows you to "upload" the individual parts by specifying octet ranges in an existing object. Or more than one object.
Since uploads are atomic, the "existing object" can be the object you're in the process of overwriting, since it remains unharmed and in place until you actually complete the multipart upload.
But there appears to be nothing stopping you from using the copy capability to provide the data for the parts you want to leave the same, avoiding the actual upload then using a normal PUT part request to upload the parts that you want to have different content.
So, while not a byte-range patch with granularity to the level of 1 octet, this could be useful for emulating an in-place modification of a large file. Examples of valid "parts" would be replacing a minimum 5 MB chunk, on a 5MB boundary, for files smaller than 50GB, or replacing a mimimum 500MB chunk on 500MB boundary for objects up to 5TB, with minimum part sizes varying between those to extremes, because of the requirement that a multipart upload have no more than 10,000 parts. The catch is that a part must start at an appropriate offset, and you need to replace the whole part.
Michael's answer is pretty explanatory on the background of the issue. Just adding the actual steps to be performed to achieve this, in case you're wondering.
List object parts using ListParts
Identify the part that has been modified
Start a multipart upload
Copy the unchanged parts using UploadPartCopy
Upload the modified part
Finish the upload to save the modification
Skip 2 if you already know which part has to be changed.
Tip: Each part has an ETag, which is MD5 hash of the specified part. This can be used to verify is that particular part has been changed.
The multipart upload overview documentation has, in the Multipart Upload Listings section, the following warning:
Note
Only use the returned listing for verification. You should not use the result of this listing when sending a complete multipart upload request. Instead, maintain your own list of the part numbers you specified when uploading parts and the corresponding ETag values that Amazon S3 returns.
Why?
Why I ask: Let's say I want to support resuming an upload that is interrupted. Doing so means knowing what remains to be uploaded, and therefore what already was uploaded. Knowing this is simpler if I may disregard the above warning. S3 is persisting the list of already-uploaded parts. I can obtain it from List Parts.
Whereas if I heed that warning, instead I'd need to intercept break or kill signals and persist the uploaded parts list locally. Although that's feasible, it seems silly to do this if S3 already has the list.
Furthermore, the warning says to use List Parts "only for verification". OK. Let's say I persist my own list, and compare it to List Parts. If they do not match, what am I going to do? I'm going to believe List Parts -- if S3 doesn't think it has a part, of course I'm going to upload it again. Therefore if List Parts is the ultimate authority, why not simply use it in the first place, and use it alone?
If they do not match, what am I going to do? I'm going to believe List Parts -- if S3 doesn't think it has a part, of course I'm going to upload it again.
You're missing the point of the warning.
It's not so much about whether parts were received. It's about whether they were received intact.
When you complete a multipart upload, you have to send a list of the parts and their etags. The etags are the hex md5sum of each part.
The lazy and careless way to complete a multipart upload would be to blindly submit the etags of the parts by just reading them from the "list" operation.
That is what they are warning against.
The correct way is to use your locally-created list, based on what you think S3 should have received, what you think the etag of each part should have been, based on the local file.
If you are resuming an upload that was interrupted, you should go back and compare the parts already uploaded (by re-reading and re-checksumming the parts of the local file) against the checksums S3 has calculated against the parts already stored (as returned by the list operation)... then either resend any incorrect parts or missing parts, or abandon the upload because the local file may have changed if one or more parts doesn't match your local calculation.
Additionally, in the interest of data integrity, you should be sending the md5 of each part with the individual part uploads, base64-encoded, with a Content-MD5 header, since this will cause S3 to refuse to accept a part that has been corrupted in any way during the upload.
Im currently implementing an update feature into an app that I'm building. It uses NSUrlConnection and the NSURLConnectionDelegate to download the files and save them to the users device.
At the moment, 1 update item downloads multiple files but i want to display this download of multiple files using 1 UIProgressView. So my problem is, how do i get the expected content length of all the files i'm about to download? I know i can get the expectedContentLength of the NSURLResponse object that gets passed into the didReceiveResponse method but thats just for the file thats being downloaded.
Any help is much appreciated. Thanks.
How about having some kind of information file on your server, which actually gives you the total bytes. You could load that at first and then load your files. Then you can substract the loaded amount for each file from the total amount.
Another method would be to connect to all files at first, and cancel the connection after you received responses. Add the expected bytes of all files and then use that as a basis for showing the total progress while loading files.
Downside of #1: you have to manually keep track of the bytes.
Downside of #2: you'll have the double amount of requests, even though they get cancelled after the response.
Use ASIhttp opensource framework widely used for this purpose,
here u just need to set progressview delegate..so it will keep updating your progress view
Try this
http://allseeing-i.com/ASIHTTPRequest/
I've been creating Presigned HTTP PUT URLs and everything was working great until I wanted to start using "folders" in S3; I wanted the key to have the character '/'.
Now I get Signature doesn't match when I send the HTTP PUT requests due to the fact the '/' probably changes to %2F... If I escape the character before creating the presigned URL it works great, but then the Amazon console management doesn't understand it and shows it as one file instead of subfolders.
Any idea?
P.s.
The HTTP PUT requests are sent using C++ with POCO NET library.
EDIT
I'm using Poco HttpRequest from C++ to my Java web server to generate a signed url (returned on the response).
C++ then uses this url to put a file in s3 using Poco again.
The problem was that the urls returned from the web server were parsed through Poco URI objects that auto decoded the s3 object key thus changing it.With that in mind I was able to fix my problem.
Tricky - I'll try to approach this bottom up.
Disclaimer: I got carried away visually inspecting the Poco libraries instead of actually debugging a code sample, which should yield more reliable results much faster, see below ;)
Analysis
If I escape the character before creating the presigned URL it works
great, but then the Amazon console management doesn't understand it
and shows it as one file instead of subfolders.
The latter stems from S3 not having a concept of folders on the storage level actually, see e.g. section Index Documents and Folders within Index Document Support:
Objects stored in Amazon S3 are stored within a flat container, i.e.,
an Amazon S3 bucket, and it does not provide any hierarchical
organization, similar to a file system's. However, you can create a
logical hierarchy using object key names and use these names to infer
logical folders that contain these objects.
That's exactly what the AWS Management Console is doing here as well:
The AWS Management Console also supports the concept of folders, by
using the same key naming convention used in the preceding sample.
However, your test regarding the assumption of / being encoded as %2F proves, that this is indeed how Poco::Net is encoding the URL when performing the HTTP PUT request.
(I'm actually a bit surprised that the AWS Java SDK seems to generate different URLs here for / vs. %2F, insofar a recent analysis regarding Why is my S3 pre-signed request invalid when I set a response header override that contains a “+”? seems to indicate respective canonicalization by the AWS .NET SDK, see below for more on this.)
Potential Solution
In order for your scenario to work as desired, you'll need to figure out where the URL is encoded this way - I could think of two components in principle:
Poco::Net
Finding out why Poco::Net is encoding the URL different than S3 (if at all, see below) is best done by debugging your code, here's where I'd start:
Class HTTPRequest uses class URI in turn, which automatically performs a few normalizations on all URIs and URI parts passed to it, in particular percent-encoded characters are decoded. The other way round is handled by method encode(), which is where things get interesting and call for a breakpoint, see URI.cpp:
lines 575 ff. - here encode() does its magic, which indeed seems to be in place, insofar neither the code within the function nor the various chars passed in via the reserved parameter contain the offending / (see lines 47 ff. for the respective constants in use)
consequently you might want to set a breakpoint in this function and backtrace the callstack to find out which code is actually doing the encoding upfront, which might not yield an offender at all, see below.
Java => C++ transition
You haven't specified yet, which channel is actually used to communicate the pre-signed URL generated by the AWS Java SDK to C++ in turn. Given the code review (mind you, visual inspection only, I haven't debugged this myself yet) of the Poco::Net functionality yields the conclusion, that no obvious offender can be identified in the library itself, thus it seems more likely that it might already enter your C++ layer encoded (easily verified via debugging of course) - are you by chance using any kind of web service between these components for example?
Good luck!