md5 checksums when uploading to file picker - amazon-s3

Background
I'm working on integrating an existing app with File Picker. In our existing setup we are relying on md5 checksums to ensure data integrity. As far as I can see File Picker does not provide any md5 when they respond to an upload against the REST API (nor using JavaScript client).
S3 storage, md5 and data integrity
We are using S3 for storage, and as far as I know you may provide S3 with an md5 checksum when storing files so that Amazon may verify and reject storing request if data seems to be wrong.
To ensure that data is not corrupted traversing the network, use the Content-MD5 header. When you use this header, Amazon S3 checks the object against the provided MD5 value and, if they do not match, returns an error. Additionally, you can calculate the MD5 while putting an object to Amazon S3 and compare the returned ETag to the calculated MD5 value.
I have investigated the etag header which Amazon returns a bit, and found that it isn't clear what actually is returned as etag. The Java documentation states:
Gets the hex encoded 128-bit MD5 hash of this object's contents as computed by Amazon S3.
The Ruby documentation states:
Generally the ETAG is the MD5 of the object. If the object was uploaded using multipart upload then this is the MD5 all of the upload-part-md5s
Another place in their documentation I found this:
The entity tag is a hash of the object. The ETag only reflects changes to the contents of an object, not its metadata. The ETag is determined when an object is created. For objects created by the PUT Object operation and the POST Object operation, the ETag is a quoted, 32-digit hexadecimal string representing the MD5 digest of the object data. For other objects, the ETag may or may not be an MD5 digest of the object data. If the ETag is not an MD5 digest of the object data, it will contain one or more non-hexadecimal characters and/or will consist of less than 32 or more than 32 hexadecimal digits.
This seems to describe how etag is actually calculated on S3, and this stack overflow post seems to imply the same thing: Etag cannot be trusted to always be equal to the file MD5.
So - here are my questions
In general, how does file picker store files to s3? Is multipart post requests used?
I see that when I do a HEAD request against for example https://www.filepicker.io/api/file/<file handle> I do get an etag header back. The etag I get back do indeed match the md5 of the file I have uploaded. Are the headers returned more or less taken from S3 directly? Or is this actually an md5 calculated by filepicker which I can trust?
Is it possible to have an explicit statement of the md5 returned to clients of File Picker's API? For instance when we POST a file we get a JSON structure back including the URL to the file and it's size. Could md5 be included here?
Is it possible to provide File Picker with an md5 which in turn will be used when posting files to S3 so we can get an end-to-end check on files?

Yes, we use the python boto library to be specific.
The ETag is pulled from S3.
& 4. It's been considered and is in our backlog, but hasn't been implemented yet.

Related

Implementing basic S3 compatible API with akka-http

I'm trying to implement the file storage ыукмшсу with basic S3 compatible API using akka-http.
I use s3 java sdk to test my service API and got the problem with the putObject(...) method. I can't consume file properly on my akka-http backend. I wrote simple route for the test purposes:
def putFile(bucket: String, file: String) = put{
extractRequestEntity{ ent =>
val finishedWriting = ent.dataBytes.runWith(FileIO.toPath(new File(s"/tmp/${file}").toPath))
onComplete(finishedWriting) { ioResult =>
complete("Finished writing data: " + ioResult)
}
}
}
It saves file, but file is always corrupted. Looking inside the file I found the lines like these:
"20000;chunk-signature=73c6b865ab5899b5b7596b8c11113a8df439489da42ddb5b8d0c861a0472f8a1".
When I try to PUT file with any other rest client it works as fine as expected.
I know S3 uses "Expect: 100-continue" header and may it he causes problems.
I really can't figure out how to deal with that. Any help appreciated.
This isn't exactly corrupted. Your service is not accounting for one of the four¹ ways S3 supports uploads to be sent on the wire, using Content-Encoding: aws-chunked and x-amz-content-sha256: STREAMING-AWS4-HMAC-SHA256-PAYLOAD.
It's a non-standards-based mechanism for streaming an object, and includes chunks that look exactly like this:
string(IntHexBase(chunk-size)) + ";chunk-signature=" + signature + \r\n + chunk-data + \r\n
...where IntHexBase() is pseudocode for a function that formats an integer as a hexadecimal number as a string.
This chunk-based algorithm is similar to, but not compatible with, Transfer-Encoding: chunked, because it embeds checksums in the stream.
Why did they make up a new HTTP transfer encoding? It's potentially useful on the client side because it eliminates the need to either "read your payload twice or buffer [the entire object payload] in memory [concurrently]" -- one or the other of which is otherwise necessary if you are going to calculate the x-amz-content-sha256 hash before the upload begins, as you otherwise must, since it's required for integrity checking.
I am not overly familiar with the internals of the Java SDK, but this type of upload might be triggered by using .withInputStream() or it might be standard behavor for files too, or for files over a certain size.
Your minimum workaround would be to throw an HTTP error if you see x-amz-content-sha256: STREAMING-AWS4-HMAC-SHA256-PAYLOAD in the request headers since you appear not to have implemented this in your API, but this would most likely only serve to prevent storing objects uploaded by this method. The fact that this isn't already what happens automatically suggests that you haven't implemented x-amz-content-sha256 handling at all, so you are not doing the server-side payload integrity checks that you need to be doing.
For full compatibility, you'll need to implement the algorithm supported by S3 and assumed to be available by the SDKs, unless the SDKs specifically support a mechanism for disabling this algorithm -- which seems unlikely, since it serves a useful purpose, particularly (it appears) for streams whose length is known but that aren't seekable.
¹ one of four -- the other three are a standard PUT, a web-based html form POST, and the multipart API that is recommended for large files and mandatory for files larger than 5 GB.

is there a size limit to individual fields in HTTP POST?

I have an API for a file upload that expects a multipart form submission. But I have a customer writing a client and his system can't properly generate a multipart/form-data request. He's asking that I modify my API to accept the file in a application/x-www-form-urlencoded request, with the filename in one key/value pair and the contents of the file, base64 encoded, in another key/value pair.
In principle I can easily do this (tho I need a shower afterwards), but I'm worried about size limits. The files we expect in Production will be fairly large: 5-10MB, sometimes up to 20MB. I can't find anything that tells me about length limitations on individual key/value pair data inside a form POST, either in specs (I've looked at, among others, the HTTP spec and the Forms spec) or in a specific implementation (my API runs on a Java application server, Jetty, with an Apache HTTP server in front of it).
What is the technical and practical limit for an individual value in a key/value pair in a form POST?
There are artificial limits, configurations, present on the HttpConfiguration class. Both for maximum number of keys, and maximum size of the request body content.
In practical terms, this is a really bad idea.
You'll have a String, which uses 2-bytes per character for the Base64 data.
And you have the typical 33% overhead just being Base64.
They'll also have to utf8 urlencode the Base64 string for various special characters (such as "+" which has meaning in Base64, but is space " " in urlencoded form. So they'll need to encode that "+" to "%2B").
So for a 20MB file you'll have ...
20,971,520 bytes of raw data, represented as 27,892,122 characters in raw Base64, using (on average) 29,286,728 characters when urlencoded, which will use 58,573,455 bytes of memory in its String form.
The decoding process on Jetty will take the incoming raw urlencoded bytes and allocate 2x that size in a String before decoding the urlencoded form. So that's a 58,573,456 length java.lang.String (that uses 117,146,912 bytes of heap memory for the String, and don't forget the 29MB of bytebuffer data being held too!) just to decode that Base64 binary file as a value in a x-www-form-urlencoded String form.
I would push back and force them to use multipart/form-data properly. There are tons of good libraries to generate that form-data properly.
If they are using Java, tell them to use the httpmime library from the Apache HttpComponents project (they don't have to have/use/install Apache Http Client to use the httpmime, its a standalone library).
Alternative Approach
There's nothing saying you have to use application/x-www-form-urlecnoded or multipart/form-data.
Offer a raw upload option via application/octet-stream
They use POST, and MUST include the following valid request headers ...
Connection: close
Content-Type: application/octet-stream
Content-Length: <whatever_size_the_content_is>
Connection: close to indicate when the http protocol is complete.
Content-Type: application/octet-stream means Jetty will not process that content as request parameters and will not apply charset translations to it.
Content-Length is required to ensure that the entire file is sent/received.
Then just stream the raw binary bytes to you.
This is just for the file contents, if you have other information that needs to be passed in (such as filename) consider using either the query parameters for that, or a custom request header (eg: X-Filename: secretsauce.doc)
On your servlet, you just use HttpServletRequest.getInputStream() to obtain those bytes, and you use the Content-Length variable to verify that you received the entire file.
Optionally, you can make them provide a SHA1 hash in the request headers, like X-Sha1Sum: bed0213d7b167aa9c1734a236f798659395e4e19 which you then use on your side to verify that the entire file was sent/received properly.

Replacing bytes of an uploaded file in Amazon S3

I understand that in order to upload a file to Amazon S3 using Multipart, the instructions are here:
http://docs.aws.amazon.com/AmazonS3/latest/dev/llJavaUploadFile.html
How do I go about replacing the bytes (say, between the range 4-1523) of an uploaded file? Do I need to make use of Multipart Upload to achieve this? or do I fire a REST call with the range specified in the HTTP header?
Appreciate any advice.
Objects in S3 are immutable.
If it's a small object, you'll need to upload the entire object again.
If it's an object over 5MB in size, then there is a workaround that allows you to "patch" a file, using a modified approach to the multipart upload API.
Background:
As you know, a multipart upload allows you to upload a file in "parts," with minimum part size 5MB and maximum part count 10,000.
However a multipart "upload" doesn't mean you have to "upload" all the data again, if some or all of it already exists in S3, and you can address it.
PUT part/copy allows you to "upload" the individual parts by specifying octet ranges in an existing object. Or more than one object.
Since uploads are atomic, the "existing object" can be the object you're in the process of overwriting, since it remains unharmed and in place until you actually complete the multipart upload.
But there appears to be nothing stopping you from using the copy capability to provide the data for the parts you want to leave the same, avoiding the actual upload then using a normal PUT part request to upload the parts that you want to have different content.
So, while not a byte-range patch with granularity to the level of 1 octet, this could be useful for emulating an in-place modification of a large file. Examples of valid "parts" would be replacing a minimum 5 MB chunk, on a 5MB boundary, for files smaller than 50GB, or replacing a mimimum 500MB chunk on 500MB boundary for objects up to 5TB, with minimum part sizes varying between those to extremes, because of the requirement that a multipart upload have no more than 10,000 parts. The catch is that a part must start at an appropriate offset, and you need to replace the whole part.
Michael's answer is pretty explanatory on the background of the issue. Just adding the actual steps to be performed to achieve this, in case you're wondering.
List object parts using ListParts
Identify the part that has been modified
Start a multipart upload
Copy the unchanged parts using UploadPartCopy
Upload the modified part
Finish the upload to save the modification
Skip 2 if you already know which part has to be changed.
Tip: Each part has an ETag, which is MD5 hash of the specified part. This can be used to verify is that particular part has been changed.

Why isn't List Parts to be used with Complete Multipart Upload?

The multipart upload overview documentation has, in the Multipart Upload Listings section, the following warning:
Note
Only use the returned listing for verification. You should not use the result of this listing when sending a complete multipart upload request. Instead, maintain your own list of the part numbers you specified when uploading parts and the corresponding ETag values that Amazon S3 returns.
Why?
Why I ask: Let's say I want to support resuming an upload that is interrupted. Doing so means knowing what remains to be uploaded, and therefore what already was uploaded. Knowing this is simpler if I may disregard the above warning. S3 is persisting the list of already-uploaded parts. I can obtain it from List Parts.
Whereas if I heed that warning, instead I'd need to intercept break or kill signals and persist the uploaded parts list locally. Although that's feasible, it seems silly to do this if S3 already has the list.
Furthermore, the warning says to use List Parts "only for verification". OK. Let's say I persist my own list, and compare it to List Parts. If they do not match, what am I going to do? I'm going to believe List Parts -- if S3 doesn't think it has a part, of course I'm going to upload it again. Therefore if List Parts is the ultimate authority, why not simply use it in the first place, and use it alone?
If they do not match, what am I going to do? I'm going to believe List Parts -- if S3 doesn't think it has a part, of course I'm going to upload it again.
You're missing the point of the warning.
It's not so much about whether parts were received. It's about whether they were received intact.
When you complete a multipart upload, you have to send a list of the parts and their etags. The etags are the hex md5sum of each part.
The lazy and careless way to complete a multipart upload would be to blindly submit the etags of the parts by just reading them from the "list" operation.
That is what they are warning against.
The correct way is to use your locally-created list, based on what you think S3 should have received, what you think the etag of each part should have been, based on the local file.
If you are resuming an upload that was interrupted, you should go back and compare the parts already uploaded (by re-reading and re-checksumming the parts of the local file) against the checksums S3 has calculated against the parts already stored (as returned by the list operation)... then either resend any incorrect parts or missing parts, or abandon the upload because the local file may have changed if one or more parts doesn't match your local calculation.
Additionally, in the interest of data integrity, you should be sending the md5 of each part with the individual part uploads, base64-encoded, with a Content-MD5 header, since this will cause S3 to refuse to accept a part that has been corrupted in any way during the upload.

AWS S3 SSE GetObject requires secret key

The idea was to generate a random key for every file being uploaded, pass this key to S3 in order to encrypt it and store the key in the database. Once the user wants to access the file, the key is read from the database and passed to S3 once again.
The first part works. My objects are uploaded and encrypted successfully, but I have issues with retrieving them.
Retrieving files with request headers set:
When setting the request headers such as x-amz-server-side-encryption-customer-algorithm etc. when performing the GET request to the resource, works, and I am able to access it. But since I want to these resources as src to an <img>-Tag, I cannot perform GET requests which require headers to be set.
Thus, I thought about:
Pre signing urls:
To create a pre signed url, I built the HMAC SHA1 of the required string and used it as a signature. The calculated signature is accepted by S3 but I get the following error when requesting the pre signed URL:
Requests specifying Server Side Encryption with Customer provided keys must provide an appropriate secret key.
The URL has the form:
https://s3-eu-west-1.amazonaws.com/bucket-id/resource-id?x-amz-server-side-encryption-customer-algorithm=AES256&AWSAccessKeyId=MyAccessKey&Expires=1429939889&Signature=GeneratedSignature
The reason why the error is shown seems to be pretty clear to me. At no point in the signing process was the encryption key used. Thus, the request cannot work. As a result, I added the encryption key as Base64, and Md5 representation as parameters to the URL. The URL now has the following format:
https://s3-eu-west-1.amazonaws.com/bucket-id/resource-id?x-amz-server-side-encryption-customer-algorithm=AES256&AWSAccessKeyId=MyAccessKey&Expires=1429939889&Signature=GeneratedSignature&x-amz-server-side-encryption-customer-key=Base64_Key&x-amz-server-side-encryption-customer-key-MD5=Md5_Key
Although the key is now present (imho), I do get the same error message.
Question
Does anyone know, how I can access my encrypted files with a GET request which does not provide any headers such as x-amz-server-side-encryption-customer-algorithm?
It seems intuitive enough to me that what you are trying should have worked.
Apparently, though, when they say "headers"...
you must provide all the encryption headers in your client application.
— http://docs.aws.amazon.com/AmazonS3/latest/dev/ServerSideEncryptionCustomerKeys.html#sse-c-how-to-programmatically-intro
... they do indeed actually mean headers and S3 doesn't accept these particular values when delivered as part of the query string, as you would expect, since S3 sometimes is somewhat flexible in that regard.
I've tested this, and that's the conclusion I've come to: doing this isn't supported.
A GET request with x-amz-server-side-encryption-customer-algorithm=AES256 included in the query string (and signature), along with the X-Amz-Server-Side-Encryption-Customer-Key and X-Amz-Server-Side-Encryption-Customer-Key-MD5 headers does work as expected... as I believe you've discovered... but putting the key and key-md5 in the query string, with or without including it in the signature seems like a dead end.
It seemed somewhat strange, at first, that they wouldn't allow this in the query string, since so many other things are allowed there... but then again, if you're going to the trouble of encrypting something, there seems little point in revealing the encryption key in a link... not to mention that the key would then be captured in the S3 access logs, leaving the encryption seeming fairly well pointless all around -- and perhaps that was their motivation for requiring it to actually be sent in the headers and not the query string.
Based on what I've found in testing, though, I don't see a way to use encrypted objects with customer-provided keys in hyperlinks, directly.
Indirectly, of course, a reverse proxy in front of the S3 bucket could do the translation for you, taking the appropriate values from the query string and placing them into the headers, instead... but it's really not clear to me what's to be gained by using customer-provided encryption keys for downloadable objects, compared to letting S3 handle the at-rest encryption with AWS-managed keys. At-rest encryption is all you're getting either way.