I have run into an issue with using API Gateway as a proxy to S3 (for custom authentication), in that it does not handle binary data well (which is a known issue).
I'm usually uploading either .gz or .Z (Unix compress utility) files. As far as I understand it, the data is not maintained due to encoding issues. I can't seem to figure out a way to decode the data back to binary.
Original leading bytes: \x1f\x8b\x08\x08\xb99\xbeW\x00\x03
After passing through API GW: ��9�W�
... Followed by filename and the rest of the data.
One way of 'getting around this' is to specify Content-Encoding in the header of the PUT request to API GW as 'gzip'. This seems to force API GW to decompress the file before forwarding it to S3.
The same does not work for .Z files compressed with the Unix compress utility. Where you should specify the Content-Encoding as 'compress'.
Does anyone have any insight about what is happening to the data, to help shed some light on my issue? Also, does anyone know any possible work-around's to maintain the encoding of my data while passing through API GW (or to decode it once it's in S3)?
Obviously I could just access the S3 API directly (or have API GW return a pre-signed URL for accessing the S3 API), but there are a few reasons why I don't want to do that.
I should mention that I don't understand very much at all about encoding - sorry if there are some obvious answers to some of my questions.
It's not exactly an "encoding issue" -- it's the fact that API Gateway just doesn't support binary data ("yet")... so it's going to potentially corrupt binary data, depending on the specifics of the data in question.
Uploading as Content-Encoding: gzip probably triggers decoding in a front-end component that is capable of dealing with binary data (gzip, after all, is a standard encoding and is binary) before passing the request body to the core infrastructure... but you will almost certainly find that this is a workaround that does not consistently deliver correct results, depending on the specific payload. The fact that it works at all seems more like a bug than a feature.
For now, the only consistently viable option is base64-encoding your payload, which increases its size on-the-wire by 33% (base64 encoding produces 4 bytes of output for every 3 bytes of input) so it's not much of a solution. Base64 + gzip with the appropriate Content-Encoding: gzip should also work, which seems quite a silly suggestion (converting a compressed file into base64 then gzipping the result to try to reduce its size on the wire) but should be consistent with what API Gateway can currently deliver.
Related
We are participating in a process where we provide a vendor with AWS S3 SignedURLs that allow them to upload files to an S3 bucket.
The process is quite simple:
They request a signedurl from us via an API / Lambda function we've published in AWS.
We return an S3 signedurl.
They use the signedurl to POST a binary file (in this case, a compressed .zip file).
While we don't have all of their code, they shared this snippet regarding the actual POST operation. (interesting to note, they're using the PUT method...not sure if that matters).
The problem occurs in Step 3. Their service uploads the binary file and, if we change the binary file to .txt, we see the following additional elements have been added to the .zip.
--6de81f1b-be80-44de-8a14-7e023c92fcf3
Content-Type: application/octet-stream
Content-Disposition: form-data; name="file"; filename="H22101_0031_087.zip"
This alters the structure of the .zip file considerably, causes checksum operations to fail and prevents us from decompressing the file...essentially it isn't a valid .zip unless we strip these additional elements out.
The code responsible for posting the binary is quite simple:
var dataFileUploadRequest = new RestRequest();
dataFileUploadRequest.Method = Method.Put;
dataFileUploadRequest.AddFile("file", sdpData, documentFileName);
dataFileUploadRequest.RequstFormat = DataFormat.Binary;
They implemented a workaround by using httpclient instead of restsharp...however, we would like to better understand why this is happening.
Does anyone know why restsharp's libraries would append this additional Content-Type, Content-Disposition...and even that strange guid (which appears to be a sort of transaction id) around the .zip's binary contents?
I'm trying to implement the file storage ыукмшсу with basic S3 compatible API using akka-http.
I use s3 java sdk to test my service API and got the problem with the putObject(...) method. I can't consume file properly on my akka-http backend. I wrote simple route for the test purposes:
def putFile(bucket: String, file: String) = put{
extractRequestEntity{ ent =>
val finishedWriting = ent.dataBytes.runWith(FileIO.toPath(new File(s"/tmp/${file}").toPath))
onComplete(finishedWriting) { ioResult =>
complete("Finished writing data: " + ioResult)
}
}
}
It saves file, but file is always corrupted. Looking inside the file I found the lines like these:
"20000;chunk-signature=73c6b865ab5899b5b7596b8c11113a8df439489da42ddb5b8d0c861a0472f8a1".
When I try to PUT file with any other rest client it works as fine as expected.
I know S3 uses "Expect: 100-continue" header and may it he causes problems.
I really can't figure out how to deal with that. Any help appreciated.
This isn't exactly corrupted. Your service is not accounting for one of the four¹ ways S3 supports uploads to be sent on the wire, using Content-Encoding: aws-chunked and x-amz-content-sha256: STREAMING-AWS4-HMAC-SHA256-PAYLOAD.
It's a non-standards-based mechanism for streaming an object, and includes chunks that look exactly like this:
string(IntHexBase(chunk-size)) + ";chunk-signature=" + signature + \r\n + chunk-data + \r\n
...where IntHexBase() is pseudocode for a function that formats an integer as a hexadecimal number as a string.
This chunk-based algorithm is similar to, but not compatible with, Transfer-Encoding: chunked, because it embeds checksums in the stream.
Why did they make up a new HTTP transfer encoding? It's potentially useful on the client side because it eliminates the need to either "read your payload twice or buffer [the entire object payload] in memory [concurrently]" -- one or the other of which is otherwise necessary if you are going to calculate the x-amz-content-sha256 hash before the upload begins, as you otherwise must, since it's required for integrity checking.
I am not overly familiar with the internals of the Java SDK, but this type of upload might be triggered by using .withInputStream() or it might be standard behavor for files too, or for files over a certain size.
Your minimum workaround would be to throw an HTTP error if you see x-amz-content-sha256: STREAMING-AWS4-HMAC-SHA256-PAYLOAD in the request headers since you appear not to have implemented this in your API, but this would most likely only serve to prevent storing objects uploaded by this method. The fact that this isn't already what happens automatically suggests that you haven't implemented x-amz-content-sha256 handling at all, so you are not doing the server-side payload integrity checks that you need to be doing.
For full compatibility, you'll need to implement the algorithm supported by S3 and assumed to be available by the SDKs, unless the SDKs specifically support a mechanism for disabling this algorithm -- which seems unlikely, since it serves a useful purpose, particularly (it appears) for streams whose length is known but that aren't seekable.
¹ one of four -- the other three are a standard PUT, a web-based html form POST, and the multipart API that is recommended for large files and mandatory for files larger than 5 GB.
I have an API for a file upload that expects a multipart form submission. But I have a customer writing a client and his system can't properly generate a multipart/form-data request. He's asking that I modify my API to accept the file in a application/x-www-form-urlencoded request, with the filename in one key/value pair and the contents of the file, base64 encoded, in another key/value pair.
In principle I can easily do this (tho I need a shower afterwards), but I'm worried about size limits. The files we expect in Production will be fairly large: 5-10MB, sometimes up to 20MB. I can't find anything that tells me about length limitations on individual key/value pair data inside a form POST, either in specs (I've looked at, among others, the HTTP spec and the Forms spec) or in a specific implementation (my API runs on a Java application server, Jetty, with an Apache HTTP server in front of it).
What is the technical and practical limit for an individual value in a key/value pair in a form POST?
There are artificial limits, configurations, present on the HttpConfiguration class. Both for maximum number of keys, and maximum size of the request body content.
In practical terms, this is a really bad idea.
You'll have a String, which uses 2-bytes per character for the Base64 data.
And you have the typical 33% overhead just being Base64.
They'll also have to utf8 urlencode the Base64 string for various special characters (such as "+" which has meaning in Base64, but is space " " in urlencoded form. So they'll need to encode that "+" to "%2B").
So for a 20MB file you'll have ...
20,971,520 bytes of raw data, represented as 27,892,122 characters in raw Base64, using (on average) 29,286,728 characters when urlencoded, which will use 58,573,455 bytes of memory in its String form.
The decoding process on Jetty will take the incoming raw urlencoded bytes and allocate 2x that size in a String before decoding the urlencoded form. So that's a 58,573,456 length java.lang.String (that uses 117,146,912 bytes of heap memory for the String, and don't forget the 29MB of bytebuffer data being held too!) just to decode that Base64 binary file as a value in a x-www-form-urlencoded String form.
I would push back and force them to use multipart/form-data properly. There are tons of good libraries to generate that form-data properly.
If they are using Java, tell them to use the httpmime library from the Apache HttpComponents project (they don't have to have/use/install Apache Http Client to use the httpmime, its a standalone library).
Alternative Approach
There's nothing saying you have to use application/x-www-form-urlecnoded or multipart/form-data.
Offer a raw upload option via application/octet-stream
They use POST, and MUST include the following valid request headers ...
Connection: close
Content-Type: application/octet-stream
Content-Length: <whatever_size_the_content_is>
Connection: close to indicate when the http protocol is complete.
Content-Type: application/octet-stream means Jetty will not process that content as request parameters and will not apply charset translations to it.
Content-Length is required to ensure that the entire file is sent/received.
Then just stream the raw binary bytes to you.
This is just for the file contents, if you have other information that needs to be passed in (such as filename) consider using either the query parameters for that, or a custom request header (eg: X-Filename: secretsauce.doc)
On your servlet, you just use HttpServletRequest.getInputStream() to obtain those bytes, and you use the Content-Length variable to verify that you received the entire file.
Optionally, you can make them provide a SHA1 hash in the request headers, like X-Sha1Sum: bed0213d7b167aa9c1734a236f798659395e4e19 which you then use on your side to verify that the entire file was sent/received properly.
I am using Apache mod_deflate to return compressed html from a webpage. It has reduced the generated page size from 3k down to 700 bytes.
How do I use HttpConnection in Blackberry to get the compressed page (i.e. only 700bytes instead of 3k)?
P.S. Trying to use the GZIPInputStream(inputStream) keeps returning an incorrect header check error.
As I understood you already tried to download and got non-compressed html page.
If so I think you should add "Accept-Encoding" header to your request (question on forum). Try:
connection.setRequestProperty("Accept-Encoding", "gzip, deflate");
Don't forget that you will get zipped data, so you need to unzip before using.
Also, as mentioned here, gzip/deflate is not so efficient when your traffic is going over BIS-B, BES. Because BB servers will encode/decode data to analyze it and make it more efficient fro transmission.
I'm successfully using yajl-objc along with ASIHTTPRequest in an iPhone project that does network access and pulls down and parses JSON data. ASIHTTPRequest allows gzipped HTTP responses by default, which is great, but I'm using the streaming parser ability of YAJL and it rightfully chokes on gzipped data. I can wait until the HTTP request has finished then un-gzip and parse the response, but I'm going for speed here and would like to parse the gzipped data as it downloads.
Is it possible to un-gzip data on the fly, parse the JSON within, then forget about that chunk of gzipped data?
If this last part could be solved, this setup seems like it would make for a great system:
YAJL is one of the fastest JSON parsers around
ASIHTTPRequest is easy and asynchronous
Response bodies could be gzipped, saving on-the-wire traffic
JSON could be parsed without loading the whole tree into constrained device memory
Any guidance would be greatly appreciated!
YES: http://groups.google.com/group/asihttprequest/browse_thread/thread/ee2e44379b181439/7699dd200780cd32#7699dd200780cd32