is there a size limit to individual fields in HTTP POST? - apache

I have an API for a file upload that expects a multipart form submission. But I have a customer writing a client and his system can't properly generate a multipart/form-data request. He's asking that I modify my API to accept the file in a application/x-www-form-urlencoded request, with the filename in one key/value pair and the contents of the file, base64 encoded, in another key/value pair.
In principle I can easily do this (tho I need a shower afterwards), but I'm worried about size limits. The files we expect in Production will be fairly large: 5-10MB, sometimes up to 20MB. I can't find anything that tells me about length limitations on individual key/value pair data inside a form POST, either in specs (I've looked at, among others, the HTTP spec and the Forms spec) or in a specific implementation (my API runs on a Java application server, Jetty, with an Apache HTTP server in front of it).
What is the technical and practical limit for an individual value in a key/value pair in a form POST?

There are artificial limits, configurations, present on the HttpConfiguration class. Both for maximum number of keys, and maximum size of the request body content.
In practical terms, this is a really bad idea.
You'll have a String, which uses 2-bytes per character for the Base64 data.
And you have the typical 33% overhead just being Base64.
They'll also have to utf8 urlencode the Base64 string for various special characters (such as "+" which has meaning in Base64, but is space " " in urlencoded form. So they'll need to encode that "+" to "%2B").
So for a 20MB file you'll have ...
20,971,520 bytes of raw data, represented as 27,892,122 characters in raw Base64, using (on average) 29,286,728 characters when urlencoded, which will use 58,573,455 bytes of memory in its String form.
The decoding process on Jetty will take the incoming raw urlencoded bytes and allocate 2x that size in a String before decoding the urlencoded form. So that's a 58,573,456 length java.lang.String (that uses 117,146,912 bytes of heap memory for the String, and don't forget the 29MB of bytebuffer data being held too!) just to decode that Base64 binary file as a value in a x-www-form-urlencoded String form.
I would push back and force them to use multipart/form-data properly. There are tons of good libraries to generate that form-data properly.
If they are using Java, tell them to use the httpmime library from the Apache HttpComponents project (they don't have to have/use/install Apache Http Client to use the httpmime, its a standalone library).
Alternative Approach
There's nothing saying you have to use application/x-www-form-urlecnoded or multipart/form-data.
Offer a raw upload option via application/octet-stream
They use POST, and MUST include the following valid request headers ...
Connection: close
Content-Type: application/octet-stream
Content-Length: <whatever_size_the_content_is>
Connection: close to indicate when the http protocol is complete.
Content-Type: application/octet-stream means Jetty will not process that content as request parameters and will not apply charset translations to it.
Content-Length is required to ensure that the entire file is sent/received.
Then just stream the raw binary bytes to you.
This is just for the file contents, if you have other information that needs to be passed in (such as filename) consider using either the query parameters for that, or a custom request header (eg: X-Filename: secretsauce.doc)
On your servlet, you just use HttpServletRequest.getInputStream() to obtain those bytes, and you use the Content-Length variable to verify that you received the entire file.
Optionally, you can make them provide a SHA1 hash in the request headers, like X-Sha1Sum: bed0213d7b167aa9c1734a236f798659395e4e19 which you then use on your side to verify that the entire file was sent/received properly.

Related

Can I trust the .Length property on IFormFile in ASP.NET Core?

We have an API endpoint that allows users to upload images; one of its parameters is an IFormFileCollection.
We'd like to validate the file size to make sure that the endpoint isn't being abused so I'm checking the Length property of each IFormFile, but I don't know whether I can trust this property or not, i.e. does this come from the request? Is it considered 'input', much like Content-Length is?
If you have an IFormFileCollection parameter, and you send data using a "form-data" content-type in the request, that parameter will be bound by a whole lot of plumbing that's hard to dig through online, but if you just debug the action method that accepts the IFormFileCollection (or any collection of IFormFile, really)and inspect the collection, you'll see that the uploaded files will already have been saved on your server's disk.
That's because the entire multi-part form request's body has to be read to determine how many files there are, if any, and form parameters, and validate the request body's format while it's reading it.
So yes, by the time your code ends up there, you can trust IFormFile.Length, because it's pointing to a local file that exists and contains that many bytes.
You're too late there to reject the request though, as it's been already entirely read. You better fix rate and size limits lower in the stack, like on the web server or firewall.
Content-Length is compressed number of bytes of data in the body , it is not reliable since it may include extra data ,for example , you are sending multipart request . Just use the IFormFile.length for features like calculation or validation .

Implementing basic S3 compatible API with akka-http

I'm trying to implement the file storage ыукмшсу with basic S3 compatible API using akka-http.
I use s3 java sdk to test my service API and got the problem with the putObject(...) method. I can't consume file properly on my akka-http backend. I wrote simple route for the test purposes:
def putFile(bucket: String, file: String) = put{
extractRequestEntity{ ent =>
val finishedWriting = ent.dataBytes.runWith(FileIO.toPath(new File(s"/tmp/${file}").toPath))
onComplete(finishedWriting) { ioResult =>
complete("Finished writing data: " + ioResult)
}
}
}
It saves file, but file is always corrupted. Looking inside the file I found the lines like these:
"20000;chunk-signature=73c6b865ab5899b5b7596b8c11113a8df439489da42ddb5b8d0c861a0472f8a1".
When I try to PUT file with any other rest client it works as fine as expected.
I know S3 uses "Expect: 100-continue" header and may it he causes problems.
I really can't figure out how to deal with that. Any help appreciated.
This isn't exactly corrupted. Your service is not accounting for one of the four¹ ways S3 supports uploads to be sent on the wire, using Content-Encoding: aws-chunked and x-amz-content-sha256: STREAMING-AWS4-HMAC-SHA256-PAYLOAD.
It's a non-standards-based mechanism for streaming an object, and includes chunks that look exactly like this:
string(IntHexBase(chunk-size)) + ";chunk-signature=" + signature + \r\n + chunk-data + \r\n
...where IntHexBase() is pseudocode for a function that formats an integer as a hexadecimal number as a string.
This chunk-based algorithm is similar to, but not compatible with, Transfer-Encoding: chunked, because it embeds checksums in the stream.
Why did they make up a new HTTP transfer encoding? It's potentially useful on the client side because it eliminates the need to either "read your payload twice or buffer [the entire object payload] in memory [concurrently]" -- one or the other of which is otherwise necessary if you are going to calculate the x-amz-content-sha256 hash before the upload begins, as you otherwise must, since it's required for integrity checking.
I am not overly familiar with the internals of the Java SDK, but this type of upload might be triggered by using .withInputStream() or it might be standard behavor for files too, or for files over a certain size.
Your minimum workaround would be to throw an HTTP error if you see x-amz-content-sha256: STREAMING-AWS4-HMAC-SHA256-PAYLOAD in the request headers since you appear not to have implemented this in your API, but this would most likely only serve to prevent storing objects uploaded by this method. The fact that this isn't already what happens automatically suggests that you haven't implemented x-amz-content-sha256 handling at all, so you are not doing the server-side payload integrity checks that you need to be doing.
For full compatibility, you'll need to implement the algorithm supported by S3 and assumed to be available by the SDKs, unless the SDKs specifically support a mechanism for disabling this algorithm -- which seems unlikely, since it serves a useful purpose, particularly (it appears) for streams whose length is known but that aren't seekable.
¹ one of four -- the other three are a standard PUT, a web-based html form POST, and the multipart API that is recommended for large files and mandatory for files larger than 5 GB.

AWS API Gateway to S3 - PUT Content-Encoding .Z Files

I have run into an issue with using API Gateway as a proxy to S3 (for custom authentication), in that it does not handle binary data well (which is a known issue).
I'm usually uploading either .gz or .Z (Unix compress utility) files. As far as I understand it, the data is not maintained due to encoding issues. I can't seem to figure out a way to decode the data back to binary.
Original leading bytes: \x1f\x8b\x08\x08\xb99\xbeW\x00\x03
After passing through API GW: ��9�W�
... Followed by filename and the rest of the data.
One way of 'getting around this' is to specify Content-Encoding in the header of the PUT request to API GW as 'gzip'. This seems to force API GW to decompress the file before forwarding it to S3.
The same does not work for .Z files compressed with the Unix compress utility. Where you should specify the Content-Encoding as 'compress'.
Does anyone have any insight about what is happening to the data, to help shed some light on my issue? Also, does anyone know any possible work-around's to maintain the encoding of my data while passing through API GW (or to decode it once it's in S3)?
Obviously I could just access the S3 API directly (or have API GW return a pre-signed URL for accessing the S3 API), but there are a few reasons why I don't want to do that.
I should mention that I don't understand very much at all about encoding - sorry if there are some obvious answers to some of my questions.
It's not exactly an "encoding issue" -- it's the fact that API Gateway just doesn't support binary data ("yet")... so it's going to potentially corrupt binary data, depending on the specifics of the data in question.
Uploading as Content-Encoding: gzip probably triggers decoding in a front-end component that is capable of dealing with binary data (gzip, after all, is a standard encoding and is binary) before passing the request body to the core infrastructure... but you will almost certainly find that this is a workaround that does not consistently deliver correct results, depending on the specific payload. The fact that it works at all seems more like a bug than a feature.
For now, the only consistently viable option is base64-encoding your payload, which increases its size on-the-wire by 33% (base64 encoding produces 4 bytes of output for every 3 bytes of input) so it's not much of a solution. Base64 + gzip with the appropriate Content-Encoding: gzip should also work, which seems quite a silly suggestion (converting a compressed file into base64 then gzipping the result to try to reduce its size on the wire) but should be consistent with what API Gateway can currently deliver.

md5 checksums when uploading to file picker

Background
I'm working on integrating an existing app with File Picker. In our existing setup we are relying on md5 checksums to ensure data integrity. As far as I can see File Picker does not provide any md5 when they respond to an upload against the REST API (nor using JavaScript client).
S3 storage, md5 and data integrity
We are using S3 for storage, and as far as I know you may provide S3 with an md5 checksum when storing files so that Amazon may verify and reject storing request if data seems to be wrong.
To ensure that data is not corrupted traversing the network, use the Content-MD5 header. When you use this header, Amazon S3 checks the object against the provided MD5 value and, if they do not match, returns an error. Additionally, you can calculate the MD5 while putting an object to Amazon S3 and compare the returned ETag to the calculated MD5 value.
I have investigated the etag header which Amazon returns a bit, and found that it isn't clear what actually is returned as etag. The Java documentation states:
Gets the hex encoded 128-bit MD5 hash of this object's contents as computed by Amazon S3.
The Ruby documentation states:
Generally the ETAG is the MD5 of the object. If the object was uploaded using multipart upload then this is the MD5 all of the upload-part-md5s
Another place in their documentation I found this:
The entity tag is a hash of the object. The ETag only reflects changes to the contents of an object, not its metadata. The ETag is determined when an object is created. For objects created by the PUT Object operation and the POST Object operation, the ETag is a quoted, 32-digit hexadecimal string representing the MD5 digest of the object data. For other objects, the ETag may or may not be an MD5 digest of the object data. If the ETag is not an MD5 digest of the object data, it will contain one or more non-hexadecimal characters and/or will consist of less than 32 or more than 32 hexadecimal digits.
This seems to describe how etag is actually calculated on S3, and this stack overflow post seems to imply the same thing: Etag cannot be trusted to always be equal to the file MD5.
So - here are my questions
In general, how does file picker store files to s3? Is multipart post requests used?
I see that when I do a HEAD request against for example https://www.filepicker.io/api/file/<file handle> I do get an etag header back. The etag I get back do indeed match the md5 of the file I have uploaded. Are the headers returned more or less taken from S3 directly? Or is this actually an md5 calculated by filepicker which I can trust?
Is it possible to have an explicit statement of the md5 returned to clients of File Picker's API? For instance when we POST a file we get a JSON structure back including the URL to the file and it's size. Could md5 be included here?
Is it possible to provide File Picker with an md5 which in turn will be used when posting files to S3 so we can get an end-to-end check on files?
Yes, we use the python boto library to be specific.
The ETag is pulled from S3.
& 4. It's been considered and is in our backlog, but hasn't been implemented yet.

Android Volley: gzip response

What type of response listener must we use to handle gzip responses with Android Volley?
If a String listener is used, the response seems to lose its encoding.
How do you handle gzip responses using Volley?
MAJOR EDIT:
HttpUrlConnection automatically adds the gzip header to requests, and if the response is gzipped, it will seamlessly decode it and present to you the response. All the gzip stuff happens behind the scenes and you don't need to do what I posted in a gist as an answer to this question. See the documentation here http://developer.android.com/reference/java/net/HttpURLConnection.html
As a matter of fact, the answer I posted SHOULD NOT be used, because the gzip decoding is extremely slow, and should be left to be handled by HttpUrlConnection.
Here is the exact piece from the documentation:
By default, this implementation of HttpURLConnection requests that
servers use gzip compression. Since getContentLength() returns the
number of bytes transmitted, you cannot use that method to predict how
many bytes can be read from getInputStream(). Instead, read that
stream until it is exhausted: when read() returns -1. Gzip compression
can be disabled by setting the acceptable encodings in the request
header:
urlConnection.setRequestProperty("Accept-Encoding", "identity");
So I figured out how to do this.
Basically, I extended StringRequest so that it handles the network response a different way.
You can just parse the response bytearray using GZipInputStream and return the resultant string.
Here's the gist: https://gist.github.com/premnirmal/8526542
You should use ion ,, it's comes pre-set with
Transparent usage of HTTP features and optimizations:
SPDY and HTTP/2
Caching
Gzip/Deflate Compression
Connection pooling/reuse via HTTP Connection: keep-alive
Uses the best/stablest connection from a server if it has multiple IP addresses
Cookies