How is file upload handled in HTTP? - apache

I am curious to know how webservers handle file uploads.
Is the entire file sent as a single chunk? Or is it streamed into the webserver - which puts it together and saves it in a temp folder for PHP etc. to use?

It's just a matter of following the encoding rules so that one can easily decode (parse) it. Read on the specification about multipart-form/data encoding (the one which is required in HTML based file uploads using input type="file").
Generally the parsing is done by the server side application itself. The webserver only takes care about streaming the bytes from the one to the other side.

It's streamed to answer that question, but see this RFC 1867 for more information.

RFC 1867 describes the mechanism.

Related

Implementing basic S3 compatible API with akka-http

I'm trying to implement the file storage ыукмшсу with basic S3 compatible API using akka-http.
I use s3 java sdk to test my service API and got the problem with the putObject(...) method. I can't consume file properly on my akka-http backend. I wrote simple route for the test purposes:
def putFile(bucket: String, file: String) = put{
extractRequestEntity{ ent =>
val finishedWriting = ent.dataBytes.runWith(FileIO.toPath(new File(s"/tmp/${file}").toPath))
onComplete(finishedWriting) { ioResult =>
complete("Finished writing data: " + ioResult)
}
}
}
It saves file, but file is always corrupted. Looking inside the file I found the lines like these:
"20000;chunk-signature=73c6b865ab5899b5b7596b8c11113a8df439489da42ddb5b8d0c861a0472f8a1".
When I try to PUT file with any other rest client it works as fine as expected.
I know S3 uses "Expect: 100-continue" header and may it he causes problems.
I really can't figure out how to deal with that. Any help appreciated.
This isn't exactly corrupted. Your service is not accounting for one of the four¹ ways S3 supports uploads to be sent on the wire, using Content-Encoding: aws-chunked and x-amz-content-sha256: STREAMING-AWS4-HMAC-SHA256-PAYLOAD.
It's a non-standards-based mechanism for streaming an object, and includes chunks that look exactly like this:
string(IntHexBase(chunk-size)) + ";chunk-signature=" + signature + \r\n + chunk-data + \r\n
...where IntHexBase() is pseudocode for a function that formats an integer as a hexadecimal number as a string.
This chunk-based algorithm is similar to, but not compatible with, Transfer-Encoding: chunked, because it embeds checksums in the stream.
Why did they make up a new HTTP transfer encoding? It's potentially useful on the client side because it eliminates the need to either "read your payload twice or buffer [the entire object payload] in memory [concurrently]" -- one or the other of which is otherwise necessary if you are going to calculate the x-amz-content-sha256 hash before the upload begins, as you otherwise must, since it's required for integrity checking.
I am not overly familiar with the internals of the Java SDK, but this type of upload might be triggered by using .withInputStream() or it might be standard behavor for files too, or for files over a certain size.
Your minimum workaround would be to throw an HTTP error if you see x-amz-content-sha256: STREAMING-AWS4-HMAC-SHA256-PAYLOAD in the request headers since you appear not to have implemented this in your API, but this would most likely only serve to prevent storing objects uploaded by this method. The fact that this isn't already what happens automatically suggests that you haven't implemented x-amz-content-sha256 handling at all, so you are not doing the server-side payload integrity checks that you need to be doing.
For full compatibility, you'll need to implement the algorithm supported by S3 and assumed to be available by the SDKs, unless the SDKs specifically support a mechanism for disabling this algorithm -- which seems unlikely, since it serves a useful purpose, particularly (it appears) for streams whose length is known but that aren't seekable.
¹ one of four -- the other three are a standard PUT, a web-based html form POST, and the multipart API that is recommended for large files and mandatory for files larger than 5 GB.

Uploading a file via Jaxax REST Client interface, with third party server

I need to invoke a remote REST interface handler and submit it a file in request body. Please note that I don't control the server. I cannot change the request to be multipart, the client has to work in accordance to external specification.
So far I managed to make it work like this (omitting headers etc. for brevity):
byte[] data = readFileCompletely ();
client.target (url).request ().post (Entity.entity (data, "file/mimetype"));
This works, but will fail with huge files that don't fit into memory. And since I have no restriction on filesize, this is a concern.
Question: is it somehow possible to use streams or something similar to avoid reading the whole file into memory?
If possible, I'd prefer to avoid implementation-specific extensions. If not, a solution that works with RESTEasy (on Wildfly) is also acceptable.
ReastEasy as well as Jersey support InputStream out of the box so simply use Entity.entity(inputStream, "application/octet-stream"); or whatever Content-Type header you want to set.
You can go low-level and construct the HTTP request using a library such as the plain java.net.URLConnection.
I have not tried it myself but there is example code which reads a local file and writes it to the request stream without loading it into a byte array.
Upload files from Java client to a HTTP server
Of course this solution requires more manual coding but it should work (unless java.net.URLConnection loads the whole file into memory)

What is http multipart request?

I have been writing iPhone applications for some time now, sending data to server, receiving data (via HTTP protocol), without thinking too much about it. Mostly I am theoretically familiar with process, but the part I am not so familiar is HTTP multipart request. I know its basic structure, but the core of it eludes me.
It seems that whenever I am sending something different than plain text (like photos, music), I have to use a multipart request. Can someone briefly explain to me why it is used and what are its advantages?
If I use it, why is it better way to send photos that way?
An HTTP multipart request is an HTTP request that HTTP clients construct to send files and data over to an HTTP Server. It is commonly used by browsers and HTTP clients to upload files to the server.
What it looks like
See Multipart Content-Type
See multipart/form-data
As the official specification says, "one or more different sets of data are combined in a single body". So when photos and music are handled as multipart messages as mentioned in the question, probably there is some plain text metadata associated as well, thus making the request containing different types of data (binary, text), which implies the usage of multipart.
I have found an excellent and relatively short explanation here.
A multipart request is a REST request containing several packed REST requests inside its entity.

Suggestions for best way of implementing HTTP upload resume

I'm working on a project which will allow large files (GB+) to be uploaded via HTTP PUT and I need to implement a method for resuming the upload. Once a file is uploaded and finalized it is complete and cannot be modified. So far, I have 2 options in mind but neither of which fit perfectly:
Option 1
Client sends an initial HEAD request on the file which will return either 404 if it does not exist or the file details including current size along with an HTTP X-Header along the lines of X-Can-Resume or something like that to specify whether the file can be resumed and a Range header specifying which bytes it has. This seems OK but I'm not keen on the X-Header as it removes from the HTTP standard.
Option 2
Client sends a PUT request with a Content-Length header of 0 bytes and no body, the server can then send back a 308 Resume Incomplete (as proposed here http://code.google.com/p/gears/wiki/ResumableHttpRequestsProposal) or a 202 Accepted header to indicate whether to resume or start from the beginning. This also seems acceptable apart from the use of a non-standard header.
Any other suggestions on the best way to implement this?
Thanks,
J
In either solutions there is no existing client and server implementations, so I'm guessing you'll code both. I think you should just find a right balance between the simplest and what is described in the gears proposal (by the way, you probably know Gears is dead), and be prepared to change when a standard emerges.
If I were to implement this feature, I'd make it possible for the client to upload in chunks and I would add a message digest on the whole content and the chunks.

I need Multi-Part DOWNLOADS from Amazon S3 for huge files

I know Amazon S3 added the multi-part upload for huge files. That's great. What I also need is a similar functionality on the client side for customers who get part way through downloading a gigabyte plus file and have errors.
I realize browsers have some level of retry and resume built in, but when you're talking about huge files I'd like to be able to pick up where they left off regardless of the type of error out.
Any ideas?
Thanks,
Brian
S3 supports the standard HTTP "Range" header if you want to build your own solution.
S3 Getting Objects
I use aria2c. For private content, you can use "GetPreSignedUrlRequest" to generate temporary private URLs that you can pass to aria2c
S3 has a feature called byte range fetches. It’s kind of the download compliment to multipart upload:
Using the Range HTTP header in a GET Object request, you can fetch a byte-range from an object, transferring only the specified portion. You can use concurrent connections to Amazon S3 to fetch different byte ranges from within the same object. This helps you achieve higher aggregate throughput versus a single whole-object request. Fetching smaller ranges of a large object also allows your application to improve retry times when requests are interrupted. For more information, see Getting Objects.
Typical sizes for byte-range requests are 8 MB or 16 MB. If objects are PUT using a multipart upload, it’s a good practice to GET them in the same part sizes (or at least aligned to part boundaries) for best performance. GET requests can directly address individual parts; for example, GET ?partNumber=N.
Source: https://docs.aws.amazon.com/whitepapers/latest/s3-optimizing-performance-best-practices/use-byte-range-fetches.html
Just updating for current situation, S3 natively supports multipart GET as well as PUT. https://youtu.be/uXHw0Xae2ww?t=1459.
NOTE: For Ruby user only
Try aws-sdk gem from Ruby, and download
object = AWS::S3::Object.new(...)
object.download_file('path/to/file.rb')
Because it download a large file with multipart by default.
Files larger than 5MB are downloaded using multipart method
http://docs.aws.amazon.com/sdkforruby/api/Aws/S3/Object.html#download_file-instance_method