Suggestions for best way of implementing HTTP upload resume - apache

I'm working on a project which will allow large files (GB+) to be uploaded via HTTP PUT and I need to implement a method for resuming the upload. Once a file is uploaded and finalized it is complete and cannot be modified. So far, I have 2 options in mind but neither of which fit perfectly:
Option 1
Client sends an initial HEAD request on the file which will return either 404 if it does not exist or the file details including current size along with an HTTP X-Header along the lines of X-Can-Resume or something like that to specify whether the file can be resumed and a Range header specifying which bytes it has. This seems OK but I'm not keen on the X-Header as it removes from the HTTP standard.
Option 2
Client sends a PUT request with a Content-Length header of 0 bytes and no body, the server can then send back a 308 Resume Incomplete (as proposed here http://code.google.com/p/gears/wiki/ResumableHttpRequestsProposal) or a 202 Accepted header to indicate whether to resume or start from the beginning. This also seems acceptable apart from the use of a non-standard header.
Any other suggestions on the best way to implement this?
Thanks,
J

In either solutions there is no existing client and server implementations, so I'm guessing you'll code both. I think you should just find a right balance between the simplest and what is described in the gears proposal (by the way, you probably know Gears is dead), and be prepared to change when a standard emerges.
If I were to implement this feature, I'd make it possible for the client to upload in chunks and I would add a message digest on the whole content and the chunks.

Related

REST API for sending files between services

I'm building a microservice which one of it's API's expects a file and some parameters which the API will process and return a response for.
I've searched and found some references, mostly pointing towards form-data (multipart), however they mostly refer to client to service and not service to service like in my case.
I'll be happy to know what is the best practice for this case for both the client (a service actually) and me.
I would also suggest to perform a POST request (multipart) to a service endpoint that can process/accept a byte stream wrapped into the provided HTML body(s). A PUT request may also work in some cases.
Your main concerns will consist in binding enough metadata to the request so that the remote service can correctly handle it. This include in particular the following headers:
Content-Type: to provide the MIME type of the data being transferred and enable its proper processing.
Content-Disposition: to provide additional information about the body part such as the file name.
I personally believe that a single request is enough (in contrast to #Evert suggestion) as it will result in less overhead overall and will keep things simple (and RESTful) by avoiding any linking (or state) between successive requests.
I would not wrap data in form-data, because it just adds to the total body size. You can just put the entire raw file in the body of a PUT or POST request.
If you also need to send meta-data, I would suggest 2 requests. If you absolutely can't do 2 requests, form-data might still be the best option and it does work server-to-server.

Best practice for initial return in a REST-like image upload api endpoint?

When sending a file, e.g. an image, over HTTP to an API, how should the server respond?
Examples:
respond as soon as file is written to disk
respond only when file is written, processed, checksummed, thumbnailed, watermarked etc.
respond as fast as possible with a link to the resource (even if it's a 404 for a few moments afterwards)
add a 'task' endpoint and respond instantly with a task ID to track the progress before data transfer & processing (eventually including path to resource)
Edit: Added one idea from an answer to a similar question: rest api design and workflow to upload images.
The client doesn't know about disks, processing, checksumming, thumbnailing, etc.
The options then are pretty simple. You either want to return the HTTP request as quickly as possible, or you want the client to wait until you know the operation was successful.
If you want the client to wait, return 201 Created. If you want to return as quickly as possible, return 202 Accepted.
Both are acceptable designs. You should let your own requirements dictate which is better for your case. I would say that by default it's a good idea to 'default' to waiting until the HTTP request was guaranteed to be successful, and only use 202 Accepted if that was a specific requirement.
You could also let the client decide with a Prefer header:
Prefer: respond-async, wait=100
See https://www.rfc-editor.org/rfc/rfc7240#section-4.3

HTTPS proxy with support for chunked-encoded requests

I'm developing a simple HTTPS proxy (written in Python) which receives POST/GET requests/responses, applies some transformation and finally forwards the result to the recipient.
I need to handle chunked-encoded requests/responses in a "streaming" fashion, meaning that as soon as a chunk is received the proxy transforms it and forwards it to the recipient.
Before deciding to support chunked-encoded requests, I've been using mitmproxy http://mitmproxy.org/ and it worked perfectly. Unfortunately, I noticed that it waits until the entire body is received before letting me handle the response/request.
How can I implement a proxy supporting chunked-encoded requests/responses? Has anyone of you ever done something like this?
Thanks
EDIT: MORE INFO ON MY USE CASE
I need to handle POST requests and GET responses.
In the POST request I receive a JSON object and I have to encrypt some of its values.
In the GET response I receive a JSON object and I have to decrypt some of its values.
Till now, the following code has worked perfectly:
def handle_request(self, r):
if(r.method=='POST'):
// encryption of r.get_form_urlencoded()
def handle_response(self, r):
if(r.request.method=='GET'):
// decryption of r.content
How can I do the same thing with single chunks?
EDIT: UPDATES
After evaluating different solutions, I decided to go for Squid (proxy) + ICAP (content adaptation).
I've successfully configured Squid and the performance are just great. Unfortunately, I can't find a suitable ICAP server (in Python, if possible) for doing content adaptation (modification). I thought this one https://github.com/netom/pyicap could do the job but looks like it doesn't read the body of myPOST requests.
Do you guys know a Python ICAP server that I can use together with Squid?
Thanks
The answer below is outdated. You can now pass --stream to mitmproxy, whose behaviour is explained in the mitmproxy documentation.
mitmproxy developer here. This is definitely a feature we want for mitmproxy as well, but it's not that trivial and probably not coming very soon. If you really want to implement that yourself, I can recommend two things:
If you have a very specific use case, you can employ libmproxy.protocol.http.HTTPRequest.from_stream for parsing the header and do the body processing yourself.
If you do not want to modify the request/response body, you may find it sufficient to modify mitmproxy itself. In a nutshell, you would need to read the request/response without content (see 1.), modify it to your needs, pass it to the server and then delegate control to the libmproxy.protocol.tcp (see https://github.com/mitmproxy/mitmproxy/blob/master/libmproxy/proxy/server.py#L169)
If you have further questions, don't hesistate to ask here or on mitmproxy's IRC channel.
Re Comment #1:
You can't take too much out of mitmproxy, but at least you get delegate the header parsing & processing.
# ...accept request, socket.makefile() etc...
req = HTTPRequest.from_stream(client_conn.rfile, include_content=False)
# manually forward to the server (req._assemble_head())
# manually receive response body chunk by chunk and forward it to the server, see
# https://github.com/mitmproxy/netlib/blob/master/netlib/http.py#L98
resp = HTTPResponse.from_stream(server_conn.rfile, include_content=False)
# manually forward headers
# manually process body and forward
That being said, this is a fairly complex topic. Eventually, you're better off hacking that directly into libmproxy.protocol.http.HTTPHandler.
Another option, depending on your use case again: Use mitmproxy, set the conntype to tcp and forward traffic as-is and use regex replacements on the content in libmproxy.protocol.tcp . Probably the easiest way, but the most hacky one.
If you can provide some context, I may guide you further in the right direction.
Re Comment #2:
Before we get to the main part: JSON is a really bad choice for streaming/chunking as long as you don't want to encrypt the complete JSON object and treat it as a single string. You should definitely consider something like tnetstrings if you only want to encrypt parts.
Apart from that, hooking into read_chunk works, but first you need to get to the point where you can actually receive chunks over the line. Then, it's as simple as reading the single chunks, encrypting them and forwarding them.

xbuf_frurl does not work properly without server header of content length?

I try to get some info from other sites with xbuf_frurl.
I got some site OK but some Not OK.
By Now, I still can not make sure what is going wrong.
But some sites are missing the content length header.
Who can tell whether xbuf_frurl() relies on the (potentially missing) content length header, esp. when growing the buffer?
xbuf_frurl() is indeed reading a body IF an HTTP content-length header is present. It will not try to decode chunked responses.
To deal with those servers using chunked replies, use the G-WAN curl.c example provided with the distribution. With libcurl you have even the opportunity to use SSL/TLS.
If that's not resolving your problem, the only way to troubleshoot this kind of issues is to give a non-working example, with both the full request that you have sent and the full reply received from the server.
That's why the xbuf_xcat("%v") format has been added: to give hexdumps, even with binary replies.
Edit your question and add this information to let people help you with a well-defined problem.

Content-Range header - allowed units?

This is related to:
How should I implement a COUNT verb in my RESTful web service? , Paging in a Rest Collection
and Using the HTTP Range Header with a range specifier other than bytes?
Actually I think the -1 rated anwser here is correct https://stackoverflow.com/a/1434701/1237617
Generally anwsers say that you can use custom units citing the sec 3.12
range-unit = bytes-unit | other-range-unit
bytes-unit = "bytes"
other-range-unit = token
However when you read the HTTP spec please notice the production rules are thus:
Content-Range = "Content-Range" ":" content-range-spec
content-range-spec = byte-content-range-spec
byte-content-range-spec = bytes-unit SP
byte-range-resp-spec "/"
( instance-length | "*" )
The header spec only references bytes-unit from sec 3.12, not range-units, so I think that actually it's against the spec to use custom units here.
Am I missing something or is the popular anwser wrong?
EDIT: Since this probbably isn't clear, the gist of my question is:
rfc2616 sec14.16 only references bytes-unit. It never mentions range-unit, so range-unit production is not relevant for Content-Range, and thus only byte-units can be used.
I think this adresses my concerns best, although I needed some time to understand it (plus I wanted to make sure, that there is something wrong with the wording).
This reflects the fact that, apparently, the first set of grammar rules has been specifically made for parsing and the second one for producing HTTP requests
thanks to elgaton
The spec, as being revised, allows custom range units. See HTTPbis Part 5, Section 2.
If you read the HTTP/1.1 RFC, section 3.12, you will see that:
The only range unit defined by HTTP/1.1 is "bytes". HTTP/1.1 implementations MAY ignore ranges specified using other units.
So, the other-range-unit token has been introduced only to make servers more "liberal" when accepting. This reflects the fact that, apparently, the first set of grammar rules has been specifically made for parsing and the second one for producing HTTP requests, so that servers could accept even invalid requests (they will be simply ignored) and clients would use only the universally-accepted bytes unit.
Therefore, I personally recommend to:
use only the bytes unit when acting as a client, and
accept other units (discarding the Content-Range header if they are invalid) when acting as a server.
This is a purely personal opinion, but I think it is fairly consistent with how other HTTP extensions (custom methods or headers) are used. Here is how I read it: Yes I can use custom range units and no, I shouldn't submit a bug report when it gets ignored when passing through firewalls, web proxies, and other intermediaries. I conform to the HTTP spec when I'm sending it and they conform to HTTP when they ignore it. WebDAV uses HTTP extensions correctly, IMO, but rarely works over the Internet for exactly this reason. As I said, a personal opinion only.
Apparently it's OK to use custom units, because:
This reflects the fact that, apparently, the first set of grammar
rules has been specifically made for parsing and the second one for
producing HTTP requests