Does Apache log cancelled downloads? - apache

If a user requests a large file from an Apache web server, but cancels the download before it completes, is this logged by Apache?
Can I tell from the log file which responses were not sent fully, and how many bytes were sent?

Yes, it logs those requests, but you need to use mod_logio to know the actual bytes sent, else it will show the total amount of bytes of the file. And to know which have failed you'd have to either:
use the %X format modifier and use a custom log format
compare the actual bytes sent against the files' sizes (why would you if you have the first option :-) )

Yes. If I remember correctly, it will show the amount of bytes transferred before the download was interrupted. You could then work out how many bytes should have been sent for that request and compare.
If you're using PHP (as the question was tagged a minute ago), you could probably do some sort of response buffer, where you chunk out the file in smaller bits. Start off by working out how many chunks you need to send, write a log (to db, or the syslog) to say you've started and once you hit the final chunk, another to say you've finished (or delete the first).

Related

Requests - How to upload large chunks of file in requests?

I have to upload large files (~5GB). I am dividing the file in small chunks (10MB), can't send all data(+5GB) at once(as the api I am requesting fails for large data than 5GB if sent in one request). The api I am uploading to, has a specification that it needs minimum of 10MB data to be sent. I did use read(10485760) and send it via requests which works fine.
However, I do not want to read all the 10MB in the memory and if I leverage multithreading in my script, so each thread reading 10MB would cost me too much memory.
Is there a way I can send a total of 10MB to the api requests but read only 4096/8192 bytes at a time and transfer till I reach 10MB, so that I do not overuse memory.
Pls.note I cannot send the fileobj in the requests as that will use less memory but I will not be able to break the chunk at 10MB and entire 5GB data will go to the request, which I do not want.
Is there any way via requests. I see the httplib has it. https://github.com/python/cpython/blob/3.9/Lib/http/client.py - I will call the send(fh.read(4096) function here in loop till I complete 10MB and will complete one request of 10MB without heavy memory usage.
this is what documentation says:
In the event you are posting a very large file as a multipart/form-data request, you may want to stream the request. By default, requests does not support this, but there is a separate package which does - requests-toolbelt. You should read the toolbelt’s documentation for more details about how to use it.
so try to stream the upload if it doesn't work as per your needs then go for requests-toolbelt
In order to stream the upload, you need to pass stream=True in the function call whether its post or put.

Is there a way to add header to apache response on how long it took to retrieve a resource?

Is there a module or a built-in function in apache which I can use/activate to send information how long it took to retrieve/process a resource?
For example the resource http://dom.net/resource is accessed. The response header will include the total time it took to wait for the resource to be ready before it gets sent back to the client.
Apache doesn't really 'wait' until the resource is ready before sending the response back to you - it streams data back to the client as and when it receives it.
Depending on what you're interested in measuring, you could record the time taken for the client to receive the first byte/last byte back from Apache, or measure the time taken for Apache to receive the first byte from the (remote?) resource like so. The time taken for Apache to receive the entire response back from the remote resource is not something you can send in the headers, as the headers will have been sent to the client before the remote response is fully received. This information could trivially be written to the Apache logs, however.

mp3 snippets on s3

I need a solution to play a segment of an mp3. I have a few 1,000 audio files which are currently stored on Amazon S3, and would like to allow users to play them, however I would like to limit the play length to 30 seconds or so in the middle of the recording.
I'm not sure if I need to create an entirely new file (snippet) such as I would for a thumbnail if it were an image, or if it's possible using some player/steam to safely limit it that way so they cannot access the whole song.
I'm coming from a Rails environment and using Paperclip to handle the files and JPlayer to play them if it matters.
Any pointers or best practices?
This is possible by using the HTTP "Content-range" header. This header says 'please just give me the bytes from here to here and ignore the rest'. If the web server is set up to handle them (Apache is for instance), then you get a 206 response with a body of just those bytes.
You must create a small proxy application that effectively acts as a gateway between the listener and Amazon.
To see if your host will respond try this from the command line:
curl -v -I http://www.mfiles.co.uk/mp3-downloads/01-Tartaros%20of%20light.mp3
Where the url is one of yours. If you are lucky you will see:
Accept-Ranges: bytes
Content-Length: 5284483
This means that the server does accept the Content-range header and the full length of the file is 5284483 bytes long.
Let's request the first third of the file:
curl -H'Range: bytes=0-1761494' http://www.mfiles.co.uk/mp3-downloads/01-Tartaros%20of%20light.mp3 > /tmp/test1.mp3
You should now be able to play /tmp/test1.mp3 and hear the first third of the track.
The next step is to create a proxy application. A good approach would be to use https://github.com/aniero/rack-streaming-proxy but you would probably need to fork the project to send the 'Range: bytes=0-1761494' header. Alternatively have a look at Sinatra.
A bonus here is that because you are proxying the remote server, you could obfuscate the actual URL of the file by having a simple database table with an ID for each file. I would suggest writing a small script that also stores the byte length of each file, so that you don't have to calculate the range for each request.
Thus a GET to "/preview/12345" would proxy "http://amazon.com/my_long_url" and give you just the first third of the file.
On top of that, you could put Varnish in front of your own server, which would cache these partial MP3 files and mean that you are not having to constantly go back to Amazon to get the files.
Unfortunately, you'll need to make new snippets - there isn't really a way to tell a user's browser "download this entire mp3 file, but only play and allow access to the middle 30 seconds".
i think it is simplier to solve the problem in the client side.
Are you using flash to play the audio files?
If yes, I have done something similar (but with videos) using JWPlayer (it also supports audio files).
You can develop a custom plugin to control the snippet you want to play and then stop the audio file and show a message or something like that.
This solution combined with signed urls or/and rtmp streaming with CloudFront can be very safe.
Due to the mp3 format limitation, you cannot seek to the arbitrary frame in the middle of the song and start transmission from that point.
So, there are basically three options:
1. Create new files offline. Very easy, but space consuming.
2. Transcode files on the fly. CPU consuming, degrades quality.
3. Limit playback with first X seconds: just peek into the song' header, get its bitrate and calculate size of the byte chunk to serve
And don't ever transmit more than you need: people will manage to intercept the stream and save it to disk (business side); save your users' traffic (good karma).

Apache2 and CGI - how to keep Apache from buffering the POST data?

I'm trying to provide live parsing of a file upload in CGI and show the data on screen as it's being uploaded.
However, Apache2 seems to want to wait for the full POST to complete before sending the CGI application anything at all.
How can I force Apache2 to stop buffering the POST to my CGI application?
EDIT
It appears that it's actually the output of the CGI that's being buffered. I started streaming the data to a temp file to watch it's progress. That, and I have another problem.
1) The output is being buffered. I've tried SetEnvIf (and simply SetEnv) for "!nogzip", "nogzip", and "!gzip" without success (within the CGI Directory definition).
2) Apache2 appears to not be reading the CGI's output until the CGI process exits? I notice that my CGI app (flushing or not) is hanging up permanently on a "fwrite(..., stdout)" line at around 80K.
EDIT
Okay, Firefox is messing with me. If I send a 150K file, then there's no CGI lockup around 80K. If the file is 2G, then there's a lockup. So, Firefox is not reading the output from the server while it's trying to send the file... is there any header or alternate content type to change that behavior?
EDIT
Okay, I suppose the CGI output lockup on big files isn't important actually. I don't need to echo the file! I'm debugging a problem caused by debugging aids. :)
I guess this works well enough then. Thanks!
FINAL NOTE
Just as a note... the reason I thought Apache2 was buffering input was that I always got a "Content-Length" environment variable. I guess FireFox is smart enough to precalculate the content length of a multipart form upload and Apache2 was passing that on. I thought Apache2 was buffering the input and reporting the length itself.
Are you sure it's the input being buffered that's the problem? Output buffering problems are much more common, and might not be distinguishable from input buffering, if your method of debugging is something like just print​ing to the response.
(Output buffering is commonly caused either by unflushed stdout in the script or by filters. The usual culprit is the DEFLATE filter, which is often used to compress all text/ responses, whether they come from a static file or a script. In general it's a good idea to compress the output of scripts, but a side-effect it that it will cause the response to be fully buffered. If you need immediate response, you'll need to turn it off for that one script or all scripts, by limiting the application of AddOutputFilterByType to particular <Directory>​s, or using mod_setenvif to set the !nogzip note.)
Similarly, an input filter (including, again DEFLATE) might cause CGI input to be buffered, if you're using any. But they're less widely-used.
Edit: for now, just comment out any httpd conf you have enabling the deflate filter. You can put it back selectively once you're happy that your IO is unbuffered without it.
I notice that my CGI app (flushing or not) is hanging up permanently on a "fwrite(..., stdout)" line at around 80K.
Yeah... if you haven't read all your input, you can deadlock when trying to write output, if you write too much. You can block on an output call, waiting for the network buffers to unclog so you can send the new data you've got, but they never will because the browser is trying to send all its data before it will start to read the output.
What are you working on here? In general it doesn't make sense to write progress-info output in response to a direct form POST, because browsers typically won't display it. If you want to provide upload-progress feedback on a plain HTML form submission, this is usually done with hacks like having an AJAX connection check back to see how the upload is going (meaning progress information has to be shared, eg. in a database), or using a Flash upload component.
From an (old version) of the Apache HTTP Server manuals:
Every time your script does a "flush"
to output data, that data gets relayed
on to the client. Some scripting
languages, for example Perl, have
their own buffering for output - this
can be disabled by setting the $|
special variable to 1. Of course this
does increase the overall number of
packets being transmitted, which can
result in a sense of slowness for the
end user.
Have you tried flushing STDOUT or checking if the language you are using has buffering you can disable?
here's a useful guide for controlling buffering when using perl on the server side:
http://perl.plover.com/FAQs/Buffering.html
many of the ideas and concepts apply to other languages too, such as using buffered and unbuffered output, raw system calls to read and write data vs I/O libraries which do their own buffering.

Sub second request time logging in Apache

While I can get microsecond resolution on the time taken to process a request (%D) to help reconstruct the sequence of requests I would like to look at this in relation to the times of multiple requests generated by a particular page. However as far as I can tell, the %t specifier only provides accuracy to the nearest second. Which makes it impossible to reconstruct the original sequence of events.
Is there another way to get this information in my access_log files?
TIA
This is now possible with Apache 2.4. Use for example the following log-format instead of %t:
[%{%d/%b/%Y:%H:%M:%S}t.%{msec_frac}t %{%z}t]
This will give times like [10/Apr/2012:10:47:22.027 +0000]
Unfortunately, no. This got covered a while back (How to timestamp request logs with millisecond accuracy in Apache 2.0), and it's still true for the most recent stable (2.2.x) Apache branch.
I know of at least one workaround, though, if you're interested: You can pipe Apache's logs to an external process (see the docs page at http://httpd.apache.org/docs/current/mod/mod_log_config.html, under the "CustomLog" directive) which would add timestamps and actually write to the log file.
Note that this method does NOT capture the true request RX time. Apache doesn't output an access log entry until AFTER it's completed sending its response. Plus, there's an additional variable delay while Apache writes into the pipe and your timestamper reads from it (possibly including some buffering, there). If you turned on Apache's "BufferLogs" directive, there's going to be more variable buffering delay. When the system is under load, and perhaps in other edge cases, the average delay could easily grow to a second or more.
If the delays aren't too bad (i.e., "BufferedLogs off", low system load), you can probably get a pretty tight estimate by subtracting the "%D" value from your external timestamp.
Some people (including me) pipe Apache's access logs to the local Syslog daemon (via the 'logger' command or whatever). The syslog daemon takes care of timestamping, among other things.