Serving dynamic zip files through Apache - apache

One of the responsibilities of my Rails application is to create and serve signed xmls. Any signed xml, once created, never changes. So I store every xml in the public folder and redirect the client appropriately to avoid unnecessary processing from the controller.
Now I want a new feature: every xml is associated with a date, and I'd like to implement the ability to serve a compressed file containing every xml whose date lies in a period specified by the client. Nevertheless, the period cannot be limited to less than one month for the feature to be useful, and this implies some zip files being served will be as big as 50M.
My application is deployed as a Passenger module of Apache. Thus, it's totally unacceptable to serve the file with send_data, since the client will have to wait for the entire compressed file to be generated before the actual download begins. Although I have an idea on how to implement the feature in Rails so the compressed file is produced while being served, I feel my server will get scarce on resources once some lengthy Ruby/Passenger processes are allocated to serve big zip files.
I've read about a better solution to serve static files through Apache, but not dynamic ones.
So, what's the solution to the problem? Do I need something like a custom Apache handler? How do I inform Apache, from my application, how to handle the request, compressing the files and streaming the result simultaneously?

Check out my mod_zip module for Nginx:
http://wiki.nginx.org/NgxZip
You can have a backend script tell Nginx which URL locations to include in the archive, and Nginx will dynamically stream a ZIP file to the client containing those files. The module leverages Nginx's single-threaded proxy code and is extremely lightweight.
The module was first released in 2008 and is fairly mature at this point. From your description I think it will suit your needs.

You simply need to use whatever API you have available for you to create a zip file and write it to the response, flushing the output periodically. If this is serving large zip files, or will be requested frequently, consider running it in a separate process with a high nice/ionice value / low priority.
Worst case, you could run a command-line zip in a low priority process and pass the output along periodically.

it's tricky to do, but I've made a gem called zipline ( http://github.com/fringd/zipline ) that gets things working for me. I want to update it so that it can support plain file handles or paths, right now it assumes you're using carrierwave...
also, you probably can't stream the response with passenger... I had to use unicorn to make streaming work properly... and certain rack middleware can even screw that up (calling response.to_s breaks it)
if anybody still needs this bother me on the github page

Related

Which file is consuming most of the bandwidth?

My website is consuming much more bandwidth than it supposed to be. From Weblizer or awstats of WHM/ cPanel I can monitor the bandwidth usage, which type of files (jpg, png, php, css etc.) is consuming the bandwidth. But I couldn't get any specific file name. My assumption is the bandwidth usage is done by referral spaming. But from the "Visitors" page of cPanel I can see only last 1000 hits. Is there any way from where I can see that which image or css file is consuming the bandwidth.
If there is a particular file which you think is consuming the most bandwidth, then you use apachetop tool.
yum install apachetop
then run
apachetop -f /var/log/apache2/domlogs/website_name-ssl.log
replace website_name with which you wish too.
It will basically pick the entries from domlogs (which saves requests being served from websites, you may read more about domlogs here).
This will show the file which is being requested the most in real time basis and might give you an idea if particular image/php etc file has maximum requests.
Domlogs is a way to find which file request by which bot etc is being carried out. Your initial investigation may start from this point.

mod_rewrite to serve static cached files if they exist and if the source file hasn't changed

I am working on a project that processes images, saves the processed images in a cache, and outputs the processed image to the client. Lets say that the project is located in /project/, the cache is located in /project/cache/, and the source images are located wherever else on the server (like in /images/ or /otherproject/images/). I can set up the cache to mirror the path to the source image (e.g. if the source image is /images/image.jpg, the cache for that image could be /project/cache/images/image.jpg), and the requests to the project are roughly /project/path/to/image (e.g. /project/images/image.jpg).
I would like to serve the images from the cache, if they exist, as efficiently as possible. However, I also want to be able to check to see if the source image has changed since the cached image was created. Ideally, this would all be done with mod_rewrite so PHP wouldn't need to be used to do any of the work.
Is this possible? What would the mod_rewrite rules need to be for this to work?
Alternatively, it seems like it would be a fine compromise to have mod_rewrite serve the cached file most of the time but send 1 out of X requests to the PHP script for files that are cached. Is this possible?
You cannot acces the file modification timestamp from the RewriteRule, so there is no way around using PHP or another programming language for that task.
On the other hand this is really simple in PHP, so you should first check whether the PHP solution is good enough in you case. Only if it isn't you should look for alternatives.
What if you used the client to do some of the work? Say you display an image in the web browser and always use src="/cache/images/foobar.jpg" and add an onerror="this.src='/images/foobar.jpg'". In mod_rewrite, send anything that goes to the /images/ dir to a script that will return and generate an image in the cache.

HTTP compression - How to send precompressed files that exist in a EAR file?

Is it possible to send pre-compressed files that are contained within an EARfile? More specifically, the jsp and js files within the WAR file. I am using Apache HTTP as the web server and although it is simple to turn on the deflate module and set it up to use a pre-compressed version of the files, I would like to apply this to files that are contained within an EAR file that is deployed to JBoss. The reason being that the content is quite static and compressing it on the fly each time is quite costly in terms of cpu time.
Quite frankly, I am not entirely familiar with how JBoss deploys these EAR files and 'serves' them. The gist of what I want to do is pre-compress the files contained inside the war so that when they are requested they are sent back to the client with gzip for Content-Encoding.
In theory, you could compress them before packging them in the EAR, and then serve them up with a custom controller which adds the http header to the response which tells the client they're compressed, but that seems like a lot of effort to go to.
When you say that on-the-fly compression is quite costly, have you actually measured it? Have you tried requesting a large number of uncompressed pages, measured the cpu usage, then tied it again with compressed pages? I think you may be over-estimating the impact. It uses quite low-intensity stream compression, designed to use little CPU resources.
You need to be very sure that you have a real performance problem before going to such lengths to mitigate it.
I don't frequent this site often and I seem to have left this thread hanging. Sorry about that. I did succeed in getting compression to my javascript and css files. What I did was I precompress them in the ant build process using the gzip. I then had to spoof the name to get rid of the gzip extension. So I had foo.js and compressed it into foo.js.gzip. I renamed this foo.js.gzip to foo.js and this is the file that gets packaged into the WAR file. So that handles the precompression part. To get this file served up properly, we just have to tell the browser that this file is compressed, via the content-encoding header of the http response. This was done via a output filter that is applied to files that matched the *.js extension (some Java/JBoss, WEB-INF/web.xml if it helps. I'm not too familiar with this so sorry guys).

Are there alternatives to CGI (and do I really need one)?

I am designing an application that is going to consist of 3-4 services that run as separate processes and are linked by a suitable IPC. The system is going to have a web interface and I want to use whatever webserver is there.
The web interface should be accessed under some URL that allows to have other URLs on the same webserver doing totally different things. I'm planning to use the path below that URL to specify what the web interface should do. It has facilities for use by other applications over the net and for humans to interact with in a browser.
Off the cuff, I'd work as follows:
make the webserver fire up a CGI process for every request it receives (like SetHandler in Apache)
let the CGI connect to the IPC
let it get whatever it needs from the backend services
let the CGI return HTML / XML and whatever HTTP Status based on the services' answers
Now, what I really want is to avoid the first two steps, or if I can't, avoid the second one, because I'm afraid that I'm wasting performance on unneccesary overhead (the requests coming from other applications might be frequent).
PHP, for example, can open persistent connections to a MySQL database that survive the script's runtime and don't need to be recreated next time, though I don't know how they actually do it. Also, as I understand it, the Apache modules are loaded once when the server starts, so that might remove the first step but would tie me to Apache.
So, what are good ways to hook a handler for specific URLs into different webservers? I don't want to handle the HTTP, otherwise I might just use a proxy setup to a second server, but it just seems to be so reinventing-the-wheel. If you think, CGI is fine and have examples where it handles large numbers of request of a similar structure, please let me know.
OK, I overlooked this previously. Explaining my question here brought me onto it:
Instead of creating a new process for every request, FastCGI can use a single persistent process which handles many requests over its lifetime. -- Wikipedia: FastCGI
Even under moderate loads, CGI is a pretty unscalable beast. FastCGI is an option, but you'll probably also find a mod_XXXX package where XXXX is the name of your language. There's a mod for ruby, perl, and python for instance and probably a fair few others.

How do I get a status report of all files currently being uploaded via a HTTP form on an Apache Server?

How do I get a status report of all files currently being uploaded via HTTP form based file upload on an Apache Server?
I don't believe you can do this with Apache itself. The upload looks like nothing more than a POST as far as Apache cares. There are modules and other servers that do special processing to uploads so you may have some luck there. It would probably be easier to keep track of it in your application.
Check out SWFUpload, its uses Flash (in a nice way) to assist with managing multiple uploads.
There are events you can monitor for how many files of a set have been uploaded.