I am uploading very large (gigabytes) files to an application server, with Apache as a proxy.
I want to stream process these to (1) use less memory and (2) prevent timeout from an AWS load balancer as I process the upload.
It seems that Apache does a substantial amount of buffering when I upload. If this is too much, I can't accomplish these objectives.
What determines the amount of upload buffering, and how can I configure it?
NOTE: To be clear, I am asking about Apache, not PHP.
Related
Assuming the following setup:
Apache server 2.4
mpm_prefork with default settings (256 workers?)
Default Timeout (300s)
High KeepAliveTimeout (100s)
reqtimeout_mod enabled with the following config: RequestReadTimeout header=62,MinRate=500 body=62,MinRate=500
Outdated mod_wsgi 3.5 using Daemon mode with 15 threads and 1 process
AWS ElasticBeanstalk's load balancer acting as a reverse proxy to apache with 60s idle connection timeout
Python/Django being the wsgi application
A simple slowloris attack like the one described here, using a "slow" request body: https://www.blackmoreops.com/2015/06/07/attack-website-using-slowhttptest-in-kali-linux/
The above attack, with just 15 requests (same as mod_wsgi threads) can easily lock the server until a timeout happens, either due to:
Load balancer timeout (60s) happens due to no data sent, this kills the apache connection and mod_wsgi can once again serve requests
Apache RequestReadTimeout happens due to data being sent, but not enough, again mod_wsgi is able to serve requests after this
However, with just 15 concurrent "slow" requests, I was able to lock the server up to 60 seconds.
Repeating the same but with a more bizarre number, like 4096 requests, pretty much locks the server permanently since there will be always a new request that needs to be served by mod_wsgi once the previous times out.
I would expect that the load balancer should handle/detect this before even sending requests to apache, which it already does for similar attacks (partial headers, or tcp syn flood attacks never hit apache which is nice)
What options are available to help against this? I know there's no failproof option since these kind of attacks are difficult to detect and protect, but it's quite silly that the server can be locked that easily.
Also, if the wsgi application never reads request body, I would expect for the issue to not happen as well since the request should return immediately, but I'm not sure about this or the internals of mod_wsgi, for example, this is true when using a local dev wsgi server (the attack files since the request body is never read) but the attack succeeds when using mod_wsgi, which leads me to think it tries to read the body even before sending it to the wsgi code.
Slowloris is a very simple Denial-of-Service attack. This is easy to detect and block.
Detecting and preventing DoS and DDos attacks are complex topics with many solutions. In your case you are making the situation worse by using outdated software and picking a low worker thread count so that the problem arises quickly.
A combination of services are available that would be used to manage Dos and DDos attacks.
The front-end of the total system would be protected by a firewall. Typically this firewall would include a Web Application Firewall to understand the nuances of HTTP protocols. In the AWS world, Amazon WAF and Shield are commonly used.
Another service that helps is a CDN. Amazon CloudFront uses Amazon Shield so it has good DDoS support.
The next step is to combine load balancers with auto scaling mechanisms. When the health checks start to fail (caused by Slowloris), the auto scaler will begin launching new instances and terminating failed instances. However, a sustained Slowloris attack will just hit the new servers. This is why the Web Application Firewall needs to detect the attack and start blocking it.
For your studies, take a look at mod_reqtimeout. This is an effective and tuneable solution for Apache for most Slowloris attacks.
[Update]
In the Amazon DDoS White Paper June 2015, Slowloris is specifically mentioned.
On AWS, you can use Amazon CloudFront and AWS WAF to defend your
application against these attacks. Amazon CloudFront allows you to
cache static content and serve it from AWS Edge Locations that can
help reduce the load on your origin. Additionally, Amazon CloudFront
can automatically close connections from slow-reading or slow-writing
attackers (e.g., Slowloris).
Amazon DDoS White Paper June 2015
In mod_wsgi daemon mode there are a bunch of options to further help to combat such attacks by recovering from it and discarding queued requests as well which have been waiting too long. Try your tests using mod_wsgi-express as it defines defaults for a lot of these options whereas when using mod_wsgi yourself directly, there are no defaults. Use mod_wsgi-express start-server --help to see what defaults are. The actual options you want to look at for mod_wsgi daemon mode are request-timeout, connect-timeout, socket-timeout and queue-timeout. There are also other options related to buffer sizes and listener backlog you can play with. Do note that ultimately the listen backlog of the main Apache worker processes can still be an issue because it usually defaults to 500, which means a lot of requests can queue up stuck before you can even tag them with a time so as to help discard the backlog by tracking queue time.
You can find the documentation at:
http://modwsgi.readthedocs.io/en/develop/configuration-directives/WSGIDaemonProcess.html
On the point of whether mod_wsgi reads the request body before sending it, no it doesn't. Apache itself because it reads in block may partially read the request body when reading the headers, but it shouldn't block on it. Once the full request headers are passed off to mod_wsgi and sent through to the daemon process, then mod_wsgi will start transferring the request body.
Soloution:
If you are getting hit, I recommend you go to a provider that protects against DDoS attacks. However your best bet would be to programatically block the IP once it has been decided that it is being malicious. If you receive two large Content-Length POST requests than you should block the IP for a few minutes for suspicious activities. Many large companies are very cheap, and some of them are free for the basic package such as Cloud Flare. I use them for my company and I am beyond happy to have them!
Edit: Their job is literally just to protect you. That is it.
Pros & cons over Apache or nginx and how they work internally in order to maximize the resource utilization
Can I use Apache & Nginx together ? If I use only Nginx then what problem I can face ?
Apache has some disadvantages, especially when it is used with the PHP module.
Apache's process model is such that each connection uses a separate process. Each process carries all the overhead of PHP and any other modules you may have loaded with it. An Apache process might run a PHP script or serve static content for one request. If the PHP has a memory leak (which does happen sometimes), the process continues to grow in size. Also, when KeepAlive is enabled, which is usually recommended, that process stays alive for a few seconds after the connection, consuming a "slot" that another client might be able to use and helping the server to reach its MaxClients sooner.
Nginx is an alternative webserver that normally uses the Linux "epoll" API to process requests in a non-blocking mode. This means that one single process can handle many simultaneous connections. Epoll is an efficient way to tell the single process which connection(s) it needs to deal with and which can wait. Nginx has a goal of solving the "C10k" problem - how to have 10,000 concurrent connections.
This naturally goes hand in hand with php-fpm, the FastCGI Process Manager. Nginx itself does not have PHP built-in. When it receives a request for a PHP script, it makes a call out to php-fpm to run the script, which then returns the result to nginx, which returns it to the client.
This all uses a lot less memory than a similar Apache+mod_php configuration.
There are a couple more huge advantages of php-fpm over mod_php:
It uses different "pools", each of which can run as a separate Linux user. This provides a simple and effective way of isolating websites (for example, if they are run by different customers who should not read each other's code) without the overhead or nastiness of suexec or suphp.
It has a slow log feature where it can dump a PHP stack trace of any script that has been running for greater than X seconds. This can help diagnose slow code issues.
Php-fpm can be run with Apache, and in fact this allows you to take advantage of Apache's more efficient Worker MPM (or Event in Apache 2.4). However, my experience is that configuring it in Apache is significantly more complex than configuring it in nginx, and even with Worker, it still is not quite as efficient with nginx.
Disadvantages of moving to nginx - not many, but things to keep in mind:
It does not support .htaccess files. I think this is a good thing personally as .htaccess files must be parsed by Apache for every request, which can cause significant overhead.
Configuration files need to be re-written. If you have many complex site configurations, this could take some doing. For simple cases it is not usually a big deal.
Feature Of Nginx
Nginx is fast because it does not need to create a new process for
each new request.
HTTP proxy and Web server features
Ability to handle more than 10,000 simultaneous connections with a
low memory footprint (~2.5 MB per 10k inactive HTTP keep-alive
connections)
Handling of static files, index files, and auto-indexing
Reverse proxy with caching
Load balancing with in-band health checks
Fault tolerance
Nginx uses very little memory, especially for static Web pages..
FastCGI, SCGI, uWSGI support with caching
Name- and IP address-based virtual servers
IPv6-compatible
SPDY protocol support
FLV and MP4 streaming
Web page access authentication
gzip compression and decompression
URL rewriting having its own rewrite engine
Custom logging with on-the-fly gzip compression
Response rate and concurrent requests limiting
Bandwidth throttling
Server Side Includes
IP address-based geolocation
User tracking
WebDAV
XSLT data processing
Embedded Perl scripting
Nginx is highly scalable, and performance is not dependent on
hardware.
With only Nginx, you lose a whole bunch of apache-specific features such as all the mod_dav stuff. You lose a lot of modules, effectively
Conclusion
The best use for nginx is in front of Apache if you need Apache modules. Use it as a load-balancer if you might, between multiple Apache instances, and you suddenly have a mixed set-up that is rather
I'm using apache with ldirector i'm facing some issues during load times when google, bing crawlers hit my site it makes apache to choke due to which my server's cup useage went to 100% utlization. after this i have to stop apache and monitor load manually i want to automate all this scenario. here is what i want when ever load comes on apache it normalizes server according to given settings and if cpu usage goes high it should not be exceded to given cpu usage limit.
I want to control all this via shell script, please give suggestions.
I heared that the websockets (e.g. socket.io) are very fast, but they require direct connection for each client. Is it so sutable for uploading files for video hostings with many clients/ frequently uploads? Or will it fail and only ajax can be used in that case?
I'd say it depends on the file sizes and how long connections to clients last.
If you chunk uploads using the HTML5 FileAPI, then use Websockets to upload the data, this can dramatically reduce the amount of data transferred because they don't need to send HTTP headers with every request; these can add up if for example you split a 1GB file into 5MB chunks.
If clients are persistently connected then Websockets can reduce the need to do long polling, wasting resources on your server if there is no new information to push to the client.
Websockets will therefore reduce the resources required but they are not available on every browser.
At our peak hour we need to serve around 250/rps. What we're doing is accepting a url for an image, pulling the image out of memcache, and returning it via Apache.
Our currently system is a dual-core machine with 4GB of memory: 2GB for the images in memcache and 2GB for Apache; but we're seeing a very high load (20-30) during our peak time. The average response time, as reported by Apache, is 30-80ms per request, which seems kind of slow for a simple Apache request served from memory.
Are there better tools for this? Serving from disk is not an option since the IO wait was holding it back, so we moved it to memory. How do CDN's do it?
EDIT: Well, the system works like this. A request comes in, we check a "queue" to see if we've seen this request before and if we have we serve the image(from disk...or memory). If not we increment the counter for that request in a memcached queue and there are worker machines that actually generate the image and then store it back on the main server. So, currently when a request comes in we're checking the memcached db if it exists then we'll connecting to another db for the actual image database. When the images were on disk we found that just the file_exist function would take 30+ ms to completed so we moved it to memory. If we moved the images to a ramdisk would this speed up the file_exist or would we still want a first check to see if we should even seek the image out?
Have you looked at nginx?
According to Netcraft in May 2009 nginx served or proxied 3.25% busiest sites. It can serve from memcached too.
Depending on size of your image, Apache should handle this with no problem at all. We have an Apache serving 2000 request/seconds, the average size of response is 12K. The machine has 32GB memory so all our content is cached.
Here are some tuning tips,
Use threaded MPM like worker, with lots of threads open (We have 256).
Use mod_cache so all the images will be in memory
Allocate as much memory as possible to the Apache process
When you say memcache, do you mean the memcached server? Running memcached will be slower because the latency on TCP connection (even though it's loopback) is much larger than direct memory access.
If you can fit all your images in memory, a RAM disk will also help a lot.