LEMP Nginx + php-fpm high load timeouts then fine - optimization

I'm pretty new to all of this but I'm OCD about optimization.
I'm trying to optimize my web server running a LEMP setup for wordpress.
I'm using WP hypercache instead of w3 total cache as it seems to perform phenomenaly in comparison with my setup
I'm using blitz.io to test and throw 450 users at the domain for 60 seconds starting with the full 450.
This is my results:
Spike at 5 sec is errors and timeouts
http://i.imgur.com/CdpBz.png
htop during the spike:
http://i.imgur.com/OhEyS.png
It's a vps w/ 2 cpu at 2.5Ghz and 2.5GB memory, as you can see memory usage is low.
nginx: worker_processes 1; worker_connections 1024;
php-fpm: dynamic, pm.max_children = 10, pm.start_servers = 2, pm.max_spare_servers = 2, ;pm.max_requests = 500 default value = 0
I've increased the nginx worker_processes to 2 with no change, and I've messed with my php-fpm settings with no change. Any ideas what I should be looking at?

This does not look too bad. ~40 timeouts out of 19k requests is normal. I got similar results.
As for tuning:
look into http://wiki.nginx.org/HttpFastcgiModule#fastcgi_cache - using this avoids touching php at all and nginx does all the caching. You can also look at batcache (http://evansolomon.me/notes/faster-wordpress-multisite-nginx-batcache/)
look into apc/memcached for object caching. this makes the non-cached requests faster and also the backend is more responsive. apc also reduces the memory footprint of php. For day to day use this makes more of a difference. This also helps if a lot of your requests are not cacheable (e.g. lot's of new comments).
consider using php5.4 it's a lot of faster and requires less memory
enable the mysql query cache. http://mysqltuner.com is a nice little script to configure your server.
Measuring peak transfers is not a good indicator for scalability most of the time. real users behave probably different.
edit: try blitz.io on a static nginx page. If there are still timeouts the problem is probably at blitz.io or somehwere else. Also activate gzip compression for your pages.

Related

Server load is minimal but website responds poorly

I have VPS on hetzner. Server is located in Germany.
It has 256GB RAM, 6CPUs (12 threads).
I have a file which since yesterday, is requested about 30 times in one second. File has 2 Select, 2 Update, 2 Insert queries, so I assumed (not sure how this works) from this file server has about 180 requests per second. So right after this requests started, all the websites on the server just started loading poorly. I made this file run just one select query and than die. This didn't help. In WHM load is aboiut 0.02.
I've checked for error logs and there is no max_user_connection or any error there.
I have enabled slow query log and checked log file. there is nothing (I've tested it with select sleep(10) and this query was logged).
This is visit statistics, please bring your attention to may 30th:
Bandwidth stats for last 24 hours:
There are many errors like this in ssl_log (diff IPs of course):
188.121.206.150 - - [30/May/2018:19:50:03 +0200] "-" 408 - "-" "-"
I've been searching web a lot and couldn't find any solution. Could anyone at least tell what should I monitor or where. I have full access to anything there is possible inside the server. Any help is appreciated.
UPDATE 1
I have subdomain: banners.analyticson.com (access allowed for now) and there I have all the images and html5 files that are requested.
Take one image for example : https://banners.analyticson.com/img/suy8G1S6RU.jpg
It needs too much time to load. As I noticed, this sub domain has some issue.
Script, that I mentioned earlier (with 6 queries) just tries to get one of those banners to the user, so result of that script is to return one banner from banners.analyticson.com.
UPDATE 2
I've checked my script, it is fine. It takes less than 1 second to complete.
I've also checked Top command and there is a result. I'm not sure if $MEM value is fine.
You're going to have to narrow the problem down...
There are multiple potential issues.
First thing to eliminate would be the performance of your new script on a development laptop - I assume you're using PHP, so use the profiling tools to work out what is going on. If it's a database query, you'll see which one by looking at the profiler.
If your PHP script and database queries are fine, the next thing to look at: it sounds like you've hit some bottleneck resource on your infrastructure. In these cases, scripts that run fine as a single request start queueing for the bottleneck resource, and every new request adds to the queue until the whole server starts to crawl. This can be a bit of a puzzle - start with top and keep digging.
Next, I'd look at configuration of Apache to make sure everything is squeaky clean - Apache used to have a default to do a reverse DNS lookup for every request, which slows the server down rather impressively on production. You may also want to look at your SSL configuration - the error you report is linked to a load balancer issue.
If it's not as simple as memory, CPU etc., you're into more esoteric issues. You may need to ramp up a load testing rig so you can experiment without affecting the live site - typically, I do this on a machine as similar to live as possible, using Apache JMeter to generate load, and find the "inflection point". Typically, you see response times increase linearly with the number of concurrent requests, until you hit the bottleneck resource, at which point the response time increases rapidly. As a simple example, if you have 10 database connections available, response time should increase linearly up to 10 concurrent connections, and then become much larger from 11 up.
Knowing where the inflection point is and being able to recreate it allows you to use PHP profiling tools under load. This is a lot of work.
UPDATE
You're using php-cgi; this is easily the most inefficient way of running PHP scripts. Your server is barely breaking a sweat - CPU and memory basically idle. Here's a comparison for how to run PHP; consider changing to mod_php.

How to handle resource limits for apache in kubernetes

I'm trying to deploy a scalable web application on google cloud.
I have kubernetes deployment which creates multiple replicas of apache+php pods. These have cpu/memory resources/limits set.
Lets say that memory limit per replica is 2GB. How do I properly configure apache to respect this limit?
I can modify maximum process count and/or maximum memory per process to prevent memory overflow (thus the replicas will not be killed because of OOM). But this does create new problem, this settings will also limit maximum number of requests that my replica could handle. In case of DDOS attack (or just more traffic) the bottleneck could be the maximum process limit, not memory/cpu limit. I think that this could happen pretty often, as these limits are set to worst case scenario, not based on average traffic.
I want to configure autoscaler so that it will create multiple replicas in case of such event, not only when the cpu/memory usage is near limit.
How do I properly solve this problem? Thanks for help!
I would recommend doing the following instead of trying to configuring apache to limit itself internally:
Enforce resource limits on pods. i.e let them OOM. (but see NOTE*)
Define an autoscaling metric for your deployment based on your load.
Setup a namespace wide resource-quota. This enforeces a clusterwide limit on the resources pods in that namespace can use.
This way you can let your Apache+PHP pods handle as many requests as possible until they OOM, at which point they respawn and join the pool again, which is fine* (because hopefully they're stateless) and at no point does your over all resource utilization exceed the resource limits (quotas) enforced on the namespace.
* NOTE: This is only true if you're not doing fancy stuff like websockets or stream based HTTP, in which case an OOMing Apache instance takes down other clients that are holding an open socket to the instance. If you want, you should always be able to enforce limits on apache in terms of the number of threads/processes it runs anyway, but it's best not to unless you have solid need for it. With this kind of setup, no matter what you do, you'll not be able to evade DDoS attacks of large magnitudes. You're either doing to have broken sockets (in the case of OOM) or request timeouts (not enough threads to handle requests). You'd need far more sophisticated networking/filtering gear to prevent "good" traffic from taking a hit.

Why Varnish sess_timeout is so low by default?

Varnish has parameter sess_timeout(docs here), by default it is set to 5 seconds. Which means that after 5 seconds the session will be closed, and next page load will require will require extra 100ms (in average) to connect to server (I've described this issue here).
Why this parameter is so low by default?
If I increase it to 60 seconds, will it cause any problems on the server?
Does it matter what do I use behind the Varnish - nginx or apache? Or varnish optimizes the connections by itself?
What's the recommended value for average website (e.g. Magento store with 500 active users at a time)?
sess_timeout is tuned to avoid keeping state around when it is not needed. Worker threads are (in high traffic situations) a precious resource, and having one waiting around doing nothing isn't productive.
For all HTTP clients I know, manual netcat/telnet excluded, it does not take 5s to push through the 100-150 byte long HTTP request.
You can safely increase this to 60s if you feel you need to. If you are using this for long-running connections, you should probably use return(pipe) instead; different timers apply there.

What is the recommended max value for Max Connections Per Child in Apache configuration?

I am traying to reduce memory usage by Apache on the server.
My actual Max Connections Per Child is 10k
According to the following recommendation
the Max Connections Per Child should be reduced to 1000
http://www.lophost.com/tutorials/how-to-reduce-high-memory-usage-by-apache-httpd-on-a-cpanel-server/
What is the recommended max value for Max Connections Per Child in Apache configuration?
The only time when this directive affects anything is when your Apache workers are leaking memory. One way this happens is that memory is allocated (via malloc() or whatever) and never freed. It's the result of design/implementation flaws in Apache or its modules.
This directive is somewhat of a hack, really -- but if there's some module that's loaded into Apache that leaks, say, 8 bytes every request, then after a lot of requests, you'll run out of memory. So the quick fix is to just kill the process every MaxConnectionsPerChild requests and start a new one.
This will only affect your memory usage if you see it gradually increase over the span of lots of requests when setting MaxConnectionsPerChild to zero.
The default is 0 (which implies no maximum connections per child) so unless you have memory leakage I'm unaware of any need to change this setting - I agree with Hut8.
Sharing here FYI from the Apache 2.4 Performance Tuning page:
Related to process creation is process death induced by the MaxConnectionsPerChild setting. By default this is 0, which means that there is no limit to the number of connections handled per child. If your configuration currently has this set to some very low number, such as 30, you may want to bump this up significantly. If you are running SunOS or an old version of Solaris, limit this to 10000 or so because of memory leaks.
And from the Apache 2.4 docs on MaxConnectionsPerChild:
Setting MaxConnectionsPerChild to a non-zero value limits the amount of memory that process can consume by (accidental) memory leakage.

Apache KeepAlive on API Server

I have a LAMP server (Quad Core Debian with 4GB RAM, Apache 2.2 and PHP 5.3) with Rackspace which is used as an API Server. I would like to know what is the best KeepAlive option for Apache given our setup.
The API server hosts a single PHP file which responds with plain JSON. This is a fairly hefty file which performs some MySql reads/writes and quite a few Memcache lookups.
We have about 90 clients that are logged into the system at any one time.
Roughly 1/3rd of clients would be idle.
Of the active clients (roughly 60) they send a request to the API every 3 seconds.
Clients switch from active to idle and vice versa every 15 or 20 minutes or so.
With KeepAlive On, the server goes nuts and memory peaks at close to 4GB (swap is engaged etc).
With KeepAlive Off, the memory sits at 3GB however I notice that Apache is constantly killing and creating new processes to handle each connection.
So, my three options are:
KeepAlive On and KeepAliveTimeout Default - In this case I guess I will just need to get more RAM.
KeepAlive On and KeepAliveTimeout Low (perhaps 10 seconds?) If KeepAliveTimeout is set at 10 seconds, will a client maintain a constant connection to that one process by accessing the resource at regular 3 second intervals? When that client becomes idle for longer than 10 seconds will the process then be killed? If so I guess option 2 looks like the best one to go for?
KeepAlive Off This is clearly best for RAM, but will it have an impact on the response times due to the work involved in setting up a new process for each request?
Which option is best?
It looks like your php script is leaking memory. Before making them long running processes you should get to grips with that.
If you have not a good idea of the memory usage per request and from request to request adding memory is not a real solution. It might help for now and break again next week.
I would keep running separate processes till memory management is under control. If you have response problems currently your best bet is add another server to spread load.
The very first thing you should be checking is whether the clients are actually using the keepalive functioality at all. I'm not sure what you mean by an 'API server' but if its some sort of webservice then (IME) its rather difficult to implement well behaved clients using keepalives.(See %k directive for mod_log_config).
ALso, we really need to know what your objectives and constraints are? Performance / capacity / low cost?
Is this running over HTTP or HTTPS - there's a big difference in latency.
I'd have said that a keeplive time of 10 seconds is ridiculously high - not low at all.
Even if you've got 90 clients holding connections open, 4Gb seems a rather large amount of memory for them to be using - I'e run systems with 150-200 concurrent connections to complex PHP scripts using approx 0.5Gb over resting usage. Your figures of 250 + 90 x 20M only gives you a footprint of about 2Gb (I know is not that simple - but its not much more complicated).
For the figures you've given I wouldn't expect any benefit - but a significantly bigger memory footprint - using anything over 5 seconds for the keepalive. You could probably use a keepalive time of 2 seconds without any significant loss of throughput, But there's no substitute for measuring the effectiveness of various configs - and analysing the data to find the optimal config.
Certainly if you find that your clients are able to take advantage of keepalives and get a measurable benefit from doing so then you need to find the best way of accomodating that. Using a threaded server might help a little with memory usage, but you'll probably find a lot more benefit in running a reverse proxy in front of the webserver - particularly which SSL.
Besides that you may get significant benefits through normal tuning - code profiling, output compression etc.
Instead of managing the KeepAlive settings, which clearly have no real advantage in your particular situation between the 3 options, you should consider switching the Apache to an event or a thread based MPM where you could easily use KeepAlive On and set the Timeout value high.
I would go as far as also considering the switch to Apache on Windows. The benefit here is that it's MPM is completely thread based and takes advantage of Windows preference for threads over processes. You can easily do 512 threads with KeepAlive On and Timeout of 3-10 seconds on 1-2GB of RAM.
WampDeveloper Pro -
Xampp -
WampServer
Otherwise, your only other options are to switch MPM from Prefork to Worker...
http://httpd.apache.org/docs/2.2/mod/worker.html
Or to Event (which also got better with Apache 2.4)...
http://httpd.apache.org/docs/2.2/mod/event.html