Possible reasons for "mysql server has gone away" error (php 5.4, mysqlnd) - apc

We recently upgraded one of our webservers from PHP 5.3 (Debian Squeeze package, using libmysqlclient and APC) to PHP 5.4 (Debian Wheezy, Dotdeb package, using mysqlnd, Opcache and APCu). After working fine for almost one day, we experienced "mysql server has gone away" errors for every request. All other servers with the same load which still run PHP 5.3 with libmysqlclient using the same MySQL server had no problem at all. On all servers we use:
max_execution_time = 60
default_socket_timeout = 60
On our PHP 5.3 servers we did not change any mysql/my.cnf timeouts. We know about problems with read_timeout (mysql), wait_timeout (mysql), default_socket_timeout (php) and max_execution_time (php), but only in context of batch scripts with long running queries. Our webservers usually respond in about 300ms, so those timeouts should not be an issue here.
It became really strange when we removed the server from loadbalancing, so there was no load anymore, but we still had 180 busy Apache processes. Even a apache2ctl graceful did not change anything, even hours later apache2ctl status said:
Apache Server Status for localhost
Server Version: Apache/2.2.22 (Debian)
Server Built: Jun 16 2014 03:51:14
__________________________________________________________________
Current Time: Tuesday, 22-Jul-2014 10:17:44 CEST
Restart Time: Monday, 21-Jul-2014 18:43:37 CEST
Parent Server Generation: 26
Server uptime: 15 hours 34 minutes 6 seconds
Total accesses: 596973 - Total Traffic: 1.6 GB
CPU Usage: u6288.72 s463.96 cu.01 cs0 - 12% CPU load
10.7 requests/sec - 30.8 kB/second - 2962 B/request
176 requests currently being processed, 99 idle workers
GGGGGG_GGGGGGGGG_GG_GGGGGGGGGGGGGGGGGGGG_GGGGGG_GGGGGGGG_GGGGGGG
GGGGGGGGGG_G_GGGGGGGG_G_GG__GGGGGG_GGGGG_GGG___GG_GGGGGGGG_G_GGG
GGGGGGGGGGGG_G_GG__GG_GGG_GGGGGGGGG__GGG_GGG_G_G_GG_G_GGGGGGGGGG
GGG_GGG_GG_GGG_GG_G_GGG_______________.___._W___________________
____.___________.______.........................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
....
Scoreboard Key:
"_" Waiting for Connection, "S" Starting up, "R" Reading Request,
"W" Sending Reply, "K" Keepalive (read), "D" DNS Lookup,
"C" Closing connection, "L" Logging, "G" Gracefully finishing,
"I" Idle cleanup of worker, "." Open slot with no current process
Only apache2ctl restart solved the issue and everything worked fine again. The MySQL error is the only "useful" error message we found so far.
Could it be an issue with mysqlnd, opcache or apcu and PHP 5.4.30? Are there any known problems which could result in the behavior we have experienced?
Or do you have an idea how to debug the "mysql server has gone away" issue?

We probably found out why it comes to the "mysql server has gone away" error: On the MySQL server we configured a wait_timeout of 30 seconds, which is less than 60 seconds max_execution_time. So under certain conditions something seems to take more than 30 seconds while we are reading a result-set from MySQL, so the server closes the connection while we are still trying to get data from the server. That leads us to the next questions:
What function consumes so much time while we are in a loop reading a result-set from mysql?
Why does apache2ctl graceful not restart the Apache Processes, even when max_execution_time should abort the scripts after 60 seconds?
I think the answer to both questions is a bug in APCu. Because if I look at the hanging Apache childs, I get FUTEX_WAIT from strace:
[pid 28354] futex(0x7f3a8c3d2094, FUTEX_WAIT, 69, NULL <unfinished ...>
If I look at such a process using gdb it seems to hang at pthread_rwlock_wrlock(), what I get is:
0x00007f3adcd18abd in pthread_rwlock_wrlock () from /lib/x86_64-linux-gnu/libpthread.so.0
OK, pthread_rwlock is used for locking in APCu, and problems with locking mechanism is a really good explanation for what we see here, in that we definitely have code which reads/writes to APCu inside of loops through MySQL result-sets, and if there is a problem with locking (which already has been a problem for us in the past with APC as well) it could take >30 and <60 seconds so the MySQL error is what we see. And after that situation something with APCu goes really wrong, so the php script can't be aborted by max_execution_time and not be restarted by apache2ctl graceful anymore.
In APCu issue tracker I could find very similar issues:
https://github.com/krakjoe/apcu/issues/19
But we found another hint. The crash always happens, when there are about 70k keys in APCu, and it does not depend on apc.shm_size, but we found out that our APCu monitoring script produces "PHP Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 78 bytes)" errors when calling apcu_cache_info() at line 47 at the same time when we see the crash. So we have to look into the script why it consumes so much memory, AFAIR id read's all the data for calculating memory fragmentation, perhaps we should remove that part...
But we had a lot of problems with APC in the past, we switched to APCu/Opcache only because we got seg faults with latest APC and PHP 5.4.30 and the issue mentioned above is open for one year now. We are happy to see recent activity on yac, perhaps lockless is a more stable option. If we can't fix by removing problems from our monitoring script, we will switch to local memcached instances, it will be slower but we know it's very stable.

Related

Apache/PHP7.3 running in Docker randomly drops connection with empty response

I have found several similar questions:
APACHE, PHP Server return randomly empty response
https://serverfault.com/questions/66662/apache-gives-empty-reply
and others
However these does not seem to help to find the cause. I can replicate the behaviour when reloading a specific page ~20 times.
Running current apache2 (= 2.4.38-3+deb10u4). I tried to disable opcache, remove MaxRequestsPerChild with no effect.
Apache log does not show any error. The request is not even logged.
The USE_ZEND_ALLOC=0 seem to have no effect and the problem persists.
I tried to install mod_forensic which shows that the request came in. No error or finished request is then logged.
The container is running in Kubernetes and I cannot replicate the issue locally running directly with Docker machine, that is why I think this might be caused by some memory setting. However I couldn't find what might be causing this as there is no single error message.
Can you think of any reason why this might be happening?
Edit1:
I tried to set log level to trace:
https://gist.github.com/knyttl/861e8a0fe5651408df37cd5c3874946b
The request is handled and then you can see:
[Tue Oct 20 08:37:55.825454 2020] [core:trace4] [pid 1] mpm_common.c(536): mpm child 388 (gen 2/slot 4) exited
With no error and no response.
Edit2:
I updated to php7.4 and the issue persists.
I finally found it:
The process is silently killed by OOM killer of the host machine:
[4019392.626796] Memory cgroup out of memory: Kill process 4178127 (apache2) score 1137 or sacrifice child
[4019392.636520] Killed process 4178127 (apache2) total-vm:143960kB, anon-rss:22856kB, file-rss:10472kB, shmem-rss:28228kB
This is never logged within the container so it was hard to find.
Why don't you use Jorge's answers ?
Finally solved by adding to /etc/apache2/envvars:
export USE_ZEND_ALLOC=0
https://serverfault.com/a/66759

Apache Server Many requests stuck in "R" Reading Request

below apache2ctl status with almost no users online.
For over 5 years we (cloud ERP supplier) deploy instances on Google Cloud with Apache with mod_perl.
This week our largest server became slow and unresponsive. No idle workers were available. It turned out increasing both MaxRequestWorkers and ServerLimit to 400 from 150 in mpm_prefork.conf got our server back fast.
Iā€™m wondering why many requests stay in "R" Reading Request, at least 10 times more requests then actually should be.
We did further checking, DoS does not seem to be the issue, as also other servers ā€“ in different clouds as ASW or Alibaba ā€“ we notice the same ratio of 10 between requests actually being processed (R/W/K) and requests that stay in Reading mode.
What could cause this?
sudo /usr/sbin/apache2ctl status
Apache Server Status for localhost (via 127.0.0.1)
Server Version: Apache/2.4.7 (Ubuntu) PHP/5.5.9-1ubuntu4.29 OpenSSL/1.0.1f
mod_perl/2.0.8 Perl/v5.18.2
Server MPM: prefork
Server Built: Apr 3 2019 18:04:25
Current Time: Saturday, 29-Feb-2020 10:15:35 CET
Restart Time: Thursday, 27-Feb-2020 09:45:48 CET
Parent Server Config. Generation: 1
Parent Server MPM Generation: 0
Server uptime: 2 days 29 minutes 47 seconds
Server load: 0.75 0.77 0.75
Total accesses: 1581181 - Total Traffic: 8.6 GB
CPU Usage: u30.32 s9.64 cu0 cs0 - .0229% CPU load
9.06 requests/sec - 51.5 kB/second - 5.7 kB/request
96 requests currently being processed, 9 idle workers
RRKRRRK_RKRKKRRRRRK_RRRRKRCK_RRRC_CKK_KCRKCRK_RCR__CKKCCRCRRRRRR
RRRRR.RRRKRRRKRRR_RR..R.K.RCRKR.CKK.RRKKR.W.RRKR.....RR.........
................................................................
................................................................
................................................................
................................................................
................
Scoreboard Key:
"_" Waiting for Connection, "S" Starting up, "R" Reading Request,
"W" Sending Reply, "K" Keepalive (read), "D" DNS Lookup,
"C" Closing connection, "L" Logging, "G" Gracefully finishing,
"I" Idle cleanup of worker, "." Open slot with no current process

apache2 processes stuck in sending reply - W

I am hosting multiple sites on a server with 7.5gb RAM. Using apache2 mpm_prefork.
Following command gives me a value of 200-300 in production
ps aux|grep -c 'apache2'
Using top i see only some hundred megabytes of RAM is free. Error log show nothing unusual. Is this much apache2 process normal?
MaxRequestWorkers is set to 512
Update:
Now i am using mod-status to check apache activity.
I have a row like this
Srv PID Acc M CPU SS Req Conn Child Slot Client VHost Request
0-0 29342 2/2/70 W 0.07 5702 0 3.0 0.00 1.67 XXX XXX /someurl
If i check again after sometime PID not changes and i get SS with greater value that previous time. M of this request is in 'W` sending reply state. So that means apache2 process locked in for that request?
On my VPS and root servers, the situation is partially similar. AFAIK the os tries to distribute most of the processing power/RAM to running processes and frees the resources for other processes as the need arises.

Apache error - mod_fcgid: too many processes, skip the spawn request- Hangs Server

We are running a LAMP stack on a Dreamhost dedicated server. Since updating to PHP 5.6 FastCGI we have been getting more frequent server outages and slow response times even with nominal server load. Apache is throwing the following error:
[Tue Jul 14 14:58:09 2015] [error] mod_fcgid: domain=<domain>.com too many /dh/cgi-system/php56.cgi processes (current:20, max:20), skip the spawn request
I have read of some FastCGI configurations that might help, but Dreamhost did not suggest modifying these settings. Would it be safe to update these settings on a Dreamhost server and would this help?
We are planning an audit of all CURL requests to add timeouts. Will this help the issue? Is there something else we can do to stop the php spawn requests from being skipped? Is there a way to identify an offending script that might be hanging the server?

Apache requests waterfall

In normal mode my apache mod status shows this:
CPU Usage: u118.45 s9.79 cu0 cs0 - 14% CPU load
7.96 requests/sec - 18.1 kB/second - 2331 B/request
1 requests currently being processed, 29 idle workers
._.._._..._.___...._____...._.____.._.._.._....___._._____W.....
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
....................................................
But sometimes (period about 1-3 minutes) my site has "lags", and the apache status looks like this:
CPU Usage: u222.29 s18.89 cu0 cs0 - 9.58% CPU load
7.77 requests/sec - 23.3 kB/second - 3064 B/request
20 requests currently being processed, 10 idle workers
WW.WW_W..WWC.._W.WCWW._._.W....WW_W..W__.W__...._...W...........
................................................................
................................................................
................................................................
................................................................
................................................................
................................................................
....................................................
During these moments, requests are usual, as at any time. But there are a lot of them.
I don't have any cron jobs that are this frequent.
Increasing prefork SpareServers mb will help, but i want to know why these waves are happening.
My current config is
Timeout 60
KeepAlive On
MaxKeepAliveRequests 200
KeepAliveTimeout 20
<IfModule prefork.c>
StartServers 100
MinSpareServers 10
MaxSpareServers 30
ServerLimit 500
MaxClients 500
MaxRequestsPerChild 10000
</IfModule>
The hardware is strong enough.
Sorry for my bad English.
Any advice would be helpful.