apache2 processes stuck in sending reply - W - apache

I am hosting multiple sites on a server with 7.5gb RAM. Using apache2 mpm_prefork.
Following command gives me a value of 200-300 in production
ps aux|grep -c 'apache2'
Using top i see only some hundred megabytes of RAM is free. Error log show nothing unusual. Is this much apache2 process normal?
MaxRequestWorkers is set to 512
Update:
Now i am using mod-status to check apache activity.
I have a row like this
Srv PID Acc M CPU SS Req Conn Child Slot Client VHost Request
0-0 29342 2/2/70 W 0.07 5702 0 3.0 0.00 1.67 XXX XXX /someurl
If i check again after sometime PID not changes and i get SS with greater value that previous time. M of this request is in 'W` sending reply state. So that means apache2 process locked in for that request?

On my VPS and root servers, the situation is partially similar. AFAIK the os tries to distribute most of the processing power/RAM to running processes and frees the resources for other processes as the need arises.

Related

Apache Server Many requests stuck in "R" Reading Request

below apache2ctl status with almost no users online.
For over 5 years we (cloud ERP supplier) deploy instances on Google Cloud with Apache with mod_perl.
This week our largest server became slow and unresponsive. No idle workers were available. It turned out increasing both MaxRequestWorkers and ServerLimit to 400 from 150 in mpm_prefork.conf got our server back fast.
Iā€™m wondering why many requests stay in "R" Reading Request, at least 10 times more requests then actually should be.
We did further checking, DoS does not seem to be the issue, as also other servers ā€“ in different clouds as ASW or Alibaba ā€“ we notice the same ratio of 10 between requests actually being processed (R/W/K) and requests that stay in Reading mode.
What could cause this?
sudo /usr/sbin/apache2ctl status
Apache Server Status for localhost (via 127.0.0.1)
Server Version: Apache/2.4.7 (Ubuntu) PHP/5.5.9-1ubuntu4.29 OpenSSL/1.0.1f
mod_perl/2.0.8 Perl/v5.18.2
Server MPM: prefork
Server Built: Apr 3 2019 18:04:25
Current Time: Saturday, 29-Feb-2020 10:15:35 CET
Restart Time: Thursday, 27-Feb-2020 09:45:48 CET
Parent Server Config. Generation: 1
Parent Server MPM Generation: 0
Server uptime: 2 days 29 minutes 47 seconds
Server load: 0.75 0.77 0.75
Total accesses: 1581181 - Total Traffic: 8.6 GB
CPU Usage: u30.32 s9.64 cu0 cs0 - .0229% CPU load
9.06 requests/sec - 51.5 kB/second - 5.7 kB/request
96 requests currently being processed, 9 idle workers
RRKRRRK_RKRKKRRRRRK_RRRRKRCK_RRRC_CKK_KCRKCRK_RCR__CKKCCRCRRRRRR
RRRRR.RRRKRRRKRRR_RR..R.K.RCRKR.CKK.RRKKR.W.RRKR.....RR.........
................................................................
................................................................
................................................................
................................................................
................
Scoreboard Key:
"_" Waiting for Connection, "S" Starting up, "R" Reading Request,
"W" Sending Reply, "K" Keepalive (read), "D" DNS Lookup,
"C" Closing connection, "L" Logging, "G" Gracefully finishing,
"I" Idle cleanup of worker, "." Open slot with no current process

How to solve: UDP send of xxx bytes failed with error 11 in Ubuntu?

UDP send of XXXX bytes failed with error 11
I am running a WebRTC streaming app on Ubuntu 16.04.
It streams video and audio from Logitec HD Webcam c930e within an Electronjs Desktop App.
It all works fine and smooth running on my other machine Macbook Pro. But on my Ubuntu machine I receive errors after 10-20 seconds when the peer connection is established:
[2743:0513/193817.691636:ERROR:stunport.cc(282)] Jingle:Port[0xa5faa3df800:audio:1:0:local:Net[wlx0013ef503b67:192.168.0.x/24:Wifi]]: UDP send of 1019 bytes failed with error 11
[2743:0513/193817.691775:ERROR:stunport.cc(282)] Jingle:Port[0xa5faa3df800:audio:1:0:local:Net[wlx0013ef503b67:192.168.0.x/24:Wifi]]: UDP send of 1020 bytes failed with error 11
[2743:0513/193817.696615:ERROR:stunport.cc(282)] Jingle:Port[0xa5faa3df800:audio:1:0:local:Net[wlx0013ef503b67:192.168.0.x/24:Wifi]]: UDP send of 1020 bytes failed with error 11
[2743:0513/193817.696777:ERROR:stunport.cc(282)] Jingle:Port[0xa5faa3df800:audio:1:0:local:Net[wlx0013ef503b67:192.168.0.x/24:Wifi]]: UDP send of 1020 bytes failed with error 11
[2743:0513/193817.712369:ERROR:stunport.cc(282)] Jingle:Port[0xa5faa3df800:audio:1:0:local:Net[wlx0013ef503b67:192.168.0.x/24:Wifi]]: UDP send of 1029 bytes failed with error 11
[2743:0513/193817.712952:ERROR:stunport.cc(282)] Jingle:Port[0xa5faa3df800:audio:1:0:local:Net[wlx0013ef503b67:192.168.0.x/24:Wifi]]: UDP send of 1030 bytes failed with error 11
[2743:0513/193817.713086:ERROR:stunport.cc(282)] Jingle:Port[0xa5faa3df800:audio:1:0:local:Net[wlx0013ef503b67:192.168.0.x/24:Wifi]]: UDP send of 1030 bytes failed with error 11
[2743:0513/193817.717713:ERROR:stunport.cc(282)] Jingle:Port[0xa5faa3df800:audio:1:0:local:Net[wlx0013ef503b67:192.168.0.x/24:Wifi]]: UDP send of 1030 bytes failed with error 11
==> Btw, if I do NOT stream audio, but video only. I got the same error but only with the "video" between the Log lines...
somewhere in between the lines I also got one line that says:
[3441:0513/195919.377887:ERROR:stunport.cc(506)] sendto: [0x0000000b] Resource temporarily unavailable
I also looked into sysctl.conf and increased the values there. My currenct sysctl.conf looks like this:
fs.file-max=1048576
fs.inotify.max_user_instances=1048576
fs.inotify.max_user_watches=1048576
fs.nr_open=1048576
net.core.netdev_max_backlog=1048576
net.core.rmem_max=16777216
net.core.somaxconn=65535
net.core.wmem_max=16777216
net.ipv4.tcp_congestion_control=htcp
net.ipv4.ip_local_port_range=1024 65535
net.ipv4.tcp_fin_timeout=5
net.ipv4.tcp_max_orphans=1048576
net.ipv4.tcp_max_syn_backlog=20480
net.ipv4.tcp_max_tw_buckets=400000
net.ipv4.tcp_no_metrics_save=1
net.ipv4.tcp_rmem=4096 87380 16777216
net.ipv4.tcp_synack_retries=2
net.ipv4.tcp_syn_retries=2
net.ipv4.tcp_tw_recycle=1
net.ipv4.tcp_tw_reuse=1
net.ipv4.tcp_wmem=4096 65535 16777216
vm.max_map_count=1048576
vm.min_free_kbytes=65535
vm.overcommit_memory=1
vm.swappiness=0
vm.vfs_cache_pressure=50
Like suggested here: https://gist.github.com/cdgraff/7920db287988463aafd7ea09eef6f9f0
It does not seem to help. I am still getting these errors and I experience lagging on the other side.
Additional info: on Ubuntu the Electronjs App connects to Heroku Server (Nodejs) and the other side of the peer connection (Chrome Browser) also connects to it. Heroku Server acts as Handshaking Server to establish WebRTC connection. Both have as configuration:
{'urls': 'stun:stun1.l.google.com:19302'},
{'urls': 'stun:stun2.l.google.com:19302'},
and also an additional Turn Server from numb.viagenie.ca
Connection is established and within the first 10 seconds the quality is very high and there is no lagging at all. But then after 10-20 seconds there is lagging and on the Ubuntu console I am getting these UDP errors.
The PC that Ubuntu is running on:
PROCESSOR / CHIPSET:
CPU Intel Core i3 (2nd Gen) 2310M / 2.1 GHz
Number of Cores: Dual-Core
Cache: 3 MB
64-bit Computing: Yes
Chipset Type: Mobile Intel HM65 Express
RAM:
Memory Speed: 1333 MHz
Memory Specification Compliance: PC3-10600
Technology: DDR3 SDRAM
Installed Size: 4 GB
Rated Memory Speed: 1333 MHz
Graphics
Graphics Processor Intel HD Graphics 3000
Could please anyone give me some hints or anything that could solve this problem?
Thank you
==============EDIT=============
I found in my very large strace log somewhere these two lines:
7671 sendmsg(17, {msg_name(0)=NULL, msg_iov(1)=[{"CHILD_PING\0", 11}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 11
7661 <... recvmsg resumed> {msg_name(0)=NULL, msg_iov(1)=[{"CHILD_PING\0", 12}], msg_controllen=32, [{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, {pid=7671, uid=0, gid=0}}], msg_flags=0}, 0) = 11
On top of that, somewhere near when the error happens (at the end of the log file, just before I quit the application) I see in the log file the following:
https://gist.github.com/Mcdane/2342d26923e554483237faf02cc7cfad
First, to get an impression of what is happening in the first place, I'd look with strace. Start your application with
strace -e network -o log.strace -f YOUR_APPLICATION
If your application looks for another running process to turn the work too, start it with parameters so it doesn't do that. For instance, for Chrome, pass in a --user-data-dir value that is different from your default.
Look for = 11 in the output file log.strace afterwards, and look what happened before and after. This will give you a rough picture of what is happening, and you can exclude silly mistakes like sendtos to 0.0.0.0 or so (For this reason, this is also very important information to include in a stackoverflow question, for instance by uploading the output to gist).
It may also be helpful to use Wireshark or another packet capture program to get a rough overview of what is being sent.
Assuming you can confirm with strace that a valid send call is taken place, you can then further analyze the error conditions.
Error 11 is EAGAIN. The documentation of send says when this error is supposed to happen:
EAGAIN (...) The socket is marked nonblocking and the requested operation would block. (...)
EAGAIN (Internet domain datagram sockets) The socket referred to by
sockfd had not previously been bound to an address and, upon
attempting to bind it to an ephemeral port, it was determined that all
port numbers in the ephemeral port range are currently in use. See
the discussion of /proc/sys/net/ipv4/ip_local_port_range in
ip(7).
Both conditions could apply.
The first will be obvious by the strace log if you trace the creation of the socket involved.
To exclude the second, you can run netstat -una (or, if you want to know the programs involved, sudo netstat -unap) to see which ports are open (if you want Stack Overflow users to look into it, post the output on gist or similar and link to it here). Your port range net.ipv4.ip_local_port_range=1024 65535 is not the standard 32768 60999; this looks like you attempted to do something about lacking port numbers already. It would help to trace back to the reason of why you changed that parameter, and the conditions that convinced you to do so.

How can I configure monit to kill a high CPU process after a few seconds?

I want to use monit to kill a process that uses more than X% CPU for more than N seconds.
I'm using stress to generate load to try a simple example.
My .monitrc:
check process stress
matching "stress.*"
if cpu usage > 95% for 2 cycles then stop
I start monit (I checked syntax with monit -t .monitrc):
monit -c .monitrc -d 5
And I launch stress:
stress --cpu 1 --timeout 60
Stress shows in top as using 100 %CPU.
I'd expect monit to kill stress in about 10 seconds, but stress completes successfully. What am I doing wrong?
I also tried monit procmatch "stress.*", which shows two matches for some reason. Maybe that's relevant?
List of processes matching pattern "stress.*":
stress --cpu 1 --timeout 60
stress --cpu 1 --timeout 60
Total matches: 2
WARNING: multiple processes matched the pattern. The check is FIRST-MATCH based, please refine the pattern
EDIT: Tried e.lopez's method
I had to remove the start statement from .monitrc because it was causing a error in monit ('stress' failed to start (exit status -1) -- Program /usr/bin/stress timed out and then a zombie process).
So launched stress manually:
stress -c 1
stress: info: [8504] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
The .monitrc:
set daemon 5
check process stress
matching "stress.*"
stop program = "/usr/bin/pkill stress"
if cpu > 5% for 2 cycles then stop
Launched monit:
monit -Iv -c .monitrc
Starting Monit 5.11 daemon
'xps13' Monit started
'stress' process is running with pid 8504
'stress' zombie check succeeded [status_flag=0000]
'stress' cpu usage check skipped (initializing)
'stress'
'stress' process is running with pid 8504
'stress' zombie check succeeded [status_flag=0000]
'stress' cpu usage check succeeded [current cpu usage=0.0%]
'stress' process is running with pid 8504
'stress' zombie check succeeded [status_flag=0000]
'stress' cpu usage check succeeded [current cpu usage=0.0%]
'stress' process is not running
'stress' trying to restart
'stress' start skipped -- method not defined
Monit sees the right process (pids match), but sees 0% usage (stress is using 1 cpu at 100% per top). I killed stress manually, which is when monit says the process is not running (at the end, above). So, monit is monitoring the process fine, but isn't seeing the right cpu usage.
Any ideas?
Note that if your system has many cores, the fact that you stress just one of them (cpu 1) will not stress the whole system. In my tests with a i7 Processor, stressing the CPU 1 to 95% just stresses the total System to 12.5%.
Depending on the number of cores, you might want to use accordingly:
monit -c X
where X is the amount of cores you want to stress.
But this is not your main issue. Your problem is that you do not provide monit with a stop instruction for the stress programm. Look at this:
check process stress
matching "stress.*"
start program = "/usr/bin/stress -c 1" with timeout 10 seconds
stop program = "/usr/bin/pkill stress"
if cpu > 5% for 2 cycles then stop
You are missing at least the "stop" line, where you define the command which will be used by monit to actually stop the process. As stress is not a service, you might want to use the pkill instruction in order to kill the process.
I tested the above configuration successfully. Output of the monit.log:
[CET Nov 5 09:03:02] info : 'stress' start action done
[CET Nov 5 09:03:02] info : 'Overlord' start action done
[CET Nov 5 09:03:12] info : Awakened by User defined signal 1
[CET Nov 5 09:03:22] error : 'stress' cpu usage of 12.5% matches resource limit [cpu usage<5.0%]
[CET Nov 5 09:03:32] error : 'stress' cpu usage of 12.4% matches resource limit [cpu usage<5.0%]
[CET Nov 5 09:03:32] info : 'stress' stop: /usr/bin/pkill
So: Assuming you are just willing to test, hence the CPU-Usage is not relevant, just use the confg I provided above. Once you are sure your config works, adjust the resource limits for the processes you would like to monitor in a production environment.
Always have at hand: https://mmonit.com/monit/documentation/
Hope it helps.
Regards
I think the reason why you're seeing 0% cpu is because stress -c 1 creates two processes - one "worker" process which will create the load and second mostly idle background process (open htop and filter for stress to see the second process).
If a regex matches more than one process, monit will pick the process with the longest uptime (check the monit doc) - for me the background process always had a longer uptime than the "worker" process.
You can mitigate this by using stress-ng. Here the "worker" process has a distinct name so there is no ambiguity when matching.
stress-ng -c 1
works with the following .monitrc file
set daemon 5
check process stress
matching "stress-ng-cpu"
stop program = "/usr/bin/pkill stress-ng"
if cpu > 5% for 2 cycles then stop

APC (PHP Cache) Uptime 0 minutes, not caching

My goal is to implement APC for opcode cache for a drupal 6 production site.
I have so far tested APC with several php files with and without including other php files with include_once.
Also tried to tweak the apc.ini values for shm_size, apc.include_once_override and apc.stat.
Restarted apache every time.
Resulting in apc.php not showing any changes in any values. (except of course the changed apc.ini values are shown as they should)
Every time i refresh the apc.php test page, the start time resets as the current time showing uptime 0 minutes.
apc.php -testpage shows:
General Cache InformationAPC Version 3.1.9
PHP Version 5.2.10
APC Host xxxx.xx.xx
Server Software Apache/2.2.3 (CentOS)
Shared Memory 1 Segment(s) with 128.0 MBytes
(mmap memory, pthread mutex Locks locking)
Start Time 2011/07/26 11:53:56
Uptime 0 minutes
File Upload Support 1
Cached Files 0 ( 0.0 Bytes)
Hits 1
Misses 1
Request Rate (hits, misses) 2.00 cache requests/second
Hit Rate 1.00 cache requests/second
Miss Rate 1.00 cache requests/second
Insert Rate 0.00 cache requests/second
Cache full count 0
Cached Variables 0 ( 0.0 Bytes)
Hits 0
Misses 0
Request Rate (hits, misses) 0.00 cache requests/second
Hit Rate 0.00 cache requests/second
Miss Rate 0.00 cache requests/second
Insert Rate 0.00 cache requests/second
Cache full count 0
apc.cache_by_default 1
apc.canonicalize 1
apc.coredump_unmap 0
apc.enable_cli 0
apc.enabled 1
apc.file_md5 0
apc.file_update_protection 2
apc.filters
apc.gc_ttl 3600
apc.include_once_override 0
apc.lazy_classes 0
apc.lazy_functions 0
apc.max_file_size 16
apc.mmap_file_mask /tmp/apcphp5.095eRm
apc.num_files_hint 1024
apc.preload_path
apc.report_autofilter 0
apc.rfc1867 0
apc.rfc1867_freq 0
apc.rfc1867_name APC_UPLOAD_PROGRESS
apc.rfc1867_prefix upload_
apc.rfc1867_ttl 3600
apc.serializer default
apc.shm_segments 1
apc.shm_size 128M
apc.slam_defense 0
apc.stat 0
apc.stat_ctime 0
apc.ttl 7200
apc.use_request_time 1
apc.user_entries_hint 4096
apc.user_ttl 7200
apc.write_lock 1
Host Status Diagrams:
Free: 128.0 MBytes (100.0%) Hits: 1 (50.0%)
Used: 20.3 KBytes (0.0%) Misses: 1 (50.0%)
Detailed Memory Usage and Fragmentation:
Fragmentation: 0%
phpinfo shows:
Server API CGI/FastCGI
APC:
Version 3.1.9
APC Debugging Enabled
MMAP Support Enabled
MMAP File Mask /tmp/apcphp5.JkKDk7
Locking type pthread mutex Locks
Serialization Support php
Revision $Revision: 308812 $
Build Date Jul 21 2011 14:31:12
I followed these steps to find if suexec settings would prevent caching:
http://www.litespeedtech.com/support/forum/showthread.php?t=4189
[root#host /]# ps -ef|grep lsphp
root 20402 17833 0 11:21 pts/0 00:00:00 grep lsphp
[root#host /]# ps -waux
root 17833 0.0 0.1 5004 1484 pts/0 S 10:39 0:00 bash
..indicates that there is no lsphp running on the host
also I read the following article and comments, concluding that in my case the problem is not the suexec as the user apache is the httpd process owner
http://www.brandonturner.net/blog/2009/07/fastcgi_with_php_opcode_cache/
also suexec command is not recognized when logged and launced as root # host
also i'm almost confident that there is no cPanel running on the host to check if a setting there would reset the running cache process at some interval
This leaves me with few clues where to head next.
I tried to set (with chown and chgrp) apache as the owner of the apc.php file and some test php files resulting in 500 server error.
Is there a way to check if the file permissions prevent the apc stay running?
I'm tremendously grateful for any suggestions or help.
Can you give your php.ini settings for APC ?
You must restart httpd to take setting change into account.
Try to change max file size
apc.max_file_size = 20M
You are alowing 128M of ram, it's quite low for big php applications like we have today (a single wordpress uses 32M)
apc.shm_size 128M
increase it also

Apache + Tomcat with mod_jk: maxThread setting upon load balancing

I have Apache + Tomcat setup with mod_jk on 2 servers. Each server has its own Apache+Tomcat pair, and every request is being served by Tomcat load balancing workers on 2 servers.
I have a question about how Apache's maxClient and Tomcat's maxThread should be set.
The default numbers are,
Apache: maxClient=150, Tomcat: maxThread=200
In this configuration, if we have only 1 server setup, it would work just fine as Tomcat worker never receives the incoming connections more than 150 at once. However, if we are load balancing between 2 servers, could it be possible that Tomcat worker receives 150 + (some number from another server) and make the maxThread overflow as SEVERE: All threads (200) are currently busy?
If so, should I set Tomcat's maxThread=300 in this case?
Thanks
Setting maxThreads to 300 should be fine - there are no fixed rules. It depends on whether you see any connections being refused.
Increasing too much causes high memory consumption but production Tomcats are known to run with 750 threads. See here as well. http://java-monitor.com/forum/showthread.php?t=235
Have you actually got the SEVERE error? I've tested on our Tomcat 6.0.20 and it throws an INFO message when the maxThreads is crossed.
INFO: Maximum number of threads (200) created for connector with address null and port 8080
It does not refuse connections until the acceptCount value is crossed. The default is 100.
From the Tomcat docs http://tomcat.apache.org/tomcat-5.5-doc/config/http.html
The maximum queue length for incoming
connection requests when all possible
request processing threads are in use.
Any requests received when the queue
is full will be refused. The default
value is 100.
The way it works is
1) As the number of simultaneous requests increase, threads will be created up to the configured maximum (the value of the maxThreads attribute).
So in your case, the message "Maximum number of threads (200) created" will appear at this point. However requests will still be queued for service.
2) If still more simultaneous requests are received, they are queued up to the configured maximum (the value of the acceptCount attribute).
Thus a total of 300 requests can be accepted without failure. (assuming your acceptCount is at default of 100)
3) Crossing this number throws Connection Refused errors, until resources are available to process them.
So you should be fine until you hit step 3