Apache mod_proxy_ajp module prematurely sending traffic to spare backend server - apache

We've got a pair of Apache 2.4 web servers (web02, web03) running mod_proxy_ajp talking to a pair of Tomcat 7.0.59 servers (app02, app03).
The Tomcat server on app03 is a standby server that should not get traffic unless app02 is completely offline.
Apache config on web02 and web03:
<Proxy balancer://ajp_cluster>
BalancerMember ajp://app02:8009 route=worker1 ping=3 retry=60
BalancerMember ajp://app03:8009 status=+R route=worker2 ping=3 retry=60
ProxySet stickysession=JSESSIONID|jsessionid lbmethod=byrequests
</Proxy>
Tomcat config for AJP on app02 and app03:
<Connector protocol="AJP/1.3" URIEncoding="UTF-8" port="8009" />
We are seeing issues where Apache starts sending traffic to app03 which is marked as the spare even when app02 is still available but perhaps a bit busy.
Apache SSL error log:
[Thu Sep 12 14:23:28.028162 2019] [proxy_ajp:error] [pid 24234:tid 140543375898368] (70007)The timeout specified has expired: [client 207.xx.xxx.7:1077] AH00897: cping/cpong failed to 10.160.160.47:8009 (app02)
[Thu Sep 12 14:23:28.028196 2019] [proxy_ajp:error] [pid 24234:tid 140543375898368] [client 207.xx.xxx.7:1077] AH00896: failed to make connection to backend: app02
[Thu Sep 12 14:23:28.098869 2019] [proxy_ajp:error] [pid 24135:tid 140543501776640] [client 207.xx.xxx.7:57809] AH01012: ajp_handle_cping_cpong: ajp_ilink_receive failed, referer: https://site.example.com/cart
[Thu Sep 12 14:23:28.098885 2019] [proxy_ajp:error] [pid 24135:tid 140543501776640] (70007)The timeout specified has expired: [client 207.xx.xxx.7:57809] AH00897: cping/cpong failed to 10.160.160.47:8009 (app02), referer: https://site.example.com/cart
There are hundreds of these messages in our Apache logs.
Any suggestions on settings for making Apache stick to app02 unless it is completely offline?

You are experiencing thread exhaustion in the Tomcat connector causing httpd to think app02 is in a bad state - which, in a way, it is.
The short answer is switch your Tomcat AJP connector to use protocol="org.apache.coyote.ajp.AjpNioProtocol"
The long answer is, well, rather longer.
mod_jk uses persistent connections between httpd and Tomcat. The historical argument for this is performance. It saves the time of establishing a new TCP connection for each request. Generally, testing shows that this argument doesn't hold and that the the time taken to establish a new TCP connection or to perform a CPING/CPONG to confirm that the connection is valid (which you need to do if you use persistent connections) takes near enough the same time. Regaredless, persistent connections are the default with mod_jk.
When using persistent connections mod_jk creates one connection per httpd worker thread and caches that connection in the worker thread.
The default AJP connection in Tomcat 7.x is the BIO connector. This connector uses blocking I/O and requires one thread per connection.
The issue occurs when httpd is configured with more workers than Tomcat has threads. Initially everything is OK. When an httpd worker encounters the first request that needs to be passed to Tomcat, mod_jk creates the persistent connection for that httpd worker and the request is served. Subsequent requests processed by that httpd worker that need to be passed to Tomcat will use that cached connection. Requests are allocated (effectively) randomly to httpd workers. As more httpd workers see their first request that needs to be passed to Tomcat, mod_jk creates the necessary persistent connection for each worker. It is likely that many of the connections to Tomcat will be mostly idle. How idle will depend on the load on httpd and the proportion of those requests that are passed to Tomcat.
All is well until more httpd workers need to create a connection to Tomcat that Tomcat has threads. Remember that the Tomcat AJP BIO connector requires a thread per connection so maxThreads is essentially the maximum number of AJP connections that Tomcat will allow. At that point mod_jk is unable to create the request and therefore the failover process is initiated.
There are two solutions. The first - the one I described above - is to remove the one thread per connection limitation. By switching to the NIO AJP connector, Tomcat uses a Poller thread to maintain 1000s of connections, only passing those with data to process to a thread for processing. The limitation for Tomcat processing is then that maxThreads is the maximum number of concurrent requests that Tomcat can process on that Connector.
The second solution is to disable persistent connections. mod_jk the creates a connection, uses it for a single request and then closes it. This reduces the number of connections the mod_jk requires at any one point between httpd and Tomcat.
Sorry the above is rather a large wall of text. I've also covered this in various presentations including this one.

Related

"Failed to start The Apache HTTP Server" after misconfiguring Froxlor

This all started yesterday after I added a second IP address for port 443 to the "ips and ports" list in Froxlor. As soon as Froxlor's cron job ran, Apache failed to restart. Ever since then, nothing I try will get Apache to stay running with SSL enabled in Froxlor.
System Config:
Ubuntu 20.04.2 LTS (focal)
Apache 2.4.41
Froxlor 0.10.27
Output from sudo systemctl start apache2:
Job for apache2.service failed because the control process exited with error code.
See "systemctl status apache2.service" and "journalctl -xe" for details.
Output from systemctl status apache2.service:
● apache2.service - The Apache HTTP Server
Loaded: loaded (/lib/systemd/system/apache2.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Mon 2021-07-19 13:33:31 UTC; 41s ago
Docs: https://httpd.apache.org/docs/2.4/
Process: 17629 ExecStart=/usr/sbin/apachectl start (code=exited, status=1/FAILURE)
systemd[1]: Starting The Apache HTTP Server...
apachectl[17641]: AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 127.0.1.1. Set the 'ServerName'>
apachectl[17629]: Action 'start' failed.
apachectl[17629]: The Apache error log may have more information.
systemd[1]: apache2.service: Control process exited, code=exited, status=1/FAILURE
systemd[1]: apache2.service: Failed with result 'exit-code'.
systemd[1]: Failed to start The Apache HTTP Server.
Output from sudo journalctl -u apache2.service --since today --no-pager:
systemd[1]: Starting The Apache HTTP Server...
apachectl[17169]: AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 127.0.1.1. Set the 'ServerName' directive globally to suppress this message
apachectl[17165]: Action 'start' failed.
apachectl[17165]: The Apache error log may have more information.
systemd[1]: apache2.service: Control process exited, code=exited, status=1/FAILURE
systemd[1]: apache2.service: Failed with result 'exit-code'.
systemd[1]: Failed to start The Apache HTTP Server.
systemd[1]: apache2.service: Unit cannot be reloaded because it is inactive.
"Address already in use" error
Initially I was also getting an error that said apachectl[16500]: (98)Address already in use: AH00072: make_sock: could not bind to address on port 443. Running netstat -anp | grep 443 did not reveal any other processes hogging that port, so I suspected that Apache was trying to use port 443 twice (which tracks with my configuration goof). I managed to get into the database and delete the ip/port record (which had not been assigned to any sites yet) and this particular error went away because Froxlor stopped creating an extra conf file containing Listen 443.
If I comment out both instances of Listen 443 within /etc/apache2/ports.conf, then this particular error goes away but Apache still fails to load.
# If you just change the port or add more ports here, you will likely also
# have to change the VirtualHost statement in
# /etc/apache2/sites-enabled/000-default.conf
Listen 80
#<IfModule ssl_module>
# Listen 443
#</IfModule>
#<IfModule mod_gnutls.c>
# Listen 443
#</IfModule>
# vim: syntax=apache ts=4 sw=4 sts=4 sr noet
(This post with a similar issue offered some insight on this bit)
Output from sudo grep "443" /etc/apache2/*
grep: /etc/apache2/conf-available: Is a directory
grep: /etc/apache2/conf-enabled: Is a directory
grep: /etc/apache2/htpasswd: Is a directory
grep: /etc/apache2/mods-available: Is a directory
grep: /etc/apache2/mods-enabled: Is a directory
/etc/apache2/ports.conf:# Listen 443
/etc/apache2/ports.conf:# Listen 443
grep: /etc/apache2/sites-available: Is a directory
grep: /etc/apache2/sites-enabled: Is a directory
Misc. remarks about Froxlor:
If I comment out \Froxlor\Cron\MasterCron::run(); inside of /var/www/froxlor/scripts/froxlor_master_cronjob.php, then the Froxlor cron job is effectively disabled. Can be useful for troubleshooting, but doesn't fix anything.
Running sudo /usr/bin/php /var/www/froxlor/scripts/froxlor_master_cronjob.php --force will trigger Froxlor to execute its cron job immediately
Current Status:
After many hours of troubleshooting, here is what I know:
when no ip is configured with port 443/SSL, Apache will start.
deleting /etc/apache2/sites-enabled/ directory allows Apache to start, until Froxlor's cron job regenerates it.
likewise, just deleting the *.443.conf files and any ssl.conf files from /etc/apache2/sites-enabled/ also temporarily allows Apache to start (until the Froxlor cron job runs)
removing Froxlor from the server allows Apache to start, but the problem comes back immediately after configuring port 443 within Froxlor.
TLDR: Something broke when I opened Froxlor and added a second IP with a port that was already in use (port 443). Now Apache won't start unless I delete any .conf file involving SSH. Removing Froxlor (including deleting the database) and deleting sites-enabled before reinstalling Froxlor did not resolve the issue.
EDIT: Regenerated my security certificates and now all is good.
Ugh... multitasking bit me again.
Apparently there was something off with the local security certificate. I regenerated it and Apache started working again.
Related post here

Mod_jk workers - ipv4 / ipv6 /fqdn

I just set up our first machine with Ubuntu 16 LTS and Tomcat 8.5.11 + Apache/2.4.18, mod_jk/1.2.41.
I was quiet familiar with Ubuntu 14 LTS, Tomcat 7.0.70 and Apache/2.4.7 mod_jk/1.2.37.
Deploying my servlet seemed fine, no errors in tomcat or app log, but still not avail.
Using fqdn/app showed an error 503.
Using ip:port/app worked fine..
I saw these entries in mod_jk.log:
[Fri Feb 24 11:17:49.149 2017] [9219:139689407260416] [info] ajp_connect_to_endpoint::jk_ajp_common.c (1068): (worker1) Failed opening socket to (::1:8009) (errno=111)
[Fri Feb 24 11:17:49.149 2017] [9219:139689407260416] [error] ajp_send_request::jk_ajp_common.c (1728): (worker1) connecting to backend failed. Tomcat is probably not started or is listening on the wrong port (errno=111)
-> ::1:8009
Why is my worker.host=localhost resolved to ::1 (IPv6) instead of 127.0.0.1 (IPv4)?
I also saw the "newer" parameter "prefer_ipv6" and set it to "worker.prefer_ipv6=0" but without luck..
Workaround:
When I set the worker.host=127.0.0.1 everything works fine/as I am used to.
Downside:
I know some collegue of mine has changed the 127.0.0.1 entry to "localhost" in the past for some reason (different ip stack in processing?), so I am not 100% confident leaving it with IPv4-IP.
Some advices on how I could fix that?
It's a bug in the JK connector, where it always prefers IPv6 when resolving the hostname in the "worker.*.host=" setting. The only way to force the latest JK connector to connect via IPv4 is to use an IPv4 address (rather than a DNS name).
The other alternative is to configure tomcat AJP/1.3 listener to listen on IPv6.
Until the folks at Apache fix the bug in the JK connector, these are the only options right now.

getting lots of 408 status code in apache access log after migration from http to https

We are getting lots of 408 status code in apache access log and these are coming after migration from http to https .
Our web server is behind loadbalancer and we are using keepalive on and keepalivetimeout value is 15 sec.
Can someone please help to resolve this.
Same problem here, after migration from http to https. Do not panic, it is not a bug but a client feature ;)
I suppose that you find these log entries only in the logs of the default (or alphabetically first) apache ssl conf and that you have a low timeout (<20).
As of my tests these are clients establishing pre-connected/speculative sockets to your web server for fast next page/resource load.
Since they only establish the initial socket connection or handshake ( 150 bytes or few thousands) the connect to the ip and do not specify a vhost name, and got logged in the default/firs apache conf log.
After few secs from the initial connection they drop the socket if not needed or the use is for faster further request.
If your timeout is lower than these few secs you get the 408 if is higher apache doesn't bother.
So either you ignore them / add a different default conf for apache, or you rise the timeout having more apache processes busy waiting from the client to drop or use the socket.
see https://bugs.chromium.org/p/chromium/issues/detail?id=85229 for some related discussions

Idle socket connection to Apache server timeout period

I open a socket connection to Apache server however I don't send any requests waiting for a specific time to do it. How long can i expect Apache to keep this idle socket connection alive?
Situation is that Apache server has limited resources and connections require to be allocated in advance before they all gone.
After request is sent server advertise its timeout policy:
KeepAlive: timeout=15,max=50
If consequent request is sent in longer then 15 seconds it gets 'server closed connection' error. So it does enforce the policy.
However, it seems that if no requests are sent after connection was opened Apache will not close it even for as long as 10 minutes.
Can someone shed some light on behavior of Apache in such situation.
According to Apache Core Features, TimeOut Directive the default timeout is 300 seconds but it's configurable.
For keep-alive connections (after the first request) the default timeout is 5 sec (see Apache Core Features, KeepAliveTimeout Directive). In Apache 2.0 the default value was 15 seconds. It's also configurable.
Furthermore, there is a mod_reqtimeout Apache Module which provides some fine-tuning settings.
I don't think that any of the mentioned values are available for http clients via http headers or any other forms. (Except the keep-alive value of course.)

"Broken Pipe" between Apache and GlassFish when using mod_jk

I am using Apache as the front-end to GlassFish 3.1, using mod_jk as the connector. The connection between the two is very unstable - works about 50% of the time - even when I am the only person on the system. When the problem occurs, the browser gives me an HTTP timeout and the GlassFish server has two types exceptions in its log:
java.io.IOException
at org.apache.jk.common.JkInputStream.receive(JkInputStream.java:249)
at org.apache.jk.common.JkInputStream.refillReadBuffer(JkInputStream.java:309)
at org.apache.jk.common.JkInputStream.doRead(JkInputStream.java:227)
at com.sun.grizzly.tcp.Request.doRead(Request.java:501)
at org.apache.catalina.connector.InputBuffer.realReadBytes(InputBuffer.java:336)
at com.sun.grizzly.util.buf.ByteChunk.substract(ByteChunk.java:431)
at org.apache.catalina.connector.InputBuffer.read(InputBuffer.java:357)
at org.apache.catalina.connector.CoyoteInputStream.read(CoyoteInputStream.java:265)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:264)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:306)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:158)
at java.io.InputStreamReader.read(InputStreamReader.java:167)
at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101)
at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84)
at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57)
at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:967)
at com.ctc.wstx.sr.StreamScanner.getNext(StreamScanner.java:738)
at com.ctc.wstx.sr.BasicStreamReader.nextFromProlog(BasicStreamReader.java:1995)
at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2647)
at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1019)
java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
at org.apache.jk.common.ChannelSocket.send(ChannelSocket.java:580)
at org.apache.jk.common.JkInputStream.doWrite(JkInputStream.java:206)
at com.sun.grizzly.tcp.Response.doWrite(Response.java:685)
at org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.java:420)
On the Apache side, the mod_jk log is completely empty. Once I hit this condition, the only way to recover is to restart the Apache server. The funny thing is that after the restart, the requests that timed out are automatically executed - magically! I have no idea who stores them.
Anyway, I am not at all experienced with Apache and mod_jk and was wondering where to start looking for problems. Software versions I am using are as follows:
Apache: version 2.2.17-2, GlassFish: 3.1, mod_jk: 1.2.30-1
Any help would be much appreciated!
Thanks.
Check the mod_jk logs for initialilzation of mod_jk during Apache startup. If no logs are written then something's wrong with the installation/configuration of mod_jk module.
Have you created a Glassfish cluster?
If yes then set DjvmRoute and Dcom.sum.web.enterprise.jkenabled jvm options for the cluster and also check the http network listener on DAS host that needs to be created to listen requests from mod_jk (it is initially jk_disabled, so enable it)..
If not then check the http network listener for mod_jk on each of the server domains on which you are deploying your application.