Server sends "internal error" response faster after Tomcat upgrade - apache

I recently upgraded our Tomcat server from 7.0.85 to 9.0.70. I am using Apache 2.4.
My Java application runs in a cluster, and it is expected that if the master node fails during a command, the secondary node will take the master role and finish the action.
I have a test that starts an action, performs a failover, and ensures that the secondary node completes the action.
The client sends the request and loops up to 8 times trying to get an answer from the server.
Before the upgrade, the client gets a read-timeout for the first 3/4 tries and then the secondary finishes the action, sends a 200 response, and the test passes. I can see in the Apache access log that the server is trying to send a 500 (internal error) response for the first tries, but I guess it takes too long and I get a read timeout before that.
After the upgrade, I am getting a read-timeout for the first try, but after that, the client receives the internal error response and stops trying. I can see that on the second try the Apache response is way faster than the first try and from the other tries (the 2,3,4 tries) before the upgrade.
I can see in the tcpdump that in the first try (both before and after the upgrade) the connection between the Apache and the Tomcat reaches the timeout. In the following tries the Tomcat sends the Apache a reset connection. The difference is, after the upgrade the Tomcat sends the reset connection immediately after the request, and before the upgrade, it takes a few seconds to send it.
My socket timeout is 20 seconds, the AJP timeout is 10 seconds (as it was before the upgrade). I am using the same configuration files as before the upgrade (except for some refactoring changes I had to do because of Tomcat changes). I tried changing the AJP timeout to 20 seconds, but it didn't help
Is this a configuration issue? Is there a way to “undo” this change?

Related

Failure tolerance counter workaround/alternatives

I use monit to monitor my daemon with the HTTP API and restart it if needed. In addition to just checking that the process is not dead, I also added an HTTP check (if failed port 80 protocol http request "/api/status") with a failure tolerance counter (for N cycles). I use the counter to avoid restarting the daemon in case of singular failed requests (e.g. due to high load). The problem is, that the failures counter seems not to reset after the daemon is successfully restarted. I.e., consider the following scenario:
Monit and the daemon are started.
The daemon is locked (e.g. due to a software bug) and stops responding to the HTTP requests.
Monit waits for N consecutive HTTP request failures and restarts the daemon.
The first monit HTTP request after the daemon restart fails again (e.g., because the daemon needs some time to get online and start serving requests).
Monit restarts the daemon again. Go to item 4.
This seems to be a bug and actually there is an issue 64 (fixed) and 787 (open). Since the second issue is opened for a year already, I do not have much hope of it to be fixed soon, so I would like to know is there a good workaround for this case.
While not exactly what I needed, I ended up with the following alternative:
Use large enough value for the timeout parameter of the start program token, to give the server enough to get online. The connection tests are not performed by monit during this time.
Use the retry parameter in the if failed port clause to tolerate singular failures. Unfortunately, the retries are done immediately (after request failure or timeout), not in the next poll cycle.
Use for N cycles parameter, to improve failure tolerance at least partially.
Basically, I have the following monitrc structure:
set daemon 5
check process server ...
start program = "..." with timeout 60 seconds
stop program = "..."
if failed port 80 protocol http request "/status" with retry 10 and timeout 5 seconds for 5 cycles then restart

How does Apache detects a stopped Tomcat JVM?

We are running multiple Tomcat JVMs under a single Apache cluster. If we shut down all the JVMs except one, sometime we get 503s. If we increase the
retry interval to 180(from retry=10), problem goes away. That bring me
to this question, how does Apache detects a stopped Tomcat JVM? If I
have a cluster which contains multiple JVMs and some of them are down,
how Apache finds that one out? Somewhere I read, Apache uses a real
request to determine health of a back end JVM. In that case, will that
request failed(with 5xx) if JVM is stopped? Why higher retry value is
making the difference? Do you think introducing ping might help?
If someone can explain a bit or point me to some doc, that would be awesome.
We are using Apache 2.4.10, mod_proxy, byrequests LB algorithm, sticky session,
keepalive is on and ttl=300 for all balancer members.
Thanks!
Well let's examine a little what your configuration is actually doing in action and then move to what might help.
[docs]
retry - Here either you 've set it 10 or 180 what you specify is how much time apache will consider your backend server down and thus won't send him requests. So the higher the value, you gain the time for your backend to get up completely but you put more load to the others since you are -1 server for more time.
stickysession - Here if you lose a backend server for whatever reason all the sessions are on it get an error.
All right now that we described the relevant variables for your situation let's clear that apache mod_proxy does not have a health check mechanism embedded, it updates the status of your backend based on responses on real requests.
So your current configuration works as following:
Request arrives on apache
Apache send it to an alive backend
If request gets an error http code for response or doesn't get a response at all, apache puts that backend in ERROR state.
After retry time is passed apache sends to that backend server requests again.
So reading the above you understand that the first request that will reach a backend server which is down will get an error page.
One of the things you can do is indeed ping, according to the docs will check the backend before send any request. Consider of course the overhead that produces.
Although I would suggest you to configure mod_proxy_ajp which is offering extra functionality (and configuration ofc) to your tomcat backend failover detection.

How to stop Tomcat from processing old request / clear the cache?

In order to test the execution-time of some methods, I set around 250 requests to my backend, which uses Tomcat and Jersey.
Shortly after I realized that there was a mistake in the code, and changed it. Now I have to wait until Tomcat has finished processing all the previous requests. I tried restarting the Tomcat server and redeploying the backend, but no luck.
When i start the server it continues to process the "old" requests.
How do I flush the cache?

Http server - slow read

I am trying to simulate a slow http read attack against apache server running on my localhost.
But it seems like, the server does not complain and simply waits forever for the client to read.
This is what I do:
Request a huge file (say ~1MB) from the http server
Read the response from the server in a loop waiting 100 secs before successive reads
Since the file is huge and the client receive buffer is small, the server has to send the file in multiple chunks. But, at the client side, I wait for 100 secs between successive reads. As a result, the server often polls the client and finds that, the receive window size of the client is zero since the client has not yet read the receive buffer.
But it looks like the server does not bother to break the connection and it silently keeps polling the client. Server sends the data when the client window size is > 0 and again goes back to wait for the client.
I want to know whether there are any apache config parameters that I can set to break the connection from the server side after waiting sometime for the client to read the data.
Perhaps this would be more useful to you, (simpler and saves you time): http://ha.ckers.org/slowloris/ which is a Perl script that sends partial HTTP requests, the Apache server leaves the connection open (now unavailable to new users) and if executed on a Linux environment, (Linux does not limit threads beyond hardware capability) you can effectively block all open sockets, and in turn prevent other users from accessing the server. It uses minimal bandwidth because it does not "flood" the server with requests, it simply slowly takes the sockets hostage. You can download the file here: http://ha.ckers.org/slowloris/slowloris.pl
To prevent an attack like this (well, mitigate) see here: https://serverfault.com/questions/32361/how-to-best-defend-against-a-slowloris-dos-attack-against-an-apache-web-server
You could also use a load-balancer or round-robin setup.
Try slowhttptest to test the slow read attack you're describing. (It can also be used to test slow sending of headers.)

How to configure Glassfish to drop hanging requests?

Can I configure Glassfish to drop any request that takes longer than 10 seconds to process?
Example:
I'm using Glassfish to host my web service. The thread pool is configured to have max 5 connections.
My service has a method that does this:
System.out.println("New request");
Thread.sleep(1000*1000);
I'm creating 5 requests to the service and I see 5 messages "New request" in the log. Then the server stop to respond for a looong time.
In live environment all requests must be processed in less than a second. If it takes more time to process then there is a problem with the request and I want Glassfish to drop such requests but stay alive and serve other requests.
Currently I'm using a workaround in the code. At the beginning of my web method I launch a separate thread for request processing with a timeout as it was suggested here: How to timeout a thread
I do not like this solution and still believe that there must be a configuration setting in the Glassfish to apply this logic to all requests, not to just one method.