Failure tolerance counter workaround/alternatives - monit

I use monit to monitor my daemon with the HTTP API and restart it if needed. In addition to just checking that the process is not dead, I also added an HTTP check (if failed port 80 protocol http request "/api/status") with a failure tolerance counter (for N cycles). I use the counter to avoid restarting the daemon in case of singular failed requests (e.g. due to high load). The problem is, that the failures counter seems not to reset after the daemon is successfully restarted. I.e., consider the following scenario:
Monit and the daemon are started.
The daemon is locked (e.g. due to a software bug) and stops responding to the HTTP requests.
Monit waits for N consecutive HTTP request failures and restarts the daemon.
The first monit HTTP request after the daemon restart fails again (e.g., because the daemon needs some time to get online and start serving requests).
Monit restarts the daemon again. Go to item 4.
This seems to be a bug and actually there is an issue 64 (fixed) and 787 (open). Since the second issue is opened for a year already, I do not have much hope of it to be fixed soon, so I would like to know is there a good workaround for this case.

While not exactly what I needed, I ended up with the following alternative:
Use large enough value for the timeout parameter of the start program token, to give the server enough to get online. The connection tests are not performed by monit during this time.
Use the retry parameter in the if failed port clause to tolerate singular failures. Unfortunately, the retries are done immediately (after request failure or timeout), not in the next poll cycle.
Use for N cycles parameter, to improve failure tolerance at least partially.
Basically, I have the following monitrc structure:
set daemon 5
check process server ...
start program = "..." with timeout 60 seconds
stop program = "..."
if failed port 80 protocol http request "/status" with retry 10 and timeout 5 seconds for 5 cycles then restart

Related

Server sends "internal error" response faster after Tomcat upgrade

I recently upgraded our Tomcat server from 7.0.85 to 9.0.70. I am using Apache 2.4.
My Java application runs in a cluster, and it is expected that if the master node fails during a command, the secondary node will take the master role and finish the action.
I have a test that starts an action, performs a failover, and ensures that the secondary node completes the action.
The client sends the request and loops up to 8 times trying to get an answer from the server.
Before the upgrade, the client gets a read-timeout for the first 3/4 tries and then the secondary finishes the action, sends a 200 response, and the test passes. I can see in the Apache access log that the server is trying to send a 500 (internal error) response for the first tries, but I guess it takes too long and I get a read timeout before that.
After the upgrade, I am getting a read-timeout for the first try, but after that, the client receives the internal error response and stops trying. I can see that on the second try the Apache response is way faster than the first try and from the other tries (the 2,3,4 tries) before the upgrade.
I can see in the tcpdump that in the first try (both before and after the upgrade) the connection between the Apache and the Tomcat reaches the timeout. In the following tries the Tomcat sends the Apache a reset connection. The difference is, after the upgrade the Tomcat sends the reset connection immediately after the request, and before the upgrade, it takes a few seconds to send it.
My socket timeout is 20 seconds, the AJP timeout is 10 seconds (as it was before the upgrade). I am using the same configuration files as before the upgrade (except for some refactoring changes I had to do because of Tomcat changes). I tried changing the AJP timeout to 20 seconds, but it didn't help
Is this a configuration issue? Is there a way to “undo” this change?

boost ssl client/server shutdown timing

I have both a boost ssl client and server. Part of my testing is to use the client to send a small test file (~40K) many times to the server, using the pattern socket_connect, async_send, socket_shutdown each time the file is sent. I have noticed on both the client and the server that the ssl_socket.shutdown() call can take upto 10 milliseconds to complete. Is this typical behavior?
The interesting behavior is that the 10 millisecond completion time does not appear until I have executed the connect/send/shutdown pattern about 20 times.

mod_wsgi DaemonProcess mode gets problem with httpd graceful reload

I'm using httpd -k graceful to dynamically reload my server, and I use time.sleep in python code to make a slow request, and I expected the active requests would't be interrupted after apache reload. But it did.
So I tried a simple python server using CGI, it works well. Then I tried mod_wsgi using apache process (only specifying WSGIScriptAlias), and it works well, too.
So I found that the problem is the WSGIDaemonProcess, which I originally used.
Then in the mod_wsgi doc I found this:
eviction-timeout=sss
When a daemon process is sent the graceful restart signal, usually SIGUSR1, to restart a process, this timeout controls how many seconds the process will wait, while still accepting new requests, before it reaches an idle state with no active requests and shutdown.
If this timeout is not specified, then the value of the graceful-timeout will instead be used. If the graceful-timeout is not specified, then the restart when sent the graceful restart signal will instead happen immediately, with the process being forcibly killed, if necessary, when the shutdown timeout has expired.
when I thought I'm going to find the reason, I found that these arguments(and i tried graceful-timeout too) didn't work at all.The requests were still interrupted by graceful reload. So why?
I'm using apache 2.4.6, with mpm mode prefork. And modwsgi 4.6.5, I compiled it myself and replaced my old-version mod_wsgi.so with it.
answer from GrahamDumpleton#Github: (https://github.com/GrahamDumpleton/mod_wsgi/issues/383)
What you are seeing is exactly as expected. Apache does not pass graceful restart signals onto managed sub processes, it only passes them onto its own child worker processes. For managed processes it will send a SIGTERM and it will brutally kill them after 3 or 5 seconds (can't remember exactly how long) if they haven't shutdown. There is no way around it. It is a limitation of Apache.
The eviction timeout thus only applies as the docs say to when a 'daemon process' is sent a graceful restart signal directly. That is, restarting Apache as a whole gracefully doesn't do anything, but send the graceful restart signal to the pid of the daemon processes themselves will.
So the only solution if this behaviour is important is to ensure you use display-name option to WSGIDaemonProcess directive so daemon processes named uniquely compared to Apache processes, and then send signals to them direct only.
Usually this only becomes an issue because some Linux systems completely ignore the fact that Apache has a perfectly good log file rotation system and instead do external log file rotation by renaming log files once a day and then attempting a graceful restart. People will see issues with interrupted requests they don't expect. In this case you should use Apache's own log file rotation mechanism if it is important and not rely on external log file rotation systems.

RabbitMQ: Server heartbeat must fail 3 times before connection drop?

We have a HA RabbitMQ cluster (v3.2.x) with two nodes that sits behind a load-balancer. Our clients are configured to use a 300s heartbeat. Everything works as expected most of the time.
However, if the client's connection drops (say the client's NIC is disconnected), we have noticed (via TCPDump/wireshark) that the RabbitMQ node will attempt 3 heartbeat messages (in our case nearly 15 mins) before it closes the connection. Why? Why not close it after one failure?
Is there some means to change this behavior on the RabbitMQ server? Or do we have to shorten our heartbeat to something much smaller like 5s or 10s in order to get the connection to close sooner, thoughts?
Related issue...
Looking at the TCPDump (captured on load-balancer), I wonder why the LB doesn't close the connection when it doesn't receive the TCP-ACK from the dead client in response to the proxied RabbitMQ server heartbeat request? In fact, the LB will attempt to send the request several times (never receiving a response, of course). Wouldn't it make sense for the LB to make the assumption the connection has been dropped and close the entire session (including the connection to RabbitMQ node)?
It appears as though RabbitMQ is configured to tolerate two missed heartbeats before it terminates the connection. However, it waits until the next heartbeat would need to be sent before it drops the connection, that's what gives it the appearance of requiring 3 missed heartbeats.
Heartbeat1 (no response) wait Heartbeat2 (no response) wait Heartbeat3 terminate
There is a slight bug in MQ (it sends a 3rd heartbeat but immediately terminates the connection) but it isn't really affecting anything.

Apache webserver - What happens to requests, when all worker threads are busy

As far as I researched, the scenario when all worker threads are busy serving requests, what happens to the requests that comes next.
Do they wait?
Is this related to some configurable parameters?
Can I get the count of such requests ?
Adding to this please can you explain or give a link where I can get a clear picture of request processing strategy of Apache webserver?
Thanks for Looking at!!
When all Apache worker threads are busy, the new request is stalled (it waits) until one of those worker threads is available. If the client gives up waiting, or you surpass the maximum wait time in your configuration file; it will drop the connection.
This answer is given in 2015. So I talk about apache httpd 2.4.
They wait because the connection is queued on the TCP socket (the connection is not ACKed) Although the default length of the backlog may be set way too high on linux boxes. This may result in connections being closed due to kernel limits being in place.
ListenBacklog (with caveats. See 1.)
This is described here. With lots of interesting stuff.
Read through Apache TCP Backlog by Ryan Frantz to get the glory details about the Apache backlog.