Difference between systemd software watchdog and "normal" process monitoring

Difference between systemd software watchdog and "normal" process monitoring - process

I've tried out 2 systemd unit configurations:
progA.service
[Service]
Type=simple
ExecStart=/opt/progA
WatchdogSec=10s
progB.service
[Service]
Type=simple
ExecStart=/opt/progB
Restart=always
RestartSec=10
The effect in 2 cases is similar: whenever the program killed/crashes/exits, it is restarted after 10s. To my understanding, using watchdog has advantage only if a specific thread/loop inside the program need to be monitored. Am I missing something?

Yes, the watchdog will detect "liveness" above and beyond the Restart directive, which only detects "deadness".
In order to avoid being killed by the watchdog, your service must actively call sd_notify. Imagine if something bad happens that doesn't quite kill your service, like a deadlock. The process would not be killed with a Restart directive, but it would fail to send the sd_notify and would be restarted by the watchdog (as long as the checkups are being sent on the same thread that is deadlocked).

Related

ClientAliveInterval is not closing the idle connection

I have the task to close the idle ssh connection if they are idle for more than 5 minutes. I have tried setting these value on sshd_config
TCPKeepAlive no
ClientAliveInterval 300
ClientAliveCountMax 0
But nothing seems to work the idle remains active and does not get lost even after 5 minutes of idle time.
Then I came across this https://bbs.archlinux.org/viewtopic.php?id=254707 they guys says
These are not for user-idle circumstances, they are - as that man page
excerpt notes - for unresponsive SSH clients. The client will be
unresponsive if the client program has frozen or the connection has
been broken. The client should not be unresponsive simply because the
human user has stepped away from the keyboard: the ssh client will
still receive packets sent from the server.
I can't even use TMOUT because there are ssh client scripts that do not run bash program.
How to achieve this?
Openssh version
OpenSSH_8.2p1 Ubuntu-4ubuntu0.4, OpenSSL 1.1.1f 31 Mar 2020

close the idle ssh connection if they are idle for more than 5 minutes
This task is surprisingly difficult. OpenSSH itself has no functionality to set a idle-timeout on shell commands, probably for a good reason: killing "idle" shells itself is non-trivial:
There's multiple ways to define "idleness", e.g., no stdin, no stdout, no I/O activity whatsoever, no CPU consumption etc
Even when a process is deemed "idle", it's difficult to kill the process and all its child processes that have possibly been created.
Given that, it's not surprising that there's only few solutions for killing idle shell sessions in general. Those that I could find with (little) research rely on background daemons that check the idle status of all processes running on a system (e.g., doinkd/idled, idleout).
One possible solution is to check if any of those solutions can be adapted to enforce an idle timeout on a specific shell session.
Another option is to adapt the OpenSSH source code to support your specific requirement. In principle, OpenSSH should be able to easily access console I/O activity and session duration, so assessing the "idle" property is probably relative easy. As for "killing" the shell and all involved children, running (and killing) the remote shell in a PID namespace is an effective option on Linux systems.
Both options a relatively complex -- so before pursuing them further, I'd further check if there's existing solutions to enforce an idle timeout on a shell session. Using them under OpenSSH will be straightforward.

mod_wsgi DaemonProcess mode gets problem with httpd graceful reload

I'm using httpd -k graceful to dynamically reload my server, and I use time.sleep in python code to make a slow request, and I expected the active requests would't be interrupted after apache reload. But it did.
So I tried a simple python server using CGI, it works well. Then I tried mod_wsgi using apache process (only specifying WSGIScriptAlias), and it works well, too.
So I found that the problem is the WSGIDaemonProcess, which I originally used.
Then in the mod_wsgi doc I found this:
eviction-timeout=sss
When a daemon process is sent the graceful restart signal, usually SIGUSR1, to restart a process, this timeout controls how many seconds the process will wait, while still accepting new requests, before it reaches an idle state with no active requests and shutdown.
If this timeout is not specified, then the value of the graceful-timeout will instead be used. If the graceful-timeout is not specified, then the restart when sent the graceful restart signal will instead happen immediately, with the process being forcibly killed, if necessary, when the shutdown timeout has expired.
when I thought I'm going to find the reason, I found that these arguments(and i tried graceful-timeout too) didn't work at all.The requests were still interrupted by graceful reload. So why?
I'm using apache 2.4.6, with mpm mode prefork. And modwsgi 4.6.5, I compiled it myself and replaced my old-version mod_wsgi.so with it.

answer from GrahamDumpleton#Github: (https://github.com/GrahamDumpleton/mod_wsgi/issues/383)
What you are seeing is exactly as expected. Apache does not pass graceful restart signals onto managed sub processes, it only passes them onto its own child worker processes. For managed processes it will send a SIGTERM and it will brutally kill them after 3 or 5 seconds (can't remember exactly how long) if they haven't shutdown. There is no way around it. It is a limitation of Apache.
The eviction timeout thus only applies as the docs say to when a 'daemon process' is sent a graceful restart signal directly. That is, restarting Apache as a whole gracefully doesn't do anything, but send the graceful restart signal to the pid of the daemon processes themselves will.
So the only solution if this behaviour is important is to ensure you use display-name option to WSGIDaemonProcess directive so daemon processes named uniquely compared to Apache processes, and then send signals to them direct only.
Usually this only becomes an issue because some Linux systems completely ignore the fact that Apache has a perfectly good log file rotation system and instead do external log file rotation by renaming log files once a day and then attempting a graceful restart. People will see issues with interrupted requests they don't expect. In this case you should use Apache's own log file rotation mechanism if it is important and not rely on external log file rotation systems.

Docker Redis container orderly shutdown

I am running redis-server in a Docker container on Ubuntu 14.10 x64. If I access the redis database via phpRedisAdmin, do a few edits and then get them to be saved to disk, shutdown the container and then restart it everything is fine - the edited redis keys are present and correct. However, if I edit keys and then shut down the container then restart it the edits do not stick.
Clearly, the dump.rdb file is not being saved automatically when the container is shutdown. I imagine that I could fix this by putting in an /etc/init.d script that is symlinked from /etc/rc6.d. However, I am wondering - why does shutting down a redis container not perform an orderly shutdown of the running process(es) in the container? After all, when I reboot my server (both the server & the container run Ubuntu 14.10) I do not have to explicitly commit the redis db changes to disk.

The main process in a Docker container will be sent a SIGTERM signal when you run docker stop -t N CONTAINER. The process should then begin to shut itself down cleanly. If after N seconds (10 by default) this still hasn't happened, Docker will use a SIGKILL signal, which will kill the process without giving it a chance to clean up. The reason you were having problems was probably because you simply weren't giving Redis long enough to shutdown cleanly.
It's important to note that only the main process in the container (PID 1) will be sent signals. This means that the main process must be responsible for shutting down any child processes in the container, or you can end up with zombie processes.
If you still have problems with redis not doing what you want on shutdown, you could wrap it in a script which acts as PID 1, catches the SIGTERM signal and does whatever tidying up you want (just make sure you do shutdown redis and any other processes you've started).

Is an XPC interruption handler called when launchd kills the process?

The documentation for the interruptionHandler block of NSXPCConnection states:
An interruption handler that is called if the remote process exits or crashes.
However, the Daemons and Services Programming Guide states:
XPC services are managed by launchd, which launches them on demand, restarts them if they crash, and terminates them (by sending SIGKILL) when they are idle. This is transparent to the application using the service, except for the case of a service that crashes while processing a message that requires a response. In that case, the application can see that its XPC connection has become invalid until the service is restarted by launchd
If an XPC process is killed for being idle, will I get a callback in my interruptionHandler? Or will I only get the callback when the app crashes while processing a message? I ask because this test case seems like it's impossible to simulate. XPC service lifecycle is unfortunately a very black box.

Yes the interruption handler will be called if launchd stops the service for being idle.
This can be simulated by leveraging the natural reaction launchd has to memory pressure: stopping all launchd launched services that are idle to help relieve the issue.
A simulated warn level of memory pressure should be enough, here is how you do it:
sudo memory_pressure -S -l warn
And for critical:
sudo memory_pressure -S -l critical
This condition is often missed when testing XPC Services. However it is recommended XPC services are designed to be stateless, so in most cases it should not matter if your service is stopped and can be restarted by launchd the next time you send a message. And ideally you invalidated the connection when you were last done with it.
Launchd will not stop an XPC service with the above conditions if there is an ongoing XPC transaction (read: a message is being handled and/or the reply block has not been invoked).

Monit - stop service and stay stopped?

I have a daemon which runs via the usual init.d/service scripts.
I have monit running which ensures these daemons are restarted if they crash.
I have a request that 'service foo stop' should stop the deamon, and because it was explicitly stopped, not a crash, monit should not restart it. How can I achieve this with monit?
I could have the service script's stop() routine call 'monit unmonitor' but this seems circular and wrong.
Thanks,
Dave

I think you should use monit stop foo instead of service foo stop. That way Monit is aware that the service didn't crash -- and won't restart it.

There is a MODE param for that:
Monit supports three monitoring modes per service: active, passive and manual.
Syntax:
MODE
In active mode (the default), Monit will pro-actively monitor a service and in case of problems raise alerts and/or restart the service.
In passive mode, Monit will passively monitor a service and will raise alerts, but will not try to fix a problem by executing start, stop or restart.
In manual mode, Monit will enter active mode only if a service was started via Monit
From here: https://mmonit.com/monit/documentation/monit.html#SERVICE-MONITORING-MODE
This way if you manage services via runit or upstart and just want to use monit for alerts and dashboards you simply set for all such services mode to passive.
For example:
check process heka with pidfile /etc/sv/myservice/supervise/pid
start program = "/usr/bin/sv start myservice"
stop program = "/usr/bin/sv stop myservice"
mode passive
If you need to enable/disable that online but not permanently -- please refer to other people's answers, they are fine.

The model is:
Monit runs as a service by init.d and therefore controlled (stop/start/restart) by init.d . (Others, please me if I am wrong).
Applications that require to be monitored are handled by monit.
Therefore, such applications should be only controlled i.e. stop/start/restart via monit.
monit

SET ONREBOOT LASTSTATE
As per: https://mmonit.com/monit/documentation/monit.html#SYSTEM-REBOOT-AND-SERVICE-STARTUP

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas