How can I configure monit to kill a high CPU process after a few seconds? - monit

I want to use monit to kill a process that uses more than X% CPU for more than N seconds.
I'm using stress to generate load to try a simple example.
My .monitrc:
check process stress
matching "stress.*"
if cpu usage > 95% for 2 cycles then stop
I start monit (I checked syntax with monit -t .monitrc):
monit -c .monitrc -d 5
And I launch stress:
stress --cpu 1 --timeout 60
Stress shows in top as using 100 %CPU.
I'd expect monit to kill stress in about 10 seconds, but stress completes successfully. What am I doing wrong?
I also tried monit procmatch "stress.*", which shows two matches for some reason. Maybe that's relevant?
List of processes matching pattern "stress.*":
stress --cpu 1 --timeout 60
stress --cpu 1 --timeout 60
Total matches: 2
WARNING: multiple processes matched the pattern. The check is FIRST-MATCH based, please refine the pattern
EDIT: Tried e.lopez's method
I had to remove the start statement from .monitrc because it was causing a error in monit ('stress' failed to start (exit status -1) -- Program /usr/bin/stress timed out and then a zombie process).
So launched stress manually:
stress -c 1
stress: info: [8504] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
The .monitrc:
set daemon 5
check process stress
matching "stress.*"
stop program = "/usr/bin/pkill stress"
if cpu > 5% for 2 cycles then stop
Launched monit:
monit -Iv -c .monitrc
Starting Monit 5.11 daemon
'xps13' Monit started
'stress' process is running with pid 8504
'stress' zombie check succeeded [status_flag=0000]
'stress' cpu usage check skipped (initializing)
'stress'
'stress' process is running with pid 8504
'stress' zombie check succeeded [status_flag=0000]
'stress' cpu usage check succeeded [current cpu usage=0.0%]
'stress' process is running with pid 8504
'stress' zombie check succeeded [status_flag=0000]
'stress' cpu usage check succeeded [current cpu usage=0.0%]
'stress' process is not running
'stress' trying to restart
'stress' start skipped -- method not defined
Monit sees the right process (pids match), but sees 0% usage (stress is using 1 cpu at 100% per top). I killed stress manually, which is when monit says the process is not running (at the end, above). So, monit is monitoring the process fine, but isn't seeing the right cpu usage.
Any ideas?

Note that if your system has many cores, the fact that you stress just one of them (cpu 1) will not stress the whole system. In my tests with a i7 Processor, stressing the CPU 1 to 95% just stresses the total System to 12.5%.
Depending on the number of cores, you might want to use accordingly:
monit -c X
where X is the amount of cores you want to stress.
But this is not your main issue. Your problem is that you do not provide monit with a stop instruction for the stress programm. Look at this:
check process stress
matching "stress.*"
start program = "/usr/bin/stress -c 1" with timeout 10 seconds
stop program = "/usr/bin/pkill stress"
if cpu > 5% for 2 cycles then stop
You are missing at least the "stop" line, where you define the command which will be used by monit to actually stop the process. As stress is not a service, you might want to use the pkill instruction in order to kill the process.
I tested the above configuration successfully. Output of the monit.log:
[CET Nov 5 09:03:02] info : 'stress' start action done
[CET Nov 5 09:03:02] info : 'Overlord' start action done
[CET Nov 5 09:03:12] info : Awakened by User defined signal 1
[CET Nov 5 09:03:22] error : 'stress' cpu usage of 12.5% matches resource limit [cpu usage<5.0%]
[CET Nov 5 09:03:32] error : 'stress' cpu usage of 12.4% matches resource limit [cpu usage<5.0%]
[CET Nov 5 09:03:32] info : 'stress' stop: /usr/bin/pkill
So: Assuming you are just willing to test, hence the CPU-Usage is not relevant, just use the confg I provided above. Once you are sure your config works, adjust the resource limits for the processes you would like to monitor in a production environment.
Always have at hand: https://mmonit.com/monit/documentation/
Hope it helps.
Regards

I think the reason why you're seeing 0% cpu is because stress -c 1 creates two processes - one "worker" process which will create the load and second mostly idle background process (open htop and filter for stress to see the second process).
If a regex matches more than one process, monit will pick the process with the longest uptime (check the monit doc) - for me the background process always had a longer uptime than the "worker" process.
You can mitigate this by using stress-ng. Here the "worker" process has a distinct name so there is no ambiguity when matching.
stress-ng -c 1
works with the following .monitrc file
set daemon 5
check process stress
matching "stress-ng-cpu"
stop program = "/usr/bin/pkill stress-ng"
if cpu > 5% for 2 cycles then stop

Related

How do I make a free VLM from kemp boot on a physical LoadMaster

I know this isn't the right place to ask this but this site has the most users on it. I recently bought a Kemp LoadMaster LM-2600 Load Balancer for my webservers. However, this unit didn't include an SSD because the previous owner decided to erase it. So, I downloaded the VirtualBox version of free VLM from kemp's website. Then, I used VBoxManage clonehd LMOS.vmdk LMOS.img --format RAW to turn the disk into a raw img file. Then, I used dd if=LMOS.img of=/dev/sdb to flash a USB with the os. Then, I booted my loadmaster with the USB.
The boot process went like normal until it finished booting and then the machine switched to runlevel 0 (Shutdown)
This is the logs I got when I plugged the USB into my computer (The log file was so big that stack overflow won't allow me to paste it here):
https://pastebin.com/5PbKzRi6
I noticed that it said something about eth0 being down so I plugged in an ethernet cable and booted it again. The same thing happened but I got a different error (The log was shorter so I labeled it):
-- BOOT --
2022-08-07T19:50:06+00:00 lb100 syslog-ng: syslog-ng starting up; version='3.25.1'
-- ERROR --
2022-08-07T19:50:07+00:00 lb100 raid_events_handler: RAID controller not detected yet (check # 0)
-- LOGIN --
2022-08-07T19:50:11+00:00 lb100 login: pam_unix(login:session): session opened for user bal by LOGIN(uid=0)
-- ERROR --
2022-08-07T19:50:14+00:00 lb100 raid_events_handler: RAID controller not detected yet (check # 1)
-- SHUTDOWN --
2022-08-07T19:50:15+00:00 lb100 init: Switching to runlevel: 0
2022-08-07T19:50:15+00:00 lb100 kernel: S99final (938): drop_caches: 1
2022-08-07T19:50:17+00:00 lb100 syslog-ng: syslog-ng shutting down; version='3.25.1'
2022-08-07T19:50:17+00:00 lb100 kernel: Kernel logging (proc) stopped.
2022-08-07T19:50:17+00:00 lb100 kernel: Kernel log daemon terminating.
2022-08-07T19:50:17+00:00 lb100 sslproxy: (815) caught signal 15
2022-08-07T19:50:17+00:00 lb100 raid_events_handler: stop
I have no idea what to do right now. I already tried everything I knew. What should I do?
Any help would be great,
Thanks!

How to include special characters on Monit (monitrc) config file?

OS: CentOS 7.4
Monit: 5.14
WebServer: Nginx 1.12.2
How could I insert special characters (specifically '#') on the config file (/etc/monit.d/monitrc)?
I am trying to do the following:
#Monitor clamd#amavisd service
check process clamd#amavisd with pidfile /var/run/amavisd/amavisd.pid
start program = "/usr/bin/systemctl start clamd#amavisd"
stop program = "/usr/bin/systemctl stop clamd#amavisd"
if cpu usage > 99% for 5 cycles then alert
if mem usage > 99% for 5 cycles then alert
But monit -t issues an error for '#' character. I've tried:
'#' and 'clamd#amavisd'
"#" and "clamd#amavisd"
\#
For other services such as amavisd below, it runs like a charm:
# Monitor Amavis-new service
check process amavisd with pidfile /var/run/amavisd/amavisd.pid
start program = "/usr/bin/systemctl start amavisd"
stop program = "/usr/bin/systemctl stop amavisd"
if cpu usage > 99% for 5 cycles then alert
if mem usage > 99% for 5 cycles then alert
According to documentation:
You might need to use double quotes around the password if it contains special chars such as "p#ssw:r#".

Unable to diagnose MISCONF redis issue while launching celery worker server

I use a celery worker server with redis as the broker url (for receiving tasks) as well as the result backend.
BROKER_URL = 'redis://localhost:6379/2'
CELERY_RESULT_BACKEND = 'redis://localhost:6379/2'
app = Celery('myceleryapp', broker=BROKER_URL,backend=CELERY_RESULT_BACKEND)
I launch the celery worker server using celery -A myceleryapp worker -l info -c 8
The worker processes start processing my tasks from the redis queue until at some point, I receive the infamous MISCONF redis error and the celery worker process terminates.
Unrecoverable error: ResponseError('MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.',)
I checked the redis log files in /var/log/redis and the tail end of the file has the following
24745:C 19 Aug 09:20:26.169 * RDB: 0 MB of memory used by copy-on-write
1590:M 19 Aug 09:20:26.247 * Background saving terminated with success
1590:M 19 Aug 09:25:27.080 * 10 changes in 300 seconds. Saving...
1590:M 19 Aug 09:25:27.081 * Background saving started by pid 25397
25397:C 19 Aug 09:25:27.082 # Write error saving DB on disk: No space left on device
1590:M 19 Aug 09:25:27.181 # Backgroun1590:M 19 Aug 09:51:03.042 * 1 changes in 900 seconds. Saving...
1590:M 19 Aug 09:51:03.042 * Background saving started by pid 26341
26341:C 19 Aug 09:51:03.405 * DB saved on disk
26341:C 19 Aug 09:51:03.405 * RDB: 22 MB of memory used by copy-on-write
1590:M 19 Aug 09:51:03.487 * Background saving terminated with success
The dump.rdb file is being written to /var/lib/redis/dump.rdb.
Since the logs reported a No space left on device, I checked the disk space where /var is mounted and there seems to be sufficient space left (1.2GB).
How do I get to the root cause of this error if there is enough disk space? Of course, to prevent this error from happening, I could set config set stop-writes-on-bgsave-error no in redis-cli. But I want to get to the root cause of this error. Any help or pointers?
Maybe this is caused by the swap file. Because the swap file took the 1.2Gb space of your disk. So redis complains No space to write.
Try this "swapon -s" command to check this.
I think 1.2Gb is not enough if this disk accept the RAM page swap. you should change the dir of RDB in a more big dir.

apache2 processes stuck in sending reply - W

I am hosting multiple sites on a server with 7.5gb RAM. Using apache2 mpm_prefork.
Following command gives me a value of 200-300 in production
ps aux|grep -c 'apache2'
Using top i see only some hundred megabytes of RAM is free. Error log show nothing unusual. Is this much apache2 process normal?
MaxRequestWorkers is set to 512
Update:
Now i am using mod-status to check apache activity.
I have a row like this
Srv PID Acc M CPU SS Req Conn Child Slot Client VHost Request
0-0 29342 2/2/70 W 0.07 5702 0 3.0 0.00 1.67 XXX XXX /someurl
If i check again after sometime PID not changes and i get SS with greater value that previous time. M of this request is in 'W` sending reply state. So that means apache2 process locked in for that request?
On my VPS and root servers, the situation is partially similar. AFAIK the os tries to distribute most of the processing power/RAM to running processes and frees the resources for other processes as the need arises.

CumulocityLongPollingTransport - canceling the long poll request because of inactivity

I am using the Cumulocity java agent (7.38.0) and it apparently lost communication with the server somehow and never recovered. The admin interface says:
LAST COMMUNICATION
November 22, 2016 2:25 AM
and last cumulo record in the the device syslog was:
Nov 22 01:25:47 localhost root: 01:25:47.166 [CumulocityLongPollingTransport-scheduler-2] WARN c.c.s.c.n.ConnectionHeartBeatWatcher - canceling the long poll request because of inactivity
(there was 1 hour time diff due to some device config prob.)
process looks running anyways:
ps -ef | grep -i c8y
root 1341 1257 0 Nov19 ? 00:00:00 /bin/sh ./c8y-agent.sh
root 1342 1341 0 Nov19 ? 00:00:00 /bin/sh ./c8y-agent.sh
root 1344 1342 0 Nov19 ? 00:25:39 java -cp cfg/*:lib/* -Dlogback.configurationFile=cfg/logback.xml c8y.lx.agent.Agent
Has anyone seen this prob before?
We had it once or twice when people were connecting to cumulocity via firewall or vpn. The result was exactly as you described: the polling gets stuck after some time, like if connections were blocked. In other words i would suspect that it’s a proxy that’s blocking the reconnect.