We are monitoring several servers with Monit. We are using the version 5.25.1.
Some are dedicated apache servers. The monitoring is ok.
But the log of monit (/var/log/monit) is like this :
[CET Mar 18 03:12:03] info : Starting Monit 5.25.1 daemon with http interface at [0.0.0.0]:3353
[CET Mar 18 03:12:03] info : Monit start delay set to 180s
[CET Mar 18 03:15:03] info : 'xxxxx.localhost' Monit 5.25.1 started
[CET Mar 18 03:15:03] error : 'apache' error -- unknown resource ID: [5]
[CET Mar 18 03:16:08] error : 'apache' error -- unknown resource ID: [5]
[CET Mar 18 03:17:08] error : 'apache' error -- unknown resource ID: [5]
[CET Mar 18 03:18:08] error : 'apache' error -- unknown resource ID: [5]
The configuration file /etc/monit.conf is like this :
###############################################################################
## Monit control file
###############################################################################
###############################################################################
## Global section
###############################################################################
##
## Start Monit in the background (run as a daemon):
# check services at 2-minute intervals
# with start delay 240 # optional: delay the first check by 4-minutes (by
# # default Monit check immediately after Monit start)
set daemon 60
with start delay 180
### Set the location of the Monit id file which stores the unique id for the
### Monit instance. The id is generated and stored on first Monit start. By
### default the file is placed in $HOME/.monit.id.
#
set idfile /var/.monit.id
## Set the list of mail servers for alert delivery. Multiple servers may be
## specified using a comma separator. By default Monit uses port 25 - it is
## possible to override this with the PORT option.
#
# set mailserver mail.bar.baz, # primary mailserver
# backup.bar.baz port 10025, # backup mailserver on port 10025
# localhost # fallback relay
set mailserver localhost
## By default Monit will drop alert events if no mail servers are available.
## If you want to keep the alerts for later delivery retry, you can use the
## EVENTQUEUE statement. The base directory where undelivered alerts will be
## stored is specified by the BASEDIR option. You can limit the maximal queue
## size using the SLOTS option (if omitted, the queue is limited by space
## available in the back end filesystem).
#
set eventqueue
basedir /var/monit # set the base directory where events will be stored
slots 100 # optionally limit the queue size
## Send status and events to M/Monit (for more informations about M/Monit
## see http://mmonit.com/).
#
# set mmonit http://monit:monit#192.168.1.10:8080/collector
#
#
## Monit by default uses the following alert mail format:
##
##
## You can override this message format or parts of it, such as subject
## or sender using the MAIL-FORMAT statement. Macros such as $DATE, etc.
## are expanded at runtime. For example, to override the sender, use:
#
# set mail-format { from: monit#foo.bar }
#
#
## You can set alert recipients whom will receive alerts if/when a
## service defined in this file has errors. Alerts may be restricted on
## events by using a filter as in the second example below.
#
set alert fake#mail.com not on { instance }
# receive all alerts
# set alert manager#foo.bar only on { timeout } # receive just service-
# # timeout alert
#
mail-format {
from: xxxxxxx#monit.localhost
subject: $SERVICE => $EVENT
message:
DESCRIPTION : $DESCRIPTION
ACTION : $ACTION
DATE : $DATE
HOST : $HOST
Sorry for the spam.
Monit
}
## Monit has an embedded web server which can be used to view status of
## services monitored and manage services from a web interface. See the
## Monit Wiki if you want to enable SSL for the web server.
#
set httpd port 3353 and
use address 0.0.0.0
allow yyyyy:zzzz
###############################################################################
## SeSTART rvices
###############################################################################
##
## Check general system resources such as load average, cpu and memory
## usage. Each test specifies a resource, conditions and the action to be
## performed should a test fail.
#
check system xxxxxx.localhost
if loadavg (1min) > 8 for 5 cycles then alert
if loadavg (5min) > 4 for 5 cycles then alert
if memory usage > 75% for 5 cycles then alert
if cpu usage (user) > 70% for 5 cycles then alert
if cpu usage (system) > 50% for 5 cycles then alert
if cpu usage (wait) > 50% for 5 cycles then alert
check process apache with pidfile /var/run/httpd/httpd.pid
group www
start program = "/etc/init.d/httpd start" with timeout 60 seconds
stop program = "/etc/init.d/httpd stop"
if failed host localhost port 80 then restart
if cpu > 60% for 2 cycles then alert
if cpu > 80% for 5 cycles then restart
if loadavg(5min) greater than 10 for 8 cycles then restart
if 3 restarts within 5 cycles then timeout
###############################################################################
## Includes
###############################################################################
##
## It is possible to include additional configuration parts from other files or
## directories.
#
# include /etc/monit.d/*
#
#
# Include all files from /etc/monit.d/
include /etc/monit.d/*
On ui monit, everything is ok. and the monitoring is 100% useful. We can stop, restart the service like we want.
So I don't understand the sentence 'error : 'apache' error -- unknown resource ID: [5]' we found on the log of monit.
Anyone has an idea about it ?
Thanks for your help.
I had the same problem..
mmonit said that loadavg is for "check system" only. it used to work for apache but not anymore..
"The loadavg statement can be used in "check system" context only (load average is system property, not process'). Please remove the following statement and reload monit"
so disable this line by adding # on the first of:
# if loadavg(5min) greater than 10 for 8 cycles then restart
then restart monit
service monit restart
You will no longer receive the appache error.
I am hosting multiple sites on a server with 7.5gb RAM. Using apache2 mpm_prefork.
Following command gives me a value of 200-300 in production
ps aux|grep -c 'apache2'
Using top i see only some hundred megabytes of RAM is free. Error log show nothing unusual. Is this much apache2 process normal?
MaxRequestWorkers is set to 512
Update:
Now i am using mod-status to check apache activity.
I have a row like this
Srv PID Acc M CPU SS Req Conn Child Slot Client VHost Request
0-0 29342 2/2/70 W 0.07 5702 0 3.0 0.00 1.67 XXX XXX /someurl
If i check again after sometime PID not changes and i get SS with greater value that previous time. M of this request is in 'W` sending reply state. So that means apache2 process locked in for that request?
On my VPS and root servers, the situation is partially similar. AFAIK the os tries to distribute most of the processing power/RAM to running processes and frees the resources for other processes as the need arises.
I attempting to monitor a service. My monit definition is below.
When I invoke monit -r I receive /etc/monit/conf.d/authentication.monit:10: syntax error 'else'
check host self with address myhost
start program = "/usr/bin/service start authentication"
stop program = "/usr/bin/service stop authentication"
if failed port 443 protocol https
request /
with timeout 5 seconds
for 2 cycles
then restart
if 1 restarts within 4 cycles then exec "/etc/monit/pagerduty-trigger authentication" else if passed for 2 cycles then exec "/etc/monit/pagerduty-resolve authentication"
All the documentation seems to indicate my syntax is correct.
I am attempting to follow the two docs
pagerduty
primary docs
Syntax is :
IF test THEN action [ELSE IF SUCCEEDED THEN *action]
Also
The "if x restarts within y cycles then ..." statement doesn't support the "else" part: https://mmonit.com/monit/documentation/monit.html#SERVICE-RESTART-LIMIT
Else for restart is a bit non-sense as there is no contrario for restart
I'm trying to get your logic here for the if then else actions but don't understand the else part.
Here you want to "stop then "start" service "authentication when https://myhost:443/ is failing two times (called T0)
Then on the next cycle, you want to run script /etc/monit/pagerduty-trigger authentication. (called T0 + 1 cycle)
Here why within 4 cycles but not less like 2, bot ok.
I suppose that on the T0 + 1 + 2 cycles, if service a online again, you want to run "/etc/monit/pagerduty-resolve authentication"
One solution is to handle it at your failed test level with custom scripts
if failed port 443 protocol https
request /
with timeout 5 seconds
for 2 cycles
then exec "/var/lib/monit/scripts/notifyAndExecute.sh"
else if succeeded then exec "/etc/monit/pagerduty-resolve authentication"
create file /var/lib/monit/scripts/notifyAndExecute.sh in charge of restarting and calling /etc/monit/pagerduty-trigger authentication
I'm not sure if you are still looking at this. I am also integrating Monit with Pagerduty. I have one simpler example which is working. What I note is the 'else' appears to only support "else if succeeded" . I think it is simply a long hand version of 'else' without to ability to add a more sophisticated expression that you are attempting.
Here is my example which triggers when service ( process in monit ) does not exist and resolves when it does.
check process tomcat8 with pidfile /var/run/tomcat8.pid
if not exist
then exec "/etc/monit/pagerduty-trigger tomcat8"
else if succeeded then exec "/etc/monit/pagerduty-resolve tomcat8"
I want to use monit to kill a process that uses more than X% CPU for more than N seconds.
I'm using stress to generate load to try a simple example.
My .monitrc:
check process stress
matching "stress.*"
if cpu usage > 95% for 2 cycles then stop
I start monit (I checked syntax with monit -t .monitrc):
monit -c .monitrc -d 5
And I launch stress:
stress --cpu 1 --timeout 60
Stress shows in top as using 100 %CPU.
I'd expect monit to kill stress in about 10 seconds, but stress completes successfully. What am I doing wrong?
I also tried monit procmatch "stress.*", which shows two matches for some reason. Maybe that's relevant?
List of processes matching pattern "stress.*":
stress --cpu 1 --timeout 60
stress --cpu 1 --timeout 60
Total matches: 2
WARNING: multiple processes matched the pattern. The check is FIRST-MATCH based, please refine the pattern
EDIT: Tried e.lopez's method
I had to remove the start statement from .monitrc because it was causing a error in monit ('stress' failed to start (exit status -1) -- Program /usr/bin/stress timed out and then a zombie process).
So launched stress manually:
stress -c 1
stress: info: [8504] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
The .monitrc:
set daemon 5
check process stress
matching "stress.*"
stop program = "/usr/bin/pkill stress"
if cpu > 5% for 2 cycles then stop
Launched monit:
monit -Iv -c .monitrc
Starting Monit 5.11 daemon
'xps13' Monit started
'stress' process is running with pid 8504
'stress' zombie check succeeded [status_flag=0000]
'stress' cpu usage check skipped (initializing)
'stress'
'stress' process is running with pid 8504
'stress' zombie check succeeded [status_flag=0000]
'stress' cpu usage check succeeded [current cpu usage=0.0%]
'stress' process is running with pid 8504
'stress' zombie check succeeded [status_flag=0000]
'stress' cpu usage check succeeded [current cpu usage=0.0%]
'stress' process is not running
'stress' trying to restart
'stress' start skipped -- method not defined
Monit sees the right process (pids match), but sees 0% usage (stress is using 1 cpu at 100% per top). I killed stress manually, which is when monit says the process is not running (at the end, above). So, monit is monitoring the process fine, but isn't seeing the right cpu usage.
Any ideas?
Note that if your system has many cores, the fact that you stress just one of them (cpu 1) will not stress the whole system. In my tests with a i7 Processor, stressing the CPU 1 to 95% just stresses the total System to 12.5%.
Depending on the number of cores, you might want to use accordingly:
monit -c X
where X is the amount of cores you want to stress.
But this is not your main issue. Your problem is that you do not provide monit with a stop instruction for the stress programm. Look at this:
check process stress
matching "stress.*"
start program = "/usr/bin/stress -c 1" with timeout 10 seconds
stop program = "/usr/bin/pkill stress"
if cpu > 5% for 2 cycles then stop
You are missing at least the "stop" line, where you define the command which will be used by monit to actually stop the process. As stress is not a service, you might want to use the pkill instruction in order to kill the process.
I tested the above configuration successfully. Output of the monit.log:
[CET Nov 5 09:03:02] info : 'stress' start action done
[CET Nov 5 09:03:02] info : 'Overlord' start action done
[CET Nov 5 09:03:12] info : Awakened by User defined signal 1
[CET Nov 5 09:03:22] error : 'stress' cpu usage of 12.5% matches resource limit [cpu usage<5.0%]
[CET Nov 5 09:03:32] error : 'stress' cpu usage of 12.4% matches resource limit [cpu usage<5.0%]
[CET Nov 5 09:03:32] info : 'stress' stop: /usr/bin/pkill
So: Assuming you are just willing to test, hence the CPU-Usage is not relevant, just use the confg I provided above. Once you are sure your config works, adjust the resource limits for the processes you would like to monitor in a production environment.
Always have at hand: https://mmonit.com/monit/documentation/
Hope it helps.
Regards
I think the reason why you're seeing 0% cpu is because stress -c 1 creates two processes - one "worker" process which will create the load and second mostly idle background process (open htop and filter for stress to see the second process).
If a regex matches more than one process, monit will pick the process with the longest uptime (check the monit doc) - for me the background process always had a longer uptime than the "worker" process.
You can mitigate this by using stress-ng. Here the "worker" process has a distinct name so there is no ambiguity when matching.
stress-ng -c 1
works with the following .monitrc file
set daemon 5
check process stress
matching "stress-ng-cpu"
stop program = "/usr/bin/pkill stress-ng"
if cpu > 5% for 2 cycles then stop
We are using Nagios to monitor our network with great success. However, we have a syslog for critical application errors and while I set up check_log, it doesn't seem to work as well as monitering a device.
The issues are:
It only shows the last entry
There doesn't seem to be a way to acknowledge the critical error and
return the monitor to a good state
Is nagios the wrong tool, or are we just not setting up the service monitering right?
Here are my entries
# log file
define command{
command_name check_log
command_line $USER1$/check_log -F /var/log/applications/appcrit.log -O /tmp/appcrit.log -q ?
}
# Define the log monitering service
define service{
name logfile-check ;
use generic-service ;
check_period 24x7 ;
max_check_attempts 1 ;
normal_check_interval 5 ;
retry_check_interval 1 ;
contact_groups admins ;
notification_options w,u,c,r ;
notification_period 24x7 ;
register 0 ;
}
define service{
use logfile-check
host_name localhost
service_description CritLogFile
check_command check_log
}
For monitoring logs with Nagios, typically the log checker will return a warning only for newly discovered error messages each time it is invoked (so it must retain some state in order to know to ignore them on subsequent runs). Therefore I usually set:
max_check_attempts 1
is_volatile 1
This causes Nagios to send out the alert immeidately, but only once, and then go back to normal.
My favorite log checker is logwarn, but I'm biased because I wrote it myself after not finding any existing ones that I liked. The logwarn package includes a Nagios plugin.
Nothing in your config jumps out at me as being misconfigured.
By design, check_log will only show either an OK message, or the last log entry that triggered an alert. If you need to see multiple entries, you'll need to modify the plugin.
However, I find the fact that you're not getting recoveries somewhat odd. The way check_log works (by comparing the current log to the previous version), you should get a recovery on the very next service check. Except of course, when there have been additional matching entries added to the log since the last check.
Does forcing another service check (or several) cause it to recover?
Also, I don't intend this in a mean way, but make sure it's really malfunctioning.
Is your log getting additional matching entries in between checks, causing it not to recover? Your check is matching "?" which will match anything new in the log. Is something else (a non-error) being added to the log and inadvertently causing a match?
If none of the above are the issue, I would suggest narrowing it down by taking Nagios out of the equation. Try running check_log manually (from the command line, but as the same user as nagios), and with a different oldlog. It should go something like this -
run check with a new "oldlog" - get initialization message
run check - check OK
make change to log
run check - check fails
run check - check OK
If this doesn't work, then you know to focus on the log, the oldlog, and how the check_log is doing the check.
If it works, then it points more towards a problem with your nagios configuration.
There is a Nagios plugin that you can use to check the log files: it's called check_logfiles and it's used to scan the lines of a file for regular expressions.
The following link shows how to install and configure check_logfiles for Nagios and Opsview:
https://www.opsview.com/resources/nagios-alternative/blog/syslog-monitoring-nagios-opsview
As there are many ways to achieve a goal, there is also a nice plugin from Consol available:
https://labs.consol.de/lang/en/nagios/check_logfiles/
supports regex
supports log rotation
To use it, you need a cfg file, this is an example for oracle databases
#searches = ({
tag => 'oraalerts',
options => 'sticky=28800',
logfile => '/u01/app/oracle/diag/rdbms/davmdkp/DAVMDKP1/trace/alert_DAVMDKP1.log',
criticalpatterns => [
'ORA\-0*204[^\d]', # error in reading control file
'ORA\-0*206[^\d]', # error in writing control file
'ORA\-0*210[^\d]', # cannot open control file
'ORA\-0*257[^\d]', # archiver is stuck
'ORA\-0*333[^\d]', # redo log read error
'ORA\-0*345[^\d]', # redo log write error
'ORA\-0*4[4-7][0-9][^\d]',# ORA-0440 - ORA-0485 background process failure
'ORA\-0*48[0-5][^\d]',
'ORA\-0*6[0-3][0-9][^\d]',# ORA-6000 - ORA-0639 internal errors
'ORA\-0*1114[^\d]', # datafile I/O write error
'ORA\-0*1115[^\d]', # datafile I/O read error
'ORA\-0*1116[^\d]', # cannot open datafile
'ORA\-0*1118[^\d]', # cannot add a data file
'ORA\-0*1122[^\d]', # database file 16 failed verification check
'ORA\-0*1171[^\d]', # datafile 16 going offline due to error advancing checkpoint
'ORA\-0*1201[^\d]', # file 16 header failed to write correctly
'ORA\-0*1208[^\d]', # data file is an old version - not accessing current version
'ORA\-0*1578[^\d]', # data block corruption
'ORA\-0*1135[^\d]', # file accessed for query is offline
'ORA\-0*1547[^\d]', # tablespace is full
'ORA\-0*1555[^\d]', # snapshot too old
'ORA\-0*1562[^\d]', # failed to extend rollback segment
'ORA\-0*162[89][^\d]', # ORA-1628 - ORA-1632 maximum extents exceeded
'ORA\-0*163[0-2][^\d]',
'ORA\-0*165[0-6][^\d]', # ORA-1650 - ORA-1656 tablespace is full
'ORA\-16014[^\d]', # log cannot be archived, no available destinations
'ORA\-16038[^\d]', # log cannot be archived
'ORA\-19502[^\d]', # write error on datafile
'ORA\-27063[^\d]', # number of bytes read/written is incorrect
'ORA\-0*4031[^\d]', # out of shared memory.
'No space left on device',
'Archival Error',
],
warningpatterns => [
'ORA\-0*3113[^\d]', # end of file on communication channel
'ORA\-0*6501[^\d]', # PL/SQL internal error
'ORA\-0*1140[^\d]', # follows WARNING: datafile #20 was not in online backup mode
'Archival stopped, error occurred. Will continue retrying',
]
});
I believe there's now a real Nagios plugin that monitors logs effectively.
http://support.nagios.com/forum/viewtopic.php?f=6&t=8851&p=42088&hilit=unixautomation#p42088
The home page of the Nagios plugin on that page is Nagios Log Monitor
Your [ commands.cfg file ] will contain:
define command {
command_name NagiosLogMonitor
command_line $USER1$/NagiosLogMonitor $HOSTNAME$ $ARG1$ $ARG2$ $ARG3$ $ARG4$ '$ARG5$' '$ARG6$' $ARG7$ $ARG8$ $ARG9$ $ARG10$
}
OR
define command {
command_name NagiosLogMonitor
command_line $USER1$/NagiosLogMonitor $HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$ $ARG4$ '$ARG5$' '$ARG6$' $ARG7$ $ARG8$ $ARG9$ $ARG10$
}
Your [ services.cfg file ] will look similar to:
define service {
check_command NagiosLogMonitor!logrobot!autofig!/var/log/proteus.log!15!500.html!500 Internal Server Error!1!2!-foundn
max_check_attempts 1
service_description 500_ERRORS_LOGCHECK
host_name sky.blat-01.net,sky.blat-02.net,sky.blat-03.net
use fifteen-minute-interval
}
Nagios now has a solution that integrates tightly with Nagios Core, XI, etc.
Nagios Log Server which can alert on any query on any log file on any system in your infrastructure.