How do I use Nagios to monitor a log file - logfiles

We are using Nagios to monitor our network with great success. However, we have a syslog for critical application errors and while I set up check_log, it doesn't seem to work as well as monitering a device.
The issues are:
It only shows the last entry
There doesn't seem to be a way to acknowledge the critical error and
return the monitor to a good state
Is nagios the wrong tool, or are we just not setting up the service monitering right?
Here are my entries
# log file
define command{
command_name check_log
command_line $USER1$/check_log -F /var/log/applications/appcrit.log -O /tmp/appcrit.log -q ?
}
# Define the log monitering service
define service{
name logfile-check ;
use generic-service ;
check_period 24x7 ;
max_check_attempts 1 ;
normal_check_interval 5 ;
retry_check_interval 1 ;
contact_groups admins ;
notification_options w,u,c,r ;
notification_period 24x7 ;
register 0 ;
}
define service{
use logfile-check
host_name localhost
service_description CritLogFile
check_command check_log
}

For monitoring logs with Nagios, typically the log checker will return a warning only for newly discovered error messages each time it is invoked (so it must retain some state in order to know to ignore them on subsequent runs). Therefore I usually set:
max_check_attempts 1
is_volatile 1
This causes Nagios to send out the alert immeidately, but only once, and then go back to normal.
My favorite log checker is logwarn, but I'm biased because I wrote it myself after not finding any existing ones that I liked. The logwarn package includes a Nagios plugin.

Nothing in your config jumps out at me as being misconfigured.
By design, check_log will only show either an OK message, or the last log entry that triggered an alert. If you need to see multiple entries, you'll need to modify the plugin.
However, I find the fact that you're not getting recoveries somewhat odd. The way check_log works (by comparing the current log to the previous version), you should get a recovery on the very next service check. Except of course, when there have been additional matching entries added to the log since the last check.
Does forcing another service check (or several) cause it to recover?
Also, I don't intend this in a mean way, but make sure it's really malfunctioning.
Is your log getting additional matching entries in between checks, causing it not to recover? Your check is matching "?" which will match anything new in the log. Is something else (a non-error) being added to the log and inadvertently causing a match?
If none of the above are the issue, I would suggest narrowing it down by taking Nagios out of the equation. Try running check_log manually (from the command line, but as the same user as nagios), and with a different oldlog. It should go something like this -
run check with a new "oldlog" - get initialization message
run check - check OK
make change to log
run check - check fails
run check - check OK
If this doesn't work, then you know to focus on the log, the oldlog, and how the check_log is doing the check.
If it works, then it points more towards a problem with your nagios configuration.

There is a Nagios plugin that you can use to check the log files: it's called check_logfiles and it's used to scan the lines of a file for regular expressions.
The following link shows how to install and configure check_logfiles for Nagios and Opsview:
https://www.opsview.com/resources/nagios-alternative/blog/syslog-monitoring-nagios-opsview

As there are many ways to achieve a goal, there is also a nice plugin from Consol available:
https://labs.consol.de/lang/en/nagios/check_logfiles/
supports regex
supports log rotation
To use it, you need a cfg file, this is an example for oracle databases
#searches = ({
tag => 'oraalerts',
options => 'sticky=28800',
logfile => '/u01/app/oracle/diag/rdbms/davmdkp/DAVMDKP1/trace/alert_DAVMDKP1.log',
criticalpatterns => [
'ORA\-0*204[^\d]', # error in reading control file
'ORA\-0*206[^\d]', # error in writing control file
'ORA\-0*210[^\d]', # cannot open control file
'ORA\-0*257[^\d]', # archiver is stuck
'ORA\-0*333[^\d]', # redo log read error
'ORA\-0*345[^\d]', # redo log write error
'ORA\-0*4[4-7][0-9][^\d]',# ORA-0440 - ORA-0485 background process failure
'ORA\-0*48[0-5][^\d]',
'ORA\-0*6[0-3][0-9][^\d]',# ORA-6000 - ORA-0639 internal errors
'ORA\-0*1114[^\d]', # datafile I/O write error
'ORA\-0*1115[^\d]', # datafile I/O read error
'ORA\-0*1116[^\d]', # cannot open datafile
'ORA\-0*1118[^\d]', # cannot add a data file
'ORA\-0*1122[^\d]', # database file 16 failed verification check
'ORA\-0*1171[^\d]', # datafile 16 going offline due to error advancing checkpoint
'ORA\-0*1201[^\d]', # file 16 header failed to write correctly
'ORA\-0*1208[^\d]', # data file is an old version - not accessing current version
'ORA\-0*1578[^\d]', # data block corruption
'ORA\-0*1135[^\d]', # file accessed for query is offline
'ORA\-0*1547[^\d]', # tablespace is full
'ORA\-0*1555[^\d]', # snapshot too old
'ORA\-0*1562[^\d]', # failed to extend rollback segment
'ORA\-0*162[89][^\d]', # ORA-1628 - ORA-1632 maximum extents exceeded
'ORA\-0*163[0-2][^\d]',
'ORA\-0*165[0-6][^\d]', # ORA-1650 - ORA-1656 tablespace is full
'ORA\-16014[^\d]', # log cannot be archived, no available destinations
'ORA\-16038[^\d]', # log cannot be archived
'ORA\-19502[^\d]', # write error on datafile
'ORA\-27063[^\d]', # number of bytes read/written is incorrect
'ORA\-0*4031[^\d]', # out of shared memory.
'No space left on device',
'Archival Error',
],
warningpatterns => [
'ORA\-0*3113[^\d]', # end of file on communication channel
'ORA\-0*6501[^\d]', # PL/SQL internal error
'ORA\-0*1140[^\d]', # follows WARNING: datafile #20 was not in online backup mode
'Archival stopped, error occurred. Will continue retrying',
]
});

I believe there's now a real Nagios plugin that monitors logs effectively.
http://support.nagios.com/forum/viewtopic.php?f=6&t=8851&p=42088&hilit=unixautomation#p42088
The home page of the Nagios plugin on that page is Nagios Log Monitor
Your [ commands.cfg file ] will contain:
define command {
command_name NagiosLogMonitor
command_line $USER1$/NagiosLogMonitor $HOSTNAME$ $ARG1$ $ARG2$ $ARG3$ $ARG4$ '$ARG5$' '$ARG6$' $ARG7$ $ARG8$ $ARG9$ $ARG10$
}
OR
define command {
command_name NagiosLogMonitor
command_line $USER1$/NagiosLogMonitor $HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$ $ARG4$ '$ARG5$' '$ARG6$' $ARG7$ $ARG8$ $ARG9$ $ARG10$
}
Your [ services.cfg file ] will look similar to:
define service {
check_command NagiosLogMonitor!logrobot!autofig!/var/log/proteus.log!15!500.html!500 Internal Server Error!1!2!-foundn
max_check_attempts 1
service_description 500_ERRORS_LOGCHECK
host_name sky.blat-01.net,sky.blat-02.net,sky.blat-03.net
use fifteen-minute-interval
}

Nagios now has a solution that integrates tightly with Nagios Core, XI, etc.
Nagios Log Server which can alert on any query on any log file on any system in your infrastructure.

Related

Syslog-ng logs not processing certain logs possibly due to journal cursor issue

I'm using syslog-ng 3.37.1 on a VMware Photon 3.0 virtual appliance (preconfigured VM). The appliance is configured to write logs into certain files under /var/log folder as well as to remote syslog servers (optional).
Logs from facility 'auth' and 'authpriv' are configured to write to /var/log/auth.log, as well as send it over to remote syslog server when enabled.
In addition, there are other messages as well from kernel, systemd services as well as other processes, configured to be processed via syslog-ng.
Issue is that, logs from a few facilities (such as auth, authpriv, cron etc) are not processed (received?) by syslog-ng initially. So, any SSH events, TTY login events are not logged into the file and remote. However, many other events from kernel, systemd and other processes are logged fine.
Below is the configuration for auth.log, that does not log in the first boot.
filter f_auth { facility(auth) or facility(authpriv)); };
destination authlog { file("/var/log/auth.log" perm(0600)); };
log { source(s_local); filter(f_auth); destination(authlog); };
I updated the filter as below without any success
filter f_auth {
facility(auth) or facility(authpriv) or
match('sshd' value('PROGRAM')) or match('systemd-logind' value('PROGRAM'));
};
In journal logs I can observe the relevant logs, for example, below command to view SSH logs.
journalctl -f -u sshd
Additional syslog-ng service restart or config reload during appliance startup do not fix this.
The log file /var/log/auth.log (and also cron log etc) show zero size during this time. Syslog-ng log looks fine too.
However, if I generate some auth facility event (say, SSH/TTY login) and manually restart syslog-ng, all the log entries (including old events) are immediately written into filesystem log (/var/log/auth.log) and also sent to remote syslog server.
In the syslog-ng.log I find below entry when it starts working that way.
syslog-ng[481]: [date] Failed to seek journal to the saved cursor position; cursor='', error='Invalid argument (22)'
It makes me wonder if it is due to some bad cursor position. However, I can still see other systemd and kernel logs being logged fine. So, not sure.
What could be causing such behaviour? How can I ensure that syslog-ng is able to receive and process these logs without manual intervention?
Below is more detailed configuration for reference:
#version: 3.37
#include "scl.conf"
source s_local {
system();
internal();
udp();
};
destination d_local {
file("/var/log/messages");
file("/var/log/messages-kv.log" template("$ISODATE $HOST $(format-welf --scope all-nv-pairs)\n") frac-digits(3));
};
log {
source(s_local);
# uncomment this line to open port 514 to receive messages
#source(s_network);
destination(d_local);
};
filter f_auth {
facility(auth) or facility(authpriv)); # Also tried facility (auth, authpriv)
};
destination authlog { file("/var/log/auth.log" perm(0600)); };
log { source(s_local); filter(f_auth); destination(authlog); };
destination d_kern { file("/dev/console" perm(0600)); };
filter f_kern { facility(kern); };
log { source(s_local); filter(f_kern); destination(d_kern); };
destination d_cron { file("/var/log/cron" perm(0600)); };
filter f_cron { facility(cron); };
log { source(s_local); filter(f_cron); destination(d_cron); };
destination d_syslogng { file("/var/log/syslog-ng.log" perm(0600)); };
filter f_syslogng { program(syslog-ng); };
log { source(s_local); filter(f_syslogng); destination(d_syslogng); };
# A few more of above kind of configuration follows here.
# Add configuration files that have remote destination, filter and log configuration for remote servers
#include "remote/*.conf"
As can be seen, /var/log/auth.log should hold logs from auth facility, but the log remains blank until subsequent restart of syslog-ng after a syslog config change (via API) or manual login into the system. However, triggering automated restart of syslog-ng using cron (without additional syslog config change) does not help.
Any thoughts, suggestions?
This is probably caused by your real time clock going backwards. The notification mechanism in libsystemd does not work in this case.
There's a proof-of-concept patch in this syslog-ng issue:
https://github.com/syslog-ng/syslog-ng/issues/2836
But I've increased the priority to tackle that problem and fix this, as it is causing issue more often than I anticipated.
As a workaround you should synchronize the time for your VM, preferably so that during boot it waits until a sync and then keep the time synchronized by ntp.

Establish Redis Connection in Phoenix

I need to establish a Redis connection when my Phoenix app initially loads. When reading the docs I thought that code would go in /config/dev.exs or /config/config.exs but the Redix dependency I am using as a Redis interface is not loaded in /config
Below results in a reference error in /config:
Redix.start_link("redis://localhost:6379/3", name: :redix)
I only want to call this once on app load. Where should I put this call in my Phoenix app?
Adding {Redix, name: :redix} to children array in application.ex adds redix process to the supervisor tree. Which means it will start along with your application:
children = [
# Start the Ecto repository
MyApp.Repo,
# Start the Telemetry supervisor
AppWeb.Telemetry,
# Start the PubSub system
....
# Single Redis connection
{Redix, name: :redix}
]
See https://hexdocs.pm/redix/real-world-usage.html
You can check in iex -S mix:
iex(1)> Redix.command(:redix, ["PING"])
{:ok, "PONG"}
Now you can use all the regular Redix commands: https://hexdocs.pm/redix/readme.html#usage

apache + mod_perl + couchbase = occasional connection problems

We use couchbase as session storage for mod_perl scripts. To avoid delays on clients caused by waiting for a new connection we do preconnect to couchbase on child_init apache stage. So during apache restart / new child creation it connects to couchbase automatically and later use that connection during apche child lifetime.
Generally everything works fine, but sometimes we got the following errors during that preconnection:
Couldn't connect: 0x13 (Operation not supported) at /perl/lib64/perl5/Couchbase/Bucket.pm line 38.
Usually it appears during apache restart and on several (or dozens) of childs, and almost never on one child only. Usually restarting apache again solves the problem.
What can cause such a problems? Is it a problem with code / server configuration / couchbase server itself?
May be it caused somehow with a lot of reconnections at the same time? Some ulimits stuff / or selinux restrictions?
UPD: versions
OS:
Centos 6, 2.6.32-358.2.1.el6.x86_64
libcouchbase:
libcouchbase-devel.x86_64 2.4.7-1.el6
libcouchbase2-core.x86_64 2.4.7-1.el6
libcouchbase2-libevent.x86_64 2.4.7-1.el6
couchbase server:
2.2.0 community edition (build-837)
SDK:
perl (Couchbase::Core v2.0.2)
connection code (isolated & simplified):
# in mod_perl environment
use Couchbase;
use Couchbase::Bucket;
use Couchbase::Document;
use Apache2::ServerUtil ();
my $cb = undef;
# connection handler, initialized once, used during apache child lifetime
sub connect_couchbase_on_child_init {
my ($child_pool, $s) = #_;
my $dsn = 'couchbase://192.168.0.1,192.168.0.2/my_bucket_name?detailed_errcodes=1';
eval { $cb = Couchbase::Bucket->new($dsn); };
# here we get the occasional warnings during apache restarts
if ($#) { warn "COUCHBASE CONNECTION ERROR! $#"; $cb = undef; }
return Apache2::Const::OK;
}
Apache2::ServerUtil->server->push_handlers(PerlChildInitHandler => \&connect_couchbase_on_child_init);
# in request handlers it used with the following calls (only if connected):
# $doc = Couchbase::Document->new($key);
# $cb->get($doc);
# ...
# $cb->replace($doc);
# ...
# $cb->insert($doc);
# ...
# $cb->remove($doc);
Because you are using server 2.2.0 and because this seems to happen when you are connecting many clients at once, my theory is that you are receiving the last error from the server. The current client bootstrap process attempts using bootstrap over memcached (which is only supported from version >= 2.5.0 of the server), that fails and it attempts to use 'terse' bootstrapping (again, only supported on >= 2.5.0 of the server) and finally 'classic' HTTP (which is available on all versions).
Add the following options to your DSN/connection string to cut out some of the steps for your server. Note that should you ever upgrade to >= 2.5 these options should be removed:
bootstrap_on=http Does not try memcached bootstrap
http_urlmode=2 Uses the pre-2.5 style of bootstrapping by default
These two options will not necessarily fix your issue, but they will at least cut out some of the initial connection time, and perhaps show a clearer reason for the error (you can also set LCB_LOGLEVEL=5 in the environment to get actual logging).
In your case, the connection string would be:
couchbase://192.168.0.1,192.168.0.2/my_bucket_name?detailed_errcodes=1&bootstrap_on=http&http_urlmode=2

puppet master didn't pass agent hostname/fqdn to enc script

Puppet version: 3.6.2
In order to simplify the management of ssl certificates, our puppet agents use the same certname, certname=agent.puppet.com
When puppet master gets request from agent(hostname: web00.xxx.com), it executes Enc script with certname as parameter.
node_terminus = exec
external_nodes = /home/ocean/puppet/conf/bce_puppet_bns
puppet.log:
2015-05-06 09:55:34 +0800 Puppet (debug): Executing '/home/ocean/puppet/conf/bce_puppet_bns agent.puppet.com'
How do I configure to make puppet master pass agent's real hostname/FQDN to Enc script like:
/home/ocean/puppet/conf/bce_puppet_bns web00.xxx.com
Or how can I get the agent's hostname/FQDN in Enc script ?
Don't.
Don't use any info other than $clientcert passed from the agent.
Don't share certificates among different agents.
There are deeply rooted assumptions in Puppet that each agent node has an individual certificate. You will wreak havoc in your infrastructure by trying such stunts.
For example, PuppetDB data is usually grouped by owning agents' certnames. This data will become inconsistent quickly with all agents calling themselves the same, but being quite different of course.
ensure puppetmaster says this
[master]
node_name = facter
alter auth.conf so that all the sections have the "agent.puppet.com" cert like this
# allow nodes to retrieve their own catalog
path ~ ^/catalog/([^/]+)$
method find
allow $1
allow agent.puppet.com
# allow nodes to retrieve their own node definition
path ~ ^/node/([^/]+)$
method find
allow $1
allow agent.puppet.com
# allow all nodes to access the certificates services
path /certificate_revocation_list/ca
method find
allow *
# allow all nodes to store their own reports
path ~ ^/report/([^/]+)$
method save
allow $1
allow agent.puppet.com
That's just puppetmaster <=> client, Felix is right that if you are using puppetdb that would have to be altered too

syslogs on AIX machine

I need to view syslog on AIX machine.
I have no clue about this.
I went through syslog.conf file and got something like this:
# "mail messages, at debug or higher, go to Log file. File must exist."
# "all facilities, at debug and higher, go to console"
# "all facilities, at crit or higher, go to all users"
# mail.debug /usr/spool/mqueue/syslog
# *.debug /dev/console
# *.crit *
# *.debug /tmp/syslog.out rotate size 100k files 4
# *.crit /tmp/syslog.out rotate time 1d
Also,I donot know how to access /dev/console
Can somebody help out?
See How to configure AIX syslogd and managing AIX logs.
From your configuration, I see that all syslogged information can be found in /tmp/syslog.out since this is where *.debug is being logged.
If you don't find anything there, you should check if the syslogd daemon is actually running.
If you make a change to syslog.conf file, you have to restart the daemon using
refresh -s syslogd
Update: I see that everything in syslog.conf is commented out. If you want to see some logs, you have to enable some logging facility. For example, it should look like this:
# "mail messages, at debug or higher, go to Log file. File must exist."
# "all facilities, at debug and higher, go to console"
# "all facilities, at crit or higher, go to all users"
# mail.debug /usr/spool/mqueue/syslog
# *.debug /dev/console
# *.crit *
*.debug /tmp/syslog.out rotate size 100k files 4
# *.crit /tmp/syslog.out rotate time 1d
if you want to see anything in /tmp/syslog.out.
And, don't forget to restart the daemon!
Update 2:
To enable logging of everything, put this in syslog.conf:
*.* /tmp/syslog.out rotate size 100k files 4
This way you'll see if logging really is working.
You forget that the file must exist:
touch /tmp/syslog.out
refresh -s syslogd
Well, /dev/console is the console, a (almost certainly) physical terminal connected to the box itself. It's not a storage device that you can get the information back from.
As to which file you need to look at, it's usually controlled by that file you showed us and individual messages can be sent to different places based on the facility and priority. However, since all those lines you see are commented out, they'll go to the default, which is probably the console.