Monit IF/ELSE syntax - monit

I attempting to monitor a service. My monit definition is below.
When I invoke monit -r I receive /etc/monit/conf.d/authentication.monit:10: syntax error 'else'
check host self with address myhost
start program = "/usr/bin/service start authentication"
stop program = "/usr/bin/service stop authentication"
if failed port 443 protocol https
request /
with timeout 5 seconds
for 2 cycles
then restart
if 1 restarts within 4 cycles then exec "/etc/monit/pagerduty-trigger authentication" else if passed for 2 cycles then exec "/etc/monit/pagerduty-resolve authentication"
All the documentation seems to indicate my syntax is correct.
I am attempting to follow the two docs
pagerduty
primary docs

Syntax is :
IF test THEN action [ELSE IF SUCCEEDED THEN *action]
Also
The "if x restarts within y cycles then ..." statement doesn't support the "else" part: https://mmonit.com/monit/documentation/monit.html#SERVICE-RESTART-LIMIT
Else for restart is a bit non-sense as there is no contrario for restart
I'm trying to get your logic here for the if then else actions but don't understand the else part.
Here you want to "stop then "start" service "authentication when https://myhost:443/ is failing two times (called T0)
Then on the next cycle, you want to run script /etc/monit/pagerduty-trigger authentication. (called T0 + 1 cycle)
Here why within 4 cycles but not less like 2, bot ok.
I suppose that on the T0 + 1 + 2 cycles, if service a online again, you want to run "/etc/monit/pagerduty-resolve authentication"
One solution is to handle it at your failed test level with custom scripts
if failed port 443 protocol https
request /
with timeout 5 seconds
for 2 cycles
then exec "/var/lib/monit/scripts/notifyAndExecute.sh"
else if succeeded then exec "/etc/monit/pagerduty-resolve authentication"
create file /var/lib/monit/scripts/notifyAndExecute.sh in charge of restarting and calling /etc/monit/pagerduty-trigger authentication

I'm not sure if you are still looking at this. I am also integrating Monit with Pagerduty. I have one simpler example which is working. What I note is the 'else' appears to only support "else if succeeded" . I think it is simply a long hand version of 'else' without to ability to add a more sophisticated expression that you are attempting.
Here is my example which triggers when service ( process in monit ) does not exist and resolves when it does.
check process tomcat8 with pidfile /var/run/tomcat8.pid
if not exist
then exec "/etc/monit/pagerduty-trigger tomcat8"
else if succeeded then exec "/etc/monit/pagerduty-resolve tomcat8"

Related

Error while running query on Impala with Superset

I'm trying to connect impala to superset, and when I test the connection prints: "Seems OK!", and when I try to see databases on impala with the SQL Editor in the left side it shows all databases without problems.
Preview of Databases/Tables
But when i write a query and click on "Run Query", it gives the error: "Could not start SASL: b'Error in sasl_client_start (-1) SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure. Minor code may provide more information (Ticket expired)'"
Error running query
I'm running superset with SSL and in production mode (with Gunicorn) and Impala with SSL in a Kerberized Hadoop Cluster, and my impala database config is:
Impala Config
And in the extras I put:
{
"metadata_params": {},
"engine_params": {
"connect_args": {
"port": 21050,
"use_ssl": "True",
"ca_cert": "path/to/my/ca_cert.pem",
"auth_mechanism": "GSSAPI"
}
},
"metadata_cache_timeout": {},
"schemas_allowed_for_csv_upload": []
}
How can I solve this error? In my superset log it only shows:
Triggering query_id: 65
INFO:superset.views.core:Triggering query_id: 65
Query 65: Running query on a Celery worker
INFO:superset.views.core:Query 65: Running query on a Celery worker
Versions: Superset 0.36.0, Impyla 0.16.2
I was able to fix this error doing this steps:
1 - Created service user for celery-worker, created a kerberos ticket for him and created a crontab to renew the ticket.
2 - Runned celery worker from this service user, instead running from root.
3 - Killed an celery-worker that was running in another machine of my cluster
4 - Restarted Impala and Superset
I think this error ocurred because in some queries instead of use the celery worker in my superset machine, it was using the celery worker that was in another machine without a valid kerberos ticket. I could fix this error because when I was reading celery-worker log , it showed that a connection with the celery worker in other machine failed in a query running.

How to fix etcd cluster "error "tls: first record does not look like a TLS handshake""

I created a three node etcd cluester, config and start is already OK, but when I check the /var/log/messages, it shows
etcd: rejected connection from "172.17.0.3:43192" (error "tls: first
record does not look like a TLS handshake", ServerName "")
How can I fix it ?
I have checked the health of etcd :
member 48b0dff99d5c867e is healthy: got healthy result from https://172.17.0.9:2379
member 646dab89331aabab is healthy: got healthy result from https://172.17.0.8:2379
member b45603216bfac234 is healthy: got healthy result from https://172.17.0.10:2379
That shows Ok, but when I cat the /var/log/messages, it always shows this error :
Jan 12 20:08:57 master etcd: rejected connection from
"172.17.0.3:43160" (error "tls: first record does not look like a TLS
handshake", ServerName "")
Jan 12 20:08:57 master etcd: rejected
connection from "172.17.0.3:43162" (error "tls: oversized record
received with length 21536", ServerName "")
I got this message for the etcd peer communication when switching from http to https for peer communication. Apparently etcd has persistent peer information that overrides the command line options so it continued to use http for peer communication in spite of the command line options.
In the end, since this was a test cluster, I nuked /var/lib/etcd and the new cli configuration took hold
There is no solution from my side to fully help you with an issue but I've found couple of links that might help you in further investigations. Read them carefully, try solutions and I hope you will resolve the problem.
Github question #9917: check ETCDCTL_API variable, especially make sure --endpoints is configured with https.
Runtime reconfiguration: try to reconfigure you etcd by updating/removing/adding etcs members.
nginx ingress: check your nginx ingress annotations in case you are using nginx
google groups TLS handshake topic: Check this topic, especially comments related to VAULT_ADDR variable. I will copy paste last comment from thread here:
We were able to get everything to work, after understanding the
permission issues.
You asked: "Please confirm if you are seeing server error messages
before initializing Vault" Upon further examination, I did determine
that the errors were not happening before initializing the Vault.
The problem ended up not being related to VAULT_ADDR, and we used the
value: "http://127.0.0.1:8200"
I have the setup operation scripted, and it appears that not
everything was being run at the proper permissions. At first I was
running the scripts using the "sudo" command, which resulted in the
failures. I discovered that the permissions for the certificate key
were restricted and the file could not be accessed by my user. There
may have been other permission issues as well. But once I switched
user to root, and ran the script, everything behaved correctly.
Thanks

How to include special characters on Monit (monitrc) config file?

OS: CentOS 7.4
Monit: 5.14
WebServer: Nginx 1.12.2
How could I insert special characters (specifically '#') on the config file (/etc/monit.d/monitrc)?
I am trying to do the following:
#Monitor clamd#amavisd service
check process clamd#amavisd with pidfile /var/run/amavisd/amavisd.pid
start program = "/usr/bin/systemctl start clamd#amavisd"
stop program = "/usr/bin/systemctl stop clamd#amavisd"
if cpu usage > 99% for 5 cycles then alert
if mem usage > 99% for 5 cycles then alert
But monit -t issues an error for '#' character. I've tried:
'#' and 'clamd#amavisd'
"#" and "clamd#amavisd"
\#
For other services such as amavisd below, it runs like a charm:
# Monitor Amavis-new service
check process amavisd with pidfile /var/run/amavisd/amavisd.pid
start program = "/usr/bin/systemctl start amavisd"
stop program = "/usr/bin/systemctl stop amavisd"
if cpu usage > 99% for 5 cycles then alert
if mem usage > 99% for 5 cycles then alert
According to documentation:
You might need to use double quotes around the password if it contains special chars such as "p#ssw:r#".

Does Node Manager through WLST in Weblogic 10.3.4 call startWeblogic.cmd

I am using weblogic 10.3.4 and have a WLST script which does the following
1. Creates a Domain
2. Create JDBC
3. Start NodeManager
4. Connect to Nodemanager
5. And Deploy my App
Below are the relevant sections of my script
templatehome = domainhome + "/wlserver/common/templates/domains/wls.jar"
readTemplate(templatehome)
create('MyDomain', 'Domain')
cd('/Security/MyDomain/User/weblogic')
cmo.setName(domainuserid)
cmo.setUserPassword(domainpwd)
writeDomain(domainlocation + '/'+ domainname)
---Some other code related to JDBC
closeTemplate()
---- Updating setDomainEnv.cmd
f = open(domainlocation+ '/' + domainname +'/bin/setDomainEnv.cmd', "a+")
f.write("set CLASSPATH=%DOMAIN_HOME%\lib\javax.el-api-2.2.4.jar;%DOMAIN_HOME%\lib\com.sun.el_2.2.0.v201105051105-com.sun.el_2.2.0.v201105051105.jar;%CLASSPATH%")
f.close()
startNodeManager()
nmConnect(domainuserid, domainpwd, 'localhost', '5556', 'MyDomain','D:/MyLoc/Tools/Weblogic/user_projects/domains/MyDomain')
nmStart('AdminServer')
connect()
deploy('myapp','my-war-location')
Please note that i am updating the SetDomainEnv.cmd in the WLST code itself.
When i run this script the domain gets created, the node manager gets started and the application gets deployed.
But the Server start does not happen through startWEblogic.cmd and my updated SetDomainEnv.cmd is not called.
So the question is does NodeManager use startWEblogic.cmd to start a Server?
If yes then why is it not happening in my code?
Check StartScriptEnabled and StartScriptName properties in nodemanager.properties (ex: wlserver_10.3/common/nodemanager/nodemanager.properties)
StartScriptEnabled should be set to true and StartScriptName should be set to startWebLogic.cmd. By default these values are set, if they are not then please set them and restart nodemanager for the changes to take effect.
If the above fix doesn't work, then please update the question with the nodemanager.log content.
Reference : http://docs.oracle.com/cd/E24329_01/web.1211/e21050/java_nodemgr.htm#i1068413

How do I use Nagios to monitor a log file

We are using Nagios to monitor our network with great success. However, we have a syslog for critical application errors and while I set up check_log, it doesn't seem to work as well as monitering a device.
The issues are:
It only shows the last entry
There doesn't seem to be a way to acknowledge the critical error and
return the monitor to a good state
Is nagios the wrong tool, or are we just not setting up the service monitering right?
Here are my entries
# log file
define command{
command_name check_log
command_line $USER1$/check_log -F /var/log/applications/appcrit.log -O /tmp/appcrit.log -q ?
}
# Define the log monitering service
define service{
name logfile-check ;
use generic-service ;
check_period 24x7 ;
max_check_attempts 1 ;
normal_check_interval 5 ;
retry_check_interval 1 ;
contact_groups admins ;
notification_options w,u,c,r ;
notification_period 24x7 ;
register 0 ;
}
define service{
use logfile-check
host_name localhost
service_description CritLogFile
check_command check_log
}
For monitoring logs with Nagios, typically the log checker will return a warning only for newly discovered error messages each time it is invoked (so it must retain some state in order to know to ignore them on subsequent runs). Therefore I usually set:
max_check_attempts 1
is_volatile 1
This causes Nagios to send out the alert immeidately, but only once, and then go back to normal.
My favorite log checker is logwarn, but I'm biased because I wrote it myself after not finding any existing ones that I liked. The logwarn package includes a Nagios plugin.
Nothing in your config jumps out at me as being misconfigured.
By design, check_log will only show either an OK message, or the last log entry that triggered an alert. If you need to see multiple entries, you'll need to modify the plugin.
However, I find the fact that you're not getting recoveries somewhat odd. The way check_log works (by comparing the current log to the previous version), you should get a recovery on the very next service check. Except of course, when there have been additional matching entries added to the log since the last check.
Does forcing another service check (or several) cause it to recover?
Also, I don't intend this in a mean way, but make sure it's really malfunctioning.
Is your log getting additional matching entries in between checks, causing it not to recover? Your check is matching "?" which will match anything new in the log. Is something else (a non-error) being added to the log and inadvertently causing a match?
If none of the above are the issue, I would suggest narrowing it down by taking Nagios out of the equation. Try running check_log manually (from the command line, but as the same user as nagios), and with a different oldlog. It should go something like this -
run check with a new "oldlog" - get initialization message
run check - check OK
make change to log
run check - check fails
run check - check OK
If this doesn't work, then you know to focus on the log, the oldlog, and how the check_log is doing the check.
If it works, then it points more towards a problem with your nagios configuration.
There is a Nagios plugin that you can use to check the log files: it's called check_logfiles and it's used to scan the lines of a file for regular expressions.
The following link shows how to install and configure check_logfiles for Nagios and Opsview:
https://www.opsview.com/resources/nagios-alternative/blog/syslog-monitoring-nagios-opsview
As there are many ways to achieve a goal, there is also a nice plugin from Consol available:
https://labs.consol.de/lang/en/nagios/check_logfiles/
supports regex
supports log rotation
To use it, you need a cfg file, this is an example for oracle databases
#searches = ({
tag => 'oraalerts',
options => 'sticky=28800',
logfile => '/u01/app/oracle/diag/rdbms/davmdkp/DAVMDKP1/trace/alert_DAVMDKP1.log',
criticalpatterns => [
'ORA\-0*204[^\d]', # error in reading control file
'ORA\-0*206[^\d]', # error in writing control file
'ORA\-0*210[^\d]', # cannot open control file
'ORA\-0*257[^\d]', # archiver is stuck
'ORA\-0*333[^\d]', # redo log read error
'ORA\-0*345[^\d]', # redo log write error
'ORA\-0*4[4-7][0-9][^\d]',# ORA-0440 - ORA-0485 background process failure
'ORA\-0*48[0-5][^\d]',
'ORA\-0*6[0-3][0-9][^\d]',# ORA-6000 - ORA-0639 internal errors
'ORA\-0*1114[^\d]', # datafile I/O write error
'ORA\-0*1115[^\d]', # datafile I/O read error
'ORA\-0*1116[^\d]', # cannot open datafile
'ORA\-0*1118[^\d]', # cannot add a data file
'ORA\-0*1122[^\d]', # database file 16 failed verification check
'ORA\-0*1171[^\d]', # datafile 16 going offline due to error advancing checkpoint
'ORA\-0*1201[^\d]', # file 16 header failed to write correctly
'ORA\-0*1208[^\d]', # data file is an old version - not accessing current version
'ORA\-0*1578[^\d]', # data block corruption
'ORA\-0*1135[^\d]', # file accessed for query is offline
'ORA\-0*1547[^\d]', # tablespace is full
'ORA\-0*1555[^\d]', # snapshot too old
'ORA\-0*1562[^\d]', # failed to extend rollback segment
'ORA\-0*162[89][^\d]', # ORA-1628 - ORA-1632 maximum extents exceeded
'ORA\-0*163[0-2][^\d]',
'ORA\-0*165[0-6][^\d]', # ORA-1650 - ORA-1656 tablespace is full
'ORA\-16014[^\d]', # log cannot be archived, no available destinations
'ORA\-16038[^\d]', # log cannot be archived
'ORA\-19502[^\d]', # write error on datafile
'ORA\-27063[^\d]', # number of bytes read/written is incorrect
'ORA\-0*4031[^\d]', # out of shared memory.
'No space left on device',
'Archival Error',
],
warningpatterns => [
'ORA\-0*3113[^\d]', # end of file on communication channel
'ORA\-0*6501[^\d]', # PL/SQL internal error
'ORA\-0*1140[^\d]', # follows WARNING: datafile #20 was not in online backup mode
'Archival stopped, error occurred. Will continue retrying',
]
});
I believe there's now a real Nagios plugin that monitors logs effectively.
http://support.nagios.com/forum/viewtopic.php?f=6&t=8851&p=42088&hilit=unixautomation#p42088
The home page of the Nagios plugin on that page is Nagios Log Monitor
Your [ commands.cfg file ] will contain:
define command {
command_name NagiosLogMonitor
command_line $USER1$/NagiosLogMonitor $HOSTNAME$ $ARG1$ $ARG2$ $ARG3$ $ARG4$ '$ARG5$' '$ARG6$' $ARG7$ $ARG8$ $ARG9$ $ARG10$
}
OR
define command {
command_name NagiosLogMonitor
command_line $USER1$/NagiosLogMonitor $HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$ $ARG4$ '$ARG5$' '$ARG6$' $ARG7$ $ARG8$ $ARG9$ $ARG10$
}
Your [ services.cfg file ] will look similar to:
define service {
check_command NagiosLogMonitor!logrobot!autofig!/var/log/proteus.log!15!500.html!500 Internal Server Error!1!2!-foundn
max_check_attempts 1
service_description 500_ERRORS_LOGCHECK
host_name sky.blat-01.net,sky.blat-02.net,sky.blat-03.net
use fifteen-minute-interval
}
Nagios now has a solution that integrates tightly with Nagios Core, XI, etc.
Nagios Log Server which can alert on any query on any log file on any system in your infrastructure.