Monit cannot open connection errors resulting in alerts from M/Monit that server is down - monit

I'm using monit and M/Monit to monitor my application infrastructure. But every once in a while, M/Monit will show a "No report" error from a server and mark it down. A few seconds later, the issue clears at the next check in for the server to M/Monit.
The monit logs on some of the servers have these events in them:
Oct 14 12:19:11 ip-10-203-51-199 monit[30307]: M/Monit: cannot open a
connection to http://example.com:8080/collector -- Connection timed out
Oct 14 12:20:16 ip-10-203-51-199 monit[30307]: M/Monit: cannot open a
connection to http://example.com:8080/collector -- Connection timed out
Oct 14 12:22:21 ip-10-203-51-199 monit[30307]: M/Monit: cannot open a
connection to http://example.com:8080/collector -- Connection timed out
What config do I need to tune to increase the threshold until M/Monit considers the server actually down?
Here is the config from the server that has the most trouble:
set httpd port 2812 and
allow xxx:xxx
set mailserver xxx.xxx.xxx port xxx username "xxx" password "xxx" using tlsv1 with timeout 15 seconds
set daemon 30
with start delay 120
set logfile syslog facility log_daemon
set alert xxx
set mail-format {
subject: $EVENT $SERVICE on $HOST
from: monit#$HOST
message: Monit $ACTION $SERVICE at $DATE on $HOST: $DESCRIPTION.
}
set mmonit http://xxx:xxx#example.com:8080/collector

There doesn't appear to be any problem with config file.
The intermittent problem you are experiencing is because monit is failing to open a socket on the port and timing out. See the source code for reference (handle_mmonit()):
http://fossies.org/linux/privat/monit-5.6.tar.gz:a/monit-5.6/src/collector.c
Search for the string "M/Monit: cannot open a connection to".
The timeout value appears to be fixed at 5 seconds in the code. But 5 seconds is ample time to open a socket connection on that port.
How often does monit post events to mmonit?

Had the same problem
[MST Apr 5 11:24:11] error : 'apache' failed protocol test [APACHESTATUS] at [phoenix.example.com]:80 [TCP/IP] -- APACHE-STATUS: error -- no scoreboard found
[MST Apr 5 11:24:16] error : Cannot create socket to [10x.xx.xx.x4]:8080 -- Connection timed out
We had another firewall on top of iptables. Opened up the 8080 in the input and the output side and it fixed it!

Related

Lettuce client for Redis - Cluster Topology Refresh Options not working

I'm using lettuce client version 6.2.0 to connect to a Redis cluster (v 6.2) with 3 masters each having 1 replica. I'm trying that the client re-discovers the cluster topology after a master goes down. Here is the client code I have:
List<RedisURI> redisURIs = new ArrayList<>();
redisURIs.add(RedisURI.create("redis://127.0.0.1:7000"));
redisURIs.add(RedisURI.create("redis://127.0.0.1:7001"));
redisURIs.add(RedisURI.create("redis://127.0.0.1:7002"));
redisURIs.add(RedisURI.create("redis://127.0.0.1:7003"));
redisURIs.add(RedisURI.create("redis://127.0.0.1:7004"));
redisURIs.add(RedisURI.create("redis://127.0.0.1:7005"));
ClusterTopologyRefreshOptions topologyRefreshOptions = ClusterTopologyRefreshOptions.builder()
.enableAllAdaptiveRefreshTriggers()
.refreshTriggersReconnectAttempts(1)
.enablePeriodicRefresh(Duration.ofSeconds(5))
.build();
ClusterClientOptions clientOptions = ClusterClientOptions.builder()
.autoReconnect(true).topologyRefreshOptions(topologyRefreshOptions).build();
ClientResources clientResources = ClientResources.builder().reconnectDelay(Delay.equalJitter()).build();
RedisClusterClient clusterClient = RedisClusterClient.create(clientResources, redisURIs);
clusterClient.setOptions(clientOptions);
The problem is that despite the setting enablePeriodicRefresh(Duration.ofSeconds(5)) the refresh interval is still taken as 60 seconds, instead of 5 seconds. Till 1 minute after a master goes down, the client stops working, i.e. it is not able to issue incr operation through clusterClient and the error keeps repeating:
Jul 18, 2022 5:56:21 PM io.lettuce.core.protocol.ConnectionWatchdog lambda$run$4
WARNING: Cannot reconnect to [127.0.0.1:7000]: Connection refused: /127.0.0.1:7000
After 1 minute timeout, it shows the warning message:
Jul 18, 2022 5:56:22 PM io.lettuce.core.cluster.topology.DefaultClusterTopologyRefresh lambda$openConnections$12
WARNING: Unable to connect to [127.0.0.1:7000]: Connection refused: /127.0.0.1:7000
Command timed out after 1 minute(s)
..and then it is able to proceed with commands. Even after that, it keeps showing the warning message:
Jul 18, 2022 5:56:27 PM io.lettuce.core.cluster.topology.DefaultClusterTopologyRefresh lambda$openConnections$12
WARNING: Unable to connect to [127.0.0.1:7000]: Connection refused: /127.0.0.1:7000
What am I missing here?
The enablePeriodicRefresh() setting works after the connection timeout.
You don't set the connection timeout, but it defaults to 60sec.
Adjusting the connection timeout will give you the desired result.
ex) redisURIs.add(RedisURI.create("redis://127.0.0.1:7000/0?timeout=10s"));

Monit: search for text at a url with https protocol

For some reason, monit configuration for monitoring the presence of text at a URL has been failing constantly in the last 48 hours. Here is the relevant config data:
if failed (url https://www.Example.com.com/where-to-buy/ and content == 'Online Retail Partners' and timeout 40 seconds)
then alert
if failed (url https://www.Example.com.com/products/high-absorption and content == 'You May Also Like' and timeout 20 seconds)
then alert
if failed (url https://www.Example.com.com/health-interests/bone-health and content == 'Refine' and timeout 20 seconds)
then alert
if failed (url https://www.Example.com.com/search?keywords=vitamin+d and content == 'Vegan D3' and timeout 20 seconds)
then alert
This all worked great for months/years.
We are getting inundated with monit alerts as follows:
Date: 21 Feb 12:11:32 -0600
Host: Example.com.
Service: httpd
Action: Alert
Description: connection succeeded to [www.Example.com]:443/health-interests/bone-health [TCP/IP TLS]
Date: 21 Feb 12:11:33 -0600
Host: Example.com
Service: httpd
Action: Alert
Description: failed protocol test [HTTP] at [www.Example.com]:443/products/high-absorption [TCP/IP TLS] -- Cannot resolve [www.Example.com]:443
Your faithful employee,
M/Monit
Date: 21 Feb 12:14:00 -0600
Host: Example.com
Service: httpd
Action: Alert
Description: connection succeeded to [www.Example.com]:443/products/high-absorption [TCP/IP TLS]
I'm not sur why we are failing the protocol tests.
Is there a different way to set port 443, https protocol while searching for text in a URL?
Cannot resolve [www.Example.com]
Monit is not able to resolve the IP of the remote service. Please investigate name resolution at the host level (DNS etc...)

Apache HTTPD Websocket Tunnel Plugin Error

My websocket connection fails to connect when connecting through Apache ws tunnel plugin intermittently. The connection always works when hitting the app servers directly.
I see the below errors.
Error during WebSocket handshake: Invalid status line
WebSocket connection to 'ws://host' failed: One or more reserved bits are on: reserved1 = 1, reserved2 = 0, reserved3 = 0
and sometimes
WebSocket connection to 'ws://host' failed: Unrecognized frame opcode: 12
and at times
Error during WebSocket handshake: Status line does not end with CRLF ui-toolkit-vendor.js:21965
Infrastructure
Apache HTTPD 2.4.9 with mod_proxy_wstunnel and mod_proxy_balancer modules
The ws tunnel module ported with 2.4.9 version has several bugs which have been later fixed in the 2.4.12 build. Please find the excerpt from the SVN log.
Revision 1587075 - (view) (download) (annotate) - [select for diffs]
Modified Sun Apr 13 18:41:05 2014 UTC (11 months, 3 weeks ago) by covener
File length: 20119 byte(s)
Diff to previous 1587057 (colored)
several related mod_proxy_wstunnel changes that are tough to pull apart:
make async websockets tunnel opt-in
add config for how long we block a thread in asynch mode
add config for a cap on the synchronous path
avoid sending error responses down the upgraded tunnel

cpanel mail forwarding not working

Forwarding is (I believe) set up correctly.Messages sent to domain addresses this morning that should have been forwarded have not been received by the target email account
result of /var/log/exim_mainlog is as follows
2015-04-02 02:31:22 1YdY8G-0004Ol-Ve == to#emailid (from#emailid) R=lookuphost T=remote_smtp defer (110): Connection timed out
2015-04-02 02:31:22 1YdY8G-0004Ol-Ve ** to#emailid : retry timeout exceeded
Please try to setup your mail forward to different mail account. I think your mail server IP is block on remote server and due to that you are getting timeout in mail logs.

Monit / restart service when failed

I have a service, which is a server which listening to port: 7000.
I want to verify that the service is always up, and when it fails I want to start it again.
I wrote the next script in /etc/monit.d/myserver
check process myserver with pidfile /var/run/myserver.pid
start program = "/etc/init.d/myserver start" with timeout 5 seconds
stop program = "/etc/init.d/myserver stop" with timeout 5 seconds
if failed host 127.0.0.1 port 7000
protocol HTTP request /testcheck then restart
if 5 restarts within 5 cycles then timeout
But I notice that even when the process is running, it restart the service, and give the next information on the log:
EST Dec 18 03:05:13] error : HTTP: error receiving data -- Resource temporarily unavailable
[EST Dec 18 03:05:13] error : 'myserver ' failed protocol test [HTTP] at INET[127.0.0.1:7000] via TCP
[EST Dec 18 03:05:13] info : 'myserver ' trying to restart
[EST Dec 18 03:05:13] info : 'myserver ' stop: /etc/init.d/myserver
[EST Dec 18 03:05:14] info : 'myserver ' start: /etc/init.d/myserver
How can I check it correctly so just when the service is down, it will restart it?
I had the same problem and at the end I found out that I'm not running monit daemon properly take look at this post: Rerun a process in Monit if process stops