Intermittent Cannot connect to shibd process, a site adminstrator should be notified - apache

We have a shibboleth native SP 2.5.4 that's been running for a few years without any issues. Yesterday I had to update a certificate for one of the IDP. Since that restart I've been getting intermittent errors:
Cannot connect to shibd process, a site adminstrator should be notified.
Errors appear to occur in bursts as shown by these number of errors per minute:
nb | time
58 Sep 22 09:56
82 Sep 22 10:53
82 Sep 22 11:16
80 Sep 22 11:17
89 Sep 22 11:37
71 Sep 22 11:38
130 Sep 22 11:43
47 Sep 22 11:44
Restarting httpd and shibd didn't resolve the issue. SElinux is disabled.
In /var/log/shibboleth-www/native_warn.log I have:
2020-09-22 11:54:13 ERROR Shibboleth.Listener [15798] shib_check_user: socket call (connect) resulted in error (2): no message
2020-09-22 11:54:13 WARN Shibboleth.Listener [15798] shib_check_user: cannot connect socket (21)...
2020-09-22 11:54:13 CRIT Shibboleth.Listener [15798] shib_check_user: socket server unavailable, failing
2020-09-22 11:54:13 ERROR Shibboleth.Apache [15798] shib_check_user: Cannot connect to shibd process, a site adminstrator should be notified.
Memory and CPU look good to me:
top - 12:08:08 up 25 days, 22:33, 2 users, load average: 1.01, 1.03, 1.01
Tasks: 294 total, 1 running, 293 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.2%us, 0.1%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 32880188k total, 4712256k used, 28167932k free, 426772k buffers
Swap: 5242876k total, 0k used, 5242876k free, 1993996k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16876 apache 20 0 370m 10m 4160 S 0.7 0.0 0:00.20 httpd
17418 shibd 20 0 4894m 58m 8084 S 0.7 0.2 0:01.46 shibd
2401 root 20 0 3116m 270m 19m S 0.3 0.8 128:41.88 cylancesvc
17519 apache 20 0 370m 10m 3948 S 0.3 0.0 0:00.12 httpd
17766 apache 20 0 370m 10m 3872 S 0.3 0.0 0:00.13 httpd
Any idea what could cause this?

Related

Error when using TLS server with pgBackRest : [113] No route to host

I´m trying to implement the TLS server feature available with pgBackRest to use a secure connection between the DB server and the repo server, replacing the previous SSH passwordless setup (that was working fine).
After following the online documentation, I´m having the following error when issuing the stanza-create command :
pgbackrest#pgb-repo$ pgbackrest --stanza=training --log-level-console=info stanza-create
2022-06-13 12:56:55.677 P00 INFO: stanza-create command begin 2.39: --buffer-size=16MB --exec-id=8994-62e5ecac --log-level-console=info --log-level-file=info --pg1-host=pg1-primary --pg1-host-ca-file=/etc/pgbackrest/cert/ca.crt --pg1-host-cert-file=/etc/pgbackrest/cert/pg1-primary.crt --pg1-host-key-file=/etc/pgbackrest/cert/pg1-primary.key --pg1-host-type=tls --pg1-host-user=postgres --pg1-path=/data/postgres/13/pg_data --repo1-path=/backup/pgbackrest --stanza=training
WARN: unable to check pg1: [HostConnectError] unable to connect to 'pg1-primary:8432': [113] No route to host
ERROR: [056]: unable to find primary cluster - cannot proceed
HINT: are all available clusters in recovery?
2022-06-13 12:58:55.835 P00 INFO: stanza-create command end: aborted with exception [056]
The PostgreSQL server is up and running on the the DB host:
[postgres#pg1-primary ~]$ psql -c "SELECT pg_is_in_recovery();"
pg_is_in_recovery
-------------------
f
(1 row)
Question
Why am I having this [113] No route to host error ?
Configuration for each server :
pg1-primary
[postgres#pg1-primary ~]$ cat /etc/pgbackrest/pgbackrest.conf
[global]
repo1-path=/backup/pgbackrest
repo1-host-ca-file=/etc/pgbackrest/cert/ca.crt
repo1-host-cert-file=/etc/pgbackrest/cert/pgb-repo.crt
repo1-host-key-file=/etc/pgbackrest/cert/pgb-repo.key
repo1-host-type=tls
tls-server-address=*
tls-server-auth=pgb-repo=training
tls-server-ca-file=/etc/pgbackrest/cert/ca.crt
tls-server-cert-file=/etc/pgbackrest/cert/pg1-primary.crt
tls-server-key-file=/etc/pgbackrest/cert/pg1-primary.key
[postgres#pg1-primary ~]$ cat /etc/pgbackrest/conf.d/training.conf
[training]
pg1-path=/data/postgres/13/pg_data
pg1-socket-path=/tmp
repo1-host=pgb-repo
repo1-host-user=pgbackrest
[postgres#pg1-primary ~]$ ll /etc/pgbackrest/cert/
total 20
-rw-------. 1 postgres postgres 1090 Jun 13 12:12 ca.crt
-rw-------. 1 postgres postgres 977 Jun 13 12:12 pg1-primary.crt
-rw-------. 1 postgres postgres 1708 Jun 13 12:12 pg1-primary.key
-rw-------. 1 postgres postgres 977 Jun 13 12:23 pgb-repo.crt
-rw-------. 1 postgres postgres 1704 Jun 13 12:23 pgb-repo.key
pgb-repo
pgbackrest#pgb-repo$ cat /etc/pgbackrest/pgbackrest.conf
[global]
repo1-path=/backup/pgbackrest
tls-server-address=*
tls-server-auth=pg1-primary=training
tls-server-ca-file=/etc/pgbackrest/cert/ca.crt
tls-server-cert-file=/etc/pgbackrest/cert/pgb-repo.crt
tls-server-key-file=/etc/pgbackrest/cert/pgb-repo.key
pgbackrest#pgb-repo$ cat /etc/pgbackrest/conf.d/training.conf
[training]
pg1-host=pg1-primary
pg1-host-user=postgres
pg1-path=/data/postgres/13/pg_data
pg1-host-ca-file=/etc/pgbackrest/cert/ca.crt
pg1-host-cert-file=/etc/pgbackrest/cert/pg1-primary.crt
pg1-host-key-file=/etc/pgbackrest/cert/pg1-primary.key
pg1-host-type=tls
pgbackrest#pgb-repo$ ll /etc/pgbackrest/cert/
total 20
-rw-------. 1 pgbackrest pgbackrest 1090 Jun 13 12:27 ca.crt
-rw-------. 1 pgbackrest pgbackrest 977 Jun 13 12:27 pg1-primary.crt
-rw-------. 1 pgbackrest pgbackrest 1708 Jun 13 12:27 pg1-primary.key
-rw-------. 1 pgbackrest pgbackrest 977 Jun 13 12:27 pgb-repo.crt
-rw-------. 1 pgbackrest pgbackrest 1704 Jun 13 12:27 pgb-repo.key
The servers are reachable from one another:
[postgres#pg1-primary ~]$ ping pgb-repo
PING pgb-repo.xxxx.com (XXX.XX.XXX.117) 56(84) bytes of data.
64 bytes from pgb-repo.xxxx.com (XXX.XX.XXX.117): icmp_seq=1 ttl=64 time=0.365 ms
64 bytes from pgb-repo.xxxx.com (XXX.XX.XXX.117): icmp_seq=2 ttl=64 time=0.421 ms
pgbackrest#pgb-repo$ ping pg1-primary
PING pg1-primary.xxxx.com (XXX.XX.XXX.116) 56(84) bytes of data.
64 bytes from pg1-primary.xxxx.com (XXX.XX.XXX.116): icmp_seq=1 ttl=64 time=0.325 ms
64 bytes from pg1-primary.xxxx.com (XXX.XX.XXX.116): icmp_seq=2 ttl=64 time=0.298 ms
So actually the issue had to do with the firewall preventing access to the default TLS port (8432) used by pgBackRest.
[root#pgb-server ~]# firewall-cmd --zone=public --add-port=8432/tcp --permanent
[root#pgb-server ~]# firewall-cmd --reload
Once the port was accessible through the firewall I could issue a telnet command successfully (for testing access) - and of course run my pgBackRest commands too.
[pgbackrest#pgb-server]$ telnet pg1-server 8432
Trying 172.XX.XXX.XXX...
Connected to pg1-server.
Escape character is '^]'.

Redis service crashes with "Failed opening the RDB file systemdd (in server root dir /etc/cron.d) for saving: Permission denied"

I am running Redis server version 6.0.6 on Ubuntu 20.04. The process is run by "redis" user.
Sometimes, the Redis process crashes and gets restarted on its own and when this happens, lot of data cached in Redis becomes unavailable. This happens every few days/weeks. I can see the following messages in the logs - saving was working fine till 2:32:43 and suddenly failed at 2:34:15:
133121:C 23 Jun 2021 02:27:54.383 * RDB: 22 MB of memory used by copy-on-write
105798:M 23 Jun 2021 02:27:54.511 * Background saving terminated with success
105798:M 23 Jun 2021 02:29:46.279 * 10000 changes in 60 seconds. Saving...
105798:M 23 Jun 2021 02:29:46.354 * Background saving started by pid 133125
133125:C 23 Jun 2021 02:30:16.363 * DB saved on disk
133125:C 23 Jun 2021 02:30:16.464 * RDB: 18 MB of memory used by copy-on-write
105798:M 23 Jun 2021 02:30:16.583 * Background saving terminated with success
105798:M 23 Jun 2021 02:32:14.138 * 10000 changes in 60 seconds. Saving...
105798:M 23 Jun 2021 02:32:14.222 * Background saving started by pid 133131
133131:C 23 Jun 2021 02:32:42.924 * DB saved on disk
133131:C 23 Jun 2021 02:32:42.988 * RDB: 22 MB of memory used by copy-on-write
105798:M 23 Jun 2021 02:32:43.123 * Background saving terminated with success
105798:M 23 Jun 2021 02:34:14.958 * DB saved on disk
105798:M 23 Jun 2021 02:34:15.705 # Failed opening the RDB file systemdd (in server root dir /etc/cron.d) for saving: Permission denied
=== REDIS BUG REPORT START: Cut & paste starting from here ===
105798:M 23 Jun 2021 02:34:15.705 # Redis 6.0.6 crashed by signal: 11
105798:M 23 Jun 2021 02:34:15.705 # Crashed running the instruction at: 0x55f2e7e35099
105798:M 23 Jun 2021 02:34:15.705 # Accessing address: 0x149968
105798:M 23 Jun 2021 02:34:15.705 # Failed assertion: <no assertion failed> (<no file>:0)
------ STACK TRACE ------
EIP:
/usr/bin/redis-server 172.16.106.88:6379(je_malloc_usable_size+0x89)[0x55f2e7e35099]
Backtrace:
/usr/bin/redis-server 172.16.106.88:6379(logStackTrace+0x4f)[0x55f2e7db2bcf]
/usr/bin/redis-server 172.16.106.88:6379(sigsegvHandler+0xb5)[0x55f2e7db33d5]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7fb934c173c0]
/usr/bin/redis-server 172.16.106.88:6379(je_malloc_usable_size+0x89)[0x55f2e7e35099]
/usr/bin/redis-server 172.16.106.88:6379(+0x50b79)[0x55f2e7d72b79]
/usr/bin/redis-server 172.16.106.88:6379(rdbSave+0x2ba)[0x55f2e7d9345a]
/usr/bin/redis-server 172.16.106.88:6379(saveCommand+0x67)[0x55f2e7d94ab7]
/usr/bin/redis-server 172.16.106.88:6379(call+0xb1)[0x55f2e7d6a8b1]
/usr/bin/redis-server 172.16.106.88:6379(processCommand+0x4a6)[0x55f2e7d6b446]
/usr/bin/redis-server 172.16.106.88:6379(processCommandAndResetClient+0x14)[0x55f2e7d799e4]
/usr/bin/redis-server 172.16.106.88:6379(processInputBuffer+0x18f)[0x55f2e7d7e39f]
/usr/bin/redis-server 172.16.106.88:6379(+0xe10ac)[0x55f2e7e030ac]
/usr/bin/redis-server 172.16.106.88:6379(aeProcessEvents+0x303)[0x55f2e7d63b83]
/usr/bin/redis-server 172.16.106.88:6379(aeMain+0x1d)[0x55f2e7d63ebd]
/usr/bin/redis-server 172.16.106.88:6379(main+0x4e5)[0x55f2e7d603d5]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fb934a370b3]
/usr/bin/redis-server 172.16.106.88:6379(_start+0x2e)[0x55f2e7d606ae]
The service restarts on its own and the Redis server starts working fine for a few days/weeks and crashes again with the same error!
I have checked several posts in SO, but none of them resolve my issue since:
a) The instance where the Redis server is running is in a private network (public access is disabled).
b) The DB file name and dir have not been corrupted as observed from "config get dbfilename" and "config get dir" commands. They show the default values.
c) The permissions of the directories are correct (/var/lib/redis is owned by redis with 755 permissions and /var/lib/redis/dump.rdb is owned by redis with 660 permissions)
Can anyone help me identify the root cause of this issue please?

Apache 2.4.10 hangs AH00485: scoreboard is full, not at MaxRequestWorkers

Apache server will stay up for random amount of time, usually days, but eventually enters a hung state. When hung the CPU load gradually spikes on the machine and new web server requests are unresponsive.
Error logs typically contain lots of these:
Wed Jan 28 16:06:58.667188 2015] [mpm_event:error] [pid 25336:tid 1] AH00485: scoreboard is full, not at MaxRequestWorkers
Environment:
LDOM (VM) SunOS myhostname 5.10 Generic_118833-36 sun4v sparc SUNW,Sun-Fire-T200
http Conf:
StartServers 8
MinSpareServers Not set
MaxSpareServers Not set
ServerLimit 256
MaxRequestWorkers 100
MaxConnectionsPerChild 1000
KeepAlive On
TimeOut 3000
MaxKeepAliveRequests 50
KeepAliveTimeout 2
Current non-hung Score Board:
Server Version: Apache/2.4.10 (Unix)
Server MPM: event
Server Built: Oct 30 2014 16:29:03
Current Time: Wednesday, 28-Jan-2015 10:59:39 PST
Restart Time: Wednesday, 28-Jan-2015 09:49:21 PST
Parent Server Config. Generation: 1
Parent Server MPM Generation: 0
Server uptime: 1 hour 10 minutes 17 seconds
Server load: 0.60 0.46 0.41
Total accesses: 1134 - Total Traffic: 2.2 GB
CPU Usage: u9.07 s16.94 cu609.51 cs69.31 - 16.7% CPU load
.269 requests/sec - 0.5 MB/second - 2.0 MB/request
1 requests currently being processed, 99 idle workers
PID Connections Threads Async connections
total accepting busy idle writing keep-alive closing
25337 0 yes 1 24 0 0 0
25338 1 yes 0 25 1 0 0
25339 1 yes 0 25 0 0 1
25340 1 yes 0 25 0 0 1
Sum 3 1 99 1 0 2
Any thoughts on http conf tuning, OS patches, apache bug fixes appreciated.
Yes I have seen the open ASF bugzilla for the same error message.
This is a production server, so you can imagine, having it go down at random times (usually when I am asleep) is not fun!

CouchDB crashes after few minutes running

CouchDB is very unpleasant for me. Niether documentation, nor tips could help me at all. The situation is like that:
FreeBSD 9.2 amd64
couchdb-1.5.0,2 installed from ports
npm couchapp
npm semver.
Started replication in CouchDB for node repo and amazing crashes are happening every several minutes. I wrote a script which tests process every 5 seconds:
13:40:53
13:48:11 7m42s [growing tendention]
13:56:09 7m58s
14:04:11 8m02s
14:12:23 8m12s
14:21:14 8m12s
14:30:08 8m54s
14:40:48 10m40s
14:57:13 16m35s [stop growing tendention]
15:08:29 11m16s
...
couch.log: (not always, sometimes nothing at all)
Tue, 06 May 2014 12:59:51 GMT] [error] [<0.134.0>] Error in replication `[REPLICATION_HASH]+continuous` (triggered by document `npmjs_repl`): timeout
Restarting replication in 40 seconds.
[info] [<0.372.0>] Replication `"[REPLICATION_HASH]+continuous"` is using:
4 worker processes
a worker batch size of 500
20 HTTP connections
a connection timeout of 30000 milliseconds
10 retries per request
socket options are: [{keepalive,true},{nodelay,false}]
source start sequence 203628
[Tue, 06 May 2014 13:00:32 GMT] [info] [<0.372.0>] Replication `"[REPLICATION_HASH]7+continuous"` is using:
4 worker processes
a worker batch size of 500
20 HTTP connections
a connection timeout of 30000 milliseconds
10 retries per request
socket options are: [{keepalive,true},{nodelay,false}]
source start sequence 203628
err.log (in every crash)
heart: Tue May 6 15:06:14 2014: heart-beat time-out, no activity for 13 seconds
heart: Tue May 6 15:06:16 2014: Executed "/usr/local/bin/couchdb -k" -> 0. Terminating.
heart_beat_kill_pid = 52979
heart_beat_timeout = 11
truss output:
...
kevent(3,0x0,0,{},256,{0.000000000 }) = 0 (0x0)
kevent(3,0x0,0,{},256,{0.000000000 }) = 0 (0x0)
kevent(3,0x0,0,{},256,{0.000000000 }) = 0 (0x0)
kevent(3,0x0,0,{},256,{0.000000000 }) = 0 (0x0)
kevent(3,0x0,0,{},256,{0.000000000 }) = 0 (0x0)
mmap(0x0,4194304,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) = 17431527424 (0x40f000000)
SIGNAL 9 (SIGKILL)
process exit, rval = 0
Thanks for helping.

Apache performance tuning with 1 GB with httpd.conf

I have a 1 GB VPS and Apache slows to a crawl almost from start up. I ran ApacheBench on a static.html file and things don't differ. However, the site will have both MySQL and PHP and a high volume of AJAX requests, so I'd like to tune for that.
When I restart, error logs show this almost immediately:
[error] server reached MaxClients setting, consider raising the MaxClients setting
ab -n 1000 -c 1000
shows:
Document Path: /static.html
Document Length: 7 bytes
Concurrency Level: 1000
Time taken for tests: 57.784 seconds
Complete requests: 1000
Failed requests: 64
(Connect: 0, Receive: 0, Length: 64, Exceptions: 0)
Write errors: 0
Total transferred: 309816 bytes
HTML transferred: 6552 bytes
Requests per second: 17.31 [#/sec] (mean)
Time per request: 57784.327 [ms] (mean)
Time per request: 57.784 [ms] (mean, across all concurrent requests)
Transfer rate: 5.24 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 25 13.4 25 48
Processing: 1070 16183 15379.4 9601 57737
Waiting: 0 14205 15176.5 9591 42516
Total: 1070 16208 15385.0 9635 57783
Percentage of the requests served within a certain time (ms)
50% 9635
66% 20591
75% 20629
80% 36357
90% 42518
95% 42538
98% 42556
99% 42560
100% 57783 (longest request)
If I run ab on a php file, it finishes sometimes, most of the time it won't and sometimes gets errors like
apr_socket_recv: Connection reset by peer (104)
and
socket: No buffer space available (105)
httpd.conf items:
Timeout 10
KeepAlive On
MaxKeepAliveRequests 100
KeepAliveTimeout 1
<IfModule prefork.c>
StartServers 3
MinSpareServers 5
MaxSpareServers 9
ServerLimit 40
MaxClients 40
MaxRequestsPerChild 5000
</IfModule>
Top... (CPU and Load 1min are very erratic during testing):
top - 10:44:51 up 11:50, 3 users, load average: 0.17, 0.42, 0.90
Tasks: 84 total, 2 running, 82 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.8%us, 3.1%sy, 0.0%ni, 94.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 1793072k total, 743604k used, 1049468k free, 0k buffers
Swap: 0k total, 0k used, 0k free, 0k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
21831 mysql 18 0 506m 71m 6688 S 0.7 4.1 4:03.18 mysqld
1828 root 15 0 113m 52m 2052 S 0.0 3.0 0:02.85 spamd
1830 popuser 18 0 113m 51m 956 S 0.0 2.9 0:00.00 spamd
8012 apache 15 0 327m 35m 17m S 3.7 2.0 0:11.83 httpd
8041 apache 15 0 320m 28m 15m S 0.0 1.6 0:11.83 httpd
8022 apache 15 0 321m 27m 14m S 2.3 1.6 0:11.05 httpd
8033 apache 15 0 320m 27m 14m S 1.7 1.6 0:10.06 httpd
Is there something obvious that is wrong here? or what would be my next step in troubleshooting?
Sounds like you don't have enough memory -- 1GB isn't much when you're running PHP with prefork and MySQL on the same server. Your MaxClients should probably be 10-20, not 40.
A few weeks ago I wrote a script to tune Apache httpd that would probably help determine the maximum values for your server. You can find the weblog entry here http://surniaulula.com/2012/11/09/check-apache-httpd-mpm-config-limits/ and the script is on Google Code as well.
Enjoy!
js.