CouchDB crashes after few minutes running - crash

CouchDB is very unpleasant for me. Niether documentation, nor tips could help me at all. The situation is like that:
FreeBSD 9.2 amd64
couchdb-1.5.0,2 installed from ports
npm couchapp
npm semver.
Started replication in CouchDB for node repo and amazing crashes are happening every several minutes. I wrote a script which tests process every 5 seconds:
13:40:53
13:48:11 7m42s [growing tendention]
13:56:09 7m58s
14:04:11 8m02s
14:12:23 8m12s
14:21:14 8m12s
14:30:08 8m54s
14:40:48 10m40s
14:57:13 16m35s [stop growing tendention]
15:08:29 11m16s
...
couch.log: (not always, sometimes nothing at all)
Tue, 06 May 2014 12:59:51 GMT] [error] [<0.134.0>] Error in replication `[REPLICATION_HASH]+continuous` (triggered by document `npmjs_repl`): timeout
Restarting replication in 40 seconds.
[info] [<0.372.0>] Replication `"[REPLICATION_HASH]+continuous"` is using:
4 worker processes
a worker batch size of 500
20 HTTP connections
a connection timeout of 30000 milliseconds
10 retries per request
socket options are: [{keepalive,true},{nodelay,false}]
source start sequence 203628
[Tue, 06 May 2014 13:00:32 GMT] [info] [<0.372.0>] Replication `"[REPLICATION_HASH]7+continuous"` is using:
4 worker processes
a worker batch size of 500
20 HTTP connections
a connection timeout of 30000 milliseconds
10 retries per request
socket options are: [{keepalive,true},{nodelay,false}]
source start sequence 203628
err.log (in every crash)
heart: Tue May 6 15:06:14 2014: heart-beat time-out, no activity for 13 seconds
heart: Tue May 6 15:06:16 2014: Executed "/usr/local/bin/couchdb -k" -> 0. Terminating.
heart_beat_kill_pid = 52979
heart_beat_timeout = 11
truss output:
...
kevent(3,0x0,0,{},256,{0.000000000 }) = 0 (0x0)
kevent(3,0x0,0,{},256,{0.000000000 }) = 0 (0x0)
kevent(3,0x0,0,{},256,{0.000000000 }) = 0 (0x0)
kevent(3,0x0,0,{},256,{0.000000000 }) = 0 (0x0)
kevent(3,0x0,0,{},256,{0.000000000 }) = 0 (0x0)
mmap(0x0,4194304,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) = 17431527424 (0x40f000000)
SIGNAL 9 (SIGKILL)
process exit, rval = 0
Thanks for helping.

Related

Redis timeout with almost no data in the database, using the .NET client

I received this error:
StackExchange.Redis.RedisTimeoutException: Timeout performing GET (5000ms),
next: GET RetryCount, inst: 3, qu: 0, qs: 1, aw: False, rs: ReadAsync, ws: Idle, in: 7, in-pipe: 0, out-pipe: 0,
serverEndpoint: redis:6379, mc: 1/1/0, mgr: 10 of 10 available, clientName: 18745af38fec,
IOCP: (Busy=0,Free=1000,Min=1,Max=1000),
WORKER: (Busy=6,Free=32761,Min=1,Max=32767), v: 2.1.58.34321
(Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)
We can see that there is only a single message in the queue (qs=1) and that there are only 7 bytes waiting to be read (in=7). Redis is used by 2 processes and holds settings for the system and store logs.
It was a re-install so no logs were written and the database has probably 2-3kb of data :)
This is the only output from Redis:
1:C 12 Sep 2020 15:20:49.293 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 12 Sep 2020 15:20:49.293 # Redis version=6.0.8, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 12 Sep 2020 15:20:49.293 # Configuration loaded
1:M 12 Sep 2020 15:20:49.296 * Running mode=standalone, port=6379.
1:M 12 Sep 2020 15:20:49.296 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:M 12 Sep 2020 15:20:49.296 # Server initialized
1:M 12 Sep 2020 15:20:49.296 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memor
y=1' for this to take effect.
1:M 12 Sep 2020 15:20:49.296 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo madvise > /sys/kernel/mm/transparent_hugepag
e/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled (set to 'madvise' or 'never').
1:M 12 Sep 2020 15:20:49.305 * DB loaded from append only file: 0.000 seconds
1:M 12 Sep 2020 15:20:49.305 * Ready to accept connections
so it looks like nothing went wrong on that side.
The 2 processes accessing it are in docker containers, so does Redis. All on a single AWS instance with a lot of ram and disk available.
this is also a one time event, it has never happened before with the same config.
I'm not very experienced with Redis; is there anything in the error message that would look suspicious?

Apache Server Many requests stuck in "R" Reading Request

below apache2ctl status with almost no users online.
For over 5 years we (cloud ERP supplier) deploy instances on Google Cloud with Apache with mod_perl.
This week our largest server became slow and unresponsive. No idle workers were available. It turned out increasing both MaxRequestWorkers and ServerLimit to 400 from 150 in mpm_prefork.conf got our server back fast.
Iā€™m wondering why many requests stay in "R" Reading Request, at least 10 times more requests then actually should be.
We did further checking, DoS does not seem to be the issue, as also other servers ā€“ in different clouds as ASW or Alibaba ā€“ we notice the same ratio of 10 between requests actually being processed (R/W/K) and requests that stay in Reading mode.
What could cause this?
sudo /usr/sbin/apache2ctl status
Apache Server Status for localhost (via 127.0.0.1)
Server Version: Apache/2.4.7 (Ubuntu) PHP/5.5.9-1ubuntu4.29 OpenSSL/1.0.1f
mod_perl/2.0.8 Perl/v5.18.2
Server MPM: prefork
Server Built: Apr 3 2019 18:04:25
Current Time: Saturday, 29-Feb-2020 10:15:35 CET
Restart Time: Thursday, 27-Feb-2020 09:45:48 CET
Parent Server Config. Generation: 1
Parent Server MPM Generation: 0
Server uptime: 2 days 29 minutes 47 seconds
Server load: 0.75 0.77 0.75
Total accesses: 1581181 - Total Traffic: 8.6 GB
CPU Usage: u30.32 s9.64 cu0 cs0 - .0229% CPU load
9.06 requests/sec - 51.5 kB/second - 5.7 kB/request
96 requests currently being processed, 9 idle workers
RRKRRRK_RKRKKRRRRRK_RRRRKRCK_RRRC_CKK_KCRKCRK_RCR__CKKCCRCRRRRRR
RRRRR.RRRKRRRKRRR_RR..R.K.RCRKR.CKK.RRKKR.W.RRKR.....RR.........
................................................................
................................................................
................................................................
................................................................
................
Scoreboard Key:
"_" Waiting for Connection, "S" Starting up, "R" Reading Request,
"W" Sending Reply, "K" Keepalive (read), "D" DNS Lookup,
"C" Closing connection, "L" Logging, "G" Gracefully finishing,
"I" Idle cleanup of worker, "." Open slot with no current process

Bacula: Director's connection to SD for this Job was lost

SD-7.4.4 (ubuntu 16) Director-7.4.4(ubuntu 16) FD-5.2.10 (windows)
I'm having trouble backing up windows clients with Bacula. I can run a backup just fine when the backup size is around 1MB or 2 but when running a backup of 500MB, I get the same error every time
"Director's connection to SD for this Job was lost."
Some things to mention. When I issue status client:
Terminated Jobs: JobId Level Files Bytes Status Finished
======================================================================
81 Full 5,796 514.8 M OK 06-Nov-17 12:50 BackupComputerA
When I issue status dir
06-Nov 17:58 acme-director JobId 81: Error: Director's connection to SD for this Job was lost.
06-Nov 17:58 acme-director JobId 81: Error: Bacula acme-director 7.4.4 (202Sep16):
Build OS: arm-unknown-linux-gnueabihf debian 9.0
JobId: 81
Job: BackupComputerA.2017-11-06_17.41.01_03
Backup Level: Full (upgraded from Incremental)
Client: "Computer-A-fd" 5.2.10 (28Jun12) Microsoft (build 9200), 32-bit,Cross-compile,Win32
FileSet: "Full Set" 2017-11-03 22:12:58
Pool: "RemoteFile" (From Job resource)
Catalog: "MyCatalog" (From Client resource)
Storage: "File1" (From Job resource)
Scheduled time: 06-Nov-2017 17:40:59
Start time: 06-Nov-2017 17:41:04
End time: 06-Nov-2017 17:58:00
Elapsed time: 16 mins 56 secs
Priority: 10
FD Files Written: 5,796
SD Files Written: 0
FD Bytes Written: 514,883,164 (514.8 MB)
SD Bytes Written: 0 (0 B)
Rate: 506.8 KB/s
Software Compression: 100.0% 1.0:1
Snapshot/VSS: yes
Encryption: yes
Accurate: no
Volume name(s):
Volume Session Id: 1
Volume Session Time: 1509989906
Last Volume Bytes: 8,045,880,119 (8.045 GB)
Non-fatal FD errors: 1
SD Errors: 0
FD termination status: OK
SD termination status: Error
Termination: *** Backup Error ***
About 5 minutes into the backup, I get a message:
Running Jobs:
Console connected at 06-Nov-17 18:08
JobId Type Level Files Bytes Name Status
======================================================================
83 Back Full 0 0 BackupComputerE has terminated
====
The job completes and terminates but loses connection afterwards and I never get a
"OK"
for the status update.
I have added the "Heartbeat Interval = 1 Minute" to all the Daemons and still no luck. Using mysql as the database on the Director
Future thanks for any help
For anyone having the same issues, I was able to fix this problem between the SD and director by adding the heartbeat interval to the clients and adjusting the keep alive time with
sysctl -w net.ipv4.tcp_keepalive_time=60
on both the Storage daemon and the director. Connecting remotely to the director with the bconsole also interrupted jobs so I ran bconsole on the same machine as the director and connected via ssh.

Unable to diagnose MISCONF redis issue while launching celery worker server

I use a celery worker server with redis as the broker url (for receiving tasks) as well as the result backend.
BROKER_URL = 'redis://localhost:6379/2'
CELERY_RESULT_BACKEND = 'redis://localhost:6379/2'
app = Celery('myceleryapp', broker=BROKER_URL,backend=CELERY_RESULT_BACKEND)
I launch the celery worker server using celery -A myceleryapp worker -l info -c 8
The worker processes start processing my tasks from the redis queue until at some point, I receive the infamous MISCONF redis error and the celery worker process terminates.
Unrecoverable error: ResponseError('MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.',)
I checked the redis log files in /var/log/redis and the tail end of the file has the following
24745:C 19 Aug 09:20:26.169 * RDB: 0 MB of memory used by copy-on-write
1590:M 19 Aug 09:20:26.247 * Background saving terminated with success
1590:M 19 Aug 09:25:27.080 * 10 changes in 300 seconds. Saving...
1590:M 19 Aug 09:25:27.081 * Background saving started by pid 25397
25397:C 19 Aug 09:25:27.082 # Write error saving DB on disk: No space left on device
1590:M 19 Aug 09:25:27.181 # Backgroun1590:M 19 Aug 09:51:03.042 * 1 changes in 900 seconds. Saving...
1590:M 19 Aug 09:51:03.042 * Background saving started by pid 26341
26341:C 19 Aug 09:51:03.405 * DB saved on disk
26341:C 19 Aug 09:51:03.405 * RDB: 22 MB of memory used by copy-on-write
1590:M 19 Aug 09:51:03.487 * Background saving terminated with success
The dump.rdb file is being written to /var/lib/redis/dump.rdb.
Since the logs reported a No space left on device, I checked the disk space where /var is mounted and there seems to be sufficient space left (1.2GB).
How do I get to the root cause of this error if there is enough disk space? Of course, to prevent this error from happening, I could set config set stop-writes-on-bgsave-error no in redis-cli. But I want to get to the root cause of this error. Any help or pointers?
Maybe this is caused by the swap file. Because the swap file took the 1.2Gb space of your disk. So redis complains No space to write.
Try this "swapon -s" command to check this.
I think 1.2Gb is not enough if this disk accept the RAM page swap. you should change the dir of RDB in a more big dir.

Apache 2.4.10 hangs AH00485: scoreboard is full, not at MaxRequestWorkers

Apache server will stay up for random amount of time, usually days, but eventually enters a hung state. When hung the CPU load gradually spikes on the machine and new web server requests are unresponsive.
Error logs typically contain lots of these:
Wed Jan 28 16:06:58.667188 2015] [mpm_event:error] [pid 25336:tid 1] AH00485: scoreboard is full, not at MaxRequestWorkers
Environment:
LDOM (VM) SunOS myhostname 5.10 Generic_118833-36 sun4v sparc SUNW,Sun-Fire-T200
http Conf:
StartServers 8
MinSpareServers Not set
MaxSpareServers Not set
ServerLimit 256
MaxRequestWorkers 100
MaxConnectionsPerChild 1000
KeepAlive On
TimeOut 3000
MaxKeepAliveRequests 50
KeepAliveTimeout 2
Current non-hung Score Board:
Server Version: Apache/2.4.10 (Unix)
Server MPM: event
Server Built: Oct 30 2014 16:29:03
Current Time: Wednesday, 28-Jan-2015 10:59:39 PST
Restart Time: Wednesday, 28-Jan-2015 09:49:21 PST
Parent Server Config. Generation: 1
Parent Server MPM Generation: 0
Server uptime: 1 hour 10 minutes 17 seconds
Server load: 0.60 0.46 0.41
Total accesses: 1134 - Total Traffic: 2.2 GB
CPU Usage: u9.07 s16.94 cu609.51 cs69.31 - 16.7% CPU load
.269 requests/sec - 0.5 MB/second - 2.0 MB/request
1 requests currently being processed, 99 idle workers
PID Connections Threads Async connections
total accepting busy idle writing keep-alive closing
25337 0 yes 1 24 0 0 0
25338 1 yes 0 25 1 0 0
25339 1 yes 0 25 0 0 1
25340 1 yes 0 25 0 0 1
Sum 3 1 99 1 0 2
Any thoughts on http conf tuning, OS patches, apache bug fixes appreciated.
Yes I have seen the open ASF bugzilla for the same error message.
This is a production server, so you can imagine, having it go down at random times (usually when I am asleep) is not fun!