Scylla node going down due to storage I/O error - scylla

The scylla nodes suddenly go down (Down and Normal state).
Found this checking the logs
Feb 06 08:37:11 scylla-zeograph-prod-eu-3 scylla[13753]: [shard 11] storage_service - Disk error: std::system_error (error system:61, No data available)
Feb 06 08:37:11 scylla-zeograph-prod-eu-3 scylla[13753]: [shard 11] sstable - failed reading index for /var/lib/scylla/data/idgraph1/graphindex-48ff28e0322211ea92ea00000000000a/mc-1019-big-Data.db: storage_io_error (Storage I/O error: 61: No data available)
Feb 06 08:37:11 scylla-zeograph-prod-eu-3 scylla[13753]: [shard 0] storage_service - Stop transport: starts
Feb 06 08:37:11 scylla-zeograph-prod-eu-3 scylla[13753]: [shard 0] storage_proxy - Exception when communicating with 10.38.0.5: storage_io_error (Storage I/O error: 61: No data available)
Feb 06 08:37:11 scylla-zeograph-prod-eu-3 scylla[13753]: [shard 0] storage_service - Thrift server stopped
Feb 06 08:37:11 scylla-zeograph-prod-eu-3 scylla[13753]: [shard 0] storage_service - CQL server stopped
Feb 06 08:37:11 scylla-zeograph-prod-eu-3 scylla[13753]: [shard 0] storage_service - Stop transport: shutdown rpc and cql server done
Feb 06 08:37:11 scylla-zeograph-prod-eu-3 scylla[13753]: [shard 0] gossip - My status = NORMAL
Feb 06 08:37:11 scylla-zeograph-prod-eu-3 scylla[13753]: [shard 0] gossip - Announcing shutdown
Feb 06 08:37:11 scylla-zeograph-prod-eu-3 scylla[13753]: [shard 0] storage_service - Node 10.38.0.5 state jump to normal
Feb 06 08:37:11 scylla-zeograph-prod-eu-3 scylla[13753]: [shard 11] sstable - failed reading index for /var/lib/scylla/data/idgraph1/graphindex-48ff28e0322211ea92ea00000000000a/mc-1019-big-Data.db: storage_io_error (Storage I/O error: 61: No data available)
What could the possible issue be ?

First of all, you should know that when Scylla cannot read one of the database files (as happened in this case), it refuses to boot at all, as you noticed. While it would have been easy to just skip this error and continue to read more files, this is dangerous - the node can then be answering requests with only a subset of the data, or potentially, even corrupted data. Since data in Scylla is normally replicated on, often, 3 nodes, it is safer to have one node go down and the other two answer (until, eventually, the operator will bring up a third), than to have the node go up with incorrect data.
Getting this introduction out the way, I guess your next question is why you have this I/O error. The ENODATA you got isn't the run-of-the-mill I/O error... As Avi suggested in a comment, please see if the system log also reports errors. What kind of filesystem do you have /var/lib/scylla/data/ in? If this problem persists, and you can reproduce this on a recent version of Scylla, you can also ask this question on the Scylla developer mailing list (scylladb-dev#googlegroups.com).

Related

Redis crashing without any log errors

I'm debugging some weird behavior in my redis, where it's crashing each 2 days more or less, but not showing any errors whatsoever, only this on the logs:
1:C 10 Sep 2020 15:44:14.517 # Configuration loaded
1:M 10 Sep 2020 15:44:14.522 * Running mode=standalone, port=6379.
1:M 10 Sep 2020 15:44:14.522 # Server initialized
1:M 10 Sep 2020 15:44:14.524 * Ready to accept connections
1:C 12 Sep 2020 13:20:23.751 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 12 Sep 2020 13:20:23.751 # Redis version=6.0.5, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 12 Sep 2020 13:20:23.751 # Configuration loaded
1:M 12 Sep 2020 13:20:23.757 * Running mode=standalone, port=6379.
1:M 12 Sep 2020 13:20:23.757 # Server initialized
1:M 12 Sep 2020 13:20:23.758 * Ready to accept connections
That's all redis says to me.
I have lots of RAM available, but I have redis running as a single instance on a docker container, could the lack of processing power cause this? Should I use multiple nodes? I don't want to setup a cluster just to find out the problem was another, how can I trace down the actually cause of the problem?
So, in the end, it was exactly what I thought it was not: a memory leak!
I had 16GB that was slowly being consumed until redis crashed with no warnings, nor the operating system/docker. I fixed the app that caused the leak and the problem was gone.

Redis timeout with almost no data in the database, using the .NET client

I received this error:
StackExchange.Redis.RedisTimeoutException: Timeout performing GET (5000ms),
next: GET RetryCount, inst: 3, qu: 0, qs: 1, aw: False, rs: ReadAsync, ws: Idle, in: 7, in-pipe: 0, out-pipe: 0,
serverEndpoint: redis:6379, mc: 1/1/0, mgr: 10 of 10 available, clientName: 18745af38fec,
IOCP: (Busy=0,Free=1000,Min=1,Max=1000),
WORKER: (Busy=6,Free=32761,Min=1,Max=32767), v: 2.1.58.34321
(Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)
We can see that there is only a single message in the queue (qs=1) and that there are only 7 bytes waiting to be read (in=7). Redis is used by 2 processes and holds settings for the system and store logs.
It was a re-install so no logs were written and the database has probably 2-3kb of data :)
This is the only output from Redis:
1:C 12 Sep 2020 15:20:49.293 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 12 Sep 2020 15:20:49.293 # Redis version=6.0.8, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 12 Sep 2020 15:20:49.293 # Configuration loaded
1:M 12 Sep 2020 15:20:49.296 * Running mode=standalone, port=6379.
1:M 12 Sep 2020 15:20:49.296 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:M 12 Sep 2020 15:20:49.296 # Server initialized
1:M 12 Sep 2020 15:20:49.296 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memor
y=1' for this to take effect.
1:M 12 Sep 2020 15:20:49.296 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo madvise > /sys/kernel/mm/transparent_hugepag
e/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled (set to 'madvise' or 'never').
1:M 12 Sep 2020 15:20:49.305 * DB loaded from append only file: 0.000 seconds
1:M 12 Sep 2020 15:20:49.305 * Ready to accept connections
so it looks like nothing went wrong on that side.
The 2 processes accessing it are in docker containers, so does Redis. All on a single AWS instance with a lot of ram and disk available.
this is also a one time event, it has never happened before with the same config.
I'm not very experienced with Redis; is there anything in the error message that would look suspicious?

Redis service automatically stops after few minutes of running

On my Ubuntu machine, redis server was running fine and suddenly it stops. After I started it, again it automatically stops after few minutes. So I start again, and so on. Why is this happening?
Here are the logs when I start redis:
21479:C 29 Apr 21:59:10.986 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
21479:C 29 Apr 21:59:10.987 # Redis version=4.0.9, bits=64, commit=00000000, modified=0, pid=21479, just started
21479:C 29 Apr 21:59:10.987 # Configuration loaded
21480:M 29 Apr 21:59:10.990 * Increased maximum number of open files to 10032 (it was originally set to 1024).
21480:M 29 Apr 21:59:10.991 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
21480:M 29 Apr 21:59:10.992 # Server initialized
21480:M 29 Apr 21:59:14.588 * DB loaded from disk: 3.596 seconds
21480:M 29 Apr 21:59:14.591 * Ready to accept connections

Redis:Error writing to the AOF file: Quota exceeded

When we run the performance test, we get error from Reids log:
11:M 06 Jun 03:26:02.640 # Bad message length or signature received from Cluster bus.
11:M 06 Jun 03:26:25.429 # Bad message length or signature received from Cluster bus.
11:M 06 Jun 03:26:25.434 # Bad message length or signature received from Cluster bus.
11:M 06 Jun 03:26:27.031 # Error writing to the AOF file: Quota exceeded
could someone help me?

Aerospike DB always starts in COLD mode

It's stated here that Aerospike should try to start in warm mode, meaning reuse same memory region holding keys. Instead, every time the database is restarted all keys are loaded back from the SSD drive, which can take tens of minutes if not hours. What I see in the log is the following:
Oct 12 2015 03:24:11 GMT: INFO (config): (cfg.c::3234) Node id bb9e10daab0c902
Oct 12 2015 03:24:11 GMT: INFO (namespace): (namespace_cold.c::101) ns organic **beginning COLD start**
Oct 12 2015 03:24:11 GMT: INFO (drv_ssd): (drv_ssd.c::3607) opened device /dev/xvdb: usable size 322122547200, io-min-size 512
Oct 12 2015 03:24:11 GMT: INFO (drv_ssd): (drv_ssd.c::3681) shadow device /dev/xvdc is compatible with main device
Oct 12 2015 03:24:11 GMT: INFO (drv_ssd): (drv_ssd.c::1107) /dev/xvdb has 307200 wblocks of size 1048576
Oct 12 2015 03:24:11 GMT: INFO (drv_ssd): (drv_ssd.c::3141) device /dev/xvdb: reading device to load index
Oct 12 2015 03:24:11 GMT: INFO (drv_ssd): (drv_ssd.c::3146) In TID 104520: Using arena #150 for loading data for namespace "organic"
Oct 12 2015 03:24:13 GMT: INFO (drv_ssd): (drv_ssd.c::3942) {organic} loaded 962647 records, 0 subrecords, /dev/xvdb 0%
What could be the reason that Aerospike fails to perform fast restart?
Thanks!
You are using community edition of the software. Warm start is not supported in it. It is available only in the enterprise edition.