Aerospike %age of available write blocks is less when hard disk space is available - aerospike

We found ourselves this problem. Config is as follows :-
Aerospike version : 3.14
Underlying hard disk : non-SSD
Variable Name Value
memory-size 5 GB
free-pct-memory 98 %
available_pct 4 %
max-void-time 0 millisec
stop-writes 0
stop-writes-pct 90 %
hwm-breached true
default-ttl 604,800 sec
max-ttl 315,360,000 sec
enable-xdr false
single-bin false
data-in-memory false
Can anybody please help us out with this ? What could be a potential reason for this ?

Aerospike only writes to free blocks. A block may contain any number of records that fit. If your write/update pattern is such that a block never falls below 50% active records(the default threshold for defragmenting: defrag-lwm-pct), then you have a bunch of "empty" space that can't be utilized. Read more about defrag in the managing storage page.
Recovering from this is much easier with a cluster that's not seeing any writes. You can increase defrag-lwm-pct, so that more blocks are eligible and gets defragmented.
Another cause could be just that the HDD isn't fast enough to keep up with defragmentation.
You can read more on possible resolutions in the Aerospike KB - Recovering from Available Percent Zero. Don't read past "Stop service on a node..."

You are basically not defragging your perisistence storage device (75GB per node). From the snapshot you have posted, you have about a million records on 3 nodes with 21 million expired. So looks like you are writing records with very short ttl and the defrag is unable to keep up.
Can you post the output of few lines when you are in this state of:
$ grep defrag /var/log/aerospike/aerospike.log
and
$ grep thr_nsup /var/log/aerospike/aerospike.log ?
What is your write/update load ? My suspicion is that you are only creating short ttl records and reading, not updating.
Depending on what you are doing, increasing defrag-lwm-pct may actually make things worse for you. I would also tweak nsup-delete-sleep from 100 microseconds default but it will depend on what your log greps above show. So post those, and lets see.
(Edit: Also, from the fact that you are not seeing evictions even though you are above the 50% HWM on persistence storage means your nsup thread is taking a very long time to run. That again points to nsup-delete-sleep value needing tuning for your set up.)

Related

the disk iops from scylla_setup iotune study my disk is different from fio test data

when use scylla_setup, iotune study my reuslt is:
Measuring sequential write bandwidth: 473 MB/s
Measuring sequential read bandwidth: 499 MB/s
Measuring random write IOPS: 1902 IOPS
Measuring random read IOPS: 1999 IOPS
iops is 1900-2000,
when use fio,
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=/dev/sdc1 --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
the result is
test: (groupid=0, jobs=1): err= 0: pid=11697: Wed Jun 26 08:58:13 2019
read: IOPS=47.6k, BW=186MiB/s (195MB/s)(3070MiB/16521msec)
bw ( KiB/s): min=187240, max=192136, per=100.00%, avg=190278.42, stdev=985.15, samples=33
iops : min=46810, max=48034, avg=47569.61, stdev=246.38, samples=33
write: IOPS=15.9k, BW=62.1MiB/s (65.1MB/s)(1026MiB/16521msec)
bw ( KiB/s): min=62656, max=65072, per=100.00%, avg=63591.52, stdev=590.96, samples=33
iops : min=15664, max=16268, avg=15897.88, stdev=147.74, samples=33
cpu : usr=4.82%, sys=12.81%, ctx=164053, majf=0, minf=23
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwt: total=785920,262656,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
READ: bw=186MiB/s (195MB/s), 186MiB/s-186MiB/s (195MB/s-195MB/s), io=3070MiB (3219MB), run=16521-16521msec
WRITE: bw=62.1MiB/s (65.1MB/s), 62.1MiB/s-62.1MiB/s (65.1MB/s-65.1MB/s), io=1026MiB (1076MB), run=16521-16521msec
Disk stats (read/write):
sdc: ios=780115/260679, merge=0/0, ticks=792798/230409, in_queue=1023170, util=99.47%
read iops is 46000 - 48000,write iops is 15000-16000
(NB: It looks like the questioner filed this as a Scylla Github issue too - https://github.com/scylladb/scylla/issues/4604 )
[Why is] the disk iops from scylla_setup iotune [...] different from fio test data
Different benchmarks, different results:
Scylla may have been using a much bigger block size (e.g. 64k) per I/O (this is likely the biggest factor). As you make the block size bigger (up to some maximum due to diminishing returns) the bandwidth (i.e. total amount of data you can send in say a second) achieved with that block size goes up but the IOPS you get will typically down (you are sending more data per I/O after all). This is normal!
Scylla could be using buffered I/O (rather than direct I/O)
Scylla may have been benchmarking reads and writes separately
Scylla may have been using a bigger queue depth
Scylla may have been batching its submissions differently
Scylla may be writing a different type of data
And so on...
In general, it's very difficult to take benchmarks done with different tools and compare them directly to each other - you would need to know what they are doing under the hood for any comparison to be meaningful. Trying to look at IOPS or bandwidth in isolation without more context is meaningless as you typically trade one off against the other. It's better to use the same benchmark tool with identical options to compare two different machines changes or to measure the impact of tuning on the same machine.
TLDR; This is likely an apples to oranges comparison where the tools are measuring different contexts.
PS: gtod_reduce is a go faster stripe that very few people actually need. If your hardware isn't capable of doing gigabytes per second and you're not seeing your CPU maxed out it's unlikely reducing gettimeofday calls is going to nudge the result very much.
(This question might be more appropriate for Server Fault (and thus get better replies there) because it's not directly about programming)

Kafka Connect S3 Connector OutOfMemory errors with TimeBasedPartitioner

I'm currently working with the Kafka Connect S3 Sink Connector 3.3.1 to copy Kafka messages over to S3 and I have OutOfMemory errors when processing late data.
I know it looks like a long question, but I tried my best to make it clear and simple to understand.
I highly appreciate your help.
High level info
The connector does a simple byte to byte copy of the Kafka messages and add the length of the message at the beginning of the byte array (for decompression purposes).
This is the role of the CustomByteArrayFormat class (see configs below)
The data is partitioned and bucketed according to the Record timestamp
The CustomTimeBasedPartitioner extends the io.confluent.connect.storage.partitioner.TimeBasedPartitioner and its sole purpose is to override the generatePartitionedPath method to put the topic at the end of the path.
The total heap size of the Kafka Connect process is of 24GB (only one node)
The connector process between 8,000 and 10,000 messages per second
Each message has a size close to 1 KB
The Kafka topic has 32 partitions
Context of OutOfMemory errors
Those errors only happen when the connector has been down for several hours and has to catch up data
When turning the connector back on, it begins to catch up but fail very quickly with OutOfMemory errors
Possible but incomplete explanation
The timestamp.extractor configuration of the connector is set to Record when those OOM errors happen
Switching this configuration to Wallclock (i.e. the time of the Kafka Connect process) DO NOT throw OOM errors and all of the late data can be processed, but the late data is no longer correctly bucketed
All of the late data will be bucketed in the YYYY/MM/dd/HH/mm/topic-name of the time at which the connector was turn back on
So my guess is that while the connector is trying to correctly bucket the data according to the Record timestamp, it does too many parallel reading leading to OOM errors
The "partition.duration.ms": "600000" parameter make the connector bucket data in six 10 minutes paths per hour (2018/06/20/12/[00|10|20|30|40|50] for 2018-06-20 at 12pm)
Thus, with 24h of late data, the connector would have to output data in 24h * 6 = 144 different S3 paths.
Each 10 minutes folder contains 10,000 messages/sec * 600 seconds = 6,000,000 messages for a size of 6 GB
If it does indeed read in parallel, that would make 864GB of data going into memory
I think that I have to correctly configure a given set of parameters in order to avoid those OOM errors but I don't feel like I see the big picture
The "flush.size": "100000" imply that if there is more dans 100,000 messages read, they should be committed to files (and thus free memory)
With messages of 1KB, this means committing every 100MB
But even if there is 144 parallel readings, that would still only give a total of 14.4 GB, which is less than the 24GB of heap size available
Is the "flush.size" the number of record to read per partition before committing? Or maybe per connector's task?
The way I understand "rotate.schedule.interval.ms": "600000" config is that data is going to be committed every 10 minutes even when the 100,000 messages of flush.size haven't been reached.
My main question would be what are the maths allowing me to plan for memory usage given:
the number or records per second
the size of the records
the number of Kafka partitions of the topics I read from
the number of Connector tasks (if this is relevant)
the number of buckets written to per hour (here 6 because of the "partition.duration.ms": "600000" config)
the maximum number of hours of late data to process
Configurations
S3 Sink Connector configurations
{
"name": "xxxxxxx",
"config": {
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"s3.region": "us-east-1",
"partition.duration.ms": "600000",
"topics.dir": "xxxxx",
"flush.size": "100000",
"schema.compatibility": "NONE",
"topics": "xxxxxx,xxxxxx",
"tasks.max": "16",
"s3.part.size": "52428800",
"timezone": "UTC",
"locale": "en",
"format.class": "xxx.xxxx.xxx.CustomByteArrayFormat",
"partitioner.class": "xxx.xxxx.xxx.CustomTimeBasedPartitioner",
"schema.generator.class": "io.confluent.connect.storage.hive.schema.DefaultSchemaGenerator",
"name": "xxxxxxxxx",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"s3.bucket.name": "xxxxxxx",
"rotate.schedule.interval.ms": "600000",
"path.format": "YYYY/MM/dd/HH/mm",
"timestamp.extractor": "Record"
}
Worker configurations
bootstrap.servers=XXXXXX
key.converter=org.apache.kafka.connect.converters.ByteArrayConverter
value.converter=org.apache.kafka.connect.converters.ByteArrayConverter
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
consumer.auto.offset.reset=earliest
consumer.max.partition.fetch.bytes=2097152
consumer.partition.assignment.strategy=org.apache.kafka.clients.consumer.RoundRobinAssignor
group.id=xxxxxxx
config.storage.topic=connect-configs
offset.storage.topic=connect-offsets
status.storage.topic=connect-status
rest.advertised.host.name=XXXX
Edit:
I forgot to add an example of the errors I have:
2018-06-21 14:54:48,644] ERROR Task XXXXXXXXXXXXX-15 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerSinkTask:482)
java.lang.OutOfMemoryError: Java heap space
[2018-06-21 14:54:48,645] ERROR Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerSinkTask:483)
[2018-06-21 14:54:48,645] ERROR Task XXXXXXXXXXXXXXXX-15 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask:148)
org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:484)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:265)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:182)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:150)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:146)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:190)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I was finally able to understand how the Heap Size usage works in the Kafka Connect S3 Connector
The S3 Connector will write the data of each Kafka partition into partitioned paths
The way those paths are partitioned depends on the partitioner.class parameter;
By default, it is by timestamp, and the value of partition.duration.ms will then determine the duration of each partitioned paths.
The S3 Connector will allocate a buffer of s3.part.size Bytes per Kafka partition (for all topics read) and per partitioned paths
Example with 20 partitions read, a timestamp.extractor set to Record, partition.duration.ms set to 1h, s3.part.size set to 50 MB
The Heap Size needed each hour is then equal to 20 * 50 MB = 1 GB;
But, timestamp.extractor being set to Record, messages having a timestamp corresponding to an earlier hour then the one at which they are read will be buffered in this earlier hour buffer. Therefore, in reality, the connector will need minimum 20 * 50 MB * 2h = 2 GB of memory because there is always late events, and more if there is events with a lateness superior to 1 hour;
Note that this isn't true if timestamp.extractor is set to Wallclock because there will virtually never be late events as far as Kafka Connect is concerned.
Those buffers are flushed (i.e. leave the memory) at 3 conditions
rotate.schedule.interval.ms time has passed
This flush condition is always triggered.
rotate.interval.ms time has passed in terms of timestamp.extractor time
This means that if timestamp.extractor is set to Record, 10 minutes of Record time can pass in less or more and 10 minutes of actual time
For instance, when processing late data, 10 minutes worth of data will be processed in a few seconds, and if rotate.interval.ms is set to 10 minutes then this condition will trigger every second (as it should);
On the contrary, if there is a pause in the flow of events, this condition will not trigger until it sees an events with a timestamp showing that more than rotate.interval.ms has passed since the condition last triggered.
flush.size messages have been read in less than min(rotate.schedule.interval.ms, rotate.interval.ms)
As for the rotate.interval.ms, this condition might never trigger if there is not enough messages.
Thus, you need to plan for Kafka partitions * s3.part.size Heap Size at the very least
If you are using a Record timestamp for partitioning, you should multiply it by max lateness in milliseconds / partition.duration.ms
This is a worst case scenario where you have constantly late events in all partitions and for the all range of max lateness in milliseconds.
The S3 connector will also buffer consumer.max.partition.fetch.bytes bytes per partition when it reads from Kafka
This is set to 2.1 MB by default.
Finally, you should not consider that all of the Heap Size is available to buffer Kafka messages because there is also a lot of different objects in it
A safe consideration would be to make sure that the buffering of Kafka messages does not go over 50% of the total available Heap Size.
#raphael has explained the working perfectly.
Pasting a small variation of similar problem (too little events to process but across many hours/days) that I had faced.
In my case I had about 150 connectors and 8 of them were failing with OOM as they had to process about 7 days worth of data (Our kafka in test env was down for about 2 weeks)
Steps Followed:
Reduced s3.part.size from 25MB to 5MB for all connectors. (In our scenario, rotate.interval was set to 10min with flush.size as 10000. Most of our events should easily fit with in this limit).
After this setting, only one connector was still getting OOM and this one connector goes into OOM within 5seconds of start (based on heap analysis), it shoots up from 200MB-1.5GB of Heap utilization. On looking at kafka offset lag, there were only 8K events to process across all 7 days. So this wasn't because of too many events to handle but rather too little events to handle/flush.
Since we were using Hourly partition and for an hour there were hardly 100 events, all the buffers for these 7 days were getting created without being flushed (without being released to JVM) - 7 * 24 * 5MB * 3 partitions = 2.5GB (xmx-1.5GB)
Fix:
Perform one of the below steps until your connector catches up and then restore your old config back. (Recommend approach - 1)
Update the Connector config to process 100 or 1000 records flush.size (depending on how your data is structured). Drawback : Too many small files get created in the hour, if actual events are more than 1000.
Change Partition to Daily so there will only be daily partitions. Drawback : You'll now have a mix of Hourly and Daily partition in your S3.

Redis 1+ min query time on lists with 15 larger JSON objects (20MB total)

I use Redis to cache database inserts. For this I created a list CACHE into which I push serialized JSON lists. In pseudocode:
let entries = [{a}, {b}, {c}, ...];
redis.rpush("CACHE", JSON.stringify(entries));
The idea is to run this code for an hour, then later do an
let all = redis.lrange("CACHE", 0, LIMIT);
processAndInsert(all);
redis.ltrim("CACHE", 0, all.length);
Now the thing is that each entries can be relatively large (but far below 512MB / whatever Redis limit I read about). Each of the a, b, c is an object of probably 20 bytes, and entries itself can easily have 100k+ objects / 2MB.
My problem now is that even for very short CACHE lists of only 15 entries a simple lrange can take many minutes(!) even from the redis-cli (my node.js actually dies with an "FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory", but that's a side comment).
The debug output for the list looks like this:
127.0.0.1:6379> debug object "CACHE"
Value at:00007FF202F4E330 refcount:1 encoding:linkedlist serializedlength:18104464 lru:12984004 lru_seconds_idle:1078
What is happening? Why is this so massively slow, and what can I do about it? This does not seem to be a normal slowness, something seems to be fundamentally wrong.
I am using a local Redis 2.8.2101 (x64), ioredis 1.6.1, node.js 0.12 on a relatively hardcore Windows 10 gaming machine (i5, 16GB RAM, 840 EVO SSD, ...) by the way.
Redis is great at doing lots of small operations,
but not so great at doing small numbers of "very big" operations.
I think you should re-evaluate your algorithm, and try to break apart your data in to smaller chunks. Not only you'll save the bandwidth, you'll also will not lock your redis instance long amounts of time.
Redis offers many data structures you should be able to use for more fine grain control over your data.
Well, still, in this case, since you are running the redis locally, and assuming you are not running anything else but this code, I doubt that the bandwidth, nor the redis is the problem. I'm more thinking this line:
JSON.stringify()
is the main culprit why you are seeing the slow execution.
JSON serialization of 20MB of string is not something simple,
The process needs allocate many small strings, and also has to go through all of your array and inspect each item individually. All of this will take a long time for a big object like this one.
Again, if you were breaking apart your data, and doing smaller operations with redis, you'd not need the JSON serializer at all.

datastax : Spark job fails : Removing BlockManager with no recent heart beats

Im using datastax-4.6. I have created a cassandra table and stored 2crore records. Im trying to read the data using scala. The code works fine for few records but when i try to retrieve all 2crore records it displays me follwing error.
**WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(1, 172.20.98.17, 34224, 0) with no recent heart beats: 140948ms exceeds 45000ms
15/05/15 19:34:06 ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(C15759,34224) not found**
Any help?
This problem is often tied to GC pressure
Tuning your Timeouts
Increase the spark.storage.blockManagerHeartBeatMs so that Spark waits for the GC pause to end.
SPARK-734 recommends setting -Dspark.worker.timeout=30000 -Dspark.akka.timeout=30000 -Dspark.storage.blockManagerHeartBeatMs=30000 -Dspark.akka.retry.wait=30000 -Dspark.akka.frameSize=10000
Tuning your jobs for your JVM
spark.cassandra.input.split.size - will allow you to change the level of parallelization of your cassandra reads. Bigger split sizes mean that more data will have to reside in memory at the same time.
spark.storage.memoryFraction and spark.shuffle.memoryFraction - amount of the heap that will be occupied by RDDs (as opposed to shuffle memory and spark overhead). If you aren't doing any shuffles, you could increase this value. The databricks guys say to make this similar in size to the size of your oldgen.
spark.executor.memory - Obviously this depends on your hardware. Per DataBricks you can do up to 55gb. Make sure to leave enough RAM for C* and for your OS and OS page cache. Remember that long GC pauses happen on larger heaps.
Out of curiosity, are you frequently going to be extracting your entire C* table with Spark? What's the use case?

Redis (1.2.6) : Slow queries

We are using Redis 1.2.6 in production environment. There are 161804 keys in redis. Machine has 2GB RAM.
Problem:
Select queries to Redis are taking 0.02 sec on an average. But some times they take 1.5-2.0 secs, I think whenever redis save modified keys on disk.
One strange thing I noticed before and after restarting the redis is that:
Before restart "changes_since_last_save" changing too fast and was reaching 3000+ (in 5 minutes). But after restart "changes_since_last_save" remains below 20 or so.
Redis stats before restart:
{:bgrewriteaof_in_progress=>"0", :arch_bits=>"64", :used_memory=>"53288487", :total_connections_received=>"586171", :multiplexing_api=>"epoll", :used_memory_human=>"50.82M", :total_commands_processed=>"54714152", :uptime_in_seconds=>"1629606", :changes_since_last_save=>"3142", :role=>"master", :uptime_in_days=>"18", :bgsave_in_progress=>"0", :db0=>"keys=161863,expires=10614", :connected_clients=>"13", :last_save_time=>"1280912841", :redis_version=>"1.2.6", :connected_slaves=>"1"}
Redis stats after restart:
{:used_memory_human=>"49.92M", :total_commands_processed=>"6012", :uptime_in_seconds=>"1872", :changes_since_last_save=>"2", :role=>"master", :uptime_in_days=>"0", :bgsave_in_progress=>"0", :db0=>"keys=161823,expires=10464", :connected_clients=>"13", :last_save_time=>"1280917477", :redis_version=>"1.2.6", :connected_slaves=>"1", :bgrewriteaof_in_progress=>"0", :arch_bits=>"64", :used_memory=>"52341658", :total_connections_received=>"252", :multiplexing_api=>"epoll"}
Not sure what is going wrong here.
Thanks in advance.
Sunil
By default Redis is configured to dump all data to disk from time to time depending on the amount of keys that changed in a time span (see the default config).
Another option is to use the append-only file, which is more lightweight but needs some kind of maintenance – you need to run BGREWRITEAOF every once in a while so that your log doesn't get too big. There's more on the Redis config file about this.
As Tobias says, you should switch to 2.0 as soon as you can since it's faster and, in many cases, uses less memory than 1.2.6.