Find records with TTL 0 in Aerospike - aerospike

my aerospike cluster hitting 50% disk usage and records are started evicting.
I have doubt I do not write that many records daily to the cluster.Per records we have 90days TTL set and default TTL for namespace is 30days.
What my concern is, I am assuming we have records with 0 TTL which are not getting evicted and not even being used.
How do I audit and find records (number of objects) with 0 TTL which should be changed immediatly to default TTL.
Thanks.

Couple of things:
Aerospike logs will tell you how many records do not have any expiration, look for the number of '0 v-t' records in the following line (in my example, here it is 0):
{ns-name} Records: 37118670, 0 0-vt, 0(377102877) expired, 185677(145304222) evicted,
0(0) set deletes, 0(0) set evicted. Evict ttls: 34560,38880,0.118. Waits: 0,0,8743.
Total time: 45467 ms
This will quickly the 'proportion' of non expirable records for that namespace (since you have the total number of records right next to it).
In order to fully identify which records those are and potentially set an expiration, you would have to scan the records and update them... There is sample code in java to scan a set (or a namespace) on this repo. There may be other much more elegant solutions, though, but they would still involve writing a bit of code.

You look for the following log line:
Sep 02 2016 15:14:00 GMT: INFO (nsup): (thr_nsup.c:1114) {test} Records: 3725361, 0 0-vt, 0(0) expired, 0(0) evicted, 0(0) set deletes. Evict ttl: 0. Waits: 0,0,0. Total time: 765 ms
This is the namespace supervisor reporting on a namespace called 'test' and it is saying that there are 372561 records of which there are 0 with 0-vt or void time.
TTL is the delta between the time now and the void time of the record (when it will expire) and so a record where void time is 0 is one that will never expire or be evicted.

Related

How to interpret "evicted_keys" from Redis Info

We are using ElastiCache for Redis, and are confused by its Evictions metric.
I'm curious what the unit is on the evicted_keys metric from Redis Info? The ElastiCache docs say it is a count: https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/CacheMetrics.Redis.html but for our application we have observed the "Evictions" metric (which is derived from evicted_keys) fluctuates up and down, indicating it's not a count. I would expect a count to never decrease, since we cannot "un-evict" a key. I'm wondering if evicted_keys is actually a rate (eg, evictions/sec), which would explain why it can fluctuate.
Thanks you in advance for any responses!
From INFO command:
evicted_keys: Number of evicted keys due to maxmemory limit
To learn more about evictions see Using Redis as an LRU cache - Eviction policies
This counter is zero when the server starts, and it is only reset if you issue the CONFIG RESETSTAT command. However, on ElastiCache, this command is not available.
That said, ElastiCache derives the metric from this value, by calculating the difference between data-points.
Redis evicted_keys 0 5 12 18 22 ....
CloudWatch Evictions 0 5 7 6 4 ....
This is the usual pattern in CloudWatch metrics. This allows you to use SUM if you want the cumulative value, but also to detect rate changes or spikes easily.
Think for example you want to alarm if evictions are more than 10,000 over one minute period. If ElastiCache stores the cumulative value from Redis straight as a metric, this would be hard to accomplish.
Also, by committing the metric only as evicted keys for the period, you are protected of the data distortion of a server-reset or a value overflow. While the Redis INFO value would go back to zero, on ElastiCache you still get the value for the period and you can still do running sum over any period.

How to set TTL on Rocks DB properly?

I am trying to use Rocks DB with TTL. The way I initialise rocks db is as below:
options.setCreateIfMissing(true).setWriteBufferSize(8 * SizeUnit.KB)
.setMaxWriteBufferNumber(3) .setCompressionType(CompressionType.LZ4_COMPRESSION).setKeepLogFileNum(1);
db = TtlDB.open(options, this.dbpath, 10, false);
I have set TTL to 10 seconds. But, the key value pairs are not being deleted after 10 seconds. Whats happening here?
That's by design:
This API should be used to open the db when key-values inserted are meant to be removed from the db in a non-strict 'ttl' amount of time therefore, this guarantees that key-values inserted will remain in the db for at least ttl amount of time and the db will make efforts to remove the key-values as soon as possible after ttl seconds of their insertion
-- from the RocksDB Wiki-page on TTL.
That means values are only removed during compaction, and staleness is not checked during reads.
One of the good things about RocksDB is that their source is quite readable. The files you would want to look at are the header and source for TtlDb. In the header you will find the compaction which removes stale values (the compaction's Filter-contract is documented well in its header). In the TtlDb source you verify for yourself that Get does not do any checks whether or not the value is stale. It just strips the timestamp (which just gets appended to the value on insert).

Kafka Connect S3 Connector OutOfMemory errors with TimeBasedPartitioner

I'm currently working with the Kafka Connect S3 Sink Connector 3.3.1 to copy Kafka messages over to S3 and I have OutOfMemory errors when processing late data.
I know it looks like a long question, but I tried my best to make it clear and simple to understand.
I highly appreciate your help.
High level info
The connector does a simple byte to byte copy of the Kafka messages and add the length of the message at the beginning of the byte array (for decompression purposes).
This is the role of the CustomByteArrayFormat class (see configs below)
The data is partitioned and bucketed according to the Record timestamp
The CustomTimeBasedPartitioner extends the io.confluent.connect.storage.partitioner.TimeBasedPartitioner and its sole purpose is to override the generatePartitionedPath method to put the topic at the end of the path.
The total heap size of the Kafka Connect process is of 24GB (only one node)
The connector process between 8,000 and 10,000 messages per second
Each message has a size close to 1 KB
The Kafka topic has 32 partitions
Context of OutOfMemory errors
Those errors only happen when the connector has been down for several hours and has to catch up data
When turning the connector back on, it begins to catch up but fail very quickly with OutOfMemory errors
Possible but incomplete explanation
The timestamp.extractor configuration of the connector is set to Record when those OOM errors happen
Switching this configuration to Wallclock (i.e. the time of the Kafka Connect process) DO NOT throw OOM errors and all of the late data can be processed, but the late data is no longer correctly bucketed
All of the late data will be bucketed in the YYYY/MM/dd/HH/mm/topic-name of the time at which the connector was turn back on
So my guess is that while the connector is trying to correctly bucket the data according to the Record timestamp, it does too many parallel reading leading to OOM errors
The "partition.duration.ms": "600000" parameter make the connector bucket data in six 10 minutes paths per hour (2018/06/20/12/[00|10|20|30|40|50] for 2018-06-20 at 12pm)
Thus, with 24h of late data, the connector would have to output data in 24h * 6 = 144 different S3 paths.
Each 10 minutes folder contains 10,000 messages/sec * 600 seconds = 6,000,000 messages for a size of 6 GB
If it does indeed read in parallel, that would make 864GB of data going into memory
I think that I have to correctly configure a given set of parameters in order to avoid those OOM errors but I don't feel like I see the big picture
The "flush.size": "100000" imply that if there is more dans 100,000 messages read, they should be committed to files (and thus free memory)
With messages of 1KB, this means committing every 100MB
But even if there is 144 parallel readings, that would still only give a total of 14.4 GB, which is less than the 24GB of heap size available
Is the "flush.size" the number of record to read per partition before committing? Or maybe per connector's task?
The way I understand "rotate.schedule.interval.ms": "600000" config is that data is going to be committed every 10 minutes even when the 100,000 messages of flush.size haven't been reached.
My main question would be what are the maths allowing me to plan for memory usage given:
the number or records per second
the size of the records
the number of Kafka partitions of the topics I read from
the number of Connector tasks (if this is relevant)
the number of buckets written to per hour (here 6 because of the "partition.duration.ms": "600000" config)
the maximum number of hours of late data to process
Configurations
S3 Sink Connector configurations
{
"name": "xxxxxxx",
"config": {
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"s3.region": "us-east-1",
"partition.duration.ms": "600000",
"topics.dir": "xxxxx",
"flush.size": "100000",
"schema.compatibility": "NONE",
"topics": "xxxxxx,xxxxxx",
"tasks.max": "16",
"s3.part.size": "52428800",
"timezone": "UTC",
"locale": "en",
"format.class": "xxx.xxxx.xxx.CustomByteArrayFormat",
"partitioner.class": "xxx.xxxx.xxx.CustomTimeBasedPartitioner",
"schema.generator.class": "io.confluent.connect.storage.hive.schema.DefaultSchemaGenerator",
"name": "xxxxxxxxx",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"s3.bucket.name": "xxxxxxx",
"rotate.schedule.interval.ms": "600000",
"path.format": "YYYY/MM/dd/HH/mm",
"timestamp.extractor": "Record"
}
Worker configurations
bootstrap.servers=XXXXXX
key.converter=org.apache.kafka.connect.converters.ByteArrayConverter
value.converter=org.apache.kafka.connect.converters.ByteArrayConverter
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
consumer.auto.offset.reset=earliest
consumer.max.partition.fetch.bytes=2097152
consumer.partition.assignment.strategy=org.apache.kafka.clients.consumer.RoundRobinAssignor
group.id=xxxxxxx
config.storage.topic=connect-configs
offset.storage.topic=connect-offsets
status.storage.topic=connect-status
rest.advertised.host.name=XXXX
Edit:
I forgot to add an example of the errors I have:
2018-06-21 14:54:48,644] ERROR Task XXXXXXXXXXXXX-15 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerSinkTask:482)
java.lang.OutOfMemoryError: Java heap space
[2018-06-21 14:54:48,645] ERROR Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerSinkTask:483)
[2018-06-21 14:54:48,645] ERROR Task XXXXXXXXXXXXXXXX-15 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask:148)
org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:484)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:265)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:182)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:150)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:146)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:190)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I was finally able to understand how the Heap Size usage works in the Kafka Connect S3 Connector
The S3 Connector will write the data of each Kafka partition into partitioned paths
The way those paths are partitioned depends on the partitioner.class parameter;
By default, it is by timestamp, and the value of partition.duration.ms will then determine the duration of each partitioned paths.
The S3 Connector will allocate a buffer of s3.part.size Bytes per Kafka partition (for all topics read) and per partitioned paths
Example with 20 partitions read, a timestamp.extractor set to Record, partition.duration.ms set to 1h, s3.part.size set to 50 MB
The Heap Size needed each hour is then equal to 20 * 50 MB = 1 GB;
But, timestamp.extractor being set to Record, messages having a timestamp corresponding to an earlier hour then the one at which they are read will be buffered in this earlier hour buffer. Therefore, in reality, the connector will need minimum 20 * 50 MB * 2h = 2 GB of memory because there is always late events, and more if there is events with a lateness superior to 1 hour;
Note that this isn't true if timestamp.extractor is set to Wallclock because there will virtually never be late events as far as Kafka Connect is concerned.
Those buffers are flushed (i.e. leave the memory) at 3 conditions
rotate.schedule.interval.ms time has passed
This flush condition is always triggered.
rotate.interval.ms time has passed in terms of timestamp.extractor time
This means that if timestamp.extractor is set to Record, 10 minutes of Record time can pass in less or more and 10 minutes of actual time
For instance, when processing late data, 10 minutes worth of data will be processed in a few seconds, and if rotate.interval.ms is set to 10 minutes then this condition will trigger every second (as it should);
On the contrary, if there is a pause in the flow of events, this condition will not trigger until it sees an events with a timestamp showing that more than rotate.interval.ms has passed since the condition last triggered.
flush.size messages have been read in less than min(rotate.schedule.interval.ms, rotate.interval.ms)
As for the rotate.interval.ms, this condition might never trigger if there is not enough messages.
Thus, you need to plan for Kafka partitions * s3.part.size Heap Size at the very least
If you are using a Record timestamp for partitioning, you should multiply it by max lateness in milliseconds / partition.duration.ms
This is a worst case scenario where you have constantly late events in all partitions and for the all range of max lateness in milliseconds.
The S3 connector will also buffer consumer.max.partition.fetch.bytes bytes per partition when it reads from Kafka
This is set to 2.1 MB by default.
Finally, you should not consider that all of the Heap Size is available to buffer Kafka messages because there is also a lot of different objects in it
A safe consideration would be to make sure that the buffering of Kafka messages does not go over 50% of the total available Heap Size.
#raphael has explained the working perfectly.
Pasting a small variation of similar problem (too little events to process but across many hours/days) that I had faced.
In my case I had about 150 connectors and 8 of them were failing with OOM as they had to process about 7 days worth of data (Our kafka in test env was down for about 2 weeks)
Steps Followed:
Reduced s3.part.size from 25MB to 5MB for all connectors. (In our scenario, rotate.interval was set to 10min with flush.size as 10000. Most of our events should easily fit with in this limit).
After this setting, only one connector was still getting OOM and this one connector goes into OOM within 5seconds of start (based on heap analysis), it shoots up from 200MB-1.5GB of Heap utilization. On looking at kafka offset lag, there were only 8K events to process across all 7 days. So this wasn't because of too many events to handle but rather too little events to handle/flush.
Since we were using Hourly partition and for an hour there were hardly 100 events, all the buffers for these 7 days were getting created without being flushed (without being released to JVM) - 7 * 24 * 5MB * 3 partitions = 2.5GB (xmx-1.5GB)
Fix:
Perform one of the below steps until your connector catches up and then restore your old config back. (Recommend approach - 1)
Update the Connector config to process 100 or 1000 records flush.size (depending on how your data is structured). Drawback : Too many small files get created in the hour, if actual events are more than 1000.
Change Partition to Daily so there will only be daily partitions. Drawback : You'll now have a mix of Hourly and Daily partition in your S3.

Aerospike %age of available write blocks is less when hard disk space is available

We found ourselves this problem. Config is as follows :-
Aerospike version : 3.14
Underlying hard disk : non-SSD
Variable Name Value
memory-size 5 GB
free-pct-memory 98 %
available_pct 4 %
max-void-time 0 millisec
stop-writes 0
stop-writes-pct 90 %
hwm-breached true
default-ttl 604,800 sec
max-ttl 315,360,000 sec
enable-xdr false
single-bin false
data-in-memory false
Can anybody please help us out with this ? What could be a potential reason for this ?
Aerospike only writes to free blocks. A block may contain any number of records that fit. If your write/update pattern is such that a block never falls below 50% active records(the default threshold for defragmenting: defrag-lwm-pct), then you have a bunch of "empty" space that can't be utilized. Read more about defrag in the managing storage page.
Recovering from this is much easier with a cluster that's not seeing any writes. You can increase defrag-lwm-pct, so that more blocks are eligible and gets defragmented.
Another cause could be just that the HDD isn't fast enough to keep up with defragmentation.
You can read more on possible resolutions in the Aerospike KB - Recovering from Available Percent Zero. Don't read past "Stop service on a node..."
You are basically not defragging your perisistence storage device (75GB per node). From the snapshot you have posted, you have about a million records on 3 nodes with 21 million expired. So looks like you are writing records with very short ttl and the defrag is unable to keep up.
Can you post the output of few lines when you are in this state of:
$ grep defrag /var/log/aerospike/aerospike.log
and
$ grep thr_nsup /var/log/aerospike/aerospike.log ?
What is your write/update load ? My suspicion is that you are only creating short ttl records and reading, not updating.
Depending on what you are doing, increasing defrag-lwm-pct may actually make things worse for you. I would also tweak nsup-delete-sleep from 100 microseconds default but it will depend on what your log greps above show. So post those, and lets see.
(Edit: Also, from the fact that you are not seeing evictions even though you are above the 50% HWM on persistence storage means your nsup thread is taking a very long time to run. That again points to nsup-delete-sleep value needing tuning for your set up.)

SQL MIN_ACTIVE_ROWVERSION() value does not change for a long while

We're troubleshooting a sort of Sync Framework between two SQL Server databases, in separate servers (both SQL Server 2008 Enterprise 64 bits SP2 - 10.0.4000.0), through linked server connections, and we reached to a point in which we're sort of stuck.
The logic to identify which are the records "pending to be synced" is of course based on ROWVERSION values, including the use of MIN_ACTIVE_ROWVERSION() to avoid dirty reads.
All SELECT operations are encapsulated in SPs on each "source" side. This is a schematic sample of one SP:
PROCEDURE LoaderRetrieve(#LastStamp bigint, #Rows int)
BEGIN
...
(vars handling)
...
SET TRANSACTION ISOLATION LEVEL SNAPSHOT
Select TOP (#Rows) Field1, Field2, Field3
FROM Table
WHERE [RowVersion] > #LastStampAsRowVersionDataType
AND [RowVersion] < #MinActiveVersion
Order by [RowVersion]
END
The approach works just fine, we usually sync records with the expected rate of 600k/hour (job every 30 seconds, batch size = 5k), but at some point, the sync process does not find any single record to be transferred, even though there are several thousand of records with a ROWVERSION value greater than the #LastStamp parameter.
When checking the reason, we've found that the MIN_ACTIVE_ROWVERSION() has a value less than (or slightly greater, just 5 or 10 increments) the #LastStamp being searched. This of course shouldn't be a problem since the MIN_ACTIVE_ROWVERSION() approach was introduced to avoid dirty reads and posterior issues, BUT:
The problem we see in some occasions, during the above scenario occurs, is that the value for MIN_ACTIVE_ROWVERSION() does not change during a long (really long) period of time, like 30/40 minutes, sometimes more than one hour. And this value is by far less than the ##DBTS value.
We first thought this was related to a pending DB transaction not yet committed. As per MSDN definition about the MIN_ACTIVE_ROWVERSION() (link):
Returns the lowest active rowversion value in the current database. A rowversion value is active if it is used in a transaction that has not yet been committed.
But when checking sessions (sys.sysprocesses) with open_tran > 0 during the duration of this issue, we couldn't find any session with a waittime greater than a few seconds, only one or two occurrences of +/- 5 minutes waittime sessions.
So at this point we're struggling to understand the situation: The MIN_ACTIVE_ROWVERSION() does not change during a huge period of time, and no uncommitted transactions with long waits are found within this time frame.
I'm not a DBA and could be the case that we're missing something in the picture to analyze this problem, doing some research on forums and blogs couldn't found any other clue. So far the open_tran > 0 was the valid reason, but under the circumstances I've exposed, it's clear that there's something else and don't know why.
Any feedback is appreciated.
well, I finally find the solution after digging a bit more.
The problem is that we were looking for sessions with a long waittime, but the real deal was to find sessions which have an active batch since a while.
If there's a session where open_tran = 1, to obtain exactly since when this transaction is open (and of course still active, not yet committed), the field last_batch from sys.sysprocesses shall be checked.
Using this query:
select
batchDurationMin= DATEDIFF(second,last_batch,getutcdate())/60.0,
batchDurationSecs= DATEDIFF(second,last_batch,getutcdate()),
hostname,open_tran,* from sys.sysprocesses a
where spid > 50
and a.open_tran >0
order by last_batch asc
we could identify a session with an open tran being active 30+ minutes. And with hostname values and some more checks within the web services (and also using dbcc inputbuffer) we found the responsible process.
So, the final question actually is "there's indeed an active session with an uncommitted transaction", therefore the MIN_ACTIVE_ROWVERSION() does not change. We were just looking processes with the wrong criteria.
Now that we know which process behaves like this, next step will be to improve it.
Hope this results useful to someone else.