How to interpret "evicted_keys" from Redis Info - redis

We are using ElastiCache for Redis, and are confused by its Evictions metric.
I'm curious what the unit is on the evicted_keys metric from Redis Info? The ElastiCache docs say it is a count: https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/CacheMetrics.Redis.html but for our application we have observed the "Evictions" metric (which is derived from evicted_keys) fluctuates up and down, indicating it's not a count. I would expect a count to never decrease, since we cannot "un-evict" a key. I'm wondering if evicted_keys is actually a rate (eg, evictions/sec), which would explain why it can fluctuate.
Thanks you in advance for any responses!

From INFO command:
evicted_keys: Number of evicted keys due to maxmemory limit
To learn more about evictions see Using Redis as an LRU cache - Eviction policies
This counter is zero when the server starts, and it is only reset if you issue the CONFIG RESETSTAT command. However, on ElastiCache, this command is not available.
That said, ElastiCache derives the metric from this value, by calculating the difference between data-points.
Redis evicted_keys 0 5 12 18 22 ....
CloudWatch Evictions 0 5 7 6 4 ....
This is the usual pattern in CloudWatch metrics. This allows you to use SUM if you want the cumulative value, but also to detect rate changes or spikes easily.
Think for example you want to alarm if evictions are more than 10,000 over one minute period. If ElastiCache stores the cumulative value from Redis straight as a metric, this would be hard to accomplish.
Also, by committing the metric only as evicted keys for the period, you are protected of the data distortion of a server-reset or a value overflow. While the Redis INFO value would go back to zero, on ElastiCache you still get the value for the period and you can still do running sum over any period.

Related

AWS Cloudwatch ELB monitoring active connections

I would like to monitor the maximum number of active connections that my ApplicationELB is managing over a 5-minute period.
The ApplicationELB publishes a metric called ActiveConnectionCount. The documentation describes this in part as:
The total number of concurrent TCP connections active from clients to the load balancer and from the load balancer to targets.
And further states:
The most useful statistic is Sum.
I believe that Sum would total all the active connections reported within the time frame. E.g. Let's say the ELB is maintaining 10 connections and it reports this number every second, then the Sum would be 3000 over a 5-minute period. This is not what I want. Furthermore, when I use SUM over a 5-minute period I'm getting 20k or so -- far more than the number of real concurrent connections which are at most a few hundred.
If I aggregate using Maximum then the number reported by AWS is zero (!?).
If I aggregate using Average then the number appears to be reasonable (ranging from 80 - 200), but also wildly inaccurate. That is, it is almost inversely correlates with new connections and response time. That is, during time so of the day when response time is low and new connections is low, average active connections is higher.
So, I guess, here are my questions:
(1) How can I achieve seeing maximum number of concurrent connections between ELB and clients/app server? (Ideally, I could separate these two, but it doesn't look like the ELB does that).
Less important, but I'm curious:
(2) Why does MAXIMUM yield zero, while AVERAGE yields 80-200?
(3) Why does the documentation say that SUM should be used?
Thanks for any help / insight!
How can I achieve seeing maximum number of concurrent connections
between ELB and clients/app server? (Ideally, I could separate these
two, but it doesn't look like the ELB does that).
Why does MAXIMUM yield zero, while AVERAGE yields 80-200?
As you said, the ELB does not do that. From the metrics you can also see something called "SampleCount" which is the number of samples taken during a period of time, by default 1 minute. If we could somehow access the counts in these samples, we could get a min and max sample. For whatever reason, it's currently broken or not implemented and min/max show as 0. Therefore, the most useful metric, in my opinion at least, is the average which takes the sum (of counts) and divides it by the SampleCount.
Why does the documentation say that SUM should be used?
Good question because if you think about it it doesn't make much sense and doesn't give you much information since it's just a sum of the count in all samples.

Implementing bursts or spikes detection in a counter using Redis

I wanna use Redis to keep track of certain numbers. Basically, they're counters. Is there a way to use Redis to sort of track the rate at which these counters increase?
For example, let's say a counter is being incremented at a rate of 10 per minute for the most of the time but suddenly it's being incremented at a rate of 40 per minute. How can I detect that?
You cannot do that directly, but you can do that with a sorted set for example, with a bit of client side, or Lua based processing.
Let's say that you use a sorted set, for each time window you increment the value:
ZINCRBY mykey timestamp 1
Then you have a simple counter per timestamp.
When you want to analyze it, you can take a range by time with ZRANGE or ZREVRANGE, getting the scores by using WITHSCORES, and do some processing on the differences for detecting anomalies. There are many ways to do it, here's a link with a few pointers: https://stats.stackexchange.com/questions/152644/what-algorithm-should-i-use-to-detect-anomalies-on-time-series

Redis instantaneous_ops_per_sec higher than actual throughput

We are using Redis as a Queue which has on an average about ~3k rps. But when we check the instantaneous_ops_per_sec, this value consistently reports higher than expected, by about 20%, in this case, reports ~4k ops per sec.
To verify this, I have taken a dump of MONITOR for about 10 seconds and checked the number of incoming commands.
grep "1489722862." monitor_output | wc -l
Where 1489722862 is the timestamp. Even this count matches with what is being produced in the queue and what is being consumed from the queue.
This is a master-slave redis cluster setup.
Does instantaneous_ops_per_sec also account for the slave reads? If not, what is the other reason for which this count is significantly higher?
The instantaneous_ops_per_sec metric is calculated as the mean of the recent samples that the server took. The number of recent samples is hardcoded as 16 by STATS_METRIC_SAMPLES in server.h.

How to calculate redis memory used percentage on ElastiCache

I want to monitor my redis cache cluster on ElastiCache. From AWS/Elasticache i am able to get metrics like FreeableMemory and BytesUsedForCache. If i am not wrong BytesUsedForCache is the memory used by cluster(assuming there is only one node in cluster). I want to calculate percentage uses of memory. Can any one help me to get percentage of Memory uses in Redis.
We had the same issue since we wanted to monitor the percentage of ElastiCache Redis memory that is consumed by our data.
As you wrote correctly, you need to look at BytesUsedForCache - that is the amount of memory (in bytes) consumed by the data you've stored in Redis.
The other two important numbers are
The available RAM of the AWS instance type you use for your ElastiCache node, see https://aws.amazon.com/elasticache/pricing/
Your value for parameter reserved-memory-percent (check your ElastiCache parameter group). That's the percentage of RAM that is reserved for "nondata purposes", i.e. for the OS and whatever AWS needs to run there to manage your ElastiCache node. By default this is 25 %. See https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/redis-memory-management.html#redis-memory-management-parameters
So the total available memory for your data in ElastiCache is
(100 - reserved-memory-percent) * instance-RAM-size
(In our case, we use instance type cache.r5.2xlarge with 52,82 GB RAM, and we have the default setting of reserved-memory-percent = 25%.
Checking with the info command in Redis I see that maxmemory_human = 39.61 GB, which is equal to 75 % of 52,82 GB.)
So the ratio of used memory to available memory is
BytesUsedForCache / ((100 - reserved-memory-percent) * instance-RAM-size)
By comparing the freeableMemory and bytesUsedForCache metrics, you will have the available memory for the Elasticache non-cluster mode (not sure if it applies to cluster-mode too).
Here is the NRQL we're using to monitor the cache:
SELECT Max(`provider.bytesUsedForCache.Sum`) / (Max(`provider.bytesUsedForCache.Sum`) + Min(`provider.freeableMemory.Sum`)) * 100 FROM DatastoreSample WHERE provider = 'ElastiCacheRedisNode'
This is based on the following:
FreeableMemory: The amount of free memory available on the host. This is derived from the RAM, buffers and cache that the OS reports as freeable.AWS CacheMetrics HostLevel
BytesUsedForCache: The total number of bytes allocated by Redis for all purposes, including the dataset, buffers, etc. This is derived from used_memory statistic at Redis INFO.AWS CacheMetrics Redis
So BytesUsedForCache (amount of memory used by Redis) + FreeableMemory (amount of data that Redis can have access to) = total memory that Redis can use.
With the release of the 18 additional CloudWatch metrics, you can now use DatabaseMemoryUsagePercentage and see the percentage of memory utilization in redis.
View more about the metric in the memory section here
You would have to calculate this based on the size of the node you have selected. See these 2 posts for more information.
Pricing doc gives you the size of your setup.
https://aws.amazon.com/elasticache/pricing/
https://forums.aws.amazon.com/thread.jspa?threadID=141154

ElastiCache cloudwatch metrics for Redis: currItems for a single database

I have set up a metric for the aws interface on the ElastiCache redis cluster. I'm looking at a value of currItems superior to a certain number for a given period (say 1000 for 1 minute)
The issue I have is that I have two databases in Redis, name 0 and 1. I would like to only get the currItems for database 0, not database 1, since database 1 is holding values for a longer period of time and make the whole metric look much bigger than it should be (since I want the current items of database 0)
Is there a way to create a metric that would only get the currItems of the database 0?
You will have to create an application for this or use existing redis tools.
https://stackoverflow.com/questions/8614737/what-are-some-recommended-tools-to-monitor-a-redis-database
If you are using new relic