Replica node with connection issue to primary node in Redis - redis

We are using ElastiCache for redis with cluster-mode on. Recently we are seeing random replica-to-primary connection issue which leads our client connection timeout at mget or xread. Usually this lasts for about 20 minutes and will recover itself but it still brought us customer experience impact. My question here is:
if redis fail to read from one replica, would it re-route the same request to another replica in the same shard?
what's the recommended way to mitigate the impact?
thanks in advance!

Related

How to make AWS Elasticache Redis split read requests across all read replicas?

I have a Redis Elasticache non-clustered instance with one primary node and two read replicas. Using the Stack Exchange Redis client I provide the reader endpoint and make a get request. Based on the documentation I would expect:
A reader endpoint will split incoming connections to the endpoint
between all read replicas in a Redis cluster.
However, 100% of the requests go to one of the read replicas. First question, why? Second question, how do I get Redis to distribute the load across all of the read replicas without having to manage the read instances at an application level?
You should use the "Reader Endpoint" connection string. (the connection string with "-ro")
This will split the connection between your replicas in case you have more than one connection to the Redis Cache server. Also to achieve this you need to have significant CPU usage to the first redis-replica server.

ElastiCache not utilizing read only replica

I have a simple Redis ElastiCache cluster (cluster mode disabled) with a master node and a read only replica.
When throwing traffic at the server, i.e. from redis-benchmark, it seems all GET traffic goes only to the master node, while the RO replica gets zero GET traffic (cache hit/miss and GetTypeCommands are all 0).
Anyone has insights on why this is happening? I expected the traffic would be distributed between the two nodes.
older question but I am answering since I am just learning this myself...
I thought that the purpose was to balance the load between master and slave, but that is not the case. The slave exists so that it can be promoted to master if master fails for any reason.
further reading: https://redis.io/topics/replication

Does redis delete all the keys when one master and its slave fails in redis cluster

I have a question. Suppose I am using a Redis cluster with 3 shards (with master and slave). I came to know that if a master and its slave fails at the same time Redis Cluster is not able to continue to operate. What happen after that.
Would Redis cluster delete all the other keys from other 2 nodes as well? (When it comes back)
Do we need to manually restart this cluster and can we somehow retain the other keys values (on other nodes)?
How will it behave if I use Azure Redis Cache?
Thanks In Advance
1. Would Redis cluster delete all the other keys from other 2 nodes as well? (When it comes back)
First of all only the operations are blocked not the cluster activity and nothing is done with the data so says the documentation
Redis Cluster failure detection is used to recognize when a master or slave node is no longer reachable by the majority of nodes and then respond by promoting a slave to the role of master. When slave promotion is not possible the cluster is put in an error state to stop receiving queries from clients.
Next regarding if the data gets deleted or not (Under Replication document)
In setups where Redis replication is used, it is strongly advised to have persistence turned on in the master
Which means that only if the persistence was turned off and the master server pair went down then you will loose the data. When the pair comes back up, you will not be able to recover the data. So keep Redis persistence turned on.
2. Do we need to manually restart this cluster and can we somehow retain the other keys values (on other nodes)?
I think the above answer covers it up.
3. How will it behave if I use Azure Redis Cache?
From Azure Redis Cache FAQ
High Availability/SLA: Azure Redis Cache guarantees that a Standard/Premium cache will be available at least 99.9% of the time. To learn more about our SLA, see Azure Redis Cache Pricing. The SLA only covers connectivity to the Cache endpoints. The SLA does not cover protection from data loss. We recommend using the Redis data persistence feature in the Premium tier to increase resiliency against data loss.
So it's kinda their headache
OR
Redis Cluster: If you want to create caches larger than 53 GB or want to shard data across multiple Redis nodes, you can use Redis clustering which is available in the Premium tier. Each node consists of a primary/replica cache pair for high availability. For more information, see How to configure clustering for a Premium Azure Redis Cache.

Best Redis setup for session caching

I see there are multiple modes of operation for Redis (cluster, sentinel, master-slave, etc?). I don't fully understand the implications of each, but my question is this:
If I have a web application that requires distributed session persistence, which configuration of Redis makes the most sense? The main reason I'm using redis is to achieve some level of fault tolerance. If one of my frontend servers fails, I want the sessions to be available for other nodes to pickup the workload. If a redis node goes down, I don't want this to affect the user experiences, and I don't want to have to wake up a developer at midnight to correct the matter.
From everything I've read, Redis Sentinel is the way to go for fault tolerance.

How to do a redis FLUSHALL without initiating a sentinel failover?

We have a redis configuration with two redis servers. We also have 3 sentinels to monitor the two instances and initiate a fail over when needed.
We currently have a process where we periodically have to do a FLUSHALL on the redis server. This is a blocking operation that takes longer than the time we have allotted for the sentinels to timeout. In other words, we have our sentinel configuration with:
sentinel down-after-milliseconds OurMasterName 5000
and doing a redis-cli FLUSHALL on the server takes > 5000 milliseconds, so the sentinels initiate a fail over.
We acknowledge that doing a FLUSHALL isn't great and we also know that we could increase the down-after-milliseconds to but for the purposes of this question assume that neither of these are options.
The question is: how can we do a FLUSHALL (or equivalent operation) WITHOUT having our sentinels initiate a fail over due to the FLUSHALL blocking for greater than 5000 milliseconds? Has anyone encountered and solved this problem?
You could just create new instances: if you are using something like AWS or Azure than you have API for creating a new Redis cluster. Start it, load it with data and once ready just modify the DNS, again with API call -so all these can be handled by some part of your application. But on premises things can get more complex because it will require some automation with ansible/chef/puppet.
The next best option you currently have to is to delete keys in batches to reduce the amout of work to at once. You can build a list, assuming you don't have one, using scan Then delete in whatever batch size works for you.
Edit: as you are not interested in keeping data, disable persistence, delete the RDB file, then just restart the instance. This way you do t have to update sentinel like you would if you take the provision new hosts.
Out of curiosity, if you're just going to be flushing all the time and don't care about the data as you'll be wiping it, why bother with sentinel?