We are using ElastiCache for redis with cluster-mode on. Recently we are seeing random replica-to-primary connection issue which leads our client connection timeout at mget or xread. Usually this lasts for about 20 minutes and will recover itself but it still brought us customer experience impact. My question here is:
if redis fail to read from one replica, would it re-route the same request to another replica in the same shard?
what's the recommended way to mitigate the impact?
thanks in advance!
I start a Client to connect remote Server.
It involves a lot of computation.
Then the Client accidental disconnected.
However the Client's computation is still running on remote Server.
has a way to close it?
It will eventually happen when socket timeout is reached, I guess.
I was recently doing some investigation into some issues I'm facing on my Redis clusters, and saw that I have many connections that are sticking around, despite being idle indefinitely after some period of time.
After some investigation, I found that I have these two settings on my cluster:
timeout 300
tcp-keepalive 0
The stale connections that aren't going away are PUB/SUB client connections (StackExchange.Redis clients, in fact, but that's beside the point), so they do not respect the timeout configuration. As such, the tcp-keepalive seems to be the only other configuration that can ensure these connections get cleaned up over time.
So I applied this setting to all the nodes:
redis-trib.rb call 127.0.0.1:6001 config set tcp-keepalive 300
At this point I went home, and I came back the next morning, assuming the stale connections would have been disposed of properly. Sadly, I was greeted by all the same connections.
My question is: Is there any way from the Redis server side to dispose of these connections gracefully after they've been established? Is it expected that applying the tcp-keepalive configuration after the connections are established and old that they will not be disposed of?
The only solution I've found besides restarting the Redis servers are to do a bit of scripting and use the CLIENT KILL command, which is doable, but I was hoping for something configuration based to handle this.
Thanks in advance for any insight here!
I am using ServiceStack 5.0.2 with Redis Sentinel (3 + 3) and having issues in case of a failover: commands being issued during or after a failover fail with timeout.
I have come up with an idea to implement retry pattern via custom IRedisClient. But probably there is a better strategy to employ in this case.
Answer given in the post How does ServiceStack PooledRedisClientManager failover work? does not seem to be the right way to go.
Thank you,
Redis Clients wrap a TCP connection with a Redis Server, a Redis Client that was connected with the instance that failed over will fail, but any new Redis Clients retrieved from the pool after failover will be connected to the new failed over instance.
I have RabbitMQ Server 3.6.0 installed on Windows (I know it's time to upgrade, I've already done that on the other server node).
Heartbeats are enabled on both server and client side (heartbeat interval 60s).
I have had a resource alarm (RAM limit), and after that I have observed the raise of amount of TCP connections to RMQ Server.
At the moment there're 18000 connections while normal amount is 6000.
Via management plugin I can see there is a lot of connections with 0 channels, while our "normal" connection have at least 1 channel.
And even RMQ Server restart won't help: all connections would re-establish.
1. Does that mean all of them are really alive?
Similar issue was described here https://github.com/rabbitmq/rabbitmq-server/issues/384, but as I can see it was fixed exactly in v3.6.0.
2. Do I understand right that before RMQ Server v3.6.0 the behavior after resource alarm was like that: several TCP connections could hang on server side per 1 real client autorecovery connection?
Maybe important: we have haProxy between the server and the clients.
3. Could haProxy be an explanation for this extra connections? Maybe it prevents client from receiving a signal the connection was closed due to resource alarm?
Are all of them alive?
Only you can answer this, but I would ask - how is it that you are ending up with many thousands of connections? Really, you should only create one connection per logical process. So if you really have 6,000 logical processes connecting to the server, that might be a reason for that many connections, but in my opinion, you're well beyond reasonable design limits even in that case.
To check, see how many connections decrease when you kill one of your logical processes.
Do I understand right that before RMQ Server v3.6.0 the behavior after resource alarm was like that: several TCP connections could hang on server side per 1 real client autorecovery connection?
As far as I can tell, yes. It looks like the developer in this case ran across a common problem in sockets, and that is the detection of dropped connections. If I had a dollar for every time someone misunderstood how TCP works, I'd have more money than Bezos. So, what they found is that someone made some bad assumptions, when actually read or write is required to detect a dead socket, and the developer wrote code to (attempt) to handle it properly. It is important to note that this does not look like a very comprehensive fix, so if the conceptual design problem had been introduced to another part of the code, then this bug might still be around in some form. Searching for bug reports might give you a more detailed answer, or asking someone on that support list.
Could haProxy be an explanation for this extra connections?
That depends. In theory, haProxy as is just a pass-through. For the connection to be recognized by the broker, it's got to go through a handshake, which is a deliberate process and cannot happen inadvertently. Closing a connection also requires a handshake, which is where haProxy might be the culprit. If haProxy thinks the connection is dead and drops it without that process, then it could be a contributing cause. But it is not in and of itself making these new connections.
The RabbitMQ team monitors this mailing list and only sometimes answers questions on StackOverflow.
I recommended that this user upgrade from Erlang 18, which has known TCP connection issues -
https://groups.google.com/d/msg/rabbitmq-users/R3700QdIVJs/taDYKI6bAgAJ
I've managed to reproduce the problem: in the end it was a bug in the way our client used RMQ connections.
It created 1 auto-recovery connection (that's all fine with that) and sometimes it created a separate simple connection for "temporary" purposes.
Step to reproduce my problem were:
Reach memory alarm in RabbitMQ (e.g. set up an easily reached RAM
limit and push a lot of big messages). Connections would be in state
"blocking".
Start sending message from our client with this new "temp" connection.
Ensure the connection is in state "blocked".
Without eliminating resource alarm, restart RabbitMQ node.
The "temp" connection itself was here! Despite the fact auto-recovery
was not enabled for it. And it continued sending heartbeats so the
server didn't close it.
We will fix the client to use one and the only connection always.
Plus we of course will upgrade Erlang.