How to dispose of idle PUBSUB Redis connections - redis

I was recently doing some investigation into some issues I'm facing on my Redis clusters, and saw that I have many connections that are sticking around, despite being idle indefinitely after some period of time.
After some investigation, I found that I have these two settings on my cluster:
timeout 300
tcp-keepalive 0
The stale connections that aren't going away are PUB/SUB client connections (StackExchange.Redis clients, in fact, but that's beside the point), so they do not respect the timeout configuration. As such, the tcp-keepalive seems to be the only other configuration that can ensure these connections get cleaned up over time.
So I applied this setting to all the nodes:
redis-trib.rb call 127.0.0.1:6001 config set tcp-keepalive 300
At this point I went home, and I came back the next morning, assuming the stale connections would have been disposed of properly. Sadly, I was greeted by all the same connections.
My question is: Is there any way from the Redis server side to dispose of these connections gracefully after they've been established? Is it expected that applying the tcp-keepalive configuration after the connections are established and old that they will not be disposed of?
The only solution I've found besides restarting the Redis servers are to do a bit of scripting and use the CLIENT KILL command, which is doable, but I was hoping for something configuration based to handle this.
Thanks in advance for any insight here!

Related

Extra TCP connections on the RabbitMQ server after resource alarm

I have RabbitMQ Server 3.6.0 installed on Windows (I know it's time to upgrade, I've already done that on the other server node).
Heartbeats are enabled on both server and client side (heartbeat interval 60s).
I have had a resource alarm (RAM limit), and after that I have observed the raise of amount of TCP connections to RMQ Server.
At the moment there're 18000 connections while normal amount is 6000.
Via management plugin I can see there is a lot of connections with 0 channels, while our "normal" connection have at least 1 channel.
And even RMQ Server restart won't help: all connections would re-establish.
   1. Does that mean all of them are really alive?
Similar issue was described here https://github.com/rabbitmq/rabbitmq-server/issues/384, but as I can see it was fixed exactly in v3.6.0.
   2. Do I understand right that before RMQ Server v3.6.0 the behavior after resource alarm was like that: several TCP connections could hang on server side per 1 real client autorecovery connection?
Maybe important: we have haProxy between the server and the clients. 
   3. Could haProxy be an explanation for this extra connections? Maybe it prevents client from receiving a signal the connection was closed due to resource alarm?
Are all of them alive?
Only you can answer this, but I would ask - how is it that you are ending up with many thousands of connections? Really, you should only create one connection per logical process. So if you really have 6,000 logical processes connecting to the server, that might be a reason for that many connections, but in my opinion, you're well beyond reasonable design limits even in that case.
To check, see how many connections decrease when you kill one of your logical processes.
Do I understand right that before RMQ Server v3.6.0 the behavior after resource alarm was like that: several TCP connections could hang on server side per 1 real client autorecovery connection?
As far as I can tell, yes. It looks like the developer in this case ran across a common problem in sockets, and that is the detection of dropped connections. If I had a dollar for every time someone misunderstood how TCP works, I'd have more money than Bezos. So, what they found is that someone made some bad assumptions, when actually read or write is required to detect a dead socket, and the developer wrote code to (attempt) to handle it properly. It is important to note that this does not look like a very comprehensive fix, so if the conceptual design problem had been introduced to another part of the code, then this bug might still be around in some form. Searching for bug reports might give you a more detailed answer, or asking someone on that support list.
Could haProxy be an explanation for this extra connections?
That depends. In theory, haProxy as is just a pass-through. For the connection to be recognized by the broker, it's got to go through a handshake, which is a deliberate process and cannot happen inadvertently. Closing a connection also requires a handshake, which is where haProxy might be the culprit. If haProxy thinks the connection is dead and drops it without that process, then it could be a contributing cause. But it is not in and of itself making these new connections.
The RabbitMQ team monitors this mailing list and only sometimes answers questions on StackOverflow.
I recommended that this user upgrade from Erlang 18, which has known TCP connection issues -
https://groups.google.com/d/msg/rabbitmq-users/R3700QdIVJs/taDYKI6bAgAJ
I've managed to reproduce the problem: in the end it was a bug in the way our client used RMQ connections.
It created 1 auto-recovery connection (that's all fine with that) and sometimes it created a separate simple connection for "temporary" purposes.
Step to reproduce my problem were:
Reach memory alarm in RabbitMQ (e.g. set up an easily reached RAM
limit and push a lot of big messages). Connections would be in state
"blocking".
Start sending message from our client with this new "temp" connection.
Ensure the connection is in state "blocked".
Without eliminating resource alarm, restart RabbitMQ node.
The "temp" connection itself was here! Despite the fact auto-recovery
was not enabled for it. And it continued sending heartbeats so the
server didn't close it.
We will fix the client to use one and the only connection always.
Plus we of course will upgrade Erlang.

twemproxy (nutcracker) port suddenly becomes unavailable

I have this twemproxy_sentinel setup that uses their default port 22122 as entry and forwards the requests to underlying redis servers running on port 6380, 6381.
Every now and then, the port 22122 becomes unavailable. Thus clients using the redis would not be able to connect. telnet to it would close instantly. All I needed to do was to /etc/init.d/nutcracker restart and things would be back to normal. All along, the sentinel and redis services are working. Only the twemproxy seems to get cut off. Before the time of restart, the nutcracker service is still running (ps would show it's running). The logs do not show any indication of things failing.
I'm not sure why this happens and tried to dig through the logs of both the redis servers, redis sentinel and twemproxy logs. I also tried looking into /var/log/messages and tried to ensure file-max won't be blocking the # of ports being opened.
Wonder where I can start to look into why things would go down.
Realized I've overlooked that max-files doesn't necessarily allows nutcracker to use those ports but merely allows the system to use so many ports. It is back to normal after actually enabling nutcracker to open more ports.

When does Resque open a redis connection?

I've been running into Redis::TimeoutError: Connection timed out errors on Heroku, and I'm trying to pin down the problem. I'm only using Resque to connect to redis, so I'm wondering how Resque connects to redis:
When does Resque connect to redis? When a worker is started?
How long do redis connections last, typically?
It's unclear to me when connections are made and how long they last. Can anyone shed some light on this for me? Thanks!
Typically connections to Redis from Rails apps are established lazily, when the connection is first time used. For troubleshooting, sometimes it is useful to force the connection by adding Redis PING (http://redis.io/commands/ping) in the initializer code.
Once connection is established it will be maintained forever. If connection is dropped, an attempt to reconnect will happen next time it is used.
Also, be aware that as of early 2015, Heroku had an ongoing issue establishing connections to Redis instances on AWS, as the connections would occasionally time out. Heroku support is aware of that, so you may be able to get some help reaching out to them.

RabbitMQ / EasyNetQ drops connections when machine very active?

I'm new to RabbitMQ / EasyNetQ and am trying to better understand a behaviour I am observing. We've seen that when our server running RabbitMQ is busy all EasyNetQ connections are dropped.
This is the exception simultaneously generated on all clients:
System.Exception: Failed to connect to Broker: 'XXXXXX.domain.com',
Port: 5672 VHost: 'XXXX'. ExceptionMessage: 'None of the specified
endpoints were reachable'
EasyNetQ automatically reconnects when the server is no longer busy, but I wonder if it is typical for RabbitMQ/EasyNetQ to drop connections when the machine is busy? (Or if I should be investigating performance issues with my server.)
(PS: By busy, I simply mean updating a large project from source control, relaunching a large ASP.NET application after redeploying it or running a CPU-intensive calculation on large amounts of data. ).
There are limits to the number of connections a RabbitMQ broker will accept. Is it possible that you are rapidly opening a connection, doing some work, then closing it, much as you would with a database connection? If so, that's not how you should interact with the broker. See the EasyNetQ documentation on connections:
https://github.com/mikehadlow/EasyNetQ/wiki/Connecting-to-RabbitMQ

How best to manage Redis connections using ServiceStack?

I work on a few .NET web apps that use Redis heavily for caching along with ServiceStack's Redis client. In all cases I've got Redis running on the same machine. I've used both BasicRedisClientManager and PooledRedisClientManager (always implemented as singletons) and have had some issues with both approaches.
With BasicRedisClientManager, things would work fine for a while, but eventually Redis would start refusing connections. Using netstat we discovered that thousands of TCP connections to the default Redis port were hanging around in TIME_WAIT status.
We then switched to PooledRedisClientManager, which seemed to fix the problem immediately. However, not long after, we started noticing occasional CPU spikes that we narrowed down to thread waiting (System.Threading.Monitor.Wait calls) caused by PooledRedisClientManager.GetClient.
In code, we use a get-in-get-out approach (using ServiceStack's handy ExecAs shortcuts) so in general connections are acquired very frequently but held as briefly as possible.
We get a modest amount of traffic but we're no StackExchange, and I can't help but think the ServiceStack client is up to the job and we're just doing something wrong. Is PooledRedisClientManager the correct approach here? Would it be advisable to simply increase the pool size? Or is that likely just masking a problem with our code?
Just looking for general guidance here, I don't have specific code I need help with at this point. Thanks in advance.
Are you absolutely sure all Redis connections are being disposed?
With ServiceStack, the Redisproperty on Service and ViewPageBase (if you're using SS Razor) do dispose themselves, but any time you request a connection from the pool yourself you must dispose it yourself.
However, despite this, we recently had issues with our pool being exhausted of all connections, too. One of my colleagues discovered that there wasn't proper clean up for Razor pages and made a pull request here - This means that there has only been correct disposal on Razor pages since ServiceStack v4.0.21. I have not checked if that fix has been back-ported to the v3 branch.
My colleague also added TrackingRedisClientsManager that may help you track down the improper disposal. See here
You can also check the stats of a PooledRedisClientManager by using this helper method. We threw it on a little razor page to check the stats as we feel appropriate) but you could write better code around this to monitor the pool health of specific nodes, too.