StackExchange.Redis Timeout issue high input stream buffer - redis

We're running a large web application across several servers. We also have a dedicated linux server that hosts Redis v6.2.1. Our web application uses StackExchange.Redis to connect. Most of the time, everything works fine. But occasionally, we get a burst of timeouts. I've searched SO and other places (also followed the link in the timeout exception) for similar issues, but most of the questions deal with not enough min worker threads. But what i've noticed is that our timeout errors seem to be coming from the IN buffer. This question StackExchange.Redis timeouts GET shows a timeout error with an in buffer like ours, but it doesn't seem to be addressed in the answer.
I've posted a sample of the error we get below. From my own findings, it seems we have enough worker threads available, but the issue seems to be the in buffer. I'm just not sure what could cause this issue.
in: 65536
Timeout performing GET (5000ms), next: LRANGE my_key, inst: 1, qu: 0, qs: 48, aw: False, rs: ReadAsync, ws: Idle, in: 65536, in-pipe: 0, out-pipe: 61, serverEndpoint: [server], mc: 1/1/0, mgr: 10 of 10 available, clientName: My_App, IOCP: (Busy=0,Free=400,Min=200,Max=400), WORKER: (Busy=44,Free=356,Min=200,Max=400), v: 2.2.4.27433 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts) at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message message, ResultProcessor`1 processor, ServerEndPoint server) at StackExchange.Redis.RedisBase.ExecuteSync[T](Message message, ResultProcessor`1 processor, ServerEndPoint server) at StackExchange.Redis.RedisDatabase.StringGet(RedisKey key, CommandFlags flags) at GetCachedObject[T](String key, T& obj)
I've checked the slowlog in our server, but there doesn't seem to be any correlation to the timeouts we're getting, and from my understanding, the in buffer is at the client side of things and isn't processing the data like it should be.
Has anyone else come across this issue and has some insight as to what I should look into? Am I even in the ballpark of what is actually happening? If you need any specific details about our configuration, I can add that as well. We do host our ASP.NET web app with at most 4 different worker threads and we are using the default 5 second timeout in StackExchange.Redis

Related

Is there a way to register to memory alarm of rabbitmq

My application just got freeze because the memory usage of rabbitmq exceeded its threshold.
I am using pika and pyrabbit as a python wrappers for handling channels and connections.
I wander if there is a way that my process will register to something and get a notification when that event occurs (and hopefully even a bit before it does).
When using rabbitpy you can check if the blocked flag is set. This flag means that the connection is being blocked due to resource constraints (most likely due to low memory).
with rabbitpy.Connection('amqp://guest:guest#localhost:5672/%2f') as conn:
print(conn.blocked)
e.g.
while conn.blocked:
time.sleep(0.1) # wait until connection is unblocked

How to prevent an I/O Completion Port from blocking when completion packets are available?

I have a server application that uses Microsoft's I/O Completion Port (IOCP) mechanism to manage asynchronous network socket communication. In general, this IOCP approach has performed very well in my environment. However, I have encountered an edge case scenario for which I am seeking guidance:
For the purposes of testing, my server application is streaming data (lets say ~400 KB/sec) over a gigabit LAN to a single client. All is well...until I disconnect the client's Ethernet cable from the LAN. Disconnecting the cable in this manner prevents the server from immediately detecting that the client has disappeared (i.e. the client's TCP network stack does not send notification of the connection's termination to the server)
Meanwhile, the server continues to make WSASend calls to the client...and being that these calls are asynchronous, they appear to "succeed" (i.e. the data is buffered by the OS in the outbound queue for the socket).
While this is all happening, I have 16 threads blocked on GetQueuedCompletionStatus, waiting to retrieve completion packets from the port as they become available. Prior to disconnecting the client's cable, there was a constant stream of completion packets. Now, everything (as expected) seems to have come to a halt...for about 32 seconds. After 32 seconds, IOCP springs back into action returning FALSE with a non-null lpOverlapped value. GetLastError returns 121 (The semaphore timeout period has expired.) I can only assume that error 121 is an artifact of WSASend finally timing out after the TCP stack determined the client was gone?
I'm fine with the network stack taking 32 seconds to figure out my client is gone. The problem is that while the system is making this determination, my IOCP is paralyzed. For example, WSAAccept events that post to the same IOCP are not handled by any of the 16 threads blocked on GetQueuedCompletionStatus until the failed completion packet (indicating error 121) is received.
My initial plan to work around this involved using WSAWaitForMultipleEvents immediately after calling WSASend. If the socket event wasn't signaled within (e.g. 3 seconds), then I terminate the socket connection and move on (in hopes of preventing the extensive blocking effect on my IOCP). Unfortunately, WSAWaitForMultipleEvents never seems to encounter a timeout (so maybe asynchronous sockets are signaled by virtue of being asynchronous? Or copying data to the TCP queue qualifies for a signal?)
I'm still trying to sort this all out, but was hoping someone had some insights as to how to prevent the IOCP hang.
Other details: My server application is running on Win7 with 8 cores; IOCP is configured to use at most 8 concurrent threads; my thread pool has 16 threads. Plenty of RAM, processor and bandwidth.
Thanks in advance for your suggestions and advice.
It's usual for the WSASend() completions to stall in this situation. You won't get them until the TCP stack times out its resend attempts and completes all of the outstanding sends in error. This doesn't block any other operations. I expect you are either testing incorrectly or have a bug in your code.
Note that your 'fix' is flawed. You could see this 'delayed send completion' situation at any point during a normal connection if the sender is sending faster than the consumer can consume. See this article on TCP flow control and async writes. A better plan is to use a counter for the amount of oustanding writes (per connection) that you want to allow and stop sending if that counter gets reached and then resume when it drops below a 'low water mark' threshold value.
Note that if you've pulled out the network cable into the machine how do you expect any other operations to complete? Reads will just sit there and only fail once a write has failed and AcceptEx will simply sit there and wait for the condition to rectify itself.

OpenSSL SSL_ERROR_WANT_WRITE never recovers during SSL_write()

I have two applications talking to each other over SSL. The client is running on a windows machine, the server is a linux based application. The client is sending a large amount of data to the server on startup. The data is sent in ~4000byte chunks over to the server that contains 30 entries. I have to send about 50000 entries over.
During that transmission the server sends a message to the client, the message size is ~4000bytes. After that happens, the SSL_write() on the client side begins to return error of SSL_ERROR_WANT_WRITE. The client sleeps for 10ms, and retries the SSL_write with the exact same parameters, however, the SSL_write fails infinitely. Subsequently it aborts. If it tries to send a new message, I get an error indicating I am not sending the same aborted message from earlier.
error:1409F07F:SSL routines:SSL3_WRITE_PENDING: bad write retryā€¯
The server eventually kills the connection since it has not heard from the client for 60s and re-establishes a new one. This is just an FYI, the real issue is how can I get SSL_write to resume.
If the server does not send a request during the receive the problem goes away. If I shrink the size of the request from 16K to 100 bytes the problem does not happen.
The SSL CTX MODE is set to SSL_MODE_AUTO_RETRY and SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER.
Does anyone have an idea what might cause a simultaneous transmission from both sides with large information can cause this failure. What can I do to prevent it if this is a limitation other than capping the size that goes out from the server to the client. My concern is that if the client is not sending anything the throttling I applied to avoid this issue is a waste.
On the client side I tried to perform an SSL_read to see if I need to read during a write even though I never receive an SSL_ERROR_PENDING_READ, but the buffer is not that big anyway. ~1000bytes in size.
Any insight on this would be appreciated.
SSL_ERROR_WANT_WRITE - This error is returned by OpenSSL (I am assuming you are using OpenSSL) only when socket send gives it an EWOULDBLOCK or EAGAIN error. The socket send will give a EWOUDLBLOCK error when the send side buffer is full, which in turn means that your Server is not reading the messages sent from Client.
So, essentially, the problem lies with your Server which is not reading the messages sent to it. You need to check your server and fix it, which will automatically fix your client problem.
Also, why have you set the option "SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER"? SSL always expects that the record which it is trying to send should be sent completely before the next record can be sent.
As it turns out that with both the client and server side app, the read and writes are processed in one thread. In a perfect storm as I described above, the client is busy writing (non blocking). The server then decides to do a write a large set of messages of its own in between processing its rx buffers. The server tx is a blocking call. The server gets stuck writing, starves the read, the buffers fill up and we have a deadlock scenario.
The default windows buffer is 8k bytes so it doesn't take much to fill it up.
The architecture should be such that there is a separate thread for the rx and tx processing on both sides. As a short cut/term fix, once can increase the rx buffers and rate limit the tx side to prevent the deadlock.

Boost ASIO iostream random delays on reading

I have a client talking to server with TCP via localhost. The server uses Boost ASIO iostream in blocking mode. It accepts the incoming connections, reads the request, sends response and closes the socket. The problem is - sometimes server have a random delay for 10-200 milliseconds on the first read via getline. I've set TCP_NODELAY flag on both server's and client's socket. What can be the reason for this delays? I know, that i should use select before reading from socket, but i expected that there shouldn't be such a great delay via localhost.
Here is the relevant part of server's code:
asio::io_service io_service;
ip::tcp::endpoint endpoint(bindAddress, 80);
ip::tcp::acceptor acceptor(io_service, endpoint);
for(;;)
{
ip::tcp::iostream stream;
acceptor.accept(*stream.rdbuf(), peer);
ip::tcp::no_delay no_delay(true);
stream.rdbuf()->set_option(no_delay);
string str;
getline(stream, str); // at this line i get random delays
//the main part of code
}
I have around 200 requests/second, delay happens several times per minute.
netstat -m shows, that there is enough buffers.
UPDATE:
It looks like the problem of client, not server: Apache HttpClient random delays under high requests/second
Answering this question for the sake of closing it out.
Apache HttpClient random delays under high requests/second
Apache's ab(1) also has "saw tooth"-like performance because it dispatches -c connections that it monitors via select(2), then once all connections have returned, it will dispatch another -c connections. The alternate (and better) approach would be to establish a new connection and readd the file-descriptor to ab(1)'s select(2) array to make sure -c connections are always active processing.
I've seen ab(1) give some very misleading results because one connection out of a thousand hung (still not a good thing, but it skews results very negatively when using it through a load balancer).

NServiceBus Retry Delay

What is the optimal way to configure/code NServiceBus to delay retrying messages?
In its default configuration retry happens almost immediately up to the number of attempts defined in the configuration file. I'd ideally like to retry again after an hour, etc.
Also, how does HandleCurrentMessageLater() work? What does the Later aspect refer to?
The NSB retries is there to remedy temporary problems like deadlocks etc. Longer retries is better handled by creating another process that monitors the error queue and puts them back into to the source queue at the interval you like. Take a look at the ReturnToSourceQueue.exe that comes with NSB for reference.
Edit: NServiceBus now supports this , we call it Second Level Retries, see http://docs.particular.net/ for more details
Here is a blog post on why NServiceBus doesn't include a retry delay that I wrote after asking Udi this very same question in his distributed systems architecture course:
NServiceBus Retries: Why no back-off delay?
And here is a discussion thread covering some of the points involved in building an error queue monitor/retry endpoint:
http://tech.groups.yahoo.com/group/nservicebus/message/10964
As far as HandleCurrentMessageLater(), all that does is puts the current message back at the end of the queue. If there are no other messages waiting, it's going to be processed again immediately.
As of NServiceBus 3.2.1, they provide an out of the box solution to handle back off delays in the event of consecutive message failures. The previously existing retry mechanism still retries failures without a delay to handle cases like Database deadlocks, quickly self healing network issues, etc.
Once a message has been retried the configured number of times, the message is moved to a "Second Level Retry" queue. This queue, as configured below, will retry after a 10, 20, and 30 second delay, then the message will be moved to the configured error queue. You're free to change these values to something that better suites your environment.
You can also check out this link:
http://docs.particular.net/nservicebus/second-level-retries