StackExchange.Redis System.TimeoutException - redis

I got this timeout exception suddenly when I try to persist a range of data, it was working before and I didn't do any changes:
Timeout performing HMSET {key}, inst: 0, mgr: ExecuteSelect, err:
never, queue: 2, qu: 1, qs: 1, qc: 0, wr: 1, wq: 1, in: 0, ar: 0,
clientName: {machine-name}, serverEndpoint:
Unspecified/localhost:6379, keyHashSlot: 2689, IOCP:
(Busy=0,Free=1000,Min=4,Max=1000), WORKER:
(Busy=0,Free=2047,Min=4,Max=2047), Local-CPU: 100% (Please take a look
at this article for some common client-side issues that can cause
timeouts:
https://github.com/StackExchange/StackExchange.Redis/tree/master/Docs/Timeouts.md)
I'm using Redis on windows.

In your timeout error message, I see Local-CPU: 100%. This is the CPU on your client that is calling into Redis server. You might want to look into what is causing the high CPU load on your client.
This article describes why high CPU usage can lead to client-side timeouts. https://gist.github.com/JonCole/db0e90bedeb3fc4823c2#high-cpu-usage

So, I battled with this issue for a few days and almost gave up. Like #Amr Reda said, breaking a large sets into smaller ones might work but that's not optimal.
In my case, I was trying to move 27,000 records into redis and i kept encountering the issue.
To resolve the issue, increase the SyncTimeout value in your redis connection string. It's set by default to 1000ms ie 1second. Large datasets typically take longer to add.

I found out what causing the issue, as I was trying to bulk inserting into hash. What I did is that I chunked the inserted list into smaller ones.

Quick suggestions that worked in my case, using a console .net project with very high concurrency using multithread (around 30.000).
In the program.cs, I added some ThreadPool settings:
int newWorkerThreadsPerCore = 50, newIOCPPerCore = 100;
ThreadPool.SetMinThreads(newWorkerThreadsPerCore, newIOCPPerCore);
Also, I had to change everything from:
var redisValue = dbCache.StringGet("SOMETHING");
To:
var redisValue = dbCache.StringGetAsync("SOMETHING").Result;
Even if you might think they look almost the same (considering you always end up waiting for a result), if you use the non-async version and one single thread receives a redis timeout, it will make all the other 29.999 threads waiting for redis to timeout too, while the async one will only cause a timeout in that only single thread.

Related

Redis stream XReadGroup not reading new messages even if `BLOCK` parameter is 0

I am using redis stream and XReadGroup for reading messages from stream. I have set block parameter as 0.
currently my code look like this
data, err := w.rdb.XReadGroup(ctx, &redis.XReadGroupArgs{
Group: w.opts.group,
Consumer: w.opts.consumer,
Streams: []string{w.opts.streamName, ">"},
Count: 1,
Block: 0,
}).Result()
I am currently facing a problem that if I keep the application (involving this code) idle for 10-12 hours, XReadGroup is not able to read new messages, if I restart the application then all the new messages consumed at once. Is there any solution for this problem?
You can have a block time of let's say 10s, it does not change anything (I guess the code you provided is in a while(true)).
From my experience you can keep the app idle for days and it still works.
I don't really know why but I guess it has to do with the "constant" back and forth "reseting" the connection.

StackExchange.Redis.RedisTimeoutException Timeout performing EXISTS

.NET Framework 4.7 and StackExchange.Redis version=2.5.43.
I am only seeing this error when the cache is on the server but not when running locally with the cache running in a container on my machine.
StackExchange.Redis.RedisTimeoutException:
'Timeout performing EXISTS (10000ms),
next: PSETEX MyKey-f79c9cad-c265-e611-80d8-005056b35bfa,
inst: 190,
qu: 39996,
qs: 176,
aw: True,
bw: Flushing,
rs: DequeueResult,
ws: Flushing,
in: 0,
in-pipe: 0,
out-pipe: 528424,
serverEndpoint: MyServer:6380,
mc: 1/1/0,
mgr: 9 of 10 available,
clientName: MyClient(SE.Redis-v2.5.43.42402),
IOCP: (Busy=0,Free=1000,Min=12,Max=1000),
WORKER: (Busy=1,Free=2046,Min=12,Max=2047),
v: 2.5.43.42402 (Please take a look at this article for some common client-side issues
that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)'
I have tried increasing the SyncTimeout configuration to 10000 but no difference.
The out-pipe value looks quite high but I am not sure what this is an indication of, or if it is a red herring.
Any ideas what could cause this timeout?
Thanks
As pointed out by #slorello the high "qu" and "qs" seems to be the issue where some big payloads blocking the pipe.
Breaking the big payloads down into smaller payloads seems to stopped the timeout. I will also investigate ConnectionMultiplexer pooling as recommended here in point 10

StackExcange.Redis.RedisTimeoutException

We are experiencing timeouts in our application using Redis. Already investigated but without success. See the timeout error below:
StackExchange.Redis.RedisTimeoutException: Timeout performing GET
USERORGANIZATIONS_D96510A4-A9A2-4DAA-84A9-BB77363DD3EA, inst: 9, mgr:
ProcessReadQueue, err: never, queue: 24, qu: 0, qs: 24, qc: 0, wr: 1, wq: 1,
in: 65536, ar: 1, clientName: RD00155D008B42, serverEndpoint:
Unspecified/xxxxxxx.redis.cache.windows.net:xxxx, keyHashSlot: 9735, IOCP:
(Busy=0,Free=1000,Min=4,Max=1000), WORKER:
(Busy=27,Free=32740,Min=200,Max=32767) (Please take a look at this article
for some common client-side issues that can cause timeouts:
http://stackexchange.github.io/StackExchange.Redis/Timeouts)
If need some more information, just ask me that I'll try to provide. Thanks in advance.
The “in: 65536” value in the timeout is very high.  This value indicates how much data is sitting in the client’s socket kernel buffer.  This indicates that the data has arrived at the local machine but has not been read by the application layer yet.  This typically happens when 1) thread pool settings need to be adjusted or 2) when client CPU is running high.  Here are some articles I suggest you read:
 
Diagnosing Redis errors on the client side
Azure Redis Best Practices

Timeout Exception while retrieving from Redis Cache at the same place always

We are receiving following timeout exception while retrieving data from Redis cache.
'Timeout performing GET inst: 2, mgr: Inactive, err: never, queue: 3, qu: 0, qs: 3, qc: 0, wr: 0, wq: 0, in: 18955,
IOCP: (Busy=4,Free=996,Min=2,Max=1000), WORKER: (Busy=0,Free=1023,Min=2,Max=1023),
Please note: Every timeout exception has different above values. queue is sometimes 2,1,3 and qs also varies with the queue value.
Also, IN: values keeps changing like 18955, 65536, 36829 etc.
Even IOCP changes like
IOCP: (Busy=6,Free=994,Min=2,Max=1000), WORKER: (Busy=0,Free=1023,Min=2,Max=1023).
Please note:
There are many similar questions in stack overflow and tried all of them. But, no luck.
We recently updated nuget package to the latest stable version (v1.2.1) of StackExchange.Redis library,
This exception seems to be occuring at the same place everytime even though there are various places where we are using redis cache. This has been found with the help of stack trace.
Also, we never faced this issue earlier like we are using the same solution from last 3 years and never encountered this issue. This exception has been occurring from last 3 months frequently atleast 3-4 times daily.
It looks like you are experiencing threadpool throttling (from the Busy and Min numbers in your error message). You will need to increase the MIN values for IOCP and Worker pool threads.
https://gist.github.com/JonCole/e65411214030f0d823cb#file-threadpool-md has more information.

Dataflow's BigQuery inserter thread pool exhausted

I'm using Dataflow to write data into BigQuery.
When the volume gets big and after some time, I get this error from Dataflow:
{
metadata: {
severity: "ERROR"
projectId: "[...]"
serviceName: "dataflow.googleapis.com"
region: "us-east1-d"
labels: {…}
timestamp: "2016-08-19T06:39:54.492Z"
projectNumber: "[...]"
}
insertId: "[...]"
log: "dataflow.googleapis.com/worker"
structPayload: {
message: "Uncaught exception: "
work: "[...]"
thread: "46"
worker: "[...]-08180915-7f04-harness-jv7y"
exception: "java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask#1a1680f rejected from java.util.concurrent.ThreadPoolExecutor#b11a8a1[Shutting down, pool size = 100, active threads = 100, queued tasks = 2316, completed tasks = 1192]
at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134)
at java.util.concurrent.Executors$DelegatedExecutorService.submit(Executors.java:681)
at com.google.cloud.dataflow.sdk.util.BigQueryTableInserter.insertAll(BigQueryTableInserter.java:218)
at com.google.cloud.dataflow.sdk.io.BigQueryIO$StreamingWriteFn.flushRows(BigQueryIO.java:2155)
at com.google.cloud.dataflow.sdk.io.BigQueryIO$StreamingWriteFn.finishBundle(BigQueryIO.java:2113)
at com.google.cloud.dataflow.sdk.util.DoFnRunnerBase.finishBundle(DoFnRunnerBase.java:158)
at com.google.cloud.dataflow.sdk.runners.worker.SimpleParDoFn.finishBundle(SimpleParDoFn.java:196)
at com.google.cloud.dataflow.sdk.runners.worker.ForwardingParDoFn.finishBundle(ForwardingParDoFn.java:47)
at com.google.cloud.dataflow.sdk.util.common.worker.ParDoOperation.finish(ParDoOperation.java:62)
at com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:79)
at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:657)
at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker.access$500(StreamingDataflowWorker.java:86)
at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:483)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)"
logger: "com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker"
stage: "F10"
job: "[...]"
}
}
It looks like I'm exhausting the thread pool defined in BigQueryTableInserter.java:84. This thread pool has an hardcoded size of 100 threads and cannot be configured.
My questions are:
How could I avoid this error?
Am I doing something wrong?
Shouldn't the pool size be configurable? How can 100 threads be the perfect fit for all needs and machine types?
Here's a bit of context of my usage:
I'm using Dataflow in streaming mode, reading from Kafka using KafkaIO.java
"After some time" is a few hours, (less than 12h)
I'm using 36 workers of type n1-standard-4
I'm reading around 180k messages/s from Kafka (about 130MB/s of network input to my workers)
Messages are grouped together, outputting around 7k messages/s into BigQuery
Dataflow workers are in the us-east1-d zone, BigQuery dataset location is US
You aren't doing anything wrong, though you may need more resources, depending on how long volume stays high.
The streaming BigQueryIO write does some basic batching of inserts by data size and row count. If I understand your numbers correctly, your rows are large enough that each is being submitted to BigQuery in its own request.
It seems that the thread pool for inserts should install ThreadPoolExecutor.CallerRunsPolicy which causes the caller to block and run jobs synchronously when they exceed the capacity of the executor. I've posted PR #393. This will convert the work queue overflow into pipeline backlog as all the processing threads block.
At this point, the issue is standard:
If the backlog is temporary, you'll catch up once volume decreases.
If the backlog grows without bound, then of course it will not solve the issue and you will need to apply more resources. The signs should be the same as any other backlog.
Another point to be aware of is that around 250 rows/second per thread this will exceed the BigQuery quota of 100k updates/second for a table (such failures will be retried, so you might get past them anyhow). If I understand your numbers correctly, you are far from this.