I know there are lots of queries on this and I am far from being a redis expert. That said I do see a pattern and was hoping to gain some insight. Our keys are small , maybe around 25 chars. We don't have a huge amount of data per key, 100-400 K. We do get spikes of reads and it appears they start to timeout. They start IOCP: (Busy=0,Free=1000,Min=2,Max=1000), WORKER: (Busy=3,Free=8188,Min=2,Max=8191) and over a few seconds end up like IOCP: (Busy=0,Free=1000,Min=2,Max=1000), WORKER: (Busy=250,Free=7941,Min=2,Max=8191). Does anyone have any insights as to where we can look to improving this situation. According to the azure portal our memory has peeked around 350mb and even with our spikes server load maxd at 4%. Our cache hits go from < 500 to spiked of 3K / per minute. Any guidance would be great.
Related
I use AWS ElastiCache Redis for our prod. I see CPU every 30 minutes of the round hour from average of 2-3% to 20%.
This is constant, which tells me it comes from schedule job.
From cloudwatch I have a suspicion it is related to KEY (and maybe SET) commands and it's latency is the only one which jumps in the same exact time as the CPU jumps.
I would like to understand what KEY (and maybe SET) commands run on a specific time, or some other way which can help me investigate this.
Thanks for any advice.
with redis-cli monitor I was able to get most of the commands running on server in a stream and get the high usage.
We recently attempted to switch from Azure Redis to Redis Enterprise, unfortunately after about an hour we were forced to roll back due to performance issues. We're looking for advice on how to get to the root cause and proceed. Here's what I've figured out so far, but I'm happy to add any more details as necessary.
First off, the client is a .NET Framework app using StackExchange.Redis version 2.1.30. The Azure Redis instance is using 4 shards, and the Redis Enterprise instance is also configured for 4 shards.
When we switched over to Redis Enterprise, we would immediately see several thousand of these exceptions per 5 minute interval:
Timeout performing GET (5000ms), next: GET [Challenges]::306331, inst:
1, qu: 0, qs: 3079, aw: False, rs: ReadAsync, ws: Idle, in: 0,
serverEndpoint: xxxxxxx:17142, mc: 1/1/0, mgr: 9 of 10 available,
clientName: API, IOCP: (Busy=2,Free=998,Min=400,Max=1000), WORKER:
(Busy=112,Free=32655,Min=2000,Max=32767), Local-CPU: 4.5%, v:
2.1.30.38891 (Please take a look at this article for some common client-side issues that can cause timeouts:
https://stackexchange.github.io/StackExchange.Redis/Timeouts)
Looking at this error message, it appears there's tons of things in the WORKER thread pool (things waiting on a response from Redis Enterprise), but nearly nothing in the IOCP thread pool (responses from Redis waiting to be processed by our client code). So, there's some sort of bottleneck on the Redis side.
Using AppInsights, I created a graph of the busy worker threads (dark blue), busy IO threads (red), and CPU usage (light blue). We see something like this:
The CPU never really goes above 20% or so, the IO threads are barely a blip (I think the max is like 2 busy), but the worker threads kinda grow and grow until eventually everything times out and the process starts over again. A little after 7pm is when we decided to roll back to Azure Redis, so everything is great at that point. So, everything points to Redis being some sort of bottleneck. So, let's look at the Redis side of things.
During this time, Redis reported a max of around 5% CPU usage. Incoming traffic topped out around 1.4MB/s, and outgoing traffic topped out around 9.5MB/s. Ops/sec were around 4k. Latency around this time was 0.05ms, and the slowest thing in the SLOWLOG was like 15ms or so. In other words, the Redis Enterprise node was barely breaking a sweat and was easily able to keep up with the traffic being sent to it. In fact, we had 4 other nodes in the cluster that weren't even being used since Redis didn't even see the need to send anything to other nodes. Redis was basically just yawning.
From here, I was thinking maybe there were network bandwidth contraints. All of our VMs are configured for accelerated networking, and we should have 10gig connections to these machines. I decided to run an iperf between the client and the server:
I can transfer easily over 700Mbit/sec between the client and the Redis Enterprise server, yet the server is processing 9.5MB/sec easily. So, it doesn't appear the problem is network bandwidth.
So, here's where we stand:
The same code works great with Azure Redis, yet causes thousands of timeouts when we switch over to Redis Enterprise.
Redis Enterprise is handling 4,000 operations per second and sending out 9 megs a second, and can usually handle a single operation in a fraction of a ms, with the very longest being 15ms.
I can send 700+ Mb/sec between the client and server.
Yet, the WORKER thread pool builds up with pending requests to Redis and eventually times out.
I'm pretty stuck here. What's a good next step to diagnose this issue? Thanks!
We have redis cluster which holds more than 2 million and these keys has been updated with the time interval of 1 minute. Now we have a requirement to take the snapshot of the redis db in a particular interval For eg every 10 minute. This snapshot should not pause the redis command execution.
Is there any async way of taking snapshot from redis ?
It would be really helpful if we get any suggestion on open source tools or frameworks.
The Redis BGSAVE is async and takes a snapshot.
It calls the fork() function of the OS. According to the Redis manual,
Fork() can be time consuming if the dataset is big, and may result in Redis to stop serving clients for some millisecond or even for one second if the dataset is very big and the CPU performance not great
Two million updates in one minutes, that is 30K+ QPS.
So you really have to try it out, run the benchmark that similutes your business, then issue BGSAVE, monitor the I/O and CPU usage of your system, and see if there's a spike in your redis calling latency.
Then issue LASTSAVE, which will tell you when your last success snapshot happened. So you can adjust your backup schedule.
I searched online for awhile about what is "Excessive resource usage" on SQL Azure, still cannot get an idea.
Some articles suggest query takes too long, too much memory etc will cause "Excessive resource usage". But If I use simple query, simple data structure, what will happen?
For example: I get a 1G SQL Azure as session state. Since session is a very small string, and save/delete all the time, I don't think it will grow to 1G for millions of session simultaneously. You can calculate, for 1 million session, 20 char each, only take 20M space, consider 20 minutes expire etc. Cannot even close to 1G. But the queries, should be lots and lots. Each query will be very simple and fast by index.
I wanna know, if this use will be consider as "Excessive resource usage"? Is there any hard number to limit you on the usage?
Btw, as example above, if all happen in same datacenter, so all cost is 1G database which is $10 a month, right?
Unfortunately the answer is 'it depends'. I think that probably the best reference (with guidance) on the SQL Azure Query Throttle is here: TechNet Article on SQL Azure Perormance This will povide details about the metrics that are monitored and the mechanism of the throttle.
The reason that I say it depends is that the throttle is non-deterministic for any given user. This is because the throttle will be activated based on the total load on the node (physical SQL Server in Azure DC). While the subscribers who will get throttled are the subscribers delivering the greatest load the level at which the throttle kicks in will depend on the total load on the node. SO if you are on a quiet node (where other tenant DBs are relatively inactive) then you will be able to put through a bunch more throughput than if you are on a busy node.
It is very appealing to use 1GB SQL Azure DBs for session state storage; you've identified the cost benefits. You are taking a risk though. One way to mitigate this risk is to partition across at least two SQL Azure 1GB DBs and adjust the load yourself based on whether one of the DBs starts hitting the throttle.
Another option if you want determinism for throughput is to use the WIndows Azure Cache to back your sesion state store. The Cache has hard pre-defined limits for query throughput so you can plan for it more easily Azure Caching FAQ including Limits. The Cache approach is probably a bit more expensive but with a lower risk of problems.
Since a few days ago, the SQL server (Microsoft SQL Server 2005) backing our site has started occasionally timeouting. It is happening at seemingly random times approximately every hour or two. It usually takes about 10 minutes during which we see hundreds of timeouted requests. Under normal circumstances, most of our queries take less than 50ms. A query which takes a significant fraction of a second is an exception.
I have effectively killed a day trying to figure out at least something without any real progress. Normally, the server load is about 10-20%, and when the timeouts happen, we don’t see any increased CPU load. Also, there is nothing special happening during the timeouts; no overzealous web crawler, no heavy background tasks, no increased network traffic, no increased number of connections etc. Simply, everything looks as usual.
Not making any progress, we decided to restart it (and install the latest SP since we were in it) which seems to have fixed the problem. It has been already over six hours without any incident. Also, the CPU load has gone down under 10%.
It almost seems like if the SQL server "deteriorated" overtime. Perhaps, some internal structure (some cache or statistic) got out shape and caused the occasional problems. I don’t have any other explanation.
The only thing I noticed when I was monitoring the server (and got lucky once to be present when the timeouts were happening), I saw several long running queries waiting on CXPACKET. But I learned that this is most likely just a consequence of some other problem. I wrote a script monitoring SQL requests, and so hopefully, next time it happens, I will have more information.
Has anybody had similar experience? I’m not an SQL Server guru. Any suggestions are welcome.
since everything looked normal: CPU, nothing special happening, no overzealous web crawler, no heavy background tasks, no increased network traffic, no increased number of connections etc. I'd look into locking\blocking\race condition. Use this to see what (if anything) is locking when the time-out are happening:
How to find out what SQL queries are being blocked and what's blocking them?