Profiling an application with Redis and getting ObjectNative::WaitTimeout - redis

I am profiling an application which gives timeouts when there are cache misses. In my code I am doing batching, so basically I send multiple batches (size = 100) in parallel to fetch data from Redis. These requests are in parallel and I know that if I send a single batch of 200 it will not timeout, so Ideally multiple batches of 100 should not timeout.
But I do see timeout, and profiling the application I see that about 68 percent of the time is spent in this code.
ConnectionMultiplexer.ExecuteSyncImpl 670 ms
OTHER ObjectNative::WaitTimeout 648 ms
BLOCKING
Can someone give some insights into what this means.. does this mean there is queuing happening or how to figure out where the issue might be. Any pointers will be helpful.
Thanks.

Related

DynamoDB large transaction writes are timing out

I have a service that receives events that vary in size from ~5 - 10k items. We split these events up into chunks and these chunks need to be written in transactions because we do some post-processing that depends on a successful write of all the items in the chunk. Ordering of the events is important so we can't Dead Letter them to process at a later time. We're running into an issue where we receive very large (10k) events and they're clogging up the event processor causing a timeout (currently set to 15s). I'm trying to find a way to increase the processing speed of these large events to eliminate timeouts.
I'm open to ideas but curious if there are there any pitfalls of running transaction writes concurrently? E.g. splitting the event into chunks of 100 and having X threads run through them to write to dynamo concurrently.
There is no concern on multi-threading writes to DynamoDB so long as you have the capacity to handle the extra throughput.
I would also advise at trying smaller batches, as with 100 items in a batch, if one happens to fail for any reason then they all fail. Typically I suggest aiming for batch sizes of approx 10. But of course this depends on your use-case.
Also ensure that no threads are targeting the same item at the same time, as this would result in conflicting writes causing large amounts of failed batches.
In summary, batch small as possible, ensure your table has adequate capacity and ensure you don't hit the same items concurrently.

How to setup Jmeter test to have a certain throughput?

I am trying to perform a load test, and according to our stats (that I can't disclose) we expect peaks of 300 users per minute, uploading files of different sizes to our system.
Now, I created a jmeter test, which works fine, but what I don't know how to fine tune is - aim for certain throughput.
I create a test with 150 users 100 loops, expecting it to simulate 150 users coming and going, and in total upload 15000 files, but that never happened because at certain point tests started failing.
Looking at our new relic monitoring, it seems that somehow I reached 1600 requests in a single minute. I am testing a microservice, running 12 instances, so that might play the role here for a higher number of requests, but even with it I expected tests to pass. My uploaded file was 600kb. In the end, I had 98% failure.
I reduced the file size to 13kb, at that point, I got 17% failiure.
So, there's obviously something with the time needed to upload the bigger file, but I don't understand what causes 150 thread/users in X loops to become 1600 at the same time. I'd expect Jmeter to never start a new loop with the same thread, unless the original user is finished. That being said - I'd expect tops 150 users in a given minute.
Any clarification on how to get exact number of users/threads running at the same time is well appreciated.
I tried to play with KeepAlive checkbox, I tried adding lifetime of request to 10 seconds (all them uploads get response earlier) - but then JMeter finished the Thread, and I had only 150 runs, no loops.
Thanks!
By default JMeter executes Samplers as fast as it can so there are 2 main factors which define the actual throughput (number of requests per unit of time):
JMeter configuration
Application under test response time
So if you're following JMeter Best Practices and JMeter has enough headroom to operate in terms of CPU, RAM, etc. - you are only limited by your application response time as JMeter waits for previous request to finish before starting a new one.
If you need to "slow down" your test execution consider adding i.e. Constant Throughput Timer to your Test Plan where you will be able to define the desired number of requests per minute

Redis is taking too much time

We are using Redis Server and we think it is not responding. when we hit a request on server from node then it respond in 50ms but when we hit the same request in bulk (1000 ) it takes 53 sec. that is too much so can you please explain us that what we can do to reduce the response time for 1000 request.
It seems that you're not using Redis pipelining and/or multiple clients connecting to Redis, each issuing a portion of the commands you're trying to execute.

Redis dequeue rate 10x slower over the network

I was testing enqueue and dequeue rate of redis over the network which has 1Gbps LAN speed, and both the machines has 1Gbps ethernet card.
Redis version:3.2.11
lpush 1L items having 1 byte per item using python client.
Dequeuing items using rpop took around 55 secs over the network which is just 1800 dequeues sec. Whereas the same operation completes within 5 secs which I dequeue from local which is around 20,000 dequeues sec.
Enqueue rates are almost close to dequeue rate.
This is done using office network when no much usage are there. The same is observed on production environments too!
A drop of less than 3x over the network is accepted. Around 10x looks like I am doing something wrong.
Please suggest if I need to make any configuration changes on server or client side.
Thanks in Advance.
Retroactively replying in case anyone else discovers this question.
Round-trip latency and concurrency are likely your bottlenecks here. If all of the dequeue calls are in serial, then you are stacking that network latency. With 1 million calls at 2ms latency, you'd have at least 2 million ms of latency overhead, or 33 mins). This is to say that your application is waiting for the server to receive the payload, do something, and reply to acknowledge the operation was successful. Some redis clients also perform multiple calls to enqueue / dequeue a single job (pop & ack/del), potentially doubling that number.
The following link illustrates different approaches for using redis keys by different libraries (ruby's resque vs. clojure's carmine, pay note to the use of multiple redis commands that are executed on the redis server for a single message). This is likely the cause of the 10x vs. 3x performance you were expecting.
https://kirshatrov.com/2018/07/20/redis-job-queue/
An oversimplified example of two calls per msg dequeue (latency of 1ms and redis server operations take 1 ms):
|client | server
~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1ms | pop msg >--(1ms)--> receive pop request
2ms | [process request (1ms)]
3ms | receive msg <--(1ms)--< send msg to client
4ms | send del >--(1ms)--> receive del
5ms | [delete msg from queue (1ms)]
6ms | receive ack <--(1ms)--< reply with delete ack
Improving dequeue times often involves using a client that supports multi-threaded or multi-process concurrency (i.e. 10 concurrent workers would significantly reduce the overall time to completion). This ensures your network is better utilized by sending a stream of dequeue requests, instead of waiting for one request to complete before grabbing the next one.
As for 1 byte vs 500 bytes, the default TCP MTU is 1500 bytes. Subtracting TCP headers, the payload is ~ 1460 bytes (less if tunneling with GRE/IPsec, more if using jumbo frames). Since both payload sizes would fit in a single TCP packet, they will have similar performance characteristics.
A 1gbps ethernet interface can deliver anywhere between 81,274 and 1,488,096 packets per second (depending on payload size).
So really, it's a question of how many processes & threads you can run concurrently on the client to keep the network & redis server busy.
Redis is generally I/O bound, not CPU bound. It may be hitting network bandwidth limits. Given the small size of your messages most of the bandwidth may be eaten by TCP overhead.
On a local machine you are bound by memory bandwidth, which is much faster than your 1Gbps network bandwidth. You can likely increase network throughput by increasing the amount of data you grab at a time.

SQL Server Log Messages / Memory Clog

I have a dedicated server that's been running for years, with no recent code or configuration changes, but suddenly about a week ago, the MS SQL Server DB has started becoming unresponsive, and shortly thereafter, the entire site goes down due to memory issues on the server. It is sporadic, which leads me to believe it could be a malicious DDOS-like attack, but I am not sure how to confirm what's going on.
After a reboot, it can stay up for a few days, or only a few hours before I start seeing rampant occurrances of these Info messages in the Windows logs, shortly before it seizing up and failing. Research has not yielded any actionable info as of yet, please help, and thank you.
Process 52:0:2 (0xaa0) Worker 0x07E340E8 appears to be non-yielding on Scheduler 0. Thread creation time: 13053491255443. Approx Thread CPU Used: kernel 280 ms, user 35895 ms. Process Utilization 0%%. System Idle 93%%. Interval: 6505497 ms.
New queries assigned to process on Node 0 have not been picked up by a worker thread in the last 2940 seconds. Blocking or long-running queries can contribute to this condition, and may degrade client response time. Use the "max worker threads" configuration option to increase number of allowable threads, or optimize current running queries. SQL Process Utilization: 0%%. System Idle: 91%%.
Here's a blog about the issue: danieladeniji.wordpress that should help you get started.
Seems unlikely that it would be a DDOS.