How to understand redis-cli's result vs redis-benchmark's result - redis

First, I am new to Redis.
So, I measure latency with redis-cli:
$ redis-cli --latency
min: 0, max: 31, avg: 0.55 (5216 samples)^C
OK, on average I get response in 0.55 milliseconds. From this I assume that using only one connection in 1 second I can get: 1000ms / 0.55ms = 1800 requests per second.
Then on the same computer I run redis-benchmark using only one connection and get more than 6000 requests per second:
$ redis-benchmark -q -n 100000 -c 1 -P 1
PING_INLINE: 5953.80 requests per second
PING_BULK: 6189.65 requests per second
So having measured latency I expected to get around 2000 request per seconds at best. However I got 6000 request per second. I cannot find explanation for it. Am I correct when I calculate: 1000ms / 0.55ms = 1800 requests per second?

Yes, your maths are correct.
IMO, the discrepancy comes from scheduling artifacts (i.e. to the behavior of the operating system scheduler or the network loopback).
redis-cli latency is implemented by a loop which only sends a PING command before waiting for 10 ms. Let's try an experiment and compare the result of redis-cli --latency with the 10 ms wait state and without.
In order to be accurate, we first make sure the client and server are always scheduled on deterministic CPU cores. Note: it is generally a good idea to do it for benchmarking purpose on NUMA boxes. Also, make sure the frequency of the CPUs is blocked to a given value (i.e. no power mode management).
# Starting Redis
numactl -C 2 src/redis-server redis.conf
# Running benchmark
numactl -C 4 src/redis-benchmark -n 100000 -c 1 -q -P 1 -t PING
PING_INLINE: 26336.58 requests per second
PING_BULK: 27166.53 requests per second
Now let's look at the latency (with the 10 ms wait state):
numactl -C 4 src/redis-cli --latency
min: 0, max: 1, avg: 0.17761 (2376 samples)
It seems too high compared to the throughput result of redis-benchmark.
Then, we alter the source code of redis-cli.c to remove the wait state, and we recompile. The code has also been modified to display more accurate figures (but less frequently, because there is no wait state anymore).
Here is the diff against redis 3.0.5:
1123,1128c1123
< avg = ((double) tot)/((double)count);
< }
< if ( count % 1024 == 0 ) {
< printf("\x1b[0G\x1b[2Kmin: %lld, max: %lld, avg: %.5f (%lld samples)",
< min, max, avg, count);
< fflush(stdout);
---
> avg = (double) tot/count;
1129a1125,1127
> printf("\x1b[0G\x1b[2Kmin: %lld, max: %lld, avg: %.2f (%lld samples)",
> min, max, avg, count);
> fflush(stdout);
1135a1134
> usleep(LATENCY_SAMPLE_RATE * 1000);
Note that this patch should not be used against a real system, since it will make the redis-client --latency feature expensive and intrusive for the performance of the server. Its purpose is just to illustrate my point for the current discussion.
Here we go again:
numactl -C 4 src/redis-cli --latency
min: 0, max: 1, avg: 0.03605 (745280 samples)
Surprise! The average latency is now much lower. Furthermore, 1000/0.03605=27739.25, which is completely in line with the result of redis-benchmark.
Morality: the more the client loop is scheduled by the OS, the lower the average latency. It is wise to trust redis-benchmark over redis-cli --latency if your Redis clients are active enough. And anyway keep in mind the average latency does not mean much for the performance of a system (i.e. you should also look at the latency distribution, the high percentiles, etc. ..)

Related

Apache Intermittant Hang is it Network Lag?

I have an intermittent lag on the web applications I am serving from Apache on a Debian box. Apache and MySQL check out. I am far from fully utilizing the box CPU/Memory. Still there is an intermittent lag. My theory is there is a network rate limit needing to be tweaked. Stats below.
Apache Server Status
Current Time: Tuesday, 02-Jun-2020 14:36:53 EDT
Restart Time: Monday, 01-Jun-2020 01:00:03 EDT
Parent Server Config. Generation: 1
Parent Server MPM Generation: 0
Server uptime: 1 day 13 hours 36 minutes 50 seconds
Server load: 2.95 3.23 3.09
Total accesses: 1213060 - Total Traffic: 22.0 GB - Total Duration: 32311929295
CPU Usage: u396.94 s164.31 cu2065.15 cs789.27 - 2.52% CPU load
8.96 requests/sec - 170.5 kB/second - 19.0 kB/request - 26636.7 ms/request
296 requests currently being processed, 66 idle workers
WR.WWWW.KWW_W._W_KWWWWWWKWWWWW_WWWWK_WK_WWW_WW_RWWWWWKCWWWWWW._W
_WW_R_W_.__K_WWWW__WWWWWWKKWWWWWWKWWWW_W____WWWWWWWW_WWW_KWWWWWW
WWWWWWWW_.WWWWWK_WWW_WWKWWWWWWKWWKWK_WWWWWRKWWW.WW_KKWKWWWKW_WWW
WW.W_.K._WWWK_WW_K_K._WW..WWWWWWW_.W_WWWW_W_W.W_WWWW_.WWKWK_WKWW
_W_WWWW_W.WWWWWW.WWWW_K__..W.WW_WWWWWWWWKRW_WWW_C.W_KW_WWW_KW.._
..WWWWWWWCWWW.WWW_WKKWWWW_._WWW.....WWW.W_W.W._.KW...W...WWW.WWW
W..W..K..WW_.W._................W..._W.W.....K.W.K_...R..K...W.W
...W..W.............................................
top
top - 14:31:14 up 79 days, 21:39, 3 users, load average: 2.26, 2.57, 2.86
Tasks: 717 total, 1 running, 716 sleeping, 0 stopped, 0 zombie
%Cpu(s): 3.3 us, 0.7 sy, 0.2 ni, 95.7 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
MiB Mem : 64365.1 total, 539.8 free, 8847.0 used, 54978.4 buff/cache
MiB Swap: 65477.0 total, 63810.0 free, 1667.0 used. 54580.5 avail Mem
ss -s
Total: 1934
TCP: 2362 (estab 1233, closed 1105, orphaned 2, timewait 1104)
Transport Total IP IPv6
RAW 0 0 0
UDP 0 0 0
TCP 1257 430 827
INET 1257 430 827
FRAG 0 0 0
ulimit -n
1024
ss -ntu | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -n
1 Local
6 192.XXX.XXX.XXX
100 127.0.0.1
340 10.0.0.XX
866 [
ss -ntu | awk '{print $6}' | cut -d: -f1 | sort | uniq -c | sort -n
..........
lists # of ip connections. Besides 127.0.0.1 and [ there are 2 ips over 50.
74 104.xxx.xxx.xxx
91 12.xxx.xxx.xxx
MySQL
No processes running more than a second. Number of processes well within limits.
I do not know what stats would be relevant beyond these in diagnosing network rate limiting issues. Any pointers would be appreciated.
EDITED
CPU
lscpu https://pastebin.com/Jha6F7J8
Apache Config
apachectl -t -D DUMP_RUN_CFG https://pastebin.com/i1L2hnjH
Mysql
SHOW GLOBAL STATUS https://pastebin.com/aQX4D01k
SHOW GLOBAL VARIABLES https://pastebin.com/L8EfmHfn
SHOW FULL PROCESSLIST https://pastebin.com/GtqK2tET
mysqltuner https://pastebin.com/GLhhKA9q
Optional Very Helpful Information
top -bn1 https://pastebin.com/r94vpXe6
iostat -xm 5 3 https://pastebin.com/R8YLK3QU
ulimit -a https://pastebin.com/KUC3wqxU
Dorothy, Your system is very busy with activity. Not knowing the frequency and duration of the intermittent hangs puts us at a disadvantage. One possible cause is com_drop_table had 3,318 uses in your 83 days of uptime. Another possible cause is volume of data read and written. It appears innodb_data_written was 484TB in 83 days and yet MySQLTuner reports only 800K of data in 10 tables. Our General Log Analysis could likely identify the cause of this high activity. These suggestions will be a starting effort, more analysis and changes should be accomplished.
From your OS command prompt,
ulimit -n 96000 would enable many more Open Files (handles) above today's 1024 limit.
This is a dynamic operation in Linux and does not require OS restart to be implemented.
For this change to persist across OS stop/start the following URL could be used as a guide.
Please use 96000, not 500000 - as in their example documentation.
https://glassonionblog.wordpress.com/2013/01/27/increase-ulimit-and-file-descriptors-limit/
Rate Per Second = RPS
Suggestions to consider for your my.cnf [mysqld] section
innodb_io_capacity=1900 # from 200 if you have SSD, 900 if you have magnetic storage to improve IOPS
net_buffer_length=32K # from 16K to reduce malloc operations
innodb_lru_scan_depth=100 # from 1024 to conserve 90% of CPU cycles used for function
key_cache_segments=16 # from 0 to reduce mutex contention with MyISAM opens
key_cache_division_limit=50 # from 100 for Hot/Warm storage to reduce key_page_reads RPS of 18
aria_pagecache_division_limit=50 # from 100 for Hot/Warm storage to reduce aria_pagecache_reads RPS of 5K
read_rnd_buffer_size=64K # from 256K to reduce handler_read_rnd_next RPS of 27,707
These changes should reduce elapsed time to complete most queries.
Additional areas to consider include the use of Slow Query Log analysis to find where an index could avoid a table scan. MySQLTuner reported more than 4 million joins performed without indexes. Our FAQ page includes information on how you could find the tables needing indexes to avoid scans. Let us know how these suggestions work for you.
Skype Talk works very well if you have the flexibility to use that form of communication.

Why does the EVALSHA command come at such a performance cost when compared to native commands run on the redis-cli client?

Here are some tests and results I have run against the redis-benchmark tool.
C02YLCE2LVCF:Downloads xxxxxx$ redis-benchmark -p 7000 -q -r 1000000 -n 2000000 JSON.SET fooz . [9999]
JSON.SET fooz . [9999]: 93049.23 requests per second
C02YLCE2LVCF:Downloads xxxxxx$ redis-benchmark -p 7000 -q -r 1000000 -n 2000000 evalsha 8d2d42f1e3a5ce869b50a2b65a8bfaafe8eff57a 1 fooz [5555]
evalsha 8d2d42f1e3a5ce869b50a2b65a8bfaafe8eff57a 1 fooz [5555]: 61132.17 requests per second
C02YLCE2LVCF:Downloads xxxxxx$ redis-benchmark -p 7000 -q -r 1000000 -n 2000000 eval "return redis.call('JSON.SET', KEYS[1], '.', ARGV[1])" 1 fooz [5555]
eval return redis.call('JSON.SET', KEYS[1], '.', ARGV[1]) 1 fooz [5555]: 57423.41 requests per second
That is a significant drop in performance for something that is supposed to have the advantage of performance for a script running server side verse the client running a script client side.
From client to EVALSHA = 34% performance loss
From EVALSHA to EVAL = 6% performance loss
The results are similar for a NON-JSON insert set command
C02YLCE2LVCF:Downloads xxxxxx$ redis-benchmark -p 7000 -q -r 1000000 -n 2000000 set fooz 3333
set fooz 3333: 116414.43 requests per second
C02YLCE2LVCF:Downloads xxxxxxx$ redis-benchmark -p 7000 -q -r 1000000 -n 2000000 evalsha e32aba8d03c97f4418a8593ed4166640651e18da 1 fooz [2222]
evalsha e32aba8d03c97f4418a8593ed4166640651e18da 1 fooz [2222]: 78520.67 requests per second
I first noticed this when I did an info commandstat and observed the poorer performance for the EVALSHA command
# Commandstats
cmdstat_ping:calls=331,usec=189,usec_per_call=0.57
cmdstat_eval:calls=65,usec=4868,usec_per_call=74.89
cmdstat_del:calls=2,usec=21,usec_per_call=10.50
cmdstat_ttl:calls=78,usec=131,usec_per_call=1.68
cmdstat_psync:calls=51,usec=2515,usec_per_call=49.31
cmdstat_command:calls=5,usec=3976,usec_per_call=795.20
cmdstat_scan:calls=172,usec=1280,usec_per_call=7.44
cmdstat_replconf:calls=185947,usec=217446,usec_per_call=1.17
****cmdstat_json.set:calls=1056,usec=26635,usec_per_call=25.22**
****cmdstat_evalsha:calls=1966,usec=68867,usec_per_call=35.03**
cmdstat_expire:calls=1073,usec=1118,usec_per_call=1.04
cmdstat_flushall:calls=9,usec=694,usec_per_call=77.11
cmdstat_monitor:calls=1,usec=1,usec_per_call=1.00
cmdstat_get:calls=17,usec=21,usec_per_call=1.24
cmdstat_cluster:calls=102761,usec=23379827,usec_per_call=227.52
cmdstat_client:calls=100551,usec=122382,usec_per_call=1.22
cmdstat_json.del:calls=247,usec=2487,usec_per_call=10.07
cmdstat_script:calls=207,usec=10834,usec_per_call=52.34
cmdstat_info:calls=4532,usec=229808,usec_per_call=50.71
cmdstat_json.get:calls=1615,usec=11923,usec_per_call=7.38
cmdstat_type:calls=78,usec=115,usec_per_call=1.47
From JSON.SET to EVALSHA there is ~30% performance reduction which is what I observed in the direct testing.
The question is, why? And, is this anything to be concerned with or is this observation within fair expectations?
For context, the reason why I am using EVALSHA and not the direct JSON.SET command is for 2 reasons.
The IORedis client library doesn't have direct support using RedisJson.
Because of the previous fact, I would have had to use send_command() which then would have sent the direct command over to the server but doesn't work with pipelining while using TypeScript. So I would have had to do every other command separately and forgo pipelining.
I thought this was supposed to be better performance?
****** Update:
So in the end, based on the below answer I refactored my code to only include 1 EVALSHA for the write because it uses 2 commands which are a set and expire command. Again, I can't single this into RedisJson so that is the reason why.
Here is the code for someones reference: Shows evalsha and fallback
await this.client.evalsha(this.luaWriteCommand, '1', documentChange.id, JSON.stringify(documentChange), expirationSeconds)
.catch((error) => {
console.error(error);
evalSHAFail = true;
});
if (evalSHAFail) {
console.error('EVALSHA for write not processed, using EVAL');
await this.client.eval("return redis.pcall('JSON.SET', KEYS[1], '.', ARGV[1]), redis.pcall('expire', KEYS[1], ARGV[2]);", '1', documentChange.id, JSON.stringify(documentChange), expirationSeconds);
console.log('SRANS FRUNDER');
this.luaWriteCommand = undefined;
Why Lua script is slower in your case?
Because EVALSHA needs to do more work than a single JSON.SET or SET command. When running EVALSHA, Redis needs to push arguments to Lua stack, run Lua script, and pop return values from Lua stack. It should be slower than a c function call for JSON.SET or SET.
So When does server side script has a performance advantage?
First of all, you must run more than one command in script, otherwise, there won't be any performance advantage as I mentioned above.
Secondly, server side script runs faster than sending serval commands to Redis one-by-one, get the results form Redis, and do the computation work on the client side. Because, Lua script saves lots of Round Trip Time.
Thirdly, if you need to do really complex computation work in Lua script. It might not be a good idea. Because Redis runs the script in a single thread, if the script takes too much time, it will block other clients. Instead, on the client side, you can take the advantage of multi-core to do the complex computation.

Redis benchmarking for HMSET, HGETALL with a data size

Can someone let me know how can I use redis-benchmark to do a benchmarking for HMSET, HGETALL with a fixed data size (-d option in redis-benchmark). I am using redis 3.2.5.
I have gone through this answer and tried the below command:-
root#cache-server1:~# redis-benchmark -h a.b.c.d -p XXXX hmset hgetall myhash rand_int rand_string -d 2048
====== hmset hgetall myhash rand_int rand_string -d 2048 ======
10000 requests completed in 0.11 seconds
50 parallel clients
3 bytes payload
keep alive: 1
99.64% <= 1 milliseconds
100.00% <= 1 milliseconds
89285.71 requests per second
But looking at the output it seems it is using only 3 bytes payload.
If it is not possible via redis-benchmark can someone suggest some other alternative?
The payload is only 3 bytes (the default) because the -d is taken as part of the command. The command must be the last argument, and all switches must precede it.
Besides that, you can't use redis-benchmark to run two custom commands. Also, the -d option is only applicable to predefined tests (the ones that run by default or with the -t option) and has no meaning if the user specifies the command used in the benchmark.
If you have a specific benchmarking flow that you want to test, the best thing you can do is mock it with any client that you're comfortable with.

Benchmark Redis under Twemproxy with redis-benchmark

I am trying to test a very simple setup with Redis and Twemproxy but I can't find a way to make it faster.
I have 2 redis servers that I run with bare minimum configuration:
./redis-server --port 6370
./redis-server --port 6371
Both of the compiled from source and running under 1 machine with all the appropriate memory and CPUs.
If I run a redis-benchmark in one of the instances I get the following:
./redis-benchmark --csv -q -p 6371 -t set,get,incr,lpush,lpop,sadd,spop -r 100000000
"SET","161290.33"
"GET","176366.86"
"INCR","170940.17"
"LPUSH","178571.42"
"LPOP","168350.17"
"SADD","176991.16"
"SPOP","168918.92"
Now I would like to use Twemproxy in front of the two instances to distribute the requests and get a higher throughput (at least this is what I expected!).
I used the following configuration for Twemproxy:
my_cluster:
listen: 127.0.0.1:6379
hash: fnv1a_64
distribution: ketama
auto_eject_hosts: false
redis: true
servers:
- 127.0.0.1:6371:1 server1
- 127.0.0.1:6372:1 server2
And I run nutcracker as:
./nutcracker -c twemproxy_redis.yml -i 5
The results are very disappointing:
./redis-benchmark -r 1000000 --csv -q -p 6379 -t set,get,incr,lpush,lpop,sadd,spop-q -p 6379
"SET","112485.94"
"GET","113895.21"
"INCR","110987.79"
"LPUSH","145560.41"
"LPOP","149700.61"
"SADD","122100.12"
I tried to understand what is going on by getting Twemproxy's statistics as this:
telnet 127.0.0.1 22222
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
{
"service": "nutcracker",
"source": "localhost.localdomain",
"version": "0.4.1",
"uptime": 10,
"timestamp": 1452545028,
"total_connections": 303,
"curr_connections": 3,
"my_cluster": {
"client_eof": 300,
"client_err": 0,
"client_connections": 0,
"server_ejects": 0,
"forward_error": 0,
"fragments": 0,
"server1": {
"server_eof": 0,
"server_err": 0,
"server_timedout": 0,
"server_connections": 1,
"server_ejected_at": 0,
"requests": 246791,
"request_bytes": 11169484,
"responses": 246791,
"response_bytes": 1104215,
"in_queue": 0,
"in_queue_bytes": 0,
"out_queue": 0,
"out_queue_bytes": 0
},
"server2": {
"server_eof": 0,
"server_err": 0,
"server_timedout": 0,
"server_connections": 1,
"server_ejected_at": 0,
"requests": 353209,
"request_bytes": 12430516,
"responses": 353209,
"response_bytes": 2422648,
"in_queue": 0,
"in_queue_bytes": 0,
"out_queue": 0,
"out_queue_bytes": 0
}
}
}
Connection closed by foreign host.
Is there any other benchmark around that works properly? Or redis-benchmark should had worked?
I forgot to mention that I am using Redis: 3.0.6 and Twemproxy: 0.4.1
It might seem counter-intuitive, but putting two instances of redis with a proxy in front of them will certainly reduce performance!
In a single instance scenario, redis-benchmark connects directly to the redis server, and thus has minimal latency per request.
Once you put two instances and a single twemproxy in front of them, think what happens - you connect to twemproxy, which analyzes the request, chooses the right instance, and connects to it.
So, first of all, each request now has two network hops to travel instead of one. Added latency means less throughput of course.
Also, you are using just one twemproxy instance. So let's assume that twemproxy itself performs more or less like a single redis instance, you can never beat a single instance with a single proxy.
Twemproxy facilitates scaling out, not scaling up. It allows you to grow your cluster to sizes that a single instance could never achieve. But there's a latency price to pay, and as long as you're using a single proxy, it's also a throughput price.
The proxy imposes a small tax on each request. Measure throughput using the proxy with one server. Impose a load until the throughput stops growing and the response times slow to a crawl. Add another server and note the response times are restored to normal, while capacity just doubled. Of course, you'll want to add servers well before response times start to crawl.

How to collect hardware event counts with perf stat on KVM host/guest?

On a 6.4 host (2.6.32-358) on SandyBridge, I am trying to collect hardware event counts for guest activity with perf stat. Although virt-top reports healthy activity in the guests, I get the following with the ":G" modifier
# perf stat -e cycles:G sleep 10
Performance counter stats for 'sleep 10':
0 cycles:G # 0.000 GHz
I tried collecting inside the guest, but get the following:
# perf stat -e cycles -A -a sleep 1
Performance counter stats for 'sleep 1':
CPU0 < not supported> cycles
I see that there is a perf kvm, but this only has top/record/report and seems to be intended for profiling an application using sampling, not collecting hardware counts.
On the host, how do I get perf stat to count the guest activity; and on the guest, what is needed to expose hardware event counting to perf stat?