How to troubleshoot periodically CPU jump in redis - redis

I use AWS ElastiCache Redis for our prod. I see CPU every 30 minutes of the round hour from average of 2-3% to 20%.
This is constant, which tells me it comes from schedule job.
From cloudwatch I have a suspicion it is related to KEY (and maybe SET) commands and it's latency is the only one which jumps in the same exact time as the CPU jumps.
I would like to understand what KEY (and maybe SET) commands run on a specific time, or some other way which can help me investigate this.
Thanks for any advice.

with redis-cli monitor I was able to get most of the commands running on server in a stream and get the high usage.

Related

How to log redis pub/sub time?

I try to use redis for pub/sub by two systems and one of them is now ours (other company maintain it). I would like to have time stamp when I publish something in redis channel. Can someone help with this idea?
I already user redis log with debug level of information (the highest one) - but there is no time info for pub/sub messages in the log.
I tested redis monitor: redis-cli monitor . It's exactly what I want, but it decrease the performance of the system by 50%.
The only way is to implement the time log by myself - may be SET some time info before pub command in redis? This will put in redis local time and it will be slightly before pub.
You cannot achieve the goal with pubsub.
You might want to try Redis Streaming. For each streaming message, the first part of the automatically generated ID is the Unix timestamp when the ID is generated, i.e. the message is received by Redis.

Efficient way to take hot snapshots from redis in production?

We have redis cluster which holds more than 2 million and these keys has been updated with the time interval of 1 minute. Now we have a requirement to take the snapshot of the redis db in a particular interval For eg every 10 minute. This snapshot should not pause the redis command execution.
Is there any async way of taking snapshot from redis ?
It would be really helpful if we get any suggestion on open source tools or frameworks.
The Redis BGSAVE is async and takes a snapshot.
It calls the fork() function of the OS. According to the Redis manual,
Fork() can be time consuming if the dataset is big, and may result in Redis to stop serving clients for some millisecond or even for one second if the dataset is very big and the CPU performance not great
Two million updates in one minutes, that is 30K+ QPS.
So you really have to try it out, run the benchmark that similutes your business, then issue BGSAVE, monitor the I/O and CPU usage of your system, and see if there's a spike in your redis calling latency.
Then issue LASTSAVE, which will tell you when your last success snapshot happened. So you can adjust your backup schedule.

Redis runs out of memory cause slow query but can not find in slow log

I have query take seconds to get a key from redis sometimes.
Redis info shows used_memory is 2 times lager than used_memory_rss and OS starts to use swap.
After cleaning the useless data, used_memory is lower than used_memory_rss and everything goes fine.
what confuse me is: if any query cost like 10 second and block other query to redis would lead serious problem to other part of the app, but it seems fine to the app.
And I can not find any of this long time query in slow log, so I check redis SLOWLOG command and it says
The execution time does not include I/O operations like talking with the client, sending the reply and so forth, but just the time needed to actually execute the command (this is the only stage of command execution where the thread is blocked and can not serve other requests in the meantime)
so if this means the execution of the query is normal and not blocking any other queries? What happen to the query when memory is not enough and lead this long time query? Which part of these query takes so long since "actually execute the command" time cost not long enough to get into slowlog?
Thanks!
When memory is not enough Redis will definitely slow down as it will start swapping .You can use INFO to report the amount of memory Redis is using ,even you can set a max limit to memory usage, using the maxmemory option in the config file to put a limit to the memory Redis can use. If this limit is reached Redis will start to reply with an error to write commands (but will continue to accept read-only commands),

Hitting redis server with redis hash using JMeter (using redis-dataset plugin)

I have a redis server running and I wanted to use JMeter to get the benchmarks and to find in how much time it hits 20K transactions per second. I have a hash setup. How should I go about querying it. I have put one of the keys as redis key and have put one of the fields of the hash as variable name.
If I use constant throughput timer, what should I enter in the name field.
Thanks in advance.
If you're planning to use Constant Throughput Timer and your target it to get 20k requests per second load you need to configure it as follows:
Target Throughput: 1200000 (20k per second * 60 seconds in minute)
Calculate Throughput based on: all active threads
See How to use JMeter's Throughput Constant Timer article for more details.
Few more recommendations:
Constant Throughput Timer can only pause the threads so make sure you have enough virtual users on Thread Group level
Constant Throughput Timer is accurate enough on "minute" level, so make sure your test lasts long enough so the timer will be correctly applied. Also consider reasonable ramp-up period.
Some people find Throughput Shaping Timer easier to use
20k+ concurrent threads is normally something you cannot achieve using single machine so it is likely you'll need to consider Distributed Testing when multiple JMeter instances act as a cluster.

Keep time in sync among servers without internet connection

I have 5 servers on a LAN without Internet connection. I need them to keep the clock in sync among them.
I could configure them as NTP peers, and set a high stratum for the local clock of one of them. In this way, the other four would sync with that clock.
What I actually want, is them to agree on a time using all of the 5 local clocks (i.e. doing some kind of average), for reasons of robustness and precision. Is it possible with NTP?
PS: I do not want to use an external clock source.
EDIT: and no scripting outside NTP features, that could only make precision worse :)
If you average 5 drifting clocks, the only thing you get is another drifting clock that's harder to correct. It won't be more precise. NTP uses multiple servers to increase precision because it takes network latency into account. Since all your systems are on a fast local network, you just need one server.
Set up two systems to be NTP server, one a primary, and if you feel the need, one a backup. Have all other systems synchronize to them. This will be significantly easier to set up than the clock-averaging solution, and you won't have to develop any crazy scripts.
You might be able to have one of them listen for the times from each computer, perform an average, set the average as it's own time, and broadcast that time for all the other computers. It seems a little excessive, though.
you can set up one of them as ntp server which will broadcast its time on the local network and the others as slaves to listen on the local network
edit:
I missed the average part. well, in that case, you can probably write a script on the local server to collect times from all the slaves get the average and update own time with that value.
You may even want to get rid of ntp in that case and just use the script to update time on all the servers
I wish I could give a definitive proposal, but I don't know enough about your environment. No matter what you'll likely be doing some sort of script kung fu.
If it's unix/linux I would set everyone up with SSH authorized keys to poll each others' date +%s command (to get the epoch), average those times with awk or something, and then set the machine's own local date.
Or perhaps it would be more secure (and reliable) to have one authoritative machine check everyone's time in the same manor, average it, and then provision itself and every other host to that average.
On Windows you'll probably be looking into VBScript and WMI.
EDIT:
You may run into some weird problems if anyone's clock drifts forward from the average and my guess is about half of them will ;). Future timestamps can be rather strange. It will be up to you to determine how frequently this synchronization will need to occur.