I have a single redis pod running in my k8s cluster, and I would like to get an idea of how many requests per second my redis server is currently handling in my production environment. I have tried redis-cli monotor, which prints out live requests on the console but I cannot seem to find a way to get a numerical measure that simply tells me something like "redis server is handling x request per second on average in the past 24 hours". Any pointers would be highly appreciated.
Related
Background of problem
Hello all , i have made a project in golang gin , and I have integrated the redis Clusterclient in it using "github.com/go-redis/redis/v7"
P.S. Redis that I am using is a redis cluster hosted on AWS
the redis commands that I am using are simply redis.Get and redis.Set only.
Now I have made one API and used caching in it, and when I run it locally, response times are around 200 to 300ms, which is awesome (thanks to Redis)
Main Problem
now when I start doing the load testing on the same API with around 100 concurrent users , response time gets significantly increased ( around 4 seconds). I used spans to monitor the time taken by a different part of the code, and I got this
Getting from primary, getting from secondary are for the redis.Get command
Setting the primary , setting the secondary are for redis.Set
both commands are taking around 1 sec to execute, which is unacceptable,
can anyone please tell me some way, so that I can tackle this problem
and reduce the time for the redis commands to execute
Ok so I have solved this somehow.
Firstly I have updated my golang redis client library from go-redis/v7 to go-redis/v8 . And it made a significant improvement. I will advise everyone to do the same.
But still I am suffering from high response time , so the next step for me wa sto change the redis infra. Earlier I was using a redis cluster which only had 1 shard, but now I have moved to another redis having 4 shards..
And it made a huge difference , my response goes from 1200ms to 180ms. Kindly note that these response time are coming when I am doing a load testing with 100 concurrent users with an average of about 130rps
So in short upgrade your redis client , upgrade your redis infra
I use AWS ElastiCache Redis for our prod. I see CPU every 30 minutes of the round hour from average of 2-3% to 20%.
This is constant, which tells me it comes from schedule job.
From cloudwatch I have a suspicion it is related to KEY (and maybe SET) commands and it's latency is the only one which jumps in the same exact time as the CPU jumps.
I would like to understand what KEY (and maybe SET) commands run on a specific time, or some other way which can help me investigate this.
Thanks for any advice.
with redis-cli monitor I was able to get most of the commands running on server in a stream and get the high usage.
I'm working on an alerting solution that uses Logstash to stream AWS CloudFront logs from an S3 bucket into Graphite after doing some minor processing.
Since multiple events with the same timestamp can occur (multiple events within a second), I elected to use Carbon Aggregator to count these events per second.
The problem I'm facing is that the aggregated whisper database seems to be dropping data. The normal whisper file sees all of it, but of course it cannot account for more than 1 event per second.
I'm running this setup in docker on an EC2 instance, which isn't hitting any sort of limit (CPU, Mem, Network, Disk).
I've checked every log I could find in the docker instances and checked docker logs, however nothing jumps out.
I've set the logstash output to display the lines on stdout (not missing any) and to send them to graphite on port 2023, which is set to be the line-by-line receiver for Carbon Aggregator:
[aggregator]
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2023
aggregation-rules.conf is set to a very simple count per second:
test.<user>.total1s (1) = count test.<user>.total
storage-schemas.conf:
[default]
pattern = .*
retentions = 1s:24h
Happy to share more of my configuration as you request it.
I've hit a brick wall with this, I've been trying so many different things but I'm not able to see all data in the aggregated whisper db.
Any help is very much appreciated.
Carbon aggregator isn't designed to do what you are trying to do. For that use-case you'd want to use statsd to count the events per second.
https://github.com/etsy/statsd/blob/master/docs/metric_types.md#counting
Carbon aggregator is meant to aggregate across different series, for each point that it sees on the input it quantizes it to a timestamp before any aggregation happens, so you are still only going to get a single value per second with aggregator. statsd will take any number of counter increments and total them up each interval.
When running Django/Celery/RabbitMQ on production server, some tasks are sent and consumed correctly. However, RabbitMQ starts using up all the CPU after processing is done. I believe this is related to the following report.
RabbitMQ on EC2 Consuming Tons of CPU
In that thread, it is suggested to set these config values:
CELERY_IGNORE_RESULT
CELERY_AMQP_TASK_RESULT_EXPIRES
I forked and customized the celery-haystack package to set both those values when calling appl_async(), however it seems to have had no effect.
I think Celery is creating a large number (one per task) of uid-named queues automatically to store results. But I don't seem to be able to stop it.
Any ideas?
I just got a day of digging into this problem myself. I think the two options you meantioned can be explained like this:
CELERY_IGNORE_RESULT: if True then the results of tasks will be ignored, hence they won't return anything where you call them with delay or apply_async.
CELERY_AMQP_TASK_RESULT_EXPIRES: the expiration time for a result stored in the result backend. You can set this option to a reasonable value so RabbitMQ can delete expired results.
The many queues generated are for storing results only. So in case you don't want to store any results, you can remove CELERY_RESULT_BACKEND option from your config file.
Have a ncie day!
Our Pipeline:
VMware-Netflow -> Logstash -> Redis -> Logstash-indexer -> 3xElastic
Data I have gathered:
I notiticed in kibana that the flows coming in were 1 hour old, then
2, then 3 and so on.
Running 'redis-cli llen netflow' shows a very large number that is slowly increasing.
Running 'redis-cli INFO shows pretty constant input at 80kbps and output at 1kbps. I would think these should be near equal.
The cpu load on all nodes is pretty negligible.
What I've tried:
I ensured that the logstash-indexer was sending to all 3 elastic nodes.
I launched many additional logstash instances on the indexers, redis now shows 40 clients.
I am not sure what else to try.
TLDR: rebooted all three elasticsearch nodes, and life is good again.
I inadvertently disabled elasticsearch as an output, and sent my netflows into the ether. The queue size in redis dropped down to 0 in minutes. Although sad, this did prove that it was elasticsearch not logstash or redis.
I watched the elastic instances, and it seemed like something was wrong with the communication between them. All three showed logs indicating that 2/3 were dropping out of the cluster, and taking forever to respond to cluster pings. What I think was happening, is writes were accepted by elastic, and just bounced around a while before being written successfully.
Upon rebooting them all, they negotiated correctly, and writes are happening as they should.