EKS Cluster high CPU usage datadog monitor - amazon-eks

I would like to create a datadog monitor which alerts if multiple hosts of an EKS cluster has high CPU usage. I'm calculating the CPU usage with the following formula:
sum:kubernetes_state.container.cpu_requested{cluster_name:xxx} by {node} / sum:kubernetes_state.node.cpu_allocatable{cluster_name:xxx} by {node} * 100
my datadog monitor
I found an article which shows how to create cluster alerts, but it's not for EKS:
https://docs.datadoghq.com/monitors/guide/create-cluster-alert/
I can combine these together. Anyone has experence with these monitors?
Thanks in advance!

Related

CoTURN Usage Statistics

I am still a bit new to the WebRTC world and trying to find my way through. I have succcessfully set up CoTURN, and been able to route calls behind a firewall by using CoTURN. Now I am wondering if it is possible to somehow inspect and possibly visualize usage statistics of CoTURN? I would love to know how many users are utilizing the server at any given time, how much the bandwidth and CPU usage is etc.? I saw details on how to optimize bandwidth and CPU usage in the official docs, but I haven't found any info on actually monitoring the usage. Any help would be highly appreciated.
If you want to monitor standard usage statistics like CPU usage, load, bandwidth, etc., you can focus on what's available for your infrastructure. For example in AWS you could have CloudWatch, or in generic Linux deployments export the usage stats with Prometheus and have them presented with Grafana.
For the coturn/TURN specific statistics, then coturn allows to store some metrics in Redis; it's described in https://github.com/coturn/coturn/blob/master/turndb/schema.stats.redis
Total traffic information is also reported when the allocation is deleted. The keys are
"turn/user/<username>/allocation/<id>/total_traffic" or "turn/user/<username>/allocation/<id>/total_traffic/peer".
Applications interested in the total amount of traffic per allocation can subscribe to these events as:
psubscribe turn/realm/*/user/*/allocation/*/total_traffic
psubscribe turn/realm/*/user/*/allocation/*/total_traffic/peer

RabbitMQ poor performance

We are facing bad performance in our RabbitMQ clusters. Even when idle.
Once installed the rabbitmq-top plugin, we see many processes with very high reductions/sec. 100k and more!
Questions:
What does it mean?
How to control it?
What might be causing such slowness without any errors?
Notes:
Our clusters are running on Kubernetes 1.15.11
We allocated 3 nodes, each with 8 CPU and 8 GB limits. Set vm_watermark to 7G. Actual usage is ~1.5 CPU and 1 GB RAM
RabbitMQ 3.8.2. Erlang 22.1
We don't have many consumers and producers. The slowness is also on a fairly idle environment
The rabbitmqctl status is very slow to return details (sometimes 2 minutes) but does not show any errors
After some more investigation, we found the actual reason was made up of two issues.
RabbitMQ (Erlang) run time configuration by default (using the bitnami helm chart) assigns only a single scheduler. This is good for some simple app with a few concurrent connections. Production grade with 1000s of connections have to use many more schedulers. Bumping up from 1 to 8 schedulers improved throughput dramatically.
Our monitoring that was hammering RabbitMQ with a lot of requests per seconds (about 100/sec). The monitoring hits the aliveness-test, which creates a connection, declares a queue (not mirrored), publishes a message and then consumes that message. Disabling the monitoring reduced load dramatically. 80%-90% drop in CPU usage and the reductions/sec also dropped by about 90%.
References
Performance:
https://www.rabbitmq.com/runtime.html#scheduling
https://www.rabbitmq.com/blog/2020/06/04/how-to-run-benchmarks/
https://www.rabbitmq.com/blog/2020/08/10/deploying-rabbitmq-to-kubernetes-whats-involved/
https://www.rabbitmq.com/runtime.html#cpu-reduce-idle-usage
Monitoring:
http://rabbitmq.1065348.n5.nabble.com/RabbitMQ-API-aliveness-test-td32723.html
https://groups.google.com/forum/#!topic/rabbitmq-users/9pOeHlhQoHA
https://www.rabbitmq.com/monitoring.html

No Worker Parallelism During Presto Query on AWS EMR

I have set up a presto cluster on AWS EMR querying from an S3 Bucket. I am exploring the cluster overview metrics as I run queries, and I notice that even though there are 2 available worker nodes, there is 0 worker parallelism. I was wondering why that was.
cluster overview image of query
Worker parallelism is a metric showing how much CPU you use across the nodes. If you do not query Presto, or queries are not CPU intensive (e.g. are bottlenecked on IO), you will not see worker parallelism.

How to monitor Elasticache metrics for Redis like resources used

I want to monitor the metrics for Redis like memory.
Can anyone tell how to find these metrics.
Assuming this is done through the AWS console, for memory usage you can use theBytesUsedForCachemetric. For other metrics refer to http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/CacheMetrics.Redis.html

RabbitMQ clustering

I have created RabbitMQ cluster on single windows machine with HA policy to all and created two DISC and two RAM node and 1 STAT node. I then ran the PerfTest (rabbitmq client test utility), the result were disappointing, it was around 5000m/sec. But when I ran the same test with single RabbitMQ node it gave me good result i.e. 25000m/sec. I am unable to get what wrong is happening, its result should be impressive if run within cluster, but it is opposite. Anyone have encounter the same or if know the reason behind it.
Thanks
A RabbitMQ Cluster with Mirrored Queues won't go faster than a single node. Why? Clustering is there to improve reliability and fault tolerance, not to improve throughput.
What's the reason for this? When you enable mirrored queues, RabbitMQ needs to coordinate state between nodes, that is, it needs to coordinate publishes, consumers and acks, to not deliver the same message more than once, or to more than one consumer. All this coordination affects performance, but that's the tradeoff with this kind of replication.
If you need decentralised replication, then you could use the Federation Plugin
The throughput rate would depend on couple of factors. In our perf tests for RabbitMQ in a cluster we observed that the rate varied depending on RabbitMQ nodes were DISC or RAM, but a big chunk of the performance variation was observed when running RabbitMQ Cluster with Mirrored Queues vs without. With Mirroring enabled we were seeing a rate of 3500 m/sec, while without it was 5000 m/sec. Also what is your message size when you run your perftest.
As is typical with RabbitMQ, it really depends. Here are a few ways that I have found to improve performance with RabbitMQ clustering:
Push the messages to a set of appropriately sized memory nodes only using a load balancer
Keep the message size very small
Do not use amqp transactions or Publisher Confirms
Only use HA Mirrored queues for a small set of queues that you absolutely have to have the data saved
Set a TTL on all messages or queues using a policy
Just to addon to above comments.. Putting it as FYI
http://www.rabbitmq.com/blog/2012/05/11/some-queuing-theory-throughput-latency-and-bandwidth/
http://www.rabbitmq.com/blog/2012/04/25/rabbitmq-performance-measurements-part-2/
The problems is that you are running a cluster on the same machine with the same resources.
The purpose of a rabbit cluster is to scale out and not scale in.
In other words, to have more network connections available, more disk power of course more CPU power to handle more messages.
When adding nodes on a single machine you don't scale your resources plus you are adding overheads of using a cluster. (As stated above)