Prometheus query comparing different metrics with same set of labels - rabbitmq

I'm trying to monitor if a queue in rabbitmq:
has messages
does not have a consumer
is not called .*_retry
If a queue matches all three, I want to create an alert.
The individual metrics are no problem to find but I cannot comprehend how I am going to AND the different metrics in one query, grouping them by a set of labels (ie. instance, queue).
Is this even possible?
I'm using the latest version of prometheus and scraping rabbitmq via it's built-in prometheus metrics plugin.

For example, if you have two metrics from different exporters:
probe_success => Blackbox exporter
node_memory_MemTotal_bytes => Node exporter
Suppose they have two common labels: "instance" and "group".
If you use the following query:
sum by (instance, group) (node_memory_MemTotal_bytes)>20000000000 and sum by (instance, group) (probe_success)==1
You'll get the instance+group with memory>20G and is UP.
See more info about logical operators in Prometheus documentation here.

Related

Cloudwatch query dimension

Let's assume I have custom namespace in Cloudwatch called nameSP with dimension PodID.
I collect number of connections from each pod. Lets assume that we have two pods so will get two Conn metrics. How can I get from Cloudwatch number of pods ?
You can use metric math to count the metrics, like this:
TIME_SERIES(METRIC_COUNT(SEARCH('{nameSP,PodID} MetricName="THE_NAME_OF_YOUR_METRIC_WITH_NUM_OF_CONNECTIONS"', 'Average', 300)))

How to make CodeDeploy Blue/Green create CloudWatch alarms for custom metrics?

I am using the CloudWatch agent to create metrics for disk usage, memory usage, cpu, and a couple other things. I would like to aggregate metrics based on the autoscaling group, using "AutoScalingGroupName":"${aws:AutoScalingGroupName}".
However, I'm using Blue/Green deployments with CodeDeploy, which creates a copy of the Autoscaling Group. The alarms I originally made for aggregations on the autoscaling groups are gone, and I can't put a widget in my Dashboard that shows avg cpu, memory, etc.
My quick solution was to use a custom append_dimension that is set to a hardcoded value, and aggregate dimensions on that. Is there an automated way that AWS provides that I don't know about?
I don't have experience of the above scenario using AWS console.
But, since I work on Terraform (infrastructure as code) mostly, you can use like this:
dimensions = {
AutoScalingGroupName = tolist(aws_codedeploy_deployment_group.autoScalingGroup.autoscaling_groups)[0]
}
Reason for converting it into list - the output of
aws_codedeploy_deployment_group.asg.autoscaling_groups
is a set value, which you can see when you output the value of the codedeployment group autoscaling group - it uses toset function. The metric dimensions for the CloudWatch metric alarm expects a string. So, the conversion of a set type (which is unordered) to list type is needed so that you can access the first element of the autoscaling group - which is the newly created copy of the autoscaling group by codedeploy.

Cloudwatch Alarm across all dimensions based on metric name for custom metrics

We are publishing custom Cloudwatch metrics from our service and want to set up alarms if the value for a metric name breaches a threshold for any dimension. Here are the metrics we are publishing:
Namespace=SameName, MetricName=Fault, Dimensions=[Operation=A, Program=ServiceName]
Namespace=SameName, MetricName=Fault, Dimensions=[Operation=B, Program=ServiceName]
Namespace=SameName, MetricName=Fault, Dimensions=[Operation=C, Program=ServiceName]
We want to set up an alarm so that a Fault across any dimension puts it into the Alarm state.
As you can see, the value for dimension Operation is different. Currently, we only have these 3 operations, so I understand we can use metric math to set up this alarm. But I am sure we will get to a point where this will keep growing.
I am able to use SEARCH expression + aggregate across search expression to generate a graph for it, but it does not let me create an alarm saying The expression for an alarm must include at least one metric.
Is there any other way I can achieve this?
Alarming directly on SEARCH is not supported yet. You would have to create a metric math expression where you list all 3 metrics, then create an expression that takes the max of the 3, like MAX(METRICS()). Make sure only the expression is marked as visible so that there is only 1 line on the graph.
As stated by Dejan, alarming on search isn't supported yet on Cloudwatch.
Another limitation is that you can only add up to 10 metrics to a metric math expression, which you can overcome with the new composite alarms.
If you would consider using a 3rd party service, you can try DataDog.
With DataDog you can import your cloudwatch metrics and set up multi-alarms which follow (and automatically discover) all tags under a specific metric.
There might be other services that offer this kind of feature, but I specifically have experience with this tool.

grafana 2, collectd - issues with graphs

So I have collectd running on some servers, they are sending the data back to InfluxDB. InfluxDB is storing the data, and Grafana 2 is configured with the InfluxDB as data backed - some graphs work fine - such as load average, however some doesn't graph properly - like interface statistics (see picture):
http://i.imgur.com/YgIxBE1.png
I'm guessing this is because load average is stored like so:
timestamp1: $current_load_average (ex. 1.2)
timestamp2: $current_load_average (ex. 1.1)
And interface statistics are stored like so:
timestamp1: $bytes_transfered_so_far (ex. 1002)
timestamp2: $bytes_transfered_so_far (ex. 1034)
So Grafana just graphs the total bytes that have been transferred over that interface but not the bytes/second that I need. With the same setup - when collectd was writing to RRD files and they were being graphed by several interfaces - it all worked as expected.
Can you advise what should I look into or change?
Grafana query might look like this:
For counters that are constantly increased you're interested in their derivation with a time window. Depends on your graph resolution (if you're looking on last day or last hour) you should choose appropriate window, where all possible peaks would be visible.
You can either use:
Transformation > derivative()
Transformation > non_negative_derivative()
The latter is useful in cases where you want to omit negative values from your chart.

Lambda Architecture Modelling Issue

I am considering implementing a Lambda Architecture in order to process events transmitted by multiple devices.
In most cases (averages etc.) its seems to fit my requirements. However, I am stuck trying to model a specific use case. In short...
Each device has a device_id. Every device emits 1 event per second. Each event has an event_id ranging from {0-->10}.
An event_id of 0 indicates START & an event_id of 10 indicates END
All the events between START & END should be grouped into one single group (event_group).
This will produce tuples of event_groups i.e. {0,2,2,2,5,10}, (0,4,2,7,...5,10), (0,10)
This (event_group) might be small i.e. 10 minutes or very large say 3hours.
According to Lambda Architecture these events transmitted by every device are my "Master Data Set".
Currently, the events are sent to HDFS & Storm using Kafka (Camus, Kafka Spout).
In the Streaming process I group by device_id, and use Redis to maintain a set of incoming events in memory, based on a key which is generated each time an event_id=0 arrives.
The problem lies in HDFS. Say I save a file with all incoming events every hour. Is there a way to distinguish these (group_events)?
Using Hive I can group tuples in the same manner. However, each file will also contain "broken" event_groups
(0,2,2,3) previous computation (file)
(4,3,) previous computation (file)
(5,6,7,8,10) current computation (file)
so that I need to merge them based on device_id into (0,2,2,3,4,3,5,6,7,8,10) (multiple files)
Is a Lambda Architecture a fit for this scenario? Or should the streaming process be the only source of truth? I.e. write to hbase, hdfs itself won't this affect the overall latency.
As far as I understand your process, I don't see any issue, as the principle of Lambda Architecure is to re-process regularly all your data on a batch mode.
(by the way, not all your data, but a time frame, usually larger than the speed layer window)
If you choose a large enough time window for your batch mode (let's say your aggregation window + 3 hours, in order to include even the longest event groups), your map reduce program will be able to compute all your event groups for the desired aggregation window, whatever file the distincts events are stored (Hadoop shuffle magic !)
The underlying files are not part of the problem, but the time windows used to select data to process are.