Which "statistic" is proper for "StatusCheckFailed" metric while setting CloudWatch alarms? - amazon-cloudwatch

I'm setting many alarms for our team's EC2 instance.
While setting an alarm for StatusCheckFailed metric, which is returning 1 or 0, whether 1 corresponds system's or instance's failing status and 0 when everything is good.
My question is about which "statistic" is proper for binary values like this.
*Below is an image to give some context

You can use either Maximum or Sum. Set the alarm to activate when the stats is greater than 0.

Related

How to make CodeDeploy Blue/Green create CloudWatch alarms for custom metrics?

I am using the CloudWatch agent to create metrics for disk usage, memory usage, cpu, and a couple other things. I would like to aggregate metrics based on the autoscaling group, using "AutoScalingGroupName":"${aws:AutoScalingGroupName}".
However, I'm using Blue/Green deployments with CodeDeploy, which creates a copy of the Autoscaling Group. The alarms I originally made for aggregations on the autoscaling groups are gone, and I can't put a widget in my Dashboard that shows avg cpu, memory, etc.
My quick solution was to use a custom append_dimension that is set to a hardcoded value, and aggregate dimensions on that. Is there an automated way that AWS provides that I don't know about?
I don't have experience of the above scenario using AWS console.
But, since I work on Terraform (infrastructure as code) mostly, you can use like this:
dimensions = {
AutoScalingGroupName = tolist(aws_codedeploy_deployment_group.autoScalingGroup.autoscaling_groups)[0]
}
Reason for converting it into list - the output of
aws_codedeploy_deployment_group.asg.autoscaling_groups
is a set value, which you can see when you output the value of the codedeployment group autoscaling group - it uses toset function. The metric dimensions for the CloudWatch metric alarm expects a string. So, the conversion of a set type (which is unordered) to list type is needed so that you can access the first element of the autoscaling group - which is the newly created copy of the autoscaling group by codedeploy.

Cloudwatch Alarm across all dimensions based on metric name for custom metrics

We are publishing custom Cloudwatch metrics from our service and want to set up alarms if the value for a metric name breaches a threshold for any dimension. Here are the metrics we are publishing:
Namespace=SameName, MetricName=Fault, Dimensions=[Operation=A, Program=ServiceName]
Namespace=SameName, MetricName=Fault, Dimensions=[Operation=B, Program=ServiceName]
Namespace=SameName, MetricName=Fault, Dimensions=[Operation=C, Program=ServiceName]
We want to set up an alarm so that a Fault across any dimension puts it into the Alarm state.
As you can see, the value for dimension Operation is different. Currently, we only have these 3 operations, so I understand we can use metric math to set up this alarm. But I am sure we will get to a point where this will keep growing.
I am able to use SEARCH expression + aggregate across search expression to generate a graph for it, but it does not let me create an alarm saying The expression for an alarm must include at least one metric.
Is there any other way I can achieve this?
Alarming directly on SEARCH is not supported yet. You would have to create a metric math expression where you list all 3 metrics, then create an expression that takes the max of the 3, like MAX(METRICS()). Make sure only the expression is marked as visible so that there is only 1 line on the graph.
As stated by Dejan, alarming on search isn't supported yet on Cloudwatch.
Another limitation is that you can only add up to 10 metrics to a metric math expression, which you can overcome with the new composite alarms.
If you would consider using a 3rd party service, you can try DataDog.
With DataDog you can import your cloudwatch metrics and set up multi-alarms which follow (and automatically discover) all tags under a specific metric.
There might be other services that offer this kind of feature, but I specifically have experience with this tool.

Hive LLAP tuning: Memory per daemon and Heap Size Calculation

I am tuning my cluster which has Hive LLAP, According to the below link, https://community.hortonworks.com/articles/215868/hive-llap-deep-dive.html I need to calculate the value of heapsize, but not sure what is the meaning of *?
I also have a question regarding how to calculate the value for hive.llap.daemon.yarn.container.mb other then then default value given by ambari?
I have tried calculating the value by considering this (* as multiplication) and set container value equal to yarn.scheduler.maximum.allocation.mb, However HiveServer 2 interactive does not start after tuning.
Here's the excellent wiki article for setting up hive llap in HDP suite.
https://community.hortonworks.com/articles/149486/llap-sizing-and-setup.html
Your understanding for * is correct, it's used for multiplication.
Rule of thumb here is set hive.llap.daemon.yarn.container.mb to yarn.scheduler.maximum-allocation-mb but if your service is not coming up with that value then I would recommend you to change llap_heap_size to 80% of hive.llap.daemon.yarn.container.mb.

Prometheus: how to rate a sum of the same counter from different machines?

I have a Prometheus counter, for which I want to get its rate on a time range (the real target is to sum the rate, and sometimes use histogram_quantile on that for histogram metric).
However, I've got multiple machines running that kind of job, each one sets its own instance label. This causes different inc operations on this counter in different machines to create different entities of the counter, as the combination of labels values is unique.
The problem is that rate() works separately on each such counter entity.
The result is that counter entities with unique combinations don't get into account for rate().
For example, if I've got:
mycounter{aaa="1",instance="1.2.3.4:6666",job="job1"} value: 1
mycounter{aaa="2",instance="1.2.3.4:6666",job="job1"} value: 1
mycounter{aaa="2",instance="1.2.3.4:7777",job="job1"} value: 1
mycounter{aaa="1",instance="5.5.5.5:6666",job="job1"} value: 1
All counter entities are unique, so they get values of 1.
If counter labels are always unique (come from different machines), rate(mycounter[5m]) would get values of 0 in this case,
and sum(rate(mycounter[5m])) would get 0, which is not what I need!
I want to ignore the instance label so that it would refer these mycounter inc operations as they were made on the same counter entity.
In other words, I expect to have only 2 entities (they can have a common instance value or no instance label):
mycounter{aaa="1", job="job1"} value: 2
mycounter{aaa="2", job="job1"} value: 2
In such a case, inc operation in new machine (with existing aaa value) would increase some entity counter instead of adding new entity with value of 1, and rate() would get real rates for each, so we may sum() them.
How do I do that?
I made several tries to solve it but all failed:
Doing a rate() of the sum() - fails because of type mismatch...
Removing the automatic instance label, using metric_relabel_configswork with action: labeldrop in configuration, but then it assigns the default address value.
Changing all instance values to a common one using metric_relabel_configswork with replacement, but it seems that one of the entities overwrites all others, so it doesn't help...
Any suggestions?
Prometheus version: 2.3.2
Thanks in Advance!
You'd better expose your counters at 0 on application start, if the other labels (aaa, etc) have a limited set of possible combinations. This way rate() function works correctly at the bottom level and sum() will give you correct results.
If you have to do a rate() of the sum(), read this first:
Note that when combining rate() with an aggregation operator (e.g. sum()) or a function aggregating over time (any function ending in _over_time), always take a rate() first, then aggregate. Otherwise rate() cannot detect counter resets when your target restarts.
If you can tolerate this (or the instances reset counters at the same time), there's a way to work around. Define a recording rule as
record: job:mycounter:sum
expr: sum without(instance) (mycounter)
and then this expression works:
sum(rate(job:mycounter:sum[5m]))
The obvious query rate(sum(...)) won't work in most cases, since the resulting sum(...) may hide possible resets to zero for individual time series, which are passed to sum. So usually the correct answer is to use sum(rate(...)) instead. See this article for details.
Unfortunately, Prometheus may miss some increases for slow-changing counter when calculating rate() as shown in the original question above. The same applies to increase() calculations. See this issue, this comment and this article for details. Prometheus developers are going to fix these issues - see this design doc.
In the mean time try to use VictoriaMetrics when you need exact values for rate() and increase() functions over slow-changing counter (and distributed counter).

ElastiCache cloudwatch metrics for Redis: currItems for a single database

I have set up a metric for the aws interface on the ElastiCache redis cluster. I'm looking at a value of currItems superior to a certain number for a given period (say 1000 for 1 minute)
The issue I have is that I have two databases in Redis, name 0 and 1. I would like to only get the currItems for database 0, not database 1, since database 1 is holding values for a longer period of time and make the whole metric look much bigger than it should be (since I want the current items of database 0)
Is there a way to create a metric that would only get the currItems of the database 0?
You will have to create an application for this or use existing redis tools.
https://stackoverflow.com/questions/8614737/what-are-some-recommended-tools-to-monitor-a-redis-database
If you are using new relic