Cloudwatch Alarm across all dimensions based on metric name for custom metrics - amazon-cloudwatch

We are publishing custom Cloudwatch metrics from our service and want to set up alarms if the value for a metric name breaches a threshold for any dimension. Here are the metrics we are publishing:
Namespace=SameName, MetricName=Fault, Dimensions=[Operation=A, Program=ServiceName]
Namespace=SameName, MetricName=Fault, Dimensions=[Operation=B, Program=ServiceName]
Namespace=SameName, MetricName=Fault, Dimensions=[Operation=C, Program=ServiceName]
We want to set up an alarm so that a Fault across any dimension puts it into the Alarm state.
As you can see, the value for dimension Operation is different. Currently, we only have these 3 operations, so I understand we can use metric math to set up this alarm. But I am sure we will get to a point where this will keep growing.
I am able to use SEARCH expression + aggregate across search expression to generate a graph for it, but it does not let me create an alarm saying The expression for an alarm must include at least one metric.
Is there any other way I can achieve this?

Alarming directly on SEARCH is not supported yet. You would have to create a metric math expression where you list all 3 metrics, then create an expression that takes the max of the 3, like MAX(METRICS()). Make sure only the expression is marked as visible so that there is only 1 line on the graph.

As stated by Dejan, alarming on search isn't supported yet on Cloudwatch.
Another limitation is that you can only add up to 10 metrics to a metric math expression, which you can overcome with the new composite alarms.
If you would consider using a 3rd party service, you can try DataDog.
With DataDog you can import your cloudwatch metrics and set up multi-alarms which follow (and automatically discover) all tags under a specific metric.
There might be other services that offer this kind of feature, but I specifically have experience with this tool.

Related

Prometheus query comparing different metrics with same set of labels

I'm trying to monitor if a queue in rabbitmq:
has messages
does not have a consumer
is not called .*_retry
If a queue matches all three, I want to create an alert.
The individual metrics are no problem to find but I cannot comprehend how I am going to AND the different metrics in one query, grouping them by a set of labels (ie. instance, queue).
Is this even possible?
I'm using the latest version of prometheus and scraping rabbitmq via it's built-in prometheus metrics plugin.
For example, if you have two metrics from different exporters:
probe_success => Blackbox exporter
node_memory_MemTotal_bytes => Node exporter
Suppose they have two common labels: "instance" and "group".
If you use the following query:
sum by (instance, group) (node_memory_MemTotal_bytes)>20000000000 and sum by (instance, group) (probe_success)==1
You'll get the instance+group with memory>20G and is UP.
See more info about logical operators in Prometheus documentation here.

How to make CodeDeploy Blue/Green create CloudWatch alarms for custom metrics?

I am using the CloudWatch agent to create metrics for disk usage, memory usage, cpu, and a couple other things. I would like to aggregate metrics based on the autoscaling group, using "AutoScalingGroupName":"${aws:AutoScalingGroupName}".
However, I'm using Blue/Green deployments with CodeDeploy, which creates a copy of the Autoscaling Group. The alarms I originally made for aggregations on the autoscaling groups are gone, and I can't put a widget in my Dashboard that shows avg cpu, memory, etc.
My quick solution was to use a custom append_dimension that is set to a hardcoded value, and aggregate dimensions on that. Is there an automated way that AWS provides that I don't know about?
I don't have experience of the above scenario using AWS console.
But, since I work on Terraform (infrastructure as code) mostly, you can use like this:
dimensions = {
AutoScalingGroupName = tolist(aws_codedeploy_deployment_group.autoScalingGroup.autoscaling_groups)[0]
}
Reason for converting it into list - the output of
aws_codedeploy_deployment_group.asg.autoscaling_groups
is a set value, which you can see when you output the value of the codedeployment group autoscaling group - it uses toset function. The metric dimensions for the CloudWatch metric alarm expects a string. So, the conversion of a set type (which is unordered) to list type is needed so that you can access the first element of the autoscaling group - which is the newly created copy of the autoscaling group by codedeploy.

Airflow: BigQueryOperator vs BigQuery Quotas and Limits

Is there any pratical way to control quotas and limits on Airflow?.
I'm specially interested on controlling BigQuery concurrency.
There are different levels of quotas on BigQuery . So according to the Operator inputs, there should be a way to check if conditions are met, otherwise waiting for it to fulfill.
It seems to be a composition of Sensor-Operators, querying against a database like redis for example:
QuotaSensor(Project, Dataset, Table, Query) >> QuotaAddOperator(Project, Dataset, Table, Query)
QuotaAddOperator(Project, Dataset, Table, Query) >> BigQueryOperator(Project, Dataset, Table, Query)
BigQueryOperator(Project, Dataset, Table, Query) >> QuotaSubOperator(Project, Dataset, Table, Query)
The Sensor must check conditions like:
- Global running queries <= 300
- Project running queries <= 100
- .. etc
Is there any lib that already does that for me? A plugin perhaps?
Or any other easier solution?
Otherwise, following the Sensor-Operators approach.
How can I encapsulate all of it under a single operator? To avoid repetition of code,
a single operator: QuotaBigQueryOperator
Currently, it is only possible to get the Compute Engine quotas programmatically. However, there is an opened feature request to get/set other project quotas via API. You can post there about the specific case you would like to have implemented and follow it to track it and ask for updates.
Meanwhile, as workaround you can try to use the PythonOperator. With it you can define your own custom code and you would be able to implement retries for the queries that you send that get a quotaExceeded error (or the specific error you are getting). In this way you wouldn't have to explicitly check for the quota levels. You just run the queries and retry until they get executed. This is a simplified code for the strategy I am thinking about:
for query in QUERIES_TO_RUN:
while True:
try:
run(query)
except quotaExceededException:
continue # Jumps to the next cycle of the nearest enclosing loop.
break

Min cost flow with edge investment cost

I want to use a python Min-Cost Flow solver to be able to construct new networks. This means that I have an initial complete graph, with the vertices being either suppliers or having a demand. Using the algorithm should tell me, based on their costs, what edges will be used to settle all demands. Different to the existing problems, the cost of an edge when being used are not only described by a unit cost but also have an investment of this edge which is independent of the flow. I have been looking into the source code of networkx and or-tools but cannot figure out how to adapt these to implement the investment cost of the edges. Does someone have a better idea or can help me adapting the code?
Best Regards
Justus
You cannot solve this with a standard graph algorithm (eg: MinCostFlow).
Instead you need to formulate it as a Mixed Integer Program.
You can start with this example:
https://developers.google.com/optimization/assignment/assignment_mip
But you need to tweak it a little bit:
You need two classes of decision variables: invest_var (binary) and flow_var (continuous).
The objective will look like this:
min: sum(flow_cost[i,j]*flow_var[i,j]) + sum(invest_cost[i,j]*invest_var[i,j])
And you need to add an additional constraint for each link:
flow_var[i,j] <= BIG_INT * invest_var[i,j]
The purpose of these to constrain flow_var to 0 if invest_var is 0.
Demand and Supply constraints will be similar as in the example.
BIG_INT is a constant. You can set it as:
BIG_INT=max(flow_upper_bound[i,j])
Where flow_upper_bound is an upper bound on your flow_var variables.
Notice, that the problem now becomes a Mixed Integer Linear Program instead of just being a Linear Program.

Elegantly handle samples with insufficient data in workflow?

I've set up a Snakemake pipeline for doing some simple QC and analysis on shallow shotgun metagenomics samples coming through our lab.
Some of the tools in the pipeline will fail or error when samples with low amounts of data are delivered as inputs -- but this is sometimes not knowable from the raw input data, as intermediate filtering steps (such as adapter trimming and host genome removal) can remove varying numbers of reads.
Ideally, I would like to be able to handle these cases with some sort of check on certain input rules, which could evaluate the number of reads in an input file and choose whether or not to continue with that portion of the workflow graph. Has anyone implemented something like this successfully?
Many thanks,
-jon
I'm not aware of the possibility to not complete the workflow based on some computation happening inside the workflow. The rules to be executed are determined based on the final required output, and failure will happen if this final output cannot be generated.
One approach could be catch the particular tool failure (try ... except construct in a run section or return code handling in a shell section) and generate a dummy output file for the corresponding rule, and have the downstream rules "propagate" dummy file generation based on a test identifying the rule's input as such a dummy file.
Another approach could be to pre-process the data outside of your snakemake workflow to determine which input to skip, and then use some filtering on the wildcards combinations as described here: https://stackoverflow.com/a/41185568/1878788.
I've been trying to find a solution to this issue as well.
Thus far I think I've identified a few potential solution but have yet to be able to correctly implement them.
I use seqkit stats to quickly generate a txt file and use the num_seqs column to filter with. You can write a quick pandas function to return a list of files which pass your threshold, and I use config.yaml to pass the minimum read threshold:
def get_passing_fastq_files(wildcards):
qc = pd.read_table('fastq.stats.txt').fillna(0)
passing = list(qc[qc['num_seqs'] > config['minReads']]['file'])
return passing
Trying to implement that as an input function in Snakemake has been an esoteric nightmare to be honest. Probably my own lack of nuanced understand about the Wildcards object.
I think the use of a checkpoint is also necessary in the process to force Snakemake to recompute the DAG after filtering samples out. Haven't been able to connect all the dots yet however, and I'm trying to avoid janky solutions that use token files etc.