Substract a value x from a Prometheus metric for a Grafana "single stat" with "Delta" activated? - testing

Maybe the problem should be solved in another way.
I am using a Grafana SingleStat pane with "Delta" activated and the dashboard shows me "today so far".
Prometheus metric: sum(request_duration_count{...})
Problem:
I have a metric counting requests. Between 03:00 and 06:00 an automated test triggers my service and the metric is incremented by value x. (I set a Grafana annotation at the starting point.)
I want to get a single stat request count without the test requests.
Advanced Problem:
Nagios checks every y minutes and also increments the counter.
How can I remove these "test-counts"?
Any ideas?

Related

Prometheus - separate alarm and recovery criteria?

I'm using prometheus to detect pauses in inbound traffic for an API endpoint. The query for this seems to be working nicely: alert if there's no change in HTTP requests for a given endpoint over a one-hour time period:
sum by(path, custom_identifier, kubernetes_namespace) (
delta(http_request_duration_seconds_count{
status_code="201",
method="POST",
path="/api/v1/path/to/endpoint"}[1h])
) == 0
What I want, though, is for this alert to resolve immediately upon resumption of traffic. As currently written, it won't resolve until traffic has been restored for the same time period - 1h.
Is this a thing that Prometheus can do? Is there a different method of accomplishing this goal?
- alert: AlertName
expr: sum by(path, custom_identifier, kubernetes_namespace) (
delta(http_request_duration_seconds_count{
status_code="201",
method="POST",
path="/api/v1/path/to/endpoint"}[1m])) == 0 #recovers after 1min if resolved
for: 1h

ScrapeFrequencyInSecs is not working in metrics-collector module

I have deployed metrics-collector module with a ScrapeFrequencyInSecs value of 60. so it should scrape data every minutes, but when I check data in insightsmetrics, I am still getting data every 5 minutes or so
I am using
mcr.microsoft.com/azureiotedge-metrics-collector:1.0
Version 1.
The mcr.microsoft.com/azureiotedge-metrics-collector:1.0 module has ScrapeFrequencyInSecs Default value: 300.
Note: After updating the configuration parameter, environment variables need a restart of the module.
You can refer to MS Q&A: can we set the interval for sending metrics using Metrics collector module to log analysis workspace

Spark structured streaming groupBy not working in append mode (works in update)

I'm trying to get a streaming aggregation/groupBy working in append output mode, to be able to use the resulting stream in a stream-to-stream join. I'm working on (Py)Spark 2.3.2, and I'm consuming from Kafka topics.
My pseudo-code is something like below, running in a Zeppelin notebook
orderStream = spark.readStream().format("kafka").option("startingOffsets", "earliest").....
orderGroupDF = (orderStream
.withWatermark("LAST_MOD", "20 seconds")
.groupBy("ID", window("LAST_MOD", "10 seconds", "5 seconds"))
.agg(
collect_list(struct("attra", "attrb2",...)).alias("orders"),
count("ID").alias("number_of_orders"),
sum("PLACED").alias("number_of_placed_orders"),
min("LAST_MOD").alias("first_order_tsd")
)
)
debug = (orderGroupDF.writeStream
.outputMode("append")
.format("memory").queryName("debug").start()
)
After that, I would expected that data appears on the debug query and I can select from it (after the late arrival window of 20 seconds has expired. But no data every appears on the debug query (I waited several minutes)
When I changed output mode to update the query works immediately.
Any hint what I'm doing wrong?
EDIT: after some more experimentation, I can add the following (but I still don't understand it).
When starting the Spark application, there is quite a lot of old data (with event timestamps << current time) on the topic from which I consume. After starting, it seems to read all these messages (MicroBatchExecution in the log reports "numRowsTotal = 6224" for example), but nothing is produced on the output, and the eventTime watermark in the log from MicroBatchExecution stays at epoch (1970-01-01).
After producing a fresh message onto the input topic with eventTimestamp very close to current time, the query immediately outputs all the "queued" records at once, and bumps the eventTime watermark in the query.
What I can also see that there seems to be an issue with the timezone. My Spark programs runs in CET (UTC+2 currently). The timestamps in the incoming Kafka messages are in UTC, e.g "LAST__MOD": "2019-05-14 12:39:39.955595000". I have set spark_sess.conf.set("spark.sql.session.timeZone", "UTC"). Still, the microbatch report after that "new" message has been produced onto the input topic says
"eventTime" : {
"avg" : "2019-05-14T10:39:39.955Z",
"max" : "2019-05-14T10:39:39.955Z",
"min" : "2019-05-14T10:39:39.955Z",
"watermark" : "2019-05-14T10:35:25.255Z"
},
So the eventTime somehow links of with the time in the input message, but it is 2 hours off. The UTC difference has been subtraced twice. Additionally, I fail to see how the watermark calculation works. Given that I set it to 20 seconds, I would have expected it to be 20 seconds older than the max eventtime. But apparently it is 4 mins 14 secs older. I fail to see the logic behind this.
I'm very confused...
It seems that this was related to the Spark version 2.3.2 that I used, and maybe more concretely to SPARK-24156. I have upgraded to Spark 2.4.3 and here I get the results of the groupBy immediately (well, of course after the watermark lateThreshold has expired, but "in the expected timeframe".

How can I tell how much time I have left in my Google Colaboratory session?

A Google Colab session expires after 12 hours at the longest. For this reason, I don't know whether it's worth starting to train my model or wait until the session has expired to start a brand new session.
Is there a way to know how long my session has been active for, or, equivalently, how much time I have left on my session?
Thanks.
import time, psutil
uptime = time.time() - psutil.boot_time()
remain = 12*60*60 - uptime
Menu -> Runtime -> View runtime logs
Look at the start time (may be on the last page), then add 12 hours.

AWS Cloudwatch alarm set to NonBreaching (or notBreaching) is not triggering, based on a log filter

With the following Metric and Alarm combination
Metric
Comes from a Cloudwatch log filter (when a match is found on the log)
Metric value: "1"
Default value: None
Unit: Count
Alarm
Statistic: Sum
Period: 1 minute
Treat missing data as: notBreaching
Threshold: [Metric] > 0 for 1 datapoints within 1 minute
The alarm goes to:
State changed to OK at 2018/12/17.
Reason: Threshold Crossed: no datapoints were received for 1 period and 1 missing datapoint was treated as [NonBreaching].
And then it doesn't trigger, even though I force the metric > 0
Why is the alarm stuck in OK? How can the alarm become triggered again?
Solution
Remove the "Unit" property from the stack template Alarm config.
The source of the problem was actually the "Unit" property. This being set to "Count" actually made the alarm become stuck :(
Ensure the stack is producing the same result as a manual alarm setup by checking with the describe-alarms API.