AWS Cloudwatch alarm set to NonBreaching (or notBreaching) is not triggering, based on a log filter - amazon-cloudwatch

With the following Metric and Alarm combination
Metric
Comes from a Cloudwatch log filter (when a match is found on the log)
Metric value: "1"
Default value: None
Unit: Count
Alarm
Statistic: Sum
Period: 1 minute
Treat missing data as: notBreaching
Threshold: [Metric] > 0 for 1 datapoints within 1 minute
The alarm goes to:
State changed to OK at 2018/12/17.
Reason: Threshold Crossed: no datapoints were received for 1 period and 1 missing datapoint was treated as [NonBreaching].
And then it doesn't trigger, even though I force the metric > 0
Why is the alarm stuck in OK? How can the alarm become triggered again?

Solution
Remove the "Unit" property from the stack template Alarm config.
The source of the problem was actually the "Unit" property. This being set to "Count" actually made the alarm become stuck :(
Ensure the stack is producing the same result as a manual alarm setup by checking with the describe-alarms API.

Related

Micrometer gauge metric update value slowly?

I'm using micrometer gauge metric to monitor Http_max_response_time in Vertx service. (configure metric with Prometheus).
When testing, I send a request with timeout 3 seconds at 13:15:16 and the gauge metric return right value about Http_max_response_time (3s). But after that request, there is not any request with timeout 3 seconds send to server, the gauge metric still return Http_max_response_time = 3 second until 13:17:51, and then it updates new value Http_max_response_time to less than 3s. I think it need update more frequently.
My questions here:
How long the gauge metric update new value OR how long it keeps current value?
Which logic that the gauge metric Http_max_response_time execute? Does it just update a global value and return it when there is an observation?
If my question is not clear, please comment and I will show detail more.
Thank in advance,
Updated:
Vertx-micrometer-metrics use a Timer metric for response time, and using a TimeWindowMax to update highest value.
Max for basic Timer implementations such as CumulativeTimer, StepTimer is a time window max (TimeWindowMax). It means that its value is the maximum value during a time window. If no new values are recorded for the time window length, the max will be reset to 0 as a new time window starts. Time window size will be the step size of the meter registry unless expiry in DistributionStatisticConfig is set to other value explicitly. The reason why a time window max is used is to capture max latency in a subsequent interval after heavy resource pressure triggers the latency and prevents metrics from being published.
So we can change default expiry configuration in DistributionStatisticConfig to smaller value as you want.
Here my code to change TimeWindowMax of metrics which contain responseTime to 2 seconds:
registry.config().meterFilter(
new MeterFilter() {
#Override
public DistributionStatisticConfig configure(Meter.Id id, DistributionStatisticConfig config) {
if(id.getName().contains("responseTime")) {
return DistributionStatisticConfig.builder()
.expiry(Duration.ofSeconds(5))
.build()
.merge(config);
}
return config;
}
});
And It worked.

Spark structured streaming groupBy not working in append mode (works in update)

I'm trying to get a streaming aggregation/groupBy working in append output mode, to be able to use the resulting stream in a stream-to-stream join. I'm working on (Py)Spark 2.3.2, and I'm consuming from Kafka topics.
My pseudo-code is something like below, running in a Zeppelin notebook
orderStream = spark.readStream().format("kafka").option("startingOffsets", "earliest").....
orderGroupDF = (orderStream
.withWatermark("LAST_MOD", "20 seconds")
.groupBy("ID", window("LAST_MOD", "10 seconds", "5 seconds"))
.agg(
collect_list(struct("attra", "attrb2",...)).alias("orders"),
count("ID").alias("number_of_orders"),
sum("PLACED").alias("number_of_placed_orders"),
min("LAST_MOD").alias("first_order_tsd")
)
)
debug = (orderGroupDF.writeStream
.outputMode("append")
.format("memory").queryName("debug").start()
)
After that, I would expected that data appears on the debug query and I can select from it (after the late arrival window of 20 seconds has expired. But no data every appears on the debug query (I waited several minutes)
When I changed output mode to update the query works immediately.
Any hint what I'm doing wrong?
EDIT: after some more experimentation, I can add the following (but I still don't understand it).
When starting the Spark application, there is quite a lot of old data (with event timestamps << current time) on the topic from which I consume. After starting, it seems to read all these messages (MicroBatchExecution in the log reports "numRowsTotal = 6224" for example), but nothing is produced on the output, and the eventTime watermark in the log from MicroBatchExecution stays at epoch (1970-01-01).
After producing a fresh message onto the input topic with eventTimestamp very close to current time, the query immediately outputs all the "queued" records at once, and bumps the eventTime watermark in the query.
What I can also see that there seems to be an issue with the timezone. My Spark programs runs in CET (UTC+2 currently). The timestamps in the incoming Kafka messages are in UTC, e.g "LAST__MOD": "2019-05-14 12:39:39.955595000". I have set spark_sess.conf.set("spark.sql.session.timeZone", "UTC"). Still, the microbatch report after that "new" message has been produced onto the input topic says
"eventTime" : {
"avg" : "2019-05-14T10:39:39.955Z",
"max" : "2019-05-14T10:39:39.955Z",
"min" : "2019-05-14T10:39:39.955Z",
"watermark" : "2019-05-14T10:35:25.255Z"
},
So the eventTime somehow links of with the time in the input message, but it is 2 hours off. The UTC difference has been subtraced twice. Additionally, I fail to see how the watermark calculation works. Given that I set it to 20 seconds, I would have expected it to be 20 seconds older than the max eventtime. But apparently it is 4 mins 14 secs older. I fail to see the logic behind this.
I'm very confused...
It seems that this was related to the Spark version 2.3.2 that I used, and maybe more concretely to SPARK-24156. I have upgraded to Spark 2.4.3 and here I get the results of the groupBy immediately (well, of course after the watermark lateThreshold has expired, but "in the expected timeframe".

Substract a value x from a Prometheus metric for a Grafana "single stat" with "Delta" activated?

Maybe the problem should be solved in another way.
I am using a Grafana SingleStat pane with "Delta" activated and the dashboard shows me "today so far".
Prometheus metric: sum(request_duration_count{...})
Problem:
I have a metric counting requests. Between 03:00 and 06:00 an automated test triggers my service and the metric is incremented by value x. (I set a Grafana annotation at the starting point.)
I want to get a single stat request count without the test requests.
Advanced Problem:
Nagios checks every y minutes and also increments the counter.
How can I remove these "test-counts"?
Any ideas?

When Elm Html.Program calls subscriptions

I've found a possible answer to this question in a Google Group but I'll like to know if it's correct and add a follow-up question if it is correct.
The answer there is
Every time the global update function in your app runs for any reason,
the global subscriptions object is reevaluated as well, and effect
managers receive the new list of current subscriptions
If any time the model is changed subscriptions get called what is the effect on subscriptions such as Time.every second taken from Time Effect Elm Guide - is that means the timer get reset when the model changes? What if that was Time.every minute - if the model changes 20 seconds after it starts will it fire in 60 - 20 = 40 seconds or will it fire in 1 minute?
You can check when update and subscriptions are called by adding a Debug.log statement to each. The subscriptions function is called first at the start (since the messages which will be sent to update may depend on it) and also after each call to update.
The timer interval seems to be unaffected by subsequent calls to subscriptions. For example, if you use the elm clock example, change the subscription to
Time.every (10*Time.second) Tick
and add a button to the view which resets the model value to 0, you will see that the tick still takes place at regular 10s intervals, no matter when you click the button.
TLDR; It will fire in 1 minute, unless you turn your subscription
off and on during the first minute
Every time your update runs, the subscriptions function will run too.
The subscriptions function essentially is a list of things you want your app to subscribe to.
In the example you have a subscription that generates a Tick message every 60 seconds.
The behavior you may expect is:
T= 0s: The first time subscriptions runs, you start your subscription to "receive Tick message every 60 seconds".
T= between 0 AND 60s: As long as that particular subscription remains ON, it doesn't matter how often your update function runs. subscriptions will be run, but as long as your particular subscription to the Tick remains ON, things are fine.
T= 60s: You receive a Tick message from your subscription, which in turn will fire update to be called.
T= 60s: subscriptions will run again (because of previous call to update)
What could be interesting is what happens if the subscription to Tick is canceled along the way and then reinstated:
T= 0: subscription to Tick
T= 20s: suppose something changes in the model, causing subscription to Tick to be canceled
T= 40s: some other change in model, causing subscription to Tick to be turned on again
T= 100s: Tick message is fired, and passed to update function
T= 100s: subscriptions will run again

eWAM - In Wynsure - Invalid time format error in aWFOperationAssignment object

When processes like GBP Subscription/Member Enrollment/Member Endorsement are performed and when these processes are accepted, the system throws an error as:
“Object of the class type aWFOperationAssignment cannot be stored in
the database with the corresponding NSID, ID & Version”
and the transaction is roll-backed with the below error shown in the error report.
“The transaction is roll backed. Err Code= 22007.
ErrMsg=SQLState=22007 . [Microsoft][SQL Server Native Client
10.0]Invalid time format”.
This happens only in few of the environments. Not sure if this is a code or configuration issue.
This issue is caused because of the “Bank Holidays Context” configuration in Wynsure.
In the Bank Holidays (Business Administration -> General Settings -> Bank Holiday), the End Time is supposed to be configured in 24 hr time format. If this is configured as for example: 8 for start and 5 for end time, instead of 8 for start and 17 for end time, then the duration is calculated incorrectly. Note that Wynsure tries to subtract the start time from the end time (in this case, it tries to subtract 8 from 5 and gives an incorrect duration)
This configuration will cause an issue while processing any transactions because at the completion of any transaction a corresponding operation is created with 2 fields viz., “Expected Limit Date” and “Expected Limit Time” and this field uses the difference between the “End Time” and “Start Time” to calculate the expected date and time limit.
As the difference between the End Time & Start Time will return an incorrect value, an invalid date & time will be calculated and the system will throw an error with the invalid date and time format and the transaction is rolled back.
To fix this issue, the “End Time” should be configured in 24 hr time format.