Should custom 5xx metrics log a value of 0 if there's no error to have a correct SampleCount? - amazon-cloudwatch

Looking at the documentation for Amazon API Gateway dimensions and metrics I noticed that the Gateway's 5xx metric can be used for both the absolute number of errors and the rate:
The Sum statistic represents this metric, namely, the total count of
the 5XXError errors in the given period. The Average statistic
represents the 5XXError error rate, namely, the total count of the
5XXError errors divided by the total number of requests during the
period. The denominator corresponds to the Count metric (below).
That seems convenient when creating alerts and dashboards, as we don't need to use a math expression to compute the error rate.
How can I create my own custom 5xx metric that behaves like this? Should I log a 0 value for the metric when I don't see an error and a 1 value when I do, so that the SampleCount statistic corresponds to the number of requests and the Sum the number of errors (and thus the Average the error rate)?

Related

Can someone explain this PromQL query to me?

I'm new to promQL and I am using it to create grafana dashboard to visualize various API metrics like throughput, latency etc.
For measuring latency I came across these queries being used together. Can someone explain how are they working
histogram_quantile(0.99, sum(irate(http_request_duration_seconds_bucket{path="<API Endpoint>"}[2m])*30) by (path,le))
histogram_quantile(0.95, sum(irate(http_request_duration_seconds_bucket{path="<API Endpoint>"}[2m])*30) by (path,le))
Also I want to write a query which will show me number of API calls with latency greater than 4sec. Can someone please help me there as well?
The provided queries are designed to return 99th and 95th percentiles for the http_request_duration_seconds{path="..."} metric of histogram type over requests received during the last 2 minutes (see 2m in square brackets).
Unfortunately the provided queries have some issues:
They use irate() function for calculating the per-second increase rate of every bucket defined in http_request_duration_seconds histogram. This function isn't recommended to use in general case, because it tends to return jumpy results on repeated queries - see this article for details. So it is better to use rate or increase instead when calculating histogram_quantile.
They multiply the calculated irate() by 30. This has no any effect on query results, since histogram_quantile() normalizes the provided per-bucket values.
So it is recommended to use the following query instead:
histogram_quantile(0.99,
sum(
increase(http_request_duration_seconds_bucket{path="..."}[2m])
) by (le)
)
This query works in the following way:
Prometheus selects all the time series matching the http_request_duration_seconds_bucket{path="..."} time series selector on the selected time range on the graph. These time series represent histogram buckets for the http_request_duration_seconds histogram. Each such bucket contains a counter, which counts the number of requests with duration not exceeding the value specified in the le label.
Prometheus calculates the increase over the last 2 minutes per each selected time series, e.g. how many requests hit every bucket during the last 2 minutes.
Prometheus calculates per-le sums over bucket values calculated at step 2 - see sum() function docs for details.
Prometheus calculates the estimated 99th percentile for the bucket results returned at step 3 by executing histogram_quantile function. The error of the estimation depends on the number of buckets and the le values. More buckets with better le distribution usually give lower error for the estimated percentile.

AWS Cloudwatch ELB monitoring active connections

I would like to monitor the maximum number of active connections that my ApplicationELB is managing over a 5-minute period.
The ApplicationELB publishes a metric called ActiveConnectionCount. The documentation describes this in part as:
The total number of concurrent TCP connections active from clients to the load balancer and from the load balancer to targets.
And further states:
The most useful statistic is Sum.
I believe that Sum would total all the active connections reported within the time frame. E.g. Let's say the ELB is maintaining 10 connections and it reports this number every second, then the Sum would be 3000 over a 5-minute period. This is not what I want. Furthermore, when I use SUM over a 5-minute period I'm getting 20k or so -- far more than the number of real concurrent connections which are at most a few hundred.
If I aggregate using Maximum then the number reported by AWS is zero (!?).
If I aggregate using Average then the number appears to be reasonable (ranging from 80 - 200), but also wildly inaccurate. That is, it is almost inversely correlates with new connections and response time. That is, during time so of the day when response time is low and new connections is low, average active connections is higher.
So, I guess, here are my questions:
(1) How can I achieve seeing maximum number of concurrent connections between ELB and clients/app server? (Ideally, I could separate these two, but it doesn't look like the ELB does that).
Less important, but I'm curious:
(2) Why does MAXIMUM yield zero, while AVERAGE yields 80-200?
(3) Why does the documentation say that SUM should be used?
Thanks for any help / insight!
How can I achieve seeing maximum number of concurrent connections
between ELB and clients/app server? (Ideally, I could separate these
two, but it doesn't look like the ELB does that).
Why does MAXIMUM yield zero, while AVERAGE yields 80-200?
As you said, the ELB does not do that. From the metrics you can also see something called "SampleCount" which is the number of samples taken during a period of time, by default 1 minute. If we could somehow access the counts in these samples, we could get a min and max sample. For whatever reason, it's currently broken or not implemented and min/max show as 0. Therefore, the most useful metric, in my opinion at least, is the average which takes the sum (of counts) and divides it by the SampleCount.
Why does the documentation say that SUM should be used?
Good question because if you think about it it doesn't make much sense and doesn't give you much information since it's just a sum of the count in all samples.

Routing solver: Unknown Fleet size/ Fleet size optimization/ Infinite fleet size

I'm trying to solve the CVRPTW with multiple-vehicle types (~42). In my case fleet size/ number of vehicles of each type is unknown. I'm expecting solver to solve for most optimal fleet size and their routes.
Initially I tried to model this problem by creating large number of vehicles for each type and their fixed costs. I expected solver will try to minimize fixed cost of vehicles and distance costs. But in that case solver is unable find an initial solution in significant duration. I think that is because large number of vehicles increases number of possible insertions to a significantly and hence solver fails to explore all possible insertions. i.e: If vehicle type A has 100 vehicles. During insertion phase the solver is trying to insert a job in all 100 vehicles. But since all the empty routes/vehicles are identical we should check insertion cost of all filled vehicles and 1 empty vehicle.
Jsprit library provides an option to define INFINITE fleet size. In this case solver assumes each vehicles type has infinite copy. I think during insertion phase solver tries to add a job in already created routes or 1 empty route of each vehicle type.
EDIT 1: I have kept all the jobs (~240) as optional. When I use 10 vehicles of each type (total 420 vehicles) solver returns a solutions with all jobs unassigned after 1 hour. When I reduce number of vehicles of each type to 1(total 42 vehicles), solver returns a feasible solution with 152 unassigned job within 60 seconds.
EDIT 2: When I tried to solve it with 1 vehicle type and 200 copies, solver returns a solutions with all jobs unassigned after 1 hour. When I reduce number of vehicles to 60, I get a solution with all jobs assigned within 2 minutes.

High precision queue statistics from RabbitMQ

I need to log with the highest possible precision the rate with which messages enter and leave a particular queue in Rabbit. I know the API already provides publishing and delivering rates, but I am interested in capturing raw incoming and outgoing values in a known period of time, so that I can estimate rates with higher precision and time periods of my choice.
Ideally, I would check on-demand (i.e. on a schedule of my choice) e.g. the current cumulative count of messages that have entered the queue so far ("published" messages), and the current cumulative count of messages consumed ("delivered" messages).
With these types of cumulative counts, I could:
Compute my own deltas of messages entering or exiting the queue, e.g. doing Δ_count = cumulative_count(t) - cumulative_count(t-1)
Compute throughput rates doing throughput = Δ_count / Δ_time
Potentially infer how long messages stay on the queue throughout the day.
The last two would ideally rely on the precise timestamps when those cumulative counts were calculated.
I am trying to solve this problem directly using RabbitMQ’s API, but I’m encountering a problem when doing so. When I calculate the message cumulative count in the queue, I get a number that I don’t expect.
For example consider the screenshot below.
The Δ_message_count between entries 90 and 91 is 1810-1633 = 177. So, as I stated, I suppose that the difference between published and delivered messages should be 177 as well (in particular, 177 more messages published than delivered).
However, when I calculate these differences, I see that the difference is not 177:
Δ of published (incoming) messages: 13417517652009 - 13417517651765 = 244
Δ of delivered (outgoing) messages: 1341751765667 - 1341751765450 = 217
so we get 244 - 217 =27 messages. This suggests that there are 177 - 27 = 150 messages "unaccounted" for.
Why?
I tried taking into account the redelivered messages given by the API but they were constant when I run my tests, suggesting that there were no redelivered messages, so I wouldn't expect that to play a role.

What is the meaning of totalSlotMs for a BigQuery job?

What is the meaning of the statistics.query.totalSlotMs value returned for a completed BigQuery job? Except for giving an indication of relative cost of one job vs the other, it's not clear how else one should interpret the number. For example, how does the slot-milliseconds number relate to the stack driver reported total slot usage for a given project (which needs to stay below 2000 for on demand BigQuery usage)?
The docs are a bit terse ('[Output-only] Slot-milliseconds for the job.')
The idea is to have a 'slots' metric in the same units at which slots of reservation are sold to customers.
For example, imagine that you have a 20-second query that is continuously consuming 4 slots. In that case, your query is using 80,000 totalSlotMs (4 * 20,000).
This way you can determine the average number of slots even if the peak number of slots differs as, in practice, the number of workers will fluctuate over the runtime of a query.