Can someone explain this PromQL query to me? - api

I'm new to promQL and I am using it to create grafana dashboard to visualize various API metrics like throughput, latency etc.
For measuring latency I came across these queries being used together. Can someone explain how are they working
histogram_quantile(0.99, sum(irate(http_request_duration_seconds_bucket{path="<API Endpoint>"}[2m])*30) by (path,le))
histogram_quantile(0.95, sum(irate(http_request_duration_seconds_bucket{path="<API Endpoint>"}[2m])*30) by (path,le))
Also I want to write a query which will show me number of API calls with latency greater than 4sec. Can someone please help me there as well?

The provided queries are designed to return 99th and 95th percentiles for the http_request_duration_seconds{path="..."} metric of histogram type over requests received during the last 2 minutes (see 2m in square brackets).
Unfortunately the provided queries have some issues:
They use irate() function for calculating the per-second increase rate of every bucket defined in http_request_duration_seconds histogram. This function isn't recommended to use in general case, because it tends to return jumpy results on repeated queries - see this article for details. So it is better to use rate or increase instead when calculating histogram_quantile.
They multiply the calculated irate() by 30. This has no any effect on query results, since histogram_quantile() normalizes the provided per-bucket values.
So it is recommended to use the following query instead:
histogram_quantile(0.99,
sum(
increase(http_request_duration_seconds_bucket{path="..."}[2m])
) by (le)
)
This query works in the following way:
Prometheus selects all the time series matching the http_request_duration_seconds_bucket{path="..."} time series selector on the selected time range on the graph. These time series represent histogram buckets for the http_request_duration_seconds histogram. Each such bucket contains a counter, which counts the number of requests with duration not exceeding the value specified in the le label.
Prometheus calculates the increase over the last 2 minutes per each selected time series, e.g. how many requests hit every bucket during the last 2 minutes.
Prometheus calculates per-le sums over bucket values calculated at step 2 - see sum() function docs for details.
Prometheus calculates the estimated 99th percentile for the bucket results returned at step 3 by executing histogram_quantile function. The error of the estimation depends on the number of buckets and the le values. More buckets with better le distribution usually give lower error for the estimated percentile.

Related

After importing a metric into Victoria Metrics, the metric is repeated for 5 minutes. What controls this behavior?

I am writing some software that will be pushing data to Victoria Metrics, as below:
curl -d 'foo{bar="baz"} 30' -X POST 'http://[Victoria]/insert/0/prometheus/api/v1/import/prometheus'
I noticed that if I push a single metric like this, it shows up as not a single data point but rather shows up repeatedly as if it was being scraped every 15 seconds, either until I push a new value for that metric or 5 minutes passes.
What setting/mechanism is causing this 5-minute repeat period?
Pushing data with a timestamp does not change this. Metric gets repeated for 5 minutes after that time or until a change regardless.
I don't necessarily need to alter this behavior, just trying to understand why it's happening.
How do you query the database?
I guess this behaviour is due to the ranged query concept and ephemeral datapoints, check this out:
https://docs.victoriametrics.com/keyConcepts.html#range-query
The interval between datapoints depends on the step parameter, which is 5 minutes when omitted.
If you want to receive only the real datapoints, go via export functions.
https://docs.victoriametrics.com/#how-to-export-time-series
TSDB VM has ephemeral dots which fill gaps in the closest sample on the left to the requested timestamp.
So if you make the instant request:
curl "http://<victoria-metrics-addr>/api/v1/query?query=foo_bar&time=2022-05-10T10:03:00.000Z"
The time range at which VictoriaMetrics will try to locate a missing data sample is equal to 5m by default and can be overridden via step parameter.
step - optional max lookback window for searching for raw samples when executing the query. If step is skipped, then it is set to 5m (5 minutes) by default.
GET | POST /api/v1/query?query=...&time=...&step=...
You can read more about key concepts in this part of the documentation
key-concepts
There you can find also information about query range and different concepts about TSDB

Crux dataset Bigquery - Query for Min/Avg/Max LCP, FID and CLS

I have been exploring the Crux dataset in big query for last 10 days to extract data for data studio report. Though I consider myself good at SQL, as I have mostly worked with oracle and SQL server, I am finding it very hard to write queries against this dataset. I started from this article by Rick Viscomi, explored the queries on his github repo but still unable to figure it out.
I am trying to use the materialized table chrome-ux-report.materialized.metrics_summary to get some of the metrics but I am not sure if the Min/Avg/Max lcp (in milliseconds) for a time period (month for example) could be extracted from this table. What other queries could I possibly try which requires less data processing. (Some of the queries that I tried expired my free TB of data processing on big query).
Any suggestion, advise solution, queries are more than welcome since the documentation about the structure of the dataset and queries against it is not very clear.
For details about the fields used on the report you can check on the main documentation for the chrome ux report specially on the last part with data format which shows the dimensions and how its interpreted as show below:
Dimension
origin "https://example.com"
effective_connection_type.name 4G
form_factor.name "phone"
first_paint.histogram.start 1000
first_paint.histogram.end 1200
first_paint.histogram.density 0.123
For example, the above shows a sample record from the Chrome User Experience Report, which indicates that 12.3% of page loads had a “first paint time” measurement in the range of 1000-1200 milliseconds when loading “http://example.com” on a “phone” device over a ”4G”-like connection. To obtain a cumulative value of users experiencing a first paint time below 1200 milliseconds, you can add up all records whose histogram’s “end” value is less than or equal to 1200.
For the metrics, in the initial link there is a section called methodology where you can get information about the metrics and dimensions of the report. I recommend going to the actual origin source table per country and per site and not the summary as the data you are looking for can be obtained there. In the Bigquery part of the documentation you will find samples of how to query those tables. I find this relatable:
SELECT
SUM(bin.density) AS density
FROM
`chrome-ux-report.chrome_ux_report.201710`,
UNNEST(first_contentful_paint.histogram.bin) AS bin
WHERE
bin.start < 1000 AND
origin = 'http://example.com'
In the example above we’re adding all of the density values in the FCP histogram for “http://example.com” where the FCP bin’s start value is less than 1000 ms. The result is 0.7537, which indicates that ~75.4% of page loads experience the FCP in under a second.
About query estimation cost, you can see estimating query cost guide on google official bigquery documentation. But using this tables due to its nature consumes a lot of processing so filter it as much as possible.

What's the best way to account for missing records when performing aggregate queries?

I have a table in QuestDB with IoT sensor data. The usual operation pattern is that sensors write info to a table while they have an active internet connection. This means they are anywhere from a few minutes to a few hours per day or constantly sending me data. When I want to run an aggregate query on top of this, how can I account for missing values?
If I want an average by minute over a 24 hour period, but 4 hours of data is missing, will my results be skewed? For example:
select avg(tempFahren) from (iot_logger timestamp(ts)) sample by 1m
It becomes obvious that I'm skipping directly to the next reported value when graphing so instead of a cyclical pattern, I get a sudden cliff when the sensor comes online again:
If you want to fill missing values, there is also the option to use the FILL keyword in SAMPLE BY aggregations. There are a few ways you can use this, such as filling by previous value, linear interpolation, or specify a constant:
select ts, avg(tempFahren) from (iot_logger timestamp(ts)) sample by 1m fill(linear);
There are some more examples of how to use this on the official documentation
Aggregation functions like avg() ignore missing data (for example null values).
So no, your results will not be skewed if your sensors do not send data for some time.

Check number of slots used by a query in BigQuery

Is there a way to check how many slots were used by a query over the period of its execution in BigQuery? I checked the execution plan but I could just see the Slot Time in ms but could not see any parameter or any graph to show the number of slots used over the period of execution. I even tried looking at Stackdriver Monitoring but I could not find anything like this. Please let me know if it can be calculated in some way or if I can see it somewhere I might've missed seeing.
A BigQuery job will report the total number of slot-milliseconds from the extended query stats in the job metadata, which is analogous to computational cost. Each stage of the query plan also indicates input stats for the stage, which can be used to indicate the number of units of work each stage dispatched.
More details about the representation can be found in the REST reference for jobs. See query.statistics.totalSlotMs and statistics.query.queryPlan[].parallelInputs for more information.
BigQuery now provides a key in the Jobs API JSON called "timeline". This structure provides "statistics.query.timeline[].completedUnits" which you can obtain either during job execution or after. If you choose to pull this information after a job has executed, "completedUnits" will be the cumulative sum of all the units of work (slots) utilised during the query execution.
The question might have two parts though: (1) Total number of slots utilised (units of work completed) or (2) Maximum parallel number of units used at a point in time by the query.
For (1), the answer is as above, given by "completedUnits".
For (2), you might need to consider the maximum value of queryPlan.parallelInputs across all query stages, which would indicate the maximum "number of parallelizable units of work for the stage" (https://cloud.google.com/bigquery/query-plan-explanation)
If, after this, you additionally want to know if the 2000 parallel slots that you are allocated across your entire on-demand query project is sufficient, you'd need to find the point in time across all queries taking place in your project where the slots being utilised is at a maximum. This is not a trivial task, but Stackdriver monitoring provides the clearest view for you on this.

Graphing slow counters with prometheus and grafana

We graph fast counters with sum(rate(my_counter_total[1m])) or with sum(irate(my_counter_total[20s])). Where the second one is preferrable if you can always expect changes within the last couple of seconds.
But how do you graph slow counters where you only have some increments every couple of minutes or even hours? Having values like 0.0013232/s is not very human friendly.
Let's say I want to graph how many users sign up to our service (we expect a couple of signups per hour). What's a reasonable query?
We currently use the following to graph that in grafana:
Query: 3600 * sum(rate(signup_total[1h]))
Step: 3600s
Resolution: 1/1
Is this reasonable?
I'm still trying to understand how all those parameters play together to draw a graph. Can someone explain how the range selector ([10m]), the rate() and the irate() functions, the Step and Resolution settings in grafana influence each other?
That's a correct way to do it. You can also use increase() which is syntactic sugar for using rate() that way.
Can someone explain how the range selector
This is only used by Prometheus, and indicates what data to work over.
the Step and Resolution settings in grafana influence each other?
This is used on the Grafana side, it affects how many time slices it'll request from Prometheus.
These settings do not directly influence each other. However the resolution should work out to be smaller than the range, or you'll be undersampling and miss information.
The 3600 * sum(rate(signup_total[1h])) can be substituted with sum(increase(signup_total[1h])) . The increase(counter[d]) function returns counter increase on the given lookbehind window d. E.g. increase(signup_total[1h]) returns the number of signups during the last hour.
Note that the returned value from increase(signup_total[1h]) may be fractional even if signup_total contains only integer values. This is because of extrapolation - see this issue for technical details. There are the following solutions for this issue:
To use offset modifier: signup_total - (signup_total offset 1h) . This query returns correct results if signup_total wasn't reset to zero during the last hour. In this case the sum(signup_total - (signup_total offset 1h)) is roughly equivalent to sum(increase(signup_total[1h])), but returns more accurate integer results.
To use VictoriaMetrics. It returns the expected integer results from increase() out of the box. See this article and this comment for technical details.