Sum over unix timestamps in grafana prometheus - gpu

Lets say I have my_val for a value in prometheus, that is recorded when I use a gpu instance. I want to sum up how many hours or gpu usage I had in the past week. I can call timestamp(myval{instance="$instance"}) which will return a vector with timestamps, but I cannot call sum(idelta) over them because it is an instant vector for some reason.
Grafana also messes with the amount of data requested based on how far I am zooming.
How do I create a reliable call for every datapoint

Try the following query:
count_over_time(my_val[7d]) * <scrape_interval> / 3600
Where <scrape_interval> is scrape interval in seconds set in Prometheus config file for the target that exports my_val metric.
This query assumes that my_val doesn't contain data points when gpu instance wasn't used.

Related

After importing a metric into Victoria Metrics, the metric is repeated for 5 minutes. What controls this behavior?

I am writing some software that will be pushing data to Victoria Metrics, as below:
curl -d 'foo{bar="baz"} 30' -X POST 'http://[Victoria]/insert/0/prometheus/api/v1/import/prometheus'
I noticed that if I push a single metric like this, it shows up as not a single data point but rather shows up repeatedly as if it was being scraped every 15 seconds, either until I push a new value for that metric or 5 minutes passes.
What setting/mechanism is causing this 5-minute repeat period?
Pushing data with a timestamp does not change this. Metric gets repeated for 5 minutes after that time or until a change regardless.
I don't necessarily need to alter this behavior, just trying to understand why it's happening.
How do you query the database?
I guess this behaviour is due to the ranged query concept and ephemeral datapoints, check this out:
https://docs.victoriametrics.com/keyConcepts.html#range-query
The interval between datapoints depends on the step parameter, which is 5 minutes when omitted.
If you want to receive only the real datapoints, go via export functions.
https://docs.victoriametrics.com/#how-to-export-time-series
TSDB VM has ephemeral dots which fill gaps in the closest sample on the left to the requested timestamp.
So if you make the instant request:
curl "http://<victoria-metrics-addr>/api/v1/query?query=foo_bar&time=2022-05-10T10:03:00.000Z"
The time range at which VictoriaMetrics will try to locate a missing data sample is equal to 5m by default and can be overridden via step parameter.
step - optional max lookback window for searching for raw samples when executing the query. If step is skipped, then it is set to 5m (5 minutes) by default.
GET | POST /api/v1/query?query=...&time=...&step=...
You can read more about key concepts in this part of the documentation
key-concepts
There you can find also information about query range and different concepts about TSDB

Can someone explain this PromQL query to me?

I'm new to promQL and I am using it to create grafana dashboard to visualize various API metrics like throughput, latency etc.
For measuring latency I came across these queries being used together. Can someone explain how are they working
histogram_quantile(0.99, sum(irate(http_request_duration_seconds_bucket{path="<API Endpoint>"}[2m])*30) by (path,le))
histogram_quantile(0.95, sum(irate(http_request_duration_seconds_bucket{path="<API Endpoint>"}[2m])*30) by (path,le))
Also I want to write a query which will show me number of API calls with latency greater than 4sec. Can someone please help me there as well?
The provided queries are designed to return 99th and 95th percentiles for the http_request_duration_seconds{path="..."} metric of histogram type over requests received during the last 2 minutes (see 2m in square brackets).
Unfortunately the provided queries have some issues:
They use irate() function for calculating the per-second increase rate of every bucket defined in http_request_duration_seconds histogram. This function isn't recommended to use in general case, because it tends to return jumpy results on repeated queries - see this article for details. So it is better to use rate or increase instead when calculating histogram_quantile.
They multiply the calculated irate() by 30. This has no any effect on query results, since histogram_quantile() normalizes the provided per-bucket values.
So it is recommended to use the following query instead:
histogram_quantile(0.99,
sum(
increase(http_request_duration_seconds_bucket{path="..."}[2m])
) by (le)
)
This query works in the following way:
Prometheus selects all the time series matching the http_request_duration_seconds_bucket{path="..."} time series selector on the selected time range on the graph. These time series represent histogram buckets for the http_request_duration_seconds histogram. Each such bucket contains a counter, which counts the number of requests with duration not exceeding the value specified in the le label.
Prometheus calculates the increase over the last 2 minutes per each selected time series, e.g. how many requests hit every bucket during the last 2 minutes.
Prometheus calculates per-le sums over bucket values calculated at step 2 - see sum() function docs for details.
Prometheus calculates the estimated 99th percentile for the bucket results returned at step 3 by executing histogram_quantile function. The error of the estimation depends on the number of buckets and the le values. More buckets with better le distribution usually give lower error for the estimated percentile.

How to aggregate the time time between pairs of logs in CloudWatch

Suppose you have logs with some transaction ID and timestamp
12:00: transactionID1 handled by funcX
12:01: transactionID2 handled by funcX
12:03: transactionID2 handled by funcY
12:04: transactionID1 handled by funcY
I want to get the time between 2 logs of the same event and aggregate (e.g. sum, avg) the time difference.
For example, for transactionID1, the time diff would be (12:04 - 12:01) 3min and for transactionID2, the time diff would be (12:03 - 12:02) 1min. Then I'd like to take the average of all these time differences, so (3+1)/2 or 2min.
Is there a way to that?
This doesn't seem possible with CloudWatch alone. I don't know where your logs come from, e.g. EC2, Lambda function. What you could do is to use the AWS SDK to create custom metrics.
Approach 1
If the logs are written by the same process, you can keep a map of transactionID and startTimein memory and create a custom metric with transactionID as dimension and calculate the metric value with the startTime. In case the logs are from different processes e.g. Lambda function invocations, then you can use DynamoDB to store the startTime.
Approach 2
If the transactions are independent you could also create custom metrics per transaction and use CloudWatch DIFF_TIME which will create a calculated metric with values for each transaction.
With CloudWatch AVG it should then be possible to calculate the average duration.
Personally, I have used the first approach to calculate a duration across Lambda functions and other services.

Graphing slow counters with prometheus and grafana

We graph fast counters with sum(rate(my_counter_total[1m])) or with sum(irate(my_counter_total[20s])). Where the second one is preferrable if you can always expect changes within the last couple of seconds.
But how do you graph slow counters where you only have some increments every couple of minutes or even hours? Having values like 0.0013232/s is not very human friendly.
Let's say I want to graph how many users sign up to our service (we expect a couple of signups per hour). What's a reasonable query?
We currently use the following to graph that in grafana:
Query: 3600 * sum(rate(signup_total[1h]))
Step: 3600s
Resolution: 1/1
Is this reasonable?
I'm still trying to understand how all those parameters play together to draw a graph. Can someone explain how the range selector ([10m]), the rate() and the irate() functions, the Step and Resolution settings in grafana influence each other?
That's a correct way to do it. You can also use increase() which is syntactic sugar for using rate() that way.
Can someone explain how the range selector
This is only used by Prometheus, and indicates what data to work over.
the Step and Resolution settings in grafana influence each other?
This is used on the Grafana side, it affects how many time slices it'll request from Prometheus.
These settings do not directly influence each other. However the resolution should work out to be smaller than the range, or you'll be undersampling and miss information.
The 3600 * sum(rate(signup_total[1h])) can be substituted with sum(increase(signup_total[1h])) . The increase(counter[d]) function returns counter increase on the given lookbehind window d. E.g. increase(signup_total[1h]) returns the number of signups during the last hour.
Note that the returned value from increase(signup_total[1h]) may be fractional even if signup_total contains only integer values. This is because of extrapolation - see this issue for technical details. There are the following solutions for this issue:
To use offset modifier: signup_total - (signup_total offset 1h) . This query returns correct results if signup_total wasn't reset to zero during the last hour. In this case the sum(signup_total - (signup_total offset 1h)) is roughly equivalent to sum(increase(signup_total[1h])), but returns more accurate integer results.
To use VictoriaMetrics. It returns the expected integer results from increase() out of the box. See this article and this comment for technical details.

grafana 2, collectd - issues with graphs

So I have collectd running on some servers, they are sending the data back to InfluxDB. InfluxDB is storing the data, and Grafana 2 is configured with the InfluxDB as data backed - some graphs work fine - such as load average, however some doesn't graph properly - like interface statistics (see picture):
http://i.imgur.com/YgIxBE1.png
I'm guessing this is because load average is stored like so:
timestamp1: $current_load_average (ex. 1.2)
timestamp2: $current_load_average (ex. 1.1)
And interface statistics are stored like so:
timestamp1: $bytes_transfered_so_far (ex. 1002)
timestamp2: $bytes_transfered_so_far (ex. 1034)
So Grafana just graphs the total bytes that have been transferred over that interface but not the bytes/second that I need. With the same setup - when collectd was writing to RRD files and they were being graphed by several interfaces - it all worked as expected.
Can you advise what should I look into or change?
Grafana query might look like this:
For counters that are constantly increased you're interested in their derivation with a time window. Depends on your graph resolution (if you're looking on last day or last hour) you should choose appropriate window, where all possible peaks would be visible.
You can either use:
Transformation > derivative()
Transformation > non_negative_derivative()
The latter is useful in cases where you want to omit negative values from your chart.