prometheus query returns inconsistent results - api

I have some data in prometheus that looks like this:
I have a job that runs every 2 minutes on a server and pushes values to prometheus' pushgateway and that's how it reaches prometheus. Now im trying to query this data with the HTTP API and I'm noticing that it returns inconsistent results, it either returns the data I expect to see or it doesnt return anything at all.
My queries are range queries where for example start = now() - 1w and end = now. The problem seems to show when I use high values for the step/resolution. The only step that seems to work all of the time is 5m. When I try 10m sometimes it works but usually it doesnt. I'm guessing this depends on the time I send the request (maybe when I use the current time it breaks something).
Why is this happening?

try to set the step as the size of the duration in the []. eg. if your query look like that: http_requests_total{job="prometheus"}[5m], use step = 5m.
if you want 10m change the duration and the step to 10m.

Related

Django - Iterating over Raw Query is slow

I have a query which uses a window function. I am using a raw query to filter over that new field, since django doesn't allow filtering over that window function (at least in the version I am using).
So it would look something like this (simplified):
# Returns 440k lines
user_files = Files.objects.filter(file__deleted=False).filter(user__terminated__gte=today).annotate(
row_number=Window(expression=RowNumber(), partition_by=[F("user")], order_by=[F("creation_date").desc()]))
I am basically trying to get the last not deleted file from each user which is not terminated.
Afterwards I use following raw query to get what I want:
# returns 9k lines
sql, params = user_files.query.sql_with_params()
latest_user_files = Files.objects.raw(f'select * from ({sql}) sq where row_number = 1', params)
if I run these queries in the database, they run quite quickly (300ms). But once I try to iterate over them or even just print them it takes a very long time to execute.
Anywhere from 100 to 200 seconds even though the query itself just takes a little bit less than half a second. Is there anything I am missing? Is the extra field row_number in the raw query an issue?
Thank you for any hint/answers.
(Using Django 3.2 and Python 3.9)

Query fast with literal but slow with variable

I am using Typeorm for SQL Server in my application. When I pass the native query like connection.query(select * from user where id = 1), the performance is really good and it less that 0.5 seconds.
If we use the findone or QueryBuilder method, it is taking around 5 seconds to get a response.
On further debugging, we found that passing value directly to query like this,
return getConnection("vehicle").createQueryBuilder()
.select("vehicle")
.from(Vehicle, "vehicle")
.where("vehicle.id='" + id + "'").getOne();
is faster than
return getConnection("vehicle").createQueryBuilder()
.select("vehicle")
.from(Vehicle, "vehicle")
.where("vehicle.id =:id", {id:id}).getOne();
Is there any optimization we can do to fix the issue with parameterized query?
I don't know Typeorm but it seems to me clear the difference. In one case you query the database for the whole table and filter it locally and in the other you send the filter to the database and it filters the data before it sends it back to the client.
Depending on the size of the table this has a big impact. Consider picking one record from 10 million. Just the time to transfer the data to the local machine is 10 million times slower.

How to use Vulkan Timestamp Queries?

This is the simplified pseudocode where I'm trying to measure a GPU workload:
for(N) vkCmdDrawIndexed();
vkCmdWriteTimestamp(VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT);
vkCmdWriteTimestamp(VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT);
submit();
vkDeviceWaitIdle();
vkGetQueryPoolResults();
Things to note:
N is 224 in my case
I have to wait for an idle device - without it, I continue to receive a validation error saying me that the data is not ready though I have multiple query pools in flight
putting the first timestamp I expect that the query value will be written as soon as all previous commands reached a preprocessing step. I was pretty sure that all 224 commands are preprocessed almost at the same time but the reality shows that this is not true.
putting the second timestamp I expect that the query value will be written after all previous commands are finished. I.e. the time difference between these two query values should give me the time it takes for the GPU to do all the work for a single frame.
I'm taking into account VkPhysicalDeviceLimits::timestampPeriod (1 on my machine) and VkQueueFamilyProperties::timestampValidBits (64 on my machine)
I created a big dataset that visually takes approx 2 seconds (~2000ms) to render a single frame. But the calculated time has only 2 (two) different values - either 0.001024ms or 0.002048ms so the frame by frame output can look like this:
0.001024ms
0.001024ms
0.002048ms
0.001024ms
0.002048ms
0.002048ms
...
Don't know how about you, but I find these values VERY suspicious. I have no answer for that. Maybe at the time, the last draw command reaches the command processor all the work is already done, but why the hell 1024 and 2048??
I tried to modify the code and move the first timestamp above, i.e.:
vkCmdWriteTimestamp(VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT);
for(N) vkCmdDrawIndexed();
vkCmdWriteTimestamp(VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT);
Now at the time the preprocessor hits the timestamp command, it writes the query value immediately, because there was no previous work and nothing to wait (remember idle device). This time I have another, closer to the truth values:
20.9336ms
20.9736ms
21.036ms
21.0196ms
20.9572ms
21.3586ms
...
which is better but still far beyond expected ~2000ms.
What's going on, what's happening inside the device when I set timestamps, how to get correct values?
While commands in Vulkan can be executed out of order (within certain restrictions), you should not broadly expect commands to be executed out of order. This is especially true of timer queries which, if they were executed out of order, would be unreliable in terms of their meaning.
Given that, your code is saying, "do a bunch of work. Then query the time it takes for the start of the pipe to be ready to execute new commands, then query the time it takes for the end of the pipe to be reached." Well, the start of the pipe might only be ready to execute new commands once most of the work is done.
Basically, what you think is happening is this:
top work work work work work work | timer
stage1 work work work work work work
stage2 work work work work work work
bottom work work work work work work | timer
But there's nothing that requires GPUs to execute this way. What is almost certainly actually happening is:
time->
top work work work work work work | timer
stage1 work work work work work work
stage2 work work work work work work
bottom work work work work work work | timer
So your two timers are only getting a fraction of the actual work.
What you want is this:
top timer | work work work work work work
stage1 work work work work work work
stage2 work work work work work work
bottom work work work work work work | timer
This queries the time from start to finish for the entire set of work.
So put the first query before the work whose time you want to measure.

Is there a way to check if a query hit a cached result or not in BigQuery?

We are performance tuning, both in terms of cost and speed, some of our queries and the results we get are a little bit weird. First, we had one query that did an overwrite on an existing table, we stopped that one after 4 hours. Running the same query to an entirely new table and it only took 5 minutes. I wonder if the 5 minute query maybe used a cached result from the first run, is that possible to check? Is it possible to force BigQuery to not use cache?
If you run query in UI - expand Options and make sure Use Cached Result properly set
Also, in UI, you can check Job Details to see if cached result was used
If you run your query programmatically - you should use respective attributes - configuration.query.useQueryCache and statistics.query.cacheHit

Graphing slow counters with prometheus and grafana

We graph fast counters with sum(rate(my_counter_total[1m])) or with sum(irate(my_counter_total[20s])). Where the second one is preferrable if you can always expect changes within the last couple of seconds.
But how do you graph slow counters where you only have some increments every couple of minutes or even hours? Having values like 0.0013232/s is not very human friendly.
Let's say I want to graph how many users sign up to our service (we expect a couple of signups per hour). What's a reasonable query?
We currently use the following to graph that in grafana:
Query: 3600 * sum(rate(signup_total[1h]))
Step: 3600s
Resolution: 1/1
Is this reasonable?
I'm still trying to understand how all those parameters play together to draw a graph. Can someone explain how the range selector ([10m]), the rate() and the irate() functions, the Step and Resolution settings in grafana influence each other?
That's a correct way to do it. You can also use increase() which is syntactic sugar for using rate() that way.
Can someone explain how the range selector
This is only used by Prometheus, and indicates what data to work over.
the Step and Resolution settings in grafana influence each other?
This is used on the Grafana side, it affects how many time slices it'll request from Prometheus.
These settings do not directly influence each other. However the resolution should work out to be smaller than the range, or you'll be undersampling and miss information.
The 3600 * sum(rate(signup_total[1h])) can be substituted with sum(increase(signup_total[1h])) . The increase(counter[d]) function returns counter increase on the given lookbehind window d. E.g. increase(signup_total[1h]) returns the number of signups during the last hour.
Note that the returned value from increase(signup_total[1h]) may be fractional even if signup_total contains only integer values. This is because of extrapolation - see this issue for technical details. There are the following solutions for this issue:
To use offset modifier: signup_total - (signup_total offset 1h) . This query returns correct results if signup_total wasn't reset to zero during the last hour. In this case the sum(signup_total - (signup_total offset 1h)) is roughly equivalent to sum(increase(signup_total[1h])), but returns more accurate integer results.
To use VictoriaMetrics. It returns the expected integer results from increase() out of the box. See this article and this comment for technical details.