What is the logic behind the prometheus sum after rate functionality? - sum

In about two minutes I have 2000 requests - which should be 1000 requests/minute or 17 requests/second.
The total counter works fine and gives me a nice graph:
sum by(status_code) ( request_duration_count{service_id="myserviceId", path=~".*myservice/frag/.*"} )
The request rate results in a flatline at 0:
sum by(status_code) (rate(request_duration_count{service_id="myserviceId", path=~".*myservice/frag/.*"}[1m]))
I think problem here is that I have one request per URL - which is not fine but it is as it is.
The URLs look like this:
https://myserver/myservice/frag/1
https://myserver/myservice/frag/2
https://myserver/myservice/frag/3
https://myserver/myservice/frag/4
https://myserver/myservice/frag/5
...
Each of these URLs are set to the "path" label and so I get 2000 series for this metric.
So if I calculate the rate over one minute I get 0,008 requests per second for each series.
If I sum this up (0,008... * 2000) I should get roundabout 17.
So why do I have a flatline at zero?

Related

SQL Time Series Group with max amount of results

I have timeseries data in a table using Timescaledb.
Data is as follows:
time locationid parameterid unitid value
2022-04-18T10:00:00.000Z "1" "1" "2" 2.2
2022-04-18T10:00:00.000Z "2" "1" "2" 3.0
2022-04-18T09:00:00.000Z "1" "1" "2" 1.2
2022-04-18T09:00:00.000Z "2" "1" "2" 4.0
2022-04-18T08:00:00.000Z "1" "1" "2" 2.6
2022-04-18T08:00:00.000Z "2" "1" "2" 3.1
2022-04-18T07:00:00.000Z "1" "1" "2" 2.1
2022-04-18T07:00:00.000Z "2" "1" "2" 2.7
I have 1000s of rows with time series IOT data that I am putting into graphs using HighCharts.
My question is, is there a way to limit the number of items returned in my results, but not a classic limit. I'd like to return a 256 data groups at all times. So if I had 2,560 rows my query would group by/date trunc / time_bucket every 100 rows, but if I had 512 rows my query would only group every 2 rows so that I am always returning 256 no matter what.
My current query:
SELECT time_bucket('4 hours', time) as "t"
,locationid, avg(timestamp) as "x", avg(value) as "y"
FROM probe_data
WHERE locationid = '${q.locationid}'and parameterid = '${q.parameterid}'
and time > '${q.startDate}' and time < `${q.endDate}`
GROUP BY "t", locationid
ORDER BY "t" DESC;
It seems like I should be able to use my min date and max date to count the number of possible returns and then divide by 256? Is this the best way to do it?
There are a few different ways you can do something like this:
You can just change the time bucket you're using dynamically in your query with time_bucket. You can do arithmetic on intervals and get another interval back ie SELECT (now()- '2022-04-21')/256; will return an interval, this can be the first input into time_bucket. So something like
SELECT time_bucket((enddate - startdate) / 256, time) as "t"
...
GROUP BY time_bucket((enddate - startdate) / 256, time)
Should do what you're looking for to a large extent...
However, it does mean that you're going to be getting averages of arbitrarily larger groups of data as you zoom out and doesn't horribly allow you to cache things or the like and probably isn't actually a great representation of the underlying process.
Another option would be:
You can do an average at a set time_bucket that is relevant to your data analysis and then downsample that using an algorithm like largest triangle three buckets which maintains the visual accuracy of a graph in a useful way while downsampling the data. It's one of the experimental hyperfunctions that we have in TimescaleDB.
This would allow you to also use something like continuous aggregates to downsample the data with a set time_bucket and then get the number of points you need for your graph more quickly using the LTTB algorithm.
So it sort of depends what you're looking for...in some cases using LTTB on its own without doing the average or even using something like ASAP smoothing (another experimental hyperfunction) might be a better way to do what you're looking for and are built-in for this type of work! I think the docs pages have more info on the algorithms and what they're useful for, but both LTTB and ASAP are designed specifically for graphing applications so I thought I'd point them out!
No - SQL doesn't support dynamic grouping. To do what you ask, you'd have to fetch the resultset & check the number of records returned to then run again with your logic.
Effectively, you have a presentation level issue - not a SQL issue.
Probably something with NTILE, not sure if this would work but I'd imagine doing it something like this:
SELECT avg(sub.timestamp), avg(sub.value) FROM (
SELECT
timestamp,
value,
NTILE (256) OVER (
ORDER BY time
) bucket_no
FROM
probe_data
) sub GROUP BY sub.bucket_no;
Where the inner query would have all of your data broken into 256 groups, each group would then have a column bucket_no, and your outer query would group by the bucket_no
EDIT: just realized the mysql tag on this question is probably inaccurate, but I'll leave the answer as it might point you in the right direction for timescaledb.

CKAN - Why I only get a result with the first 10 from Count

With the CKAN API query I get a count = 47 (thats correct) but only 10 results.
How do I get all (=47) results with the API query?
CKAN API Query:
https://suche.transparenz.hamburg.de/api/3/action/package_search?q=title:Fahrplandaten+(GTFS)&sort=score+asc
From source: *For me the page loads very slowly, patience
https://suche.transparenz.hamburg.de/dataset?q=hvv-fahrplandaten+gtfs&sort=score+desc%2Ctitle_sort+asc&esq_not_all_versions=true&limit=50&esq_not_all_versions=true
The count shows only the total number of results found. You can change the total number of results returned by setting up limit and row parameters. e.g https://suche.transparenz.hamburg.de/api/3/action/package_search?q=title:Fahrplandaten+(GTFS)&sort=score+asc&rows=100. The row limit is 1000 per query. You can find more info here

How to get Google CSE RESTFull API Result's Next Page & confusion on daily request limit?

I am using Google CSE Restlful API. And my code to get results is
Google.Apis.Customsearch.v1.CseResource.ListRequest listRequest = svc.Cse.List(query);
listRequest.Cx = cx;
Google.Apis.Customsearch.v1.Data.Search search = listRequest.Fetch();
foreach (Google.Apis.Customsearch.v1.Data.Result result in search.Items)
{
//do something with items
}
It returns me 10 results out of total 100 . To see results of next 10 records I have to
listRequest.Start = 11;
search = listRequest.Fetch();
And now I my 'search.Items' have results from 11-20 .
Now I have 2 questions:
1- Is it right way to get the results of next page ( next 10 records) ?
2- And doing so would it mean that I have consumed 2 request out of 100 allowed requests per day ?
If this is correct then effectively user can only get total of 1000 results per day from Google CSE API.
So it means if I have to see all 100 results of my first query I would have to make 10 requests.
Thanks,
Wasim
Yes it's the right way: setting the start parameter to the next index will request the next paginated results from your query.
You are right also on the second question, each request (paginated or non paginated) is counted between the max of 100 allowed per day, resulting of a total of 1000 max results per day.

Grouping, totaling in Rails and Active Record

I'm trying to group a series of records in Active Record so I can do some calculations to normalize that quantity attribute of each record for example:
A user enters a date and a quantity. Dates are not unique, so I may have 10 - 20 quantities for each date. I need to work with only the totals for each day, not every individual record. Because then, after determining the highest and lowest value, I convert each one by basically dividing by n which is usually 10.
This is what I'm doing right now:
def heat_map(project, word_count, n_div)
return "freezing" if word_count == 0
words = project.words
counts = words.map(&:quantity)
max = counts.max
min = counts.min
return "max" if word_count == max
return "min" if word_count == min
break_point = (max - min).to_f/n_div.to_f
heat_index = (((word_count - min).to_f)/break_point).to_i
end
This works great if I display a table of all the word counts, but I'm trying to apply the heat map to a calendar that displays running totals for each day. This obviously doesn't total the days, so I end up with numbers that are out of the normal scale.
I can't figure out a way to group the word counts and total them by day before I do the normalization. I tried doing a group_by and then adding the map call, but I got an error an undefined method error. Any ideas? I'm also open to better / cleaner ways of normalizing the word counts, too.
Hard to answer without knowing a bit more about your models. So I'm going to assume that the date you're interested in is just the created_at date in the words table. I'm assuming that you have a field in your words table called word where you store the actual word.
I'm also assuming that you might have multiple entries for the same word (possibly with different quantities) in the one day.
So, this will give you an ordered hash of counts of words per day:
project.words.group('DATE(created_at)').group('word').sum('quantity')
If those guesses make no sense, then perhaps you can give a bit more detail about the structure of your models.

Biased random in SQL?

I have some entries in my database, in my case Videos with a rating and popularity and other factors. Of all these factors I calculate a likelihood factor or more to say a boost factor.
So I essentially have the fields ID and BOOST.The boost is calculated in a way that it turns out as an integer that represents the percentage of how often this entry should be hit in in comparison.
ID Boost
1 1
2 2
3 7
So if I run my random function indefinitely I should end up with X hits on ID 1, twice as much on ID 2 and 7 times as much on ID 3.
So every hit should be random but with a probability of (boost / sum of boosts). So the probability for ID 3 in this example should be 0.7 (because the sum is 10. I choose those values for simplicity).
I thought about something like the following query:
SELECT id FROM table WHERE CEIL(RAND() * MAX(boost)) >= boost ORDER BY rand();
Unfortunately that doesn't work, after considering the following entries in the table:
ID Boost
1 1
2 2
It will, with a 50/50 chance, have only the 2nd or both elements to choose from randomly.
So 0.5 hit goes to the second element
And 0.5 hit goes to the (second and first) element which is chosen from randomly so so 0.25 each.
So we end up with a 0.25/0.75 ratio, but it should be 0.33/0.66
I need some modification or new a method to do this with good performance.
I also thought about storing the boost field cumulatively so I just do a range query from (0-sum()), but then I would have to re-index everything coming after one item if I change it or develop some swapping algorithm or something... but that's really not elegant and stuff.
Both inserting/updating and selecting should be fast!
Do you have any solutions to this problem?
The best use case to think of is probably advertisement delivery. "Please choose a random ad with given probability"... however i need it for another purpose but just to give you a last picture what it should do.
edit:
Thanks to kens answer i thought about the following approach:
calculate a random value from 0-sum(distinct boost)
SET #randval = (select ceil(rand() * sum(DISTINCT boost)) from test);
select the boost factor from all distinct boost factors which added up surpasses the random value
then we have in our 1st example 1 with a 0.1, 2 with a 0.2 and 7 with a 0.7 probability.
now select one random entry from all entries having this boost factor
PROBLEM: because the count of entries having one boost is always different. For example if there is only 1-boosted entry i get it in 1 of 10 calls, but if there are 1 million with 7, each of them is hardly ever returned...
so this doesnt work out :( trying to refine it.
I have to somehow include the count of entries with this boost factor ... but i am somehow stuck on that...
You need to generate a random number per row and weight it.
In this case, RAND(CHECKSUM(NEWID())) gets around the "per query" evaluation of RAND. Then simply multiply it by boost and ORDER BY the result DESC. The SUM..OVER gives you the total boost
DECLARE #sample TABLE (id int, boost int)
INSERT #sample VALUES (1, 1), (2, 2), (3, 7)
SELECT
RAND(CHECKSUM(NEWID())) * boost AS weighted,
SUM(boost) OVER () AS boostcount,
id
FROM
#sample
GROUP BY
id, boost
ORDER BY
weighted DESC
If you have wildly different boost values (which I think you mentioned), I'd also consider using LOG (which is base e) to smooth the distribution.
Finally, ORDER BY NEWID() is a randomness that would take no account of boost. It's useful to seed RAND but not by itself.
This sample was put together on SQL Server 2008, BTW
I dare to suggest straightforward solution with two queries, using cumulative boost calculation.
First, select sum of boosts, and generate some number between 0 and boost sum:
select ceil(rand() * sum(boost)) from table;
This value should be stored as a variable, let's call it {random_number}
Then, select table rows, calculating cumulative sum of boosts, and find the first row, which has cumulative boost greater than {random number}:
SET #cumulative_boost=0;
SELECT
id,
#cumulative_boost:=(#cumulative_boost + boost) AS cumulative_boost,
FROM
table
WHERE
cumulative_boost >= {random_number}
ORDER BY id
LIMIT 1;
My problem was similar: Every person had a calculated number of tickets in the final draw. If you had more tickets then you would have an higher chance to win "the lottery".
Since I didn't trust any of the found results rand() * multiplier or the one with -log(rand()) on the web I wanted to implement my own straightforward solution.
What I did and in your case would look a little bit like this:
(SELECT id, boost FROM foo) AS values
INNER JOIN (
SELECT id % 100 + 1 AS counter
FROM user
GROUP BY counter) AS numbers ON numbers.counter <= values.boost
ORDER BY RAND()
Since I don't have to run it often I don't really care about future performance and at the moment it was fast for me.
Before I used this query I checked two things:
The maximum number of boost is less than the maximum returned in the number query
That the inner query returns ALL numbers between 1..100. It might not depending on your table!
Since I have all distinct numbers between 1..100 then joining on numbers.counter <= values.boost would mean that if a row has a boost of 2 it would end up duplicated in the final result. If a row has a boost of 100 it would end up in the final set 100 times. Or in another words. If sum of boosts is 4212 which it was in my case you would have 4212 rows in the final set.
Finally I let MySql sort it randomly.
Edit: For the inner query to work properly make sure to use a large table, or make sure that the id's don't skip any numbers. Better yet and probably a bit faster you might even create a temporary table which would simply have all numbers between 1..n. Then you could simply use INNER JOIN numbers ON numbers.id <= values.boost