Prometheus: how to rate a sum of the same counter from different machines? - sum

I have a Prometheus counter, for which I want to get its rate on a time range (the real target is to sum the rate, and sometimes use histogram_quantile on that for histogram metric).
However, I've got multiple machines running that kind of job, each one sets its own instance label. This causes different inc operations on this counter in different machines to create different entities of the counter, as the combination of labels values is unique.
The problem is that rate() works separately on each such counter entity.
The result is that counter entities with unique combinations don't get into account for rate().
For example, if I've got:
mycounter{aaa="1",instance="1.2.3.4:6666",job="job1"} value: 1
mycounter{aaa="2",instance="1.2.3.4:6666",job="job1"} value: 1
mycounter{aaa="2",instance="1.2.3.4:7777",job="job1"} value: 1
mycounter{aaa="1",instance="5.5.5.5:6666",job="job1"} value: 1
All counter entities are unique, so they get values of 1.
If counter labels are always unique (come from different machines), rate(mycounter[5m]) would get values of 0 in this case,
and sum(rate(mycounter[5m])) would get 0, which is not what I need!
I want to ignore the instance label so that it would refer these mycounter inc operations as they were made on the same counter entity.
In other words, I expect to have only 2 entities (they can have a common instance value or no instance label):
mycounter{aaa="1", job="job1"} value: 2
mycounter{aaa="2", job="job1"} value: 2
In such a case, inc operation in new machine (with existing aaa value) would increase some entity counter instead of adding new entity with value of 1, and rate() would get real rates for each, so we may sum() them.
How do I do that?
I made several tries to solve it but all failed:
Doing a rate() of the sum() - fails because of type mismatch...
Removing the automatic instance label, using metric_relabel_configswork with action: labeldrop in configuration, but then it assigns the default address value.
Changing all instance values to a common one using metric_relabel_configswork with replacement, but it seems that one of the entities overwrites all others, so it doesn't help...
Any suggestions?
Prometheus version: 2.3.2
Thanks in Advance!

You'd better expose your counters at 0 on application start, if the other labels (aaa, etc) have a limited set of possible combinations. This way rate() function works correctly at the bottom level and sum() will give you correct results.
If you have to do a rate() of the sum(), read this first:
Note that when combining rate() with an aggregation operator (e.g. sum()) or a function aggregating over time (any function ending in _over_time), always take a rate() first, then aggregate. Otherwise rate() cannot detect counter resets when your target restarts.
If you can tolerate this (or the instances reset counters at the same time), there's a way to work around. Define a recording rule as
record: job:mycounter:sum
expr: sum without(instance) (mycounter)
and then this expression works:
sum(rate(job:mycounter:sum[5m]))

The obvious query rate(sum(...)) won't work in most cases, since the resulting sum(...) may hide possible resets to zero for individual time series, which are passed to sum. So usually the correct answer is to use sum(rate(...)) instead. See this article for details.
Unfortunately, Prometheus may miss some increases for slow-changing counter when calculating rate() as shown in the original question above. The same applies to increase() calculations. See this issue, this comment and this article for details. Prometheus developers are going to fix these issues - see this design doc.
In the mean time try to use VictoriaMetrics when you need exact values for rate() and increase() functions over slow-changing counter (and distributed counter).

Related

OptaPlanner, force PlanningVariable to be filled in sequence of a range

Want to sequence a set of tasks with some rules. Each task has an index (the PlanningVariable) to indicate its sequence, its range is from 1 to n. One rule needs to minimize the sum of a shadow variable of all tasks, and it's meaningful only when the sum is calculated from index 1 consecutively to a larger number, say from 1 to 5. It's useless to sum say 2,4,8. Question: Is there any way to force optaplanner to assign 1 to a task and then 2 and then 3 ... to get potential solutions? no waste in this way.
Take a look at the task assigning example in optaplanner-examples (sources are in zip download in green button on optaplanner.org). See this video.
It uses a CHAINED variable to assign tasks to a (linked) list basically. Then use a #CustomShadowVariable to calculate the index in that chain for each Task.
In a future version of optaplanner, we'll support an Employee having a List<Task> and a Task having a #IndexShadowVariable, which will be a much simplier model. But meanwhile you'll have to work with the chained var approach.

How might one most efficiently calculate contingent values?

Suppose that I have 10 values n_1, n_2, ... n_10 and that given any 1 of these value, the other 9 can be calculated. Let f_i(n_j) be the function that calculates the value n_i using the values of n_j (where i != j). These functions are relatively simple (i.e. contain no more than a few exponential functions or powers).
In terms of the functions used, what would be the most efficient way of creating a program to calculate the other 9 values in n_1, ..., n_10 given the 1 that is initially known?
Would the best option be to minimize the number of functions used (and thus minimize the number of lines of code), or to create a function defining every single mapping?
For example, would it be most efficient to use only the 18 functions
f_1(n_2), f_1(n_3), ..., f_1(n_10) [1]
f_2(n_1), f_3(n_1), ..., f_10(n_1) [2]
And then, for whatever input is provided by the user, the value of n_1 may be calculated by using the relevant function in line 1, from which every other value of intererest may be calculated using functions from line [2]?
Or would it be better to define all 90 mappings, and so that only a single function (rather than 2 functions) must be called to calculate each of the 9 other values?
Edit: The specific result that I am trying to achieve is as follows...
I am currently using VBA, with a user form of the following format:
The conversion frequency is a required field (so lets just say, for example, that it is always equal to 2 and forget about it). I want to use on change events so that whenever the user changes any of the 6 fields below the conversion frequency field, the other 5 fields are auto-filled with the correct value. However, since the user need only update any one out of six fields, with the other 5 fields being calculated from this, we will require 6^6-6 = 30 different functions to do these calculations. We will thus end up with a lot of repetitive code.
My question regards the best practices to follow when working with a form where one of many inputs may be provided, and all other fields must be updated as a result of the input provided and its value.
Or, equivalently, is there a way to update all fields when the value of one field changes? Can this be done without the number of lines of code required increasing exponentially as the number of fields increases?
I think you are grossly overthinking this. Think of this in terms of the formulas you need; which I think are 6. 6 functions that take 5 inputs each:
calculateEIR(nominalInterestRate, ForceOfInterest, DiscountFactor, EffectiveDiscountRate, NominalDiscountRate)
calculateNIR(EffectiveInterestRate, ForceOfInterest, DiscountFactor, EffectiveDiscountRate, NominalDiscountRate)
' and so on...
The event handlers, and the code to calculate the values are their own thing. Your onchange event handlers simply need to call the correct methods; this is 6 event handlers calling 5 methods each, so 11 functions if you want to keep count. It's a lot of copypasta. For example:
sub textEffectiveInterestRate_onchange()
Me.textNominalInterstRate.value = calculateNIR(Me.textEffectiveInterestRate.value, Me.textForceOfInterest.value, etc...)
Me.textForceOfInterest.value = calculateForceOfInterest(Me.textEffectiveInterestRate.value, Me.textNominalInterstRate.value, etc...)
' And every other function aside from calculateEIR()
end sub
I am unsure about the specifics of how you are changing all the values based on a change in the others (since I don't know the formulas), but in general, you should not in any way need 30 functions...

Graphing slow counters with prometheus and grafana

We graph fast counters with sum(rate(my_counter_total[1m])) or with sum(irate(my_counter_total[20s])). Where the second one is preferrable if you can always expect changes within the last couple of seconds.
But how do you graph slow counters where you only have some increments every couple of minutes or even hours? Having values like 0.0013232/s is not very human friendly.
Let's say I want to graph how many users sign up to our service (we expect a couple of signups per hour). What's a reasonable query?
We currently use the following to graph that in grafana:
Query: 3600 * sum(rate(signup_total[1h]))
Step: 3600s
Resolution: 1/1
Is this reasonable?
I'm still trying to understand how all those parameters play together to draw a graph. Can someone explain how the range selector ([10m]), the rate() and the irate() functions, the Step and Resolution settings in grafana influence each other?
That's a correct way to do it. You can also use increase() which is syntactic sugar for using rate() that way.
Can someone explain how the range selector
This is only used by Prometheus, and indicates what data to work over.
the Step and Resolution settings in grafana influence each other?
This is used on the Grafana side, it affects how many time slices it'll request from Prometheus.
These settings do not directly influence each other. However the resolution should work out to be smaller than the range, or you'll be undersampling and miss information.
The 3600 * sum(rate(signup_total[1h])) can be substituted with sum(increase(signup_total[1h])) . The increase(counter[d]) function returns counter increase on the given lookbehind window d. E.g. increase(signup_total[1h]) returns the number of signups during the last hour.
Note that the returned value from increase(signup_total[1h]) may be fractional even if signup_total contains only integer values. This is because of extrapolation - see this issue for technical details. There are the following solutions for this issue:
To use offset modifier: signup_total - (signup_total offset 1h) . This query returns correct results if signup_total wasn't reset to zero during the last hour. In this case the sum(signup_total - (signup_total offset 1h)) is roughly equivalent to sum(increase(signup_total[1h])), but returns more accurate integer results.
To use VictoriaMetrics. It returns the expected integer results from increase() out of the box. See this article and this comment for technical details.

How predictable is NEWSEQUENTIALID?

According to Microsoft's documentation on NEWSEQUENTIALID, the output of NEWSEQUENTIALID is predictable. But how predictable is predictable? Say I have a GUID that was generated by NEWSEQUENTIALID, how hard would it be to:
Calculate the next value?
Calculate the previous value?
Calculate the first value?
Calculate the first value, even without knowing any GUID's at all?
Calculate the amount of rows? E.g. when using integers, /order?id=842 tells me that there are 842 orders in the application.
Below is some background information about what I am doing and what the various tradeoffs are.
One of the security benefits of using GUID's over integers as primary keys is that GUID's are hard to guess. E.g. say a hacker sees a URL like /user?id=845 he might try to access /user?id=0, since it is probable that the first user in the database is an administrative user. Moreover, a hacker can iterate over /user?id=0..1..2 to quickly gather all users.
Similarly, a privacy downside of integers is that they leak information. /order?id=482 tells me that the web shop has had 482 orders since its implementation.
Unfortunately, using GUID's as primary keys has well-known performance downsides. To this end, SQL Server introduced the NEWSEQUENTIALID function. In this question, I would like to learn how predictable the output of NEWSEQUENTIALID is.
The underlying OS function is UuidCreateSequential. The value is derived from one of your network cards MAC address and a per-os-boot incremental value. See RFC4122. SQL Server does some byte-shuffling to make the result sort properly. So the value is highly predictable, in a sense. Specifically, if you know a value you can immediately predict a range of similar value.
However one cannot predict the equivalent of id=0, nor can it predict that 52DE358F-45F1-E311-93EA-00269E58F20D means the store sold at least 482 items.
The only 'approved' random generation is CRYPT_GEN_RANDOM (which wraps CryptGenRandom) but that is obviously a horrible key candidate.
In most cases, the next newsequentialid can be predicted by taking the current value and adding one to the first hex pair.
In other words:
1E29E599-45F1-E311-80CA-00155D008B1C
is followed by
1F29E599-45F1-E311-80CA-00155D008B1C
is followed by
2029E599-45F1-E311-80CA-00155D008B1C
Occasionally, the sequence will restart from a new value.
So, it's very predictable
NewSequentialID is a wrapper around the windows function UuidCreateSequential
You can try this code:
DECLARE #tbl TABLE (
PK uniqueidentifier DEFAULT NEWSEQUENTIALID(),
Num int
)
INSERT INTO #tbl(Num) values(1),(2),(3),(4),(5)
select * from #tbl
On my machine in this time is result:
PK Num
52DE358F-45F1-E311-93EA-00269E58F20D 1
53DE358F-45F1-E311-93EA-00269E58F20D 2
54DE358F-45F1-E311-93EA-00269E58F20D 3
55DE358F-45F1-E311-93EA-00269E58F20D 4
56DE358F-45F1-E311-93EA-00269E58F20D 5
You should try it several times in different time/date to interpolate the behaviour.
I tried it run several times and the first part is changing everytime (you see in results: 52...,53...,54...,etc...). I waited some time to check it, and after some time the second part is incremented too. I suppose the incementation continues to the all parts. Basically it look like simple +=1 incementation transformed into Guid.
EDIT:
If you want sequential GUID and you want have control over the values, you can use Sequences.
Sample code:
select cast(cast(next value for [dbo].[MySequence] as varbinary(max)) as uniqueidentifier)
• Calculate the next value? Yes
Microsoft says:
If privacy is a concern, do not use this function. It is possible to guess the value of the next generated GUID and, therefore, access data associated with that GUID.
SO it's a possibility to get the next value. I don't find information if it is possible to get the prevoius one.
from: http://msdn.microsoft.com/en-us/library/ms189786.aspx
edit: another few words about NEWSEQUENTIALID and security: http://vadivel.blogspot.com/2007/09/newid-vs-newsequentialid.html
Edit:
NewSequentialID contains the server's MAC address (or one of them), therefore knowing a sequential ID gives a potential attacker information that may be useful as part of a security or DoS attack.
from: Are there any downsides to using NewSequentialID?

how to store an approximate number? (number is too small to be measured)

I have a table representing standards of alloys. The standard is partly based on the chemical composition of the alloys. The composition is presented in percentages. The percentage is determined by a chemical composition test. Sample data.
But sometimes, the lab cannot measure below a certain percentage. So they indicate that the element is present, but the percentage is less than they can measure.
I was confused how to accurately store such a number in an SQL database. I thought to store the number with a negative sign. No element can have a negative composition of course, but i can interpret this as less than the specified value. Or option is to add another column for each element!! The latter option i really don't like.
Any other ideas? It's a small issue if you think about it, but i think a crowd is always wiser. Somebody might have a neater solution.
Question updated:
Thanks for all the replies.
The test results come from different labs, so there is no common lower bound.
The when the percentage of Titanium is less than <0.0004 for example, the number is still important, only the formula will differ slightly in this case.
Hence the value cannot be stored as NULL, and i don't know the lower bound for all values.
Tricky one.
Another possibility i thought of is to store it as a string. Any other ideas?
What you're talking about is a sentinel value. It's a common technique. Strings in most languages after all use 0 as a sentinel end-of-string value. You can do that. You just need to find a number that makes sense and isn't used for anything else. Many string functions will return -1 to indicate what you're looking for isn't there.
0 might work because if the element isn't there there shouldn't even be a record. You also face the problem that it might be mistaken for actually meaning 0. -1 is another option. It doesn't have that same problem obviously.
Another column to indicate if the amount is measurable or not is also a viable option. The case for this one becomes stronger if you need to store different categories of trace elements (eg <1%, <0.1%, <0.01%, etc). Storing the negative of those numbers seems a bit hacky to me.
You could just store it as NULL, meaning that the value exists but is undefined.
Any arithmetic operation with a NULL yields a NULL.
Division by NULL is safe.
NULL's are ignored by the aggregation functions, so queries like these:
SELECT SUM(metal_percent), COUNT(metal_percent)
FROM alloys
GROUP BY
metal
will give you the sum and the count of the actual, defined values, not taking the unfilled values into account.
I would use a threshold value which is at least one significant digit smaller than your smallest expected value. This way you can logically say that any value less than say 0.01, can be presented to you application as a "trace" amount. This remains easy to understand and gives you flexibility in determining where your threshold should lie.
Since the constraints of the values are well defined (cannot have negative composition), I would go for the "negative value to indicate less-than" approach. As long as this use of such sentinel values are sufficiently documented, it should be reasonably easy to implement and maintain.
An alternative but similar method would be to add 100 to the values, assuming that you can't get more than 100%. So <0.001 becomes 100.001.
I would have a table modeling the certificate, in a one to many relation with another table, storing the values for elements. Then, I would still have the elements table containing the value in one column and a flag (less than) as a separate column.
Draft:
create table CERTIFICATES
(
PK_ID integer,
NAME varchar(128)
)
create table ELEMENTS
(
ELEMENT_ID varchar(2),
CERTIFICATE_ID integer,
CONCENTRATION number,
MEASURABLE integer
)
Depending on the database engine you're using, the types of the columns may vary.
Why not add another column to store whether or not its a trace amount
This will allow you to to save the amount that the trace is less than too
Since there is no common lowest threshold value and NULL is not acceptable, the cleanest solution now is to have a marker column which indicates whether there is a quantifiable amount or a trace amount present. A value of "Trace" would indicate to anybody reading the raw data that only a trace amount was present. A value of "Quantity" would indicate that you should check an amount column to find the actual quantity present.
I would have to warn against storing numerical values as strings. It will inevitably add additional pain, since you now lose the assertions a strong type definition gives you. When your application consumes the values in that column, it has to read the string to determine whether it's a sentinel value, a numeric value or simply some other string it can't interpret. Trying to handle data conversion errors at this point in your application is something I'm sure you don't want to be doing.
Another field seems like the way to go; call it 'MinMeasurablePercent'.