Find outliers to each band of records - sql

The goal is to find extremely small or large records for each band based on a formula.
Input:
Distance Rate
10 5
25 200
50 300
1000 5
2000 2000
Bands are defined by my input. For example, I want to have two bands for this input (actually there are more, like 10 bands) for distance: 1-100, 101-10000.
For each band, we want to find all records that the rates are outliers by formula f (two standard deviations away from mean, if you are interested in the formula)
The formula f I want to use
(Rate- avg(Rate) over ()) / (stddev(Rate) over ()) > 2
Output:
Distance Rate
10 5
1000 5 (this number is for illustrative purpose only.)
The difficult part is I do not know how to do it for each band, and it makes applying formula more difficult.

Without knowing how you intend to apply your formula (my guess would be UDF), you can create your "bands" by grouping by a CASE expression:
GROUP BY CASE
WHEN Distance BETWEEN 1 AND 100 THEN 'Band1'
WHEN Distance BETWEEN 101 AND 10000 THEN 'Band2'
ETC
END
Similarly you use the same CASE expression in a RANK() OVER () function, if that works better for the rest of your query.
EDIT: based on your clarification, you need to handle this with a correlated sub-query in your WHERE clause. I would consider encapsulating it in a UDF to make the main query look cleaner. Something like:
WHERE (Rate- {Correlated query to select the AVG(rate) of all rows in this band (using the above CASE statement to determine "this band"} over ()) / (stddev(Rate) over ()) > 2

Related

Query smallest number of rows to match a given value threshold

I would like to create a query that operates similar to a cash register. Imagine a cash register full of coins of different sizes. I would like to retrieve a total value of coins in the fewest number of coins possible.
Given this table:
id
value
1
100
2
100
3
500
4
500
5
1000
How would I query for a list of rows that:
has a total value of AT LEAST a given threshold
with the minimum excess value (value above the threshod)
in the fewest possible rows
For example, if my threshold is 1050, this would be the expected result:
id
value
1
100
5
1000
I'm working with postgres and elixir/ecto. If it can be done in a single query great, if it requires a sequence of multiple queries no problem.
I had a go at this myself, using answers from previous questions:
Using ABS() to order by the closest value to the threshold
Select rows until a sum reduction of a single column reaches a threshold
Based on #TheImpaler's comment above, this prioritises minimum number of rows over minimum excess. It's not 100% what I was looking for, so open to improvements if anyone can, but if not I think this is going to be good enough:
-- outer query selects all rows underneath the threshold
-- inner subquery adds a running total column
-- window function orders by the difference between value and threshold
SELECT
*
FROM (
SELECT
i.*,
SUM(i.value) OVER (
ORDER BY
ABS(i.value - $THRESHOLD),
i.id
) AS total
FROM
inputs i
) t
WHERE
t.total - t.value < $THRESHOLD;

How to calculate dynamic % of grand total as a measure on Power BI?

I have the below table connected into Power BI and I am looking for ways to create a formula calculating % of grand total of the Rating column and further subtracting with targets for each rating. For example, the % of grand total for Rating 1 is 3 divided by 7 (42.86%). The most important part of the formula is the denominator which has to remain at a total level and dynamic for any filters applied to either Grade or BU columns. For example, denominator at a total level would be 7 and when filtered down to Academy BU should be 3.
Sample Data Table:
Rating Target Table:
I want the end result to look like this,
I have used the following formula to achieve this,
Measure created: % of total calc = DIVIDE(COUNT('Table'[Rating]),CALCULATE(SUM('Table'[Count]),'Table'[Rating]))
To make the above formula work I had to add an extra column and include ones in it (see below)
I want to know if there are other ways of achieving this outcome?
ALLEXCEPT will produce such result to exclude used dimensions and include mandatory filters such as date with one condition, rating, date, any dimension must be in the same table.

No sum() with null values

I'm looking for a solution to create sums of +- 10 scores and targets of a product over 6 different dimensions. There are some more i won't bother you with. Of every dimension I need a total. For example
SalesPeriod. Product: Bikes. Dimensions: bmx, size, colours, with bars etc. Targets: 1,2,3,4,5. Scores:1,2,3,4,5.
So 10 totals for bmx bikes with size x, colour red and bars, and 10 totals for bmx bikes, size x, colour red etc etc.
However, every score needs to be calculated only when none of the underlying values is a null. For example score 1 contains a null then no calculation, but score 2 does not contain a null thus should be calculated.
At this point the calculation is done via a case statement which basically checks the values of within each column/score and only calculates the total when the count of scores is equal to the expected rows.
The calculation requires a lot of cpu and with a larger dataset this is very inefficient and it simply takes too long.
I'm looking for a solution that will be much more effecient. What could be my best option to try?
You can filter (or first group by) the products with Non Null values only first by using your same count method. I don't think there is any other method.
SELECT columnid, SUM(column1)
FROM table
GROUP BY columnid
HAVING COUNT(column1)=COUNT(*);
Then you can join it on columnid with another similar query on another columnN as well.
(I'm not sure if understood your problem completely, but you basically want an efficient query with sum(scores) and sum(targets) only when they are not null? or only when they are both not null? or only scores? or only targets?)

MDX calculations and grouping

I have a query that returns results for entries that match a calculation of two floating point numbers. IE, when a weighted sum of the two numbers is in a certain range:
Select
NON EMPTY Measures.AllMembers on 0
from (
Select
Filter(MyDim.[p1].Children * MyDim.[p2].Children,
MyDim.[p1].CurrentMember.MemberValue * 0.5 +
MyDim.[p2].CurrentMember.MemberValue * 0.5 >= 70
and
MyDim.[p1].CurrentMember.MemberValue * 0.5 +
MyDim.[p2].CurrentMember.MemberValue * 0.5 <= 90)
on 0
from MyCube])
This query is generated by c# code, and the 0.5's and the 70 and the 90 change.
First off, is there a better way to do this?
Next, how do I create a query that will return ranges of the results with the measures? Something like
-----------------------------------------------
< 70 | blah blah blah measures measures
70 - 90 | blah blah blah measures measures
> 90 | blah blah blah measures measures
-----------------------------------------------
If it does this all by itself (ie creates the buckets by magic) that's great, but I wouldn't mind having to find out the possible ranges first, and then writing out the whole query by hand (or code). For now, I can't even work out how to create WITH members or sets or whatever else, rather than having to run individual queries one after another.
Edit: for one parameter, this works if I say
WITH
Member MyDim.p1.[<70] as
Aggregate(Filter(MyDim.p1.members,
MyDim.p1.CurrentMember.MemberValue < 70))
Member Mydim.p1.[70 - 90] as
Aggregate(Filter(MyDim.p1.members,
MyDim.p1.CurrentMember.MemberValue >= 70
and MyDim.p1.CurrentMember.MemberValue <= 90))
Member MyDim.p1.[>90] as
Aggregate(Filter(MyDim.p1.members,
MyDim.p1.CurrentMember.MemberValue> 90))
Select {MyDim.p1.[<70],MyDim.p1.[70 - 90], MyDim.p1.[>90] on 1,
measures.Members on 0
from MyCube
This doesn't seem to work for the 2 parameter query.
EDIT: further info
What do you mean by "parameters"? Values from different hierarchies?
Yes, exactly. p1 and p2 in the query above.
How do you want to determine the buckets? You do not explain the rationale behind the "magic".
Ideally, it would be broken down into buckets with equal numbers of observations, just like "discretization" does when building a cube - that would be "magic". I'm planning to settle for just taking the min and max values, then breaking the range up into n (say, 10) buckets of size (max - min)/n.
What do you mean by "does not work"? What did you try and what error messages did you get?
I'll have to write it again and post the query and the results here - will do in a couple of hours. I think what I tried was the 2nd query, but with p1*p2 in the Filter bit, with the weighted sum in the filter condition. I was trying to put it all into p1 hierarchy. From memory, it ran, but returned all results without filtering anything. I appreciate this is vague, and will update it here. I just thought it was so wholesale wrong, that I didn't bother putting that particular experiment in the original question.

Dynamic use of MDX AVG function

Anyone have advice on how to build an average measure that is dynamic -- it doesn't specify a particular slice but instead uses your current view? I'm working within a front-end OLAP viewer (Strategy Companion) and I need a "dynamic" implementation based on the dimensions that are currently filtered in the data view.
My fact table looks something like this:
Key AmountA IndicatorA AmountB Other Data
1 5 1 null 25
2 6 1 null 52
3 7 1 2 106
4 null 0 4 108
Now I can specify a simple average for "[Measures].[AmountA]" with "[Measures].[AmountA] / [Measures].[IndicatorA]" which works great - "[IndicatorA]" sums up to the number of non-null values of "[AmountA]". And this also works great no matter what dimensions are selected in the view - it always divides by the count of rows that have been filtered in.
But what about [AmountB]? I don't have a null indicator column. I want to get an average value of [AmountB] for whatever rows have been filtered in for my current view. If I try to use the count of rows as a simple formula (psuedo-code "[Measures].[AmountB] / Count([Measures].[Key])") I get the wrong result, because it is counting all the null rows in the average.
So, I need a way to use the AVG function to specify the average of [AmountB] over the set of "whatever rows I'm currently filtering in, based on whatever dimensions I'm currently using". How do I specify this dynamic set?
I've tried several different uses of the AVG function and they have either returned null or summed up to huge numbers, clearly not the average I'm looking for.
Thanks-
Matt
Sorry, my first suggestion was wrong. If you don't have access to OLAP cube you can't write any mdx-query for this purpose (IMHO). Because, you don't have any detailed data (from your fact table) in this access level and you can use only aggregated data and dimensions from your cube.
Otherwise (if you have access to olap db), you can create this metric (count of not NULL rows) in your measure group and after that use it for AVG calculation (as calculated member in your cube or in section "WITH" in your mdx-query).