How to generate ranges of a column based on condition - sql

There is a column with numbers- I would like to develop a report that categorizes values of this column into ranges (lower limit and upper limit). This split must happen if the difference in values is more than 10. Is this something achievable by either query in Power BI or SQL Server?

In SQL, I would use lag() and a window sum() to define the groups, and then aggregate:
select min(x) lower_limit, max(x) upper_limit
from (
select x, sum(case when x <= lag_x + 10 then 0 else 1 end) over(order by x) grp
from (select x, lag(x) over(order by x) lag_x from mytable) t
) t
group by grp
lag() gives you the the previous value. Then, the window sum implements the following logic: everytime the difference between the current and the previous value is more than 10, a new group starts. Finally, the outer query aggregates by group and computes the lower and upper bounds.

GMB's solution is definitely the canonical approach to solving this, by treating it as a variant of gaps-and-islands. I was wondering if there is a way to do this without two levels of subqueries. And there is:
select coalesce(lag(next_x) over (order by x), first_x) as lower,
x as upper
from (select t.*,
first_value(x) over (order by x) as first_x,
lead(x) over (order by x) as next_x
from t
) t
where next_x is null or next_x > x + 10;
Here is a db<>fiddle.
It would be interesting to compare the performance on a large set of data -- 2 window functions + aggregation versus 3 window functions + filtering.

Related

TSQL - how can I sum values except absolutes

I would like to sum values in my table, except the ones that are absolute (field absolute, value = 1). If that's the case, the summing should reset. Example:
Date Value Absolute
1-1-2020 4 0
1-2-2020 7 1
1-3-2020 3 0
Regular SUM() would return (4+7+3=) 14. But in this example it should reset at value 7, which makes a sum of (7+3=) 10.
How can I make this work?
You seem to want a window sum that resets everytime absolute is 1. If so, you can do:
select t.*, sum(value) over(partition by grp order by date) sum_value
from (
select t.*, sum(absolute) over(order by date) grp
from mytable t
) t
The subquery uses a window sum of absolute to define the groups, then the outer query performs the sums value over each group.

Compute average of a column, excpet for the first row

I'm trying to compute some queries, with the aggregates functions.
The problem is I'm not able to compute the average of the column, without the first value.
example
_myColumn_
10
15
20
Final average: (10 + 15 + 20) / 3 = 15
What I want is: (15 + 20) / 2 = 12.5
This is the code I've tried without success
select avg(age) from testing
except
select avg(age) from testing
limit 1
First use OFFSET clause to skip the first row. (You should really ensure the order with an ORDER BY clause.) Then compute the AVG on that result:
select avg(age)
from
(
select age from testing
offset 1
) dt
Or, if the first row is expected to be the one with the lowest age:
select (sum(age) - min(age)) / (count(*) - 1)
from testing
There is no such thing as a "first" row in SQL, because tables represent unordered sets. A column is needed to specify the ordering.
Let me assume you mean the row with the smallest value. This is a little tricky, but you can use row_number():
select avg(age)
from (select t.*, row_number() over (order by age) as seqnum
from t
) t
where seqnum > 1;
I'd propose use something like this(some field Should be UNIQUE, for example ID if you have one)
SELECT AVG(age) FROM testing WHERE ID NOT IN
(SELECT ID FROM testing ORDER BY ??SOMETHING HERE?? limit 1)

get graph data from large table using nth_value window function in postgres

I'm trying to plot a graph of the data from a large table. I can get all of it easily by doing basically
select id, value from data order by value desc;
but that yields me about a hundred thousand rows. About 50 are enough for my purpose, so I want to basically have the equivalent of a step function. Searching turned up "nth_value" as the appropriate window function that probably does what I need, but I couldn't find examples on how to actually use it for this purpose.
Or maybe there's a better way even?
(I'm using PostgreSQL 9.6 in case it matters)
I would be very wary of using row_number() without an order by clause.
One way to phrase this is:
select id, value
from (select d.*,
row_number() over (order by id) as n,
count(*) over () as cnt
from data d
) d
where n % floor(cnt / 50) = 0;
This will typically return either 50 or 51 rows. If you want exactly 50 rows, you can add fetch first 50 rows.
You want only the first 50 rows?
select id, value
from data
order by value desc
limit 50;
Every 50th row?
select id, value
from (select id, value, row_number() over () as n
from data) d
where n % 50 = 0
You can choose whatever ordering you want in the OVER clause, e.g. ROW_NUMBER() OVER (ORDER BY value DESC)
Every nth row to get 50 results?
select id, value
from (select id, value, row_number() over () as n
from data) d
where n % ((select count(*) from d) / 50) = 0
Working example on dbfiddle

Turning Percentile_cont/disc (Median) into scalar function

Editing this since turns out we were trying to recreate the wheel.
The below works perfectly in determining the median. Now, how would we go about converting into a function so that we can call median(column) instead of having to do the below each time. Below does the trick:
select percentile_cont(0.5) within group (order by n) over (PARTITION BY [column1]),
from t;
AHH - I see. Is it possible to groupby where it calcs the median only across where column1 = a,b,c so output would be
A median of values with A identifier
B median of values with B identifier
C median of values with C identifier
You should just usetthe percentile_cont() or percentile_disc() window functions:
select percentile_cont(0.5) within group (order by n) over (),
percentile_disc(0.5) within group (order by n) over ()
from t;
There is no need to re-invent the wheel.

How to extract median value?

I need to get median value in column "median". Any ideas, please?
SELECT
MIN(score) min, CAST(AVG(score) AS float) median, MAX(score) max
FROM result JOIN student ON student.id = result.student_id
I think the simplest method is PERCENTILE_CONT() or PERCENTILE_DISC():
SELECT MIN(score) as min_score,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY score) as median_score,
MAX(score) max_score
FROM result r JOIN
student s
ON s.id = r.student_id;
This assumes (reasonably) that score is numeric.
The difference between PERCENTILE_CONT() and PERCENTILE_DISC() is what happens when there are an even number of values. That is usually an unimportant consideration, unless you have a small amount of data.
Average is not Median, you're right.
You can do it the exact way, with:
SELECT ( (SELECT MIN(score) FROM Results X
WHERE (SELECT COUNT(*) FROM Results Y WHERE Y.score<= X.score)
>= (SELECT COUNT(*) FROM Results ) / 2)
+ (SELECT MAX(num) FROM Results X
WHERE (SELECT COUNT(*) FROM Results Y WHERE Y.score >= X.score)
>= (SELECT COUNT(*) FROM Results ) / 2)
) / 2 AS median
This handles the case where the boundary between the upper and lower 50% falls between two values; it arbitrarily takes the halfway point between them as the median. There are arguments why that might be weighted slightly higher or lower, but any value in that interval correctly divides the population in two.
Or, if you are dealing with a hyperbolic distribution, there's a short-cut approximation:
SELECT SQRT(SUM(num) / SUM(1.0/num)) FROM List
Many other real-world distributions have a lot of little members and a few large members.
Having just hit SAVE and seen the prior answer: yes, SQL2003 now gives you something simpler :-)