How to extract median value? - sql

I need to get median value in column "median". Any ideas, please?
SELECT
MIN(score) min, CAST(AVG(score) AS float) median, MAX(score) max
FROM result JOIN student ON student.id = result.student_id

I think the simplest method is PERCENTILE_CONT() or PERCENTILE_DISC():
SELECT MIN(score) as min_score,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY score) as median_score,
MAX(score) max_score
FROM result r JOIN
student s
ON s.id = r.student_id;
This assumes (reasonably) that score is numeric.
The difference between PERCENTILE_CONT() and PERCENTILE_DISC() is what happens when there are an even number of values. That is usually an unimportant consideration, unless you have a small amount of data.

Average is not Median, you're right.
You can do it the exact way, with:
SELECT ( (SELECT MIN(score) FROM Results X
WHERE (SELECT COUNT(*) FROM Results Y WHERE Y.score<= X.score)
>= (SELECT COUNT(*) FROM Results ) / 2)
+ (SELECT MAX(num) FROM Results X
WHERE (SELECT COUNT(*) FROM Results Y WHERE Y.score >= X.score)
>= (SELECT COUNT(*) FROM Results ) / 2)
) / 2 AS median
This handles the case where the boundary between the upper and lower 50% falls between two values; it arbitrarily takes the halfway point between them as the median. There are arguments why that might be weighted slightly higher or lower, but any value in that interval correctly divides the population in two.
Or, if you are dealing with a hyperbolic distribution, there's a short-cut approximation:
SELECT SQRT(SUM(num) / SUM(1.0/num)) FROM List
Many other real-world distributions have a lot of little members and a few large members.
Having just hit SAVE and seen the prior answer: yes, SQL2003 now gives you something simpler :-)

Related

I get the same value percentile for all rows

For each borough, what is the 90th percentile number of people injured per intersection?
I have borough column, columns with number of injured that I aggregate it to one column. I need to find the 90th percentile of the injured people. It gives me just one value. I need to get different value for each row (or am I wrong?)
select distinct borough,count(num_of_injured) as count_all, PERCENTILE_CONT(num_of_injured, 0.9 RESPECT NULLS) OVER() AS percentile90
from`bigquery-public-data.new_york.nypd_mv_collisions` c cross join
unnest(array[number_of_persons_injured,number_of_pedestrians_injured,number_of_motorist_killed,number_of_cyclist_injured]
)num_of_injured
where borough!=''
group by borough,num_of_injured
order by count_all desc
limit 10;
table
Thank you for the help!
If you look at the counts by borough for each num injured, then over 90% are 0:
select borough, num_of_injured, count(*),
count(*) / sum(count(*)) over (partition by borough)
from`bigquery-public-data.new_york.nypd_mv_collisions` c cross join
unnest(array[number_of_persons_injured,number_of_pedestrians_injured,number_of_motorist_killed,number_of_cyclist_injured]
) num_of_injured
group by 1, 2
order by 1, 2;
Hence, the 90th percentile is 0.

How to generate ranges of a column based on condition

There is a column with numbers- I would like to develop a report that categorizes values of this column into ranges (lower limit and upper limit). This split must happen if the difference in values is more than 10. Is this something achievable by either query in Power BI or SQL Server?
In SQL, I would use lag() and a window sum() to define the groups, and then aggregate:
select min(x) lower_limit, max(x) upper_limit
from (
select x, sum(case when x <= lag_x + 10 then 0 else 1 end) over(order by x) grp
from (select x, lag(x) over(order by x) lag_x from mytable) t
) t
group by grp
lag() gives you the the previous value. Then, the window sum implements the following logic: everytime the difference between the current and the previous value is more than 10, a new group starts. Finally, the outer query aggregates by group and computes the lower and upper bounds.
GMB's solution is definitely the canonical approach to solving this, by treating it as a variant of gaps-and-islands. I was wondering if there is a way to do this without two levels of subqueries. And there is:
select coalesce(lag(next_x) over (order by x), first_x) as lower,
x as upper
from (select t.*,
first_value(x) over (order by x) as first_x,
lead(x) over (order by x) as next_x
from t
) t
where next_x is null or next_x > x + 10;
Here is a db<>fiddle.
It would be interesting to compare the performance on a large set of data -- 2 window functions + aggregation versus 3 window functions + filtering.

Compute average of a column, excpet for the first row

I'm trying to compute some queries, with the aggregates functions.
The problem is I'm not able to compute the average of the column, without the first value.
example
_myColumn_
10
15
20
Final average: (10 + 15 + 20) / 3 = 15
What I want is: (15 + 20) / 2 = 12.5
This is the code I've tried without success
select avg(age) from testing
except
select avg(age) from testing
limit 1
First use OFFSET clause to skip the first row. (You should really ensure the order with an ORDER BY clause.) Then compute the AVG on that result:
select avg(age)
from
(
select age from testing
offset 1
) dt
Or, if the first row is expected to be the one with the lowest age:
select (sum(age) - min(age)) / (count(*) - 1)
from testing
There is no such thing as a "first" row in SQL, because tables represent unordered sets. A column is needed to specify the ordering.
Let me assume you mean the row with the smallest value. This is a little tricky, but you can use row_number():
select avg(age)
from (select t.*, row_number() over (order by age) as seqnum
from t
) t
where seqnum > 1;
I'd propose use something like this(some field Should be UNIQUE, for example ID if you have one)
SELECT AVG(age) FROM testing WHERE ID NOT IN
(SELECT ID FROM testing ORDER BY ??SOMETHING HERE?? limit 1)

Query in sql to get the top 10 percent in standard sql (without limit, top and the likes, without window functions)

I'm wondering how to retrieve the top 10% athletes in terms of points, without using any clauses such as TOP, Limit etc, just a plain SQL query.
My idea so far:
Table Layout:
Score:
ID | Name | Points
Query:
select *
from Score s
where 0.10 * (select count(*) from Score x) >
(select count(*) from Score p where p.Points < s.Points)
Is there an easier way to do this? Any suggestions?
In most databases, you would use the ANSI standard window functions:
select s.*
from (select s.*,
count(*) over () as cnt,
row_number() over (order by score) as seqnum
from s
) s
where seqnum*10 < cnt;
Try:
select s1.id, s1.name s1.points, count(s2.points)
from score s1, score s2
where s2.points > s1.points
group by s1.id, s1.name s1.points
having count(s2.points) <= (select count(*)*.1 from score)
Basically calculates the count of players with a higher score than the current score, and if that count is less than or equal to 10% of the count of all scores, it's in the top 10%.
The PERCENTILE_DISC function is standard SQL and can help you here. Not every SQL implementation supports it, but the following should work in SQL Server 2012, for example. If you need to be particular about ties, or what the top 10% means if there are fewer than 10 athletes, make sure this is computing what you want. PERCENTILE_COMP may be a better option for some questions.
WITH C(cutoff) AS (
SELECT DISTINCT
PERCENTILE_DISC(0.90)
WITHIN GROUP (ORDER BY points)
OVER ()
FROM T
)
SELECT *
FROM Score JOIN C
ON points >= cutoff;

SQL: Show average and min/max within standard deviations

I have the following SQL table -
Date StoreNo Sales
23/4 34 4323.00
23/4 23 564.00
24/4 34 2345.00
etc
I am running a query that returns average sales, max sales and min sales for a certain period -
select avg(Sales), max(sales), min(sales)
from tbl_sales
where date between etc
But there are some values coming through in the min and max that are really extreme - perhaps because the data entry was bad, perhaps because some anomoly had occurred on that date and store.
What I'd like is a query that returns average, max and min, but somehow excludes the extreme values. I am open to how this is done, but perhaps it would use standard deviations in some way (for example, only using data within x std devs of the true average).
Many thanks
In order to calculate the standard deviation, you need to iterate through all of the elements, so it would be impossible to do this in one query. The lazy way would be to just do it in two passes:
DECLARE
#Avg int,
#StDev int
SELECT #Avg = AVG(Sales), #StDev = STDEV(Sales)
FROM tbl_sales
WHERE ...
SELECT AVG(Sales) AS AvgSales, MAX(Sales) AS MaxSales, MIN(Sales) AS MinSales
FROM tbl_sales
WHERE ...
AND Sales >= #Avg - #StDev * 3
AND Sales <= #Avg + #StDev * 3
Another simple option that might work (fairly common in analysis of scientific data) would be to just drop the minimum and maximum x values, which works if you have a lot of data to process. You can use ROW_NUMBER to do this in one statement:
WITH OrderedValues AS
(
SELECT
Sales,
ROW_NUMBER() OVER (ORDER BY Sales) AS RowNumAsc,
ROW_NUMBER() OVER (ORDER BY Sales DESC) AS RowNumDesc
)
SELECT ...
FROM tbl_sales
WHERE ...
AND Sales >
(
SELECT MAX(Sales)
FROM OrderedValues
WHERE RowNumAsc <= #ElementsToDiscard
)
AND Sales <
(
SELECT MIN(Sales)
FROM OrderedValues
WHERE RowNumDesc <= #ElementsToDiscard
)
Replace ROW_NUMBER with RANK or DENSE_RANK if you want to discard a certain number of unique values.
Beyond these simple tricks you start to get into some pretty heavy stats. I have to deal with similar kinds of validation and it's far too much material for a SO post. There are a hundred different algorithms that you can tweak in a dozen different ways. I would try to keep it simple if possible!
Expanding on DuffyMo's post you could do something like
With SalesStats As
(
Select Sales, NTILE( 100 ) OVER ( Order By Sales ) As NtileNum
From tbl_Sales
)
Select Avg( Sales ), Max( Sales ), Min( Sales )
From SalesStats
Where NtileNum Between 5 And 95
This will exclude the lowest 5% and highest 95%. If you have numbers that vary wildly, you may find that the Average isn't a quality summary statistic and should consider using median. You can do that by doing something like:
With SalesStats As
(
Select NTILE( 100 ) OVER ( Order By Sales ) As NtileNum
, ROW_NUMBER() OVER ( Order By Id ) As RowNum
From tbl_Sales
)
, TotalSalesRows
(
Select COUNT(*) As Total
From tbl_Sales
)
, Median As
(
Select Sales
From SalesStats
Cross Join TotalSalesRows
Where RowNum In ( (TotalRows.Total + 1) / 2, (TotalRows.Total + 2) / 2 )
)
Select Avg( Sales ), Max( Sales ), Min( Sales ), Median.Sales
From SalesStats
Cross Join Median
Where NtileNum Between 5 And 95
Maybe what you're looking for are percentiles.
Standard deviation tends to be sensitive to outliers, since it's calculated using the square of the difference between a value and the mean.
Maybe a more robust, less sensitive measure like absolute value of difference between a value and the mean would be more appropriate in your case.