How to find the median in SQL Server - sql

I have a single table that houses student scores by their classes.
For example, each class has 30 students, so there are 30 scores for each class.
I'd like to do a simple report that averages, does a median, and a mode, for each data set per class.
So, each class will have an average, a median, and a mode.
I know that SQL Server does not have a built in function for median and mode, and I found sample SQLs for the median. However, the samples I found do not do any grouping, I found:
SELECT
(
(SELECT MAX(Value) FROM
(SELECT TOP 50 PERCENT Value FROM dbo.VOrders ORDER BY Value) AS H1)
+
(SELECT MIN(Value) FROM
(SELECT TOP 50 PERCENT Value FROM dbo.VOrders ORDER BY Value DESC) AS H2)
) / 2 AS Median
Is it possible to modify to add a group by so I get a median value per class?
i don;t think i was clear enough, i'd like the SQL to return one data set, looking something like this:
MEDIAN CLASS
====== =====
90 BIO
77 CHEM

this is the answer:
WITH CTE AS (
SELECT e_id,
scale,
ROW_NUMBER() OVER(PARTITION BY e_id ORDER BY scale ASC) AS rn,
COUNT(scale) OVER(PARTITION BY e_id) AS cn
FROM waypoint.dbo.ScoreMaster
WHERE scale IS NOT NULL
)
SELECT e_id,
cast(AVG (cast(scale as decimal(5,2))) as decimal(5,3)) as [AVG],
cast (STDEV(cast(scale as decimal(5,1))) as decimal(5,3)) as [STDDEV],
AVG(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN scale END) AS FinancialMedian,
MAX(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN scale END) AS StatisticalMedian
from CTE
GROUP BY e_id

Related

calculate 2 cumulative sums for 2 different groups

i have a table that looks like this:
id position value
5 senior 10000
6 senior 20000
8 senior 30000
9 junior 5000
4 junior 7000
3 junior 10000
It is sorted by position and value (asc) already. I want to calculate the number of seniors and juniors that can fit in a budget of 50,000 such that preference is given to seniors.
So for example, here 2 seniors (first and second) + 3 juniors can fit in the budget of 50,000.
id position value cum_sum
5 senior 10000 10000
6 senior 20000 30000
8 senior 30000 60000 ----not possible because it is more than 50000
----------------------------------- --- so out of 50k, 30k is used for 2 seniors.
9 junior 5000 5000
4 junior 7000 12000
1 junior 7000 19000 ---with the remaining 20k, these 3 juniors can also fit
3 junior 10000 29000
so the output should look like this:
juniors seniors
3 2
how can i achieve this in sql?
Here's one possible solution: DB Fiddle
with seniorsCte as (
select id, position, value, total
from budget b
inner join (
select id, position, value, (sum(value) over (order by value, id)) total
from people
where position = 'senior'
) as s
on s.total <= b.amount
)
, juniorsCte as (
select j.id, j.position, j.value, j.total + r.seniorsTotal
from (
select coalesce(max(total), 0) seniorsTotal
, max(b.amount) - coalesce(max(total), 0) remainingAmount
from budget b
cross join seniorsCte
) as r
inner join (
select id, position, value, (sum(value) over (order by value, id)) total
from people
where position = 'junior'
) as j
on j.total <= r.remainingAmount
)
/* use this if you want the specific records
select *
from seniorsCte
union all
select *
from juniorsCte
*/
select (select count(1) from seniorsCte) seniors
, (select count(1) from juniorsCte) juniors
From your question I suspect you're familiar with window functions; but in case not; the below query pulls back all rows from the people table where the position is senior, and creates a column, total which is our cumulative total of the value of the rows returned, starting with the lowest value, ascending (then sorting by id to ensure consistent behaviour if there's multiple rows with the same value; though that's not strictly required if we're happy to get those in an arbitrary order).
select id, position, value, (sum(value) over (order by value, id)) total
from people
where position = 'senior'
The budget table I just use to hold a single row/value saying what our cutoff is; i.e. this avoids hardcoding the 50k value you mentioned, so we can easily amend it as required.
The common table expressions (CTEs) I've used to allow us to filter our juniors subquery based on the output of our seniors subquery (i.e. as we only want those juniors up to the difference between the budget and the senior's total), whilst allowing us to return the results of juniors and seniors independently (i.e. if we wanted to return the actual rows, rather than just totals, this allows us to perform a union all between the two sets; as demonstrated in the commented out code.
For it to work, the sum has to be not only cumulative, but also selective. As mentioned in the comment, you can achieve that with a recursive cte: online demo
with recursive
ordered as --this will be fed to the actual recursive cte
( select *,
row_number() over (order by position desc,value asc)
from test_table)
,recursive_cte as
( select id,
position,
value,
value*(value<50000)::int as cum_sum,
value<50000 as is_hired,
2 as next_i
from ordered
where row_number=1
union
select o.id,
o.position,
o.value,
case when o.value+r.cum_sum<50000 then o.value+r.cum_sum else r.cum_sum end,
(o.value+r.cum_sum)<50000 as is_hired,
r.next_i+1 as next_i
from recursive_cte r,
ordered o
where o.row_number=next_i
)
select count(*) filter (where position='junior') as juniors,
count(*) filter (where position='senior') as seniors
from recursive_cte
where is_hired;
row_number() over () is a window function
count(*) filter (where...) is an aggregate filter. It's a faster variant of the sum(case when expr then a else 0 end) or count(nullif(expr)) approach, for when you only wish to sum a specific subset of values. That's just to put those in columns as you did in your expected result, but it could be done with a select position, count(*) from recursive_cte where is_hired group by position, stacked.
All it does is order your list according to your priorities in the first cte, then go through it row by row in the second one, collecting the cumulative sum, based on whether it's still below your limit/budget.
postgresql supports window SUM(col) OVER()
with cte as (
SELECT *, SUM(value) OVER(PARTITION BY position ORDER BY id) AS cumulative_sum
FROM mytable
)
select position, count(1)
from cte
where cumulative_sum < 50000
group by position
An other way to do it to get results in one row :
with cte as (
SELECT *, SUM(value) OVER(PARTITION BY position ORDER BY id) AS cumulative_sum
FROM mytable
),
cte2 as (
select position, count(1) as _count
from cte
where cumulative_sum < 50000
group by position
)
select
sum(case when position = 'junior' then _count else null end) juniors,
sum(case when position = 'senior' then _count else null end) seniors
from cte2
Demo here
This example of using a running total:
select
count(case when chek_sum_jun > 0 and position = 'junior' then position else null end) chek_jun,
count(case when chek_sum_sen > 0 and position = 'senior' then position else null end) chek_sen
from (
select position,
20000 - sum(case when position = 'junior' then value else 0 end) over (partition by position order by value asc rows between unbounded preceding and current row ) chek_sum_jun,
50000 - sum(case when position = 'senior' then value else 0 end) over (partition by position order by value asc rows between unbounded preceding and current row ) chek_sum_sen
from test_table) x
demo : https://dbfiddle.uk/ZgOoSzF0

RANK in SQL but start at 1 again when number is greater than

I need an sql code for the below. I want it to RANK however if DSLR >= 60 then I want the rank to start again like below.
Thanks
Assuming that you have a column that defines the ordering of the rows, say id, you can address this as a gaps-and-islands problem. Islands are group of adjacent record that start with a dslr above 60. We can identify them with a window sum, then rank within each island:
select dslr, rank() over(partition by grp order by id) as rn
from (
select t.*,
sum(case when dslr >= 60 then 1 else 0 end) over(order by id) as grp
from mytable t
) t

Choosing 10 largest sets of data based on sum, outputting cumulative sum for each

Say the dataset is:
Class Value Ordering
A 10 1
A 13 2
...
B 20 1
B 7 2
...
I want to be able to find the 10 classes with the highest total value and then output the cumulative sum of each class.
So far I have created a script to determine the 10 largest:
SELECT Class
FROM Table
GROUP BY Class
ORDER BY sum(Value) DESC
LIMIT 10;
And a script to find the cumulative sum of a specific class:
SELECT sum(Value) OVER (
ORDER BY Ordering
ROWS BETWEEN
UNBOUNDED PRECEDING
AND CURRENT ROW
) AS cumulativeSum
FROM Table
WHERE Class = 'A'
ORDER BY Ordering ASC;
But I cannot find a way to combine the process together
EDIT:
Assuming A and B were two of the highest classes, the output would be:
A B
10 20
23 27
If a class C was not one of the 10 largest, it would not be output
If I followed you correctly, you can do:
select class, value, ordering, cumulativeSum
from (
select
t.*,
rank() over(order by totalsum desc) rn
from (
select
t.*,
sum(value) over(partition by class order by ordering) cumulativeSum,
sum(value) over(partition by class) totalsum
from table t
) t
) t
where rn <= 10
order by class, ordering
This filters the table on the top 10 classes by their total value, and adds a cumulative sum per class to each row.

Aggregate function like MAX for most common cell in column?

Group by the highest Number in a column worked great with MAX(), but what if I would like to get the cell that is at most common.
As example:
ID
100
250
250
300
200
250
So I would like to group by ID and instead of get the lowest (MIN) or highest (MAX) number, I would like to get the most common one (that would be 250, because there 3x).
Is there an easy way in SQL Server 2012 or am I forced to add a second SELECT where I COUNT(DISTINCT ID) and add that somehow to my first SELECT statement?
You can use dense_rank to return all the id's with the highest counts. This would handle cases when there are ties for the highest counts as well.
select id from
(select id, dense_rank() over(order by count(*) desc) as rnk from tablename group by id) t
where rnk = 1
A simple way to do what you want uses top and order by:
SELECT top 1 id
FROM t
GROUP BY id
ORDER BY COUNT(*) DESC;
This is a statistic called the mode. Getting the mode and max is a bit challenging in SQL Server. I would approach it as:
WITH cte AS (
SELECT t.id, COUNT(*) AS cnt,
row_number() OVER (ORDER BY COUNT(*) DESC) AS seqnum
FROM t
GROUP BY id
)
SELECT MAX(id) AS themax, MAX(CASE WHEN seqnum = 1 THEN id END) AS MODE
FROM cte;

How do I filter the top 1% and lower 1% of data in each group in SQL

I have a data set that includes PRICE, SUBTYPE, and others. I want to do some outlier removal before I use the dataset. I want to remove rows for things where the price is ridiculously high or low, in each SUBTYPE.
For each SUBTYPE look at the range of the PRICEs and remove or filter out rows.
Keep rows that fall between: PRICErange * .01 |KEEP| PRICErange * .99
This was provided to me by a Martin Smith on stackoverflow, I edited this question, so lets start from here.
;WITH CTE
AS (SELECT *,
ROW_NUMBER() OVER (PARTITION BY SUBTYPE ORDER BY PRICE) AS RN,
COUNT(*) OVER(PARTITION BY SUBTYPE) AS Cnt
FROM all_resale)
SELECT *
FROM CTE
WHERE (CASE WHEN Cnt > 1 THEN 100.0 * (RN -1)/(Cnt -1) END) BETWEEN 1 AND 99
I'm not sure this is what I need to do. I don't know how many rows will be removed off the ends.
You don't specify exactly how you define the 1 percent and how ties should be handled.
One way is below
;WITH CTE
AS (SELECT *,
ROW_NUMBER() OVER (PARTITION BY SUBTYPE ORDER BY PRICE) AS RN,
COUNT(*) OVER(PARTITION BY SUBTYPE) AS Cnt
FROM all_resale)
SELECT *
FROM CTE
WHERE (CASE WHEN Cnt > 1 THEN 100.0 * (RN -1)/(Cnt -1) END) BETWEEN 1 AND 99
That assumes the highest price item is 100%, the lowest price one 0% and all others scaled evenly between taking no account of ties. If you need to take account of ties look into RANK rather than ROW_NUMBER
NB: If all of the subtypes have a relatively large amount of rows you could use NTILE(100) instead but it does not distribute between buckets well if the number of rows is small relative to number of buckets.