calculate 2 cumulative sums for 2 different groups - sql

i have a table that looks like this:
id position value
5 senior 10000
6 senior 20000
8 senior 30000
9 junior 5000
4 junior 7000
3 junior 10000
It is sorted by position and value (asc) already. I want to calculate the number of seniors and juniors that can fit in a budget of 50,000 such that preference is given to seniors.
So for example, here 2 seniors (first and second) + 3 juniors can fit in the budget of 50,000.
id position value cum_sum
5 senior 10000 10000
6 senior 20000 30000
8 senior 30000 60000 ----not possible because it is more than 50000
----------------------------------- --- so out of 50k, 30k is used for 2 seniors.
9 junior 5000 5000
4 junior 7000 12000
1 junior 7000 19000 ---with the remaining 20k, these 3 juniors can also fit
3 junior 10000 29000
so the output should look like this:
juniors seniors
3 2
how can i achieve this in sql?

Here's one possible solution: DB Fiddle
with seniorsCte as (
select id, position, value, total
from budget b
inner join (
select id, position, value, (sum(value) over (order by value, id)) total
from people
where position = 'senior'
) as s
on s.total <= b.amount
)
, juniorsCte as (
select j.id, j.position, j.value, j.total + r.seniorsTotal
from (
select coalesce(max(total), 0) seniorsTotal
, max(b.amount) - coalesce(max(total), 0) remainingAmount
from budget b
cross join seniorsCte
) as r
inner join (
select id, position, value, (sum(value) over (order by value, id)) total
from people
where position = 'junior'
) as j
on j.total <= r.remainingAmount
)
/* use this if you want the specific records
select *
from seniorsCte
union all
select *
from juniorsCte
*/
select (select count(1) from seniorsCte) seniors
, (select count(1) from juniorsCte) juniors
From your question I suspect you're familiar with window functions; but in case not; the below query pulls back all rows from the people table where the position is senior, and creates a column, total which is our cumulative total of the value of the rows returned, starting with the lowest value, ascending (then sorting by id to ensure consistent behaviour if there's multiple rows with the same value; though that's not strictly required if we're happy to get those in an arbitrary order).
select id, position, value, (sum(value) over (order by value, id)) total
from people
where position = 'senior'
The budget table I just use to hold a single row/value saying what our cutoff is; i.e. this avoids hardcoding the 50k value you mentioned, so we can easily amend it as required.
The common table expressions (CTEs) I've used to allow us to filter our juniors subquery based on the output of our seniors subquery (i.e. as we only want those juniors up to the difference between the budget and the senior's total), whilst allowing us to return the results of juniors and seniors independently (i.e. if we wanted to return the actual rows, rather than just totals, this allows us to perform a union all between the two sets; as demonstrated in the commented out code.

For it to work, the sum has to be not only cumulative, but also selective. As mentioned in the comment, you can achieve that with a recursive cte: online demo
with recursive
ordered as --this will be fed to the actual recursive cte
( select *,
row_number() over (order by position desc,value asc)
from test_table)
,recursive_cte as
( select id,
position,
value,
value*(value<50000)::int as cum_sum,
value<50000 as is_hired,
2 as next_i
from ordered
where row_number=1
union
select o.id,
o.position,
o.value,
case when o.value+r.cum_sum<50000 then o.value+r.cum_sum else r.cum_sum end,
(o.value+r.cum_sum)<50000 as is_hired,
r.next_i+1 as next_i
from recursive_cte r,
ordered o
where o.row_number=next_i
)
select count(*) filter (where position='junior') as juniors,
count(*) filter (where position='senior') as seniors
from recursive_cte
where is_hired;
row_number() over () is a window function
count(*) filter (where...) is an aggregate filter. It's a faster variant of the sum(case when expr then a else 0 end) or count(nullif(expr)) approach, for when you only wish to sum a specific subset of values. That's just to put those in columns as you did in your expected result, but it could be done with a select position, count(*) from recursive_cte where is_hired group by position, stacked.
All it does is order your list according to your priorities in the first cte, then go through it row by row in the second one, collecting the cumulative sum, based on whether it's still below your limit/budget.

postgresql supports window SUM(col) OVER()
with cte as (
SELECT *, SUM(value) OVER(PARTITION BY position ORDER BY id) AS cumulative_sum
FROM mytable
)
select position, count(1)
from cte
where cumulative_sum < 50000
group by position
An other way to do it to get results in one row :
with cte as (
SELECT *, SUM(value) OVER(PARTITION BY position ORDER BY id) AS cumulative_sum
FROM mytable
),
cte2 as (
select position, count(1) as _count
from cte
where cumulative_sum < 50000
group by position
)
select
sum(case when position = 'junior' then _count else null end) juniors,
sum(case when position = 'senior' then _count else null end) seniors
from cte2
Demo here

This example of using a running total:
select
count(case when chek_sum_jun > 0 and position = 'junior' then position else null end) chek_jun,
count(case when chek_sum_sen > 0 and position = 'senior' then position else null end) chek_sen
from (
select position,
20000 - sum(case when position = 'junior' then value else 0 end) over (partition by position order by value asc rows between unbounded preceding and current row ) chek_sum_jun,
50000 - sum(case when position = 'senior' then value else 0 end) over (partition by position order by value asc rows between unbounded preceding and current row ) chek_sum_sen
from test_table) x
demo : https://dbfiddle.uk/ZgOoSzF0

Related

SUM a specific column in next rows until a condition is true

Here is a table of articles and I want to store sum of Mass Column from next rows in sumNext Column based on a condition.
If next row has same floor (in floorNo column) as current row, then add the mass of next rows until the floor is changed
E.g : Rows three has sumNext = 2. That is computed by adding the mass from row four and row five because both rows has same floor number as row three.
id
mass
symbol
floorNo
sumNext
2891176
1
D
1
0
2891177
1
L
8
0
2891178
1
L
1
2
2891179
1
L
1
1
2891180
1
1
0
2891181
1
5
2
2891182
1
5
1
2891183
1
5
0
Here is the query, that is generating this table, I just want to add sumNext column with the right value inside.
WITH items AS (SELECT
SP.id,
SP.mass,
SP.symbol,
SP.floorNo
FROM articles SP
ORDER BY
DECODE(SP.symbol,
'P',1,
'D',2,
'L',3,
4 ) asc)
SELECT CLS.*
FROM items CLS;
You could use below solution which uses
common table expression (cte) technique to put all consecutive rows with same FLOORNO value in the same group (new grp column).
Then uses the analytic version of SUM function to sum all next MASS per grp column as required.
Items_RowsNumbered (id, mass, symbol, floorNo, rnb) as (
select ID, MASS, SYMBOL, FLOORNO
, row_number()over(
order by DECODE(symbol, 'P',1, 'D',2, 'L',3, 4 ) asc, ID )
/*
You need to add ID column (or any others columns that can identify each row uniquely)
in the "order by" clause to make the result deterministic
*/
from (Your source query)Items
)
, cte(id, mass, symbol, floorNo, rnb, grp) as (
select id, mass, symbol, floorNo, rnb, 1 grp
from Items_RowsNumbered
where rnb = 1
union all
select t.id, t.mass, t.symbol, t.floorNo, t.rnb
, case when t.floorNo = c.floorNo then c.grp else c.grp + 1 end grp
from Items_RowsNumbered t
join cte c on (c.rnb + 1 = t.rnb)
)
select
ID, MASS, SYMBOL, FLOORNO
/*, RNB, GRP*/
, nvl(
sum(MASS)over(
partition by grp
order by rnb
ROWS BETWEEN 1 FOLLOWING and UNBOUNDED FOLLOWING)
, 0
) sumNext
from cte
;
demo on db<>fiddle
This is a typical gaps-and-islands problem. You can use LAG() in order to determine the exact partitions, and then SUM() analytic function such as
WITH ii AS
(
SELECT i.*,
ROW_NUMBER() OVER (ORDER BY id DESC) AS rn2,
ROW_NUMBER() OVER (PARTITION BY floorNo ORDER BY id DESC) AS rn1
FROM items i
)
SELECT id,mass,symbol, floorNo,
SUM(mass) OVER (PARTITION BY rn2-rn1 ORDER BY id DESC)-1 AS sumNext
FROM ii
ORDER BY id
Demo

Filter Postgres percent_rank before calculation but show all results?

I am trying to figure out how to run a percent_rank on a table, but filter which records the percent_rank is run on, but still include those rows that were filtered out, but give them a 0 percent_rank
For example, I have a users table and everyone has a point value assigned to them. I only want to percent_rank on people with >= 20 points, but not exclude them from the results. Meaning if they have 19 points I can still see their record, but their rank is 0
For example:
SELECT name,points,PERCENT_RANK() OVER (ORDER BY points)
FROM users
WHERE points >= 20;
But keep the people with less than 20 points in the results.
You could use a union here:
SELECT name, points, 0 AS pct_rank FROM users WHERE points < 20
UNION ALL
SELECT name, points, PERCENT_RANK() OVER (ORDER BY points) FROM users WHERE points >= 20;
You can do this with case:
SELECT u.name, u.points,
(CASE WHEN u.points >= 20
THEN PERCENT_RANK() OVER (PARTITION BY u.points >= 20 ORDER BY u.points)
ELSE 0
END) as rank
FROM users u;
If you don't want to repeat the condition, you can use a lateral join:
SELECT u.name, u.points,
(CASE WHEN v.cond
THEN PERCENT_RANK() OVER (PARTITION BY v.cond ORDER BY points)
ELSE 0
END) as rank
FROM users u CROSS JOIN LATERAL
(VALUES (u.points >= 20)) v(cond);

Oracle ListaGG, Top 3 most frequent values, given in one column, grouped by ID

I have a problem regarding SQL query , it can be done in "plain" SQL, but as I am sure that I need to use some group concatenation (can't use MySQL) so second option is ORACLE dialect as there will be Oracle database. Let's say we have following entities:
Table: Veterinarian visits
Visit_Id,
Animal_id,
Veterinarian_id,
Sickness_code
Let's say there is 100 visits (100 visit_id) and each animal_id visits around 20 times.
I need to create a SELECT , grouped by Animal_id with 3 columns
animal_id
second shows aggregated amount of flu visits for this particular animal (let's say flu, sickness_code = 5)
3rd column shows top three sicknesses codes for each animal (top 3 most often codes for this particular animal_id)
How to do it? First and second columns are easy, but third? I know that I need to use LISTAGG from Oracle, OVER PARTITION BY, COUNT and RANK, I tried to tie it together but didn't work out as I expected :( How should this query look like?
Here sample data
create table VET as
select
rownum+1 Visit_Id,
mod(rownum+1,5) Animal_id,
cast(NULL as number) Veterinarian_id,
trunc(10*dbms_random.value)+1 Sickness_code
from dual
connect by level <=100;
Query
basically the subqueries do the following:
aggregate count and calculate flu count (in all records of the animal)
calculate RANK (if you need realy only 3 records use ROW_NUMBER - see discussion below)
Filter top 3 RANKs
LISTAGGregate result
with agg as (
select Animal_id, Sickness_code, count(*) cnt,
sum(case when SICKNESS_CODE = 5 then 1 else 0 end) over (partition by animal_id) as cnt_flu
from vet
group by Animal_id, Sickness_code
), agg2 as (
select ANIMAL_ID, SICKNESS_CODE, CNT, cnt_flu,
rank() OVER (PARTITION BY ANIMAL_ID ORDER BY cnt DESC) rnk
from agg
), agg3 as (
select ANIMAL_ID, SICKNESS_CODE, CNT, CNT_FLU, RNK
from agg2
where rnk <= 3
)
select
ANIMAL_ID, max(CNT_FLU) CNT_FLU,
LISTAGG(SICKNESS_CODE||'('||CNT||')', ', ') WITHIN GROUP (ORDER BY rnk) as cnt_lts
from agg3
group by ANIMAL_ID
order by 1;
gives
ANIMAL_ID CNT_FLU CNT_LTS
---------- ---------- ---------------------------------------------
0 1 6(5), 1(4), 9(3)
1 1 1(5), 3(4), 2(3), 8(3)
2 0 1(5), 10(3), 4(3), 6(3), 7(3)
3 1 5(4), 2(3), 4(3), 7(3)
4 1 2(5), 10(4), 1(2), 3(2), 5(2), 7(2), 8(2)
I intentionally show Sickness_code(count visits) to demonstarte that top 3 can have ties that you should handle.
Check the RANK function. Using ROW_NUMBER is not deterministic in this case.
I think the most natural way uses two levels of aggregation, along with a dash of window functions here and there:
select vas.animal,
sum(case when sickness_code = 5 then cnt else 0 end) as numflu,
listagg(case when seqnum <= 3 then sickness_code end, ',') within group (order by seqnum) as top3sicknesses
from (select animal, sickness_code, count(*) as cnt,
row_number() over (partition by animal order by count(*) desc) as seqnum
from visits
group by animal, sickness_code
) vas
group by vas.animal;
This uses the fact that listagg() ignores NULL values.

How do I filter the top 1% and lower 1% of data in each group in SQL

I have a data set that includes PRICE, SUBTYPE, and others. I want to do some outlier removal before I use the dataset. I want to remove rows for things where the price is ridiculously high or low, in each SUBTYPE.
For each SUBTYPE look at the range of the PRICEs and remove or filter out rows.
Keep rows that fall between: PRICErange * .01 |KEEP| PRICErange * .99
This was provided to me by a Martin Smith on stackoverflow, I edited this question, so lets start from here.
;WITH CTE
AS (SELECT *,
ROW_NUMBER() OVER (PARTITION BY SUBTYPE ORDER BY PRICE) AS RN,
COUNT(*) OVER(PARTITION BY SUBTYPE) AS Cnt
FROM all_resale)
SELECT *
FROM CTE
WHERE (CASE WHEN Cnt > 1 THEN 100.0 * (RN -1)/(Cnt -1) END) BETWEEN 1 AND 99
I'm not sure this is what I need to do. I don't know how many rows will be removed off the ends.
You don't specify exactly how you define the 1 percent and how ties should be handled.
One way is below
;WITH CTE
AS (SELECT *,
ROW_NUMBER() OVER (PARTITION BY SUBTYPE ORDER BY PRICE) AS RN,
COUNT(*) OVER(PARTITION BY SUBTYPE) AS Cnt
FROM all_resale)
SELECT *
FROM CTE
WHERE (CASE WHEN Cnt > 1 THEN 100.0 * (RN -1)/(Cnt -1) END) BETWEEN 1 AND 99
That assumes the highest price item is 100%, the lowest price one 0% and all others scaled evenly between taking no account of ties. If you need to take account of ties look into RANK rather than ROW_NUMBER
NB: If all of the subtypes have a relatively large amount of rows you could use NTILE(100) instead but it does not distribute between buckets well if the number of rows is small relative to number of buckets.

How to find the median in SQL Server

I have a single table that houses student scores by their classes.
For example, each class has 30 students, so there are 30 scores for each class.
I'd like to do a simple report that averages, does a median, and a mode, for each data set per class.
So, each class will have an average, a median, and a mode.
I know that SQL Server does not have a built in function for median and mode, and I found sample SQLs for the median. However, the samples I found do not do any grouping, I found:
SELECT
(
(SELECT MAX(Value) FROM
(SELECT TOP 50 PERCENT Value FROM dbo.VOrders ORDER BY Value) AS H1)
+
(SELECT MIN(Value) FROM
(SELECT TOP 50 PERCENT Value FROM dbo.VOrders ORDER BY Value DESC) AS H2)
) / 2 AS Median
Is it possible to modify to add a group by so I get a median value per class?
i don;t think i was clear enough, i'd like the SQL to return one data set, looking something like this:
MEDIAN CLASS
====== =====
90 BIO
77 CHEM
this is the answer:
WITH CTE AS (
SELECT e_id,
scale,
ROW_NUMBER() OVER(PARTITION BY e_id ORDER BY scale ASC) AS rn,
COUNT(scale) OVER(PARTITION BY e_id) AS cn
FROM waypoint.dbo.ScoreMaster
WHERE scale IS NOT NULL
)
SELECT e_id,
cast(AVG (cast(scale as decimal(5,2))) as decimal(5,3)) as [AVG],
cast (STDEV(cast(scale as decimal(5,1))) as decimal(5,3)) as [STDDEV],
AVG(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN scale END) AS FinancialMedian,
MAX(CASE WHEN 2 * rn - cn BETWEEN 0 AND 2 THEN scale END) AS StatisticalMedian
from CTE
GROUP BY e_id