SQL Splitting dataset into 3 sections 60/20/20 for testing - sql

I have a dataset that I want to split into 3 groups and the splits to be 60/20/20. I would also like the split to be random. Was wondering what is the best method to do this using SQL (redshift). I tried using percent rank but that doesn't work so open to ideas.
Thanks.
Example data:
ID
Column 2
123214123
Y
544354342
N
43241231
Y
231213123
Y
123123123
Y

The simplest method is probably just using random():
select t.*,
(case when random() < 0.6 then 'group1'
when random() < 0.5 then 'group2'
else 'group3'
end)
from t;
This is only approximate in the counts. You can get more precision using window functions:
select t.*,
(case when ntile <= 6 then 'group1'
when ntile <= 8 then 'group2'
else 'group3'
end)
from (select t.*,
ntile(10) over (order by random()) as tile
from t
) t

Related

calculate 2 cumulative sums for 2 different groups

i have a table that looks like this:
id position value
5 senior 10000
6 senior 20000
8 senior 30000
9 junior 5000
4 junior 7000
3 junior 10000
It is sorted by position and value (asc) already. I want to calculate the number of seniors and juniors that can fit in a budget of 50,000 such that preference is given to seniors.
So for example, here 2 seniors (first and second) + 3 juniors can fit in the budget of 50,000.
id position value cum_sum
5 senior 10000 10000
6 senior 20000 30000
8 senior 30000 60000 ----not possible because it is more than 50000
----------------------------------- --- so out of 50k, 30k is used for 2 seniors.
9 junior 5000 5000
4 junior 7000 12000
1 junior 7000 19000 ---with the remaining 20k, these 3 juniors can also fit
3 junior 10000 29000
so the output should look like this:
juniors seniors
3 2
how can i achieve this in sql?
Here's one possible solution: DB Fiddle
with seniorsCte as (
select id, position, value, total
from budget b
inner join (
select id, position, value, (sum(value) over (order by value, id)) total
from people
where position = 'senior'
) as s
on s.total <= b.amount
)
, juniorsCte as (
select j.id, j.position, j.value, j.total + r.seniorsTotal
from (
select coalesce(max(total), 0) seniorsTotal
, max(b.amount) - coalesce(max(total), 0) remainingAmount
from budget b
cross join seniorsCte
) as r
inner join (
select id, position, value, (sum(value) over (order by value, id)) total
from people
where position = 'junior'
) as j
on j.total <= r.remainingAmount
)
/* use this if you want the specific records
select *
from seniorsCte
union all
select *
from juniorsCte
*/
select (select count(1) from seniorsCte) seniors
, (select count(1) from juniorsCte) juniors
From your question I suspect you're familiar with window functions; but in case not; the below query pulls back all rows from the people table where the position is senior, and creates a column, total which is our cumulative total of the value of the rows returned, starting with the lowest value, ascending (then sorting by id to ensure consistent behaviour if there's multiple rows with the same value; though that's not strictly required if we're happy to get those in an arbitrary order).
select id, position, value, (sum(value) over (order by value, id)) total
from people
where position = 'senior'
The budget table I just use to hold a single row/value saying what our cutoff is; i.e. this avoids hardcoding the 50k value you mentioned, so we can easily amend it as required.
The common table expressions (CTEs) I've used to allow us to filter our juniors subquery based on the output of our seniors subquery (i.e. as we only want those juniors up to the difference between the budget and the senior's total), whilst allowing us to return the results of juniors and seniors independently (i.e. if we wanted to return the actual rows, rather than just totals, this allows us to perform a union all between the two sets; as demonstrated in the commented out code.
For it to work, the sum has to be not only cumulative, but also selective. As mentioned in the comment, you can achieve that with a recursive cte: online demo
with recursive
ordered as --this will be fed to the actual recursive cte
( select *,
row_number() over (order by position desc,value asc)
from test_table)
,recursive_cte as
( select id,
position,
value,
value*(value<50000)::int as cum_sum,
value<50000 as is_hired,
2 as next_i
from ordered
where row_number=1
union
select o.id,
o.position,
o.value,
case when o.value+r.cum_sum<50000 then o.value+r.cum_sum else r.cum_sum end,
(o.value+r.cum_sum)<50000 as is_hired,
r.next_i+1 as next_i
from recursive_cte r,
ordered o
where o.row_number=next_i
)
select count(*) filter (where position='junior') as juniors,
count(*) filter (where position='senior') as seniors
from recursive_cte
where is_hired;
row_number() over () is a window function
count(*) filter (where...) is an aggregate filter. It's a faster variant of the sum(case when expr then a else 0 end) or count(nullif(expr)) approach, for when you only wish to sum a specific subset of values. That's just to put those in columns as you did in your expected result, but it could be done with a select position, count(*) from recursive_cte where is_hired group by position, stacked.
All it does is order your list according to your priorities in the first cte, then go through it row by row in the second one, collecting the cumulative sum, based on whether it's still below your limit/budget.
postgresql supports window SUM(col) OVER()
with cte as (
SELECT *, SUM(value) OVER(PARTITION BY position ORDER BY id) AS cumulative_sum
FROM mytable
)
select position, count(1)
from cte
where cumulative_sum < 50000
group by position
An other way to do it to get results in one row :
with cte as (
SELECT *, SUM(value) OVER(PARTITION BY position ORDER BY id) AS cumulative_sum
FROM mytable
),
cte2 as (
select position, count(1) as _count
from cte
where cumulative_sum < 50000
group by position
)
select
sum(case when position = 'junior' then _count else null end) juniors,
sum(case when position = 'senior' then _count else null end) seniors
from cte2
Demo here
This example of using a running total:
select
count(case when chek_sum_jun > 0 and position = 'junior' then position else null end) chek_jun,
count(case when chek_sum_sen > 0 and position = 'senior' then position else null end) chek_sen
from (
select position,
20000 - sum(case when position = 'junior' then value else 0 end) over (partition by position order by value asc rows between unbounded preceding and current row ) chek_sum_jun,
50000 - sum(case when position = 'senior' then value else 0 end) over (partition by position order by value asc rows between unbounded preceding and current row ) chek_sum_sen
from test_table) x
demo : https://dbfiddle.uk/ZgOoSzF0

SQL: perform undersampling to select a subset of majority class

I have a table that looks like the following:
user_id
target
1278
1
9809
0
3345
0
9800
0
1298
1
1223
0
My goal is to perform undersampling which means that I want to randomly select a subset of users that have a target of 0 while keeping all users that have a target of 1 value. I have tried the following code, however, since the user_ids are all unique, it doesn't remove the rows with the target of 0 randomly. Any idea what I need to do?
select *
from (select user_id, target, row_number() over (partition by user_id, target order by rand()) as seq
from dataset.mytable
) a
where target = 1 or seq = 1
One method uses window functions:
select t.* except (seqnum, cnt1)
from (select t.*,
row_number() over (partition by target order by rand()) as seqnum,
countif(target = 1) over () as cnt1
from t
) t
where seqnum <= cnt1;
The above might have performance problems -- or even exceed resources because of the large volume of data being sorted. An approximate method might also work for your purposes:
select t.* except (cnt, cnt1)
from (select t.*,
count(*) over (partition by target) as cnt,
countif(target = 1) over () as cnt1
from t
) t
where rand() < cnt * 1.0 / cnt1;
This is not guaranteed to produce exactly the same numbers of 0 and 1, but the numbers will be quite close.
Consider below approach - it leaves all target=1 rows and ~50% of target=0 rows
select *
from `dataset.mytable`
where if(target = 1, true, rand() < 0.5)

RANK in SQL but start at 1 again when number is greater than

I need an sql code for the below. I want it to RANK however if DSLR >= 60 then I want the rank to start again like below.
Thanks
Assuming that you have a column that defines the ordering of the rows, say id, you can address this as a gaps-and-islands problem. Islands are group of adjacent record that start with a dslr above 60. We can identify them with a window sum, then rank within each island:
select dslr, rank() over(partition by grp order by id) as rn
from (
select t.*,
sum(case when dslr >= 60 then 1 else 0 end) over(order by id) as grp
from mytable t
) t

sql table random numbers into multiple columns

new here - but would appreciate any help
I have a table with one column - this column has numbers from 1 to 1000
I would like to break this column up into ten columns - so I would have 10 columns and 100 rows in my new table
I would also like the numbers to be in random order
Any help would really be appreciated -- thanks in advance
You can use conditional aggregation and row_number():
select max(case when seqnum % 10 = 0 then number end) as number_1,
max(case when seqnum % 10 = 1 then number end) as number_2,
. . .
max(case when seqnum % 10 = 9 then number end) as number_10
from (select t.*,
row_number() over (order by newid()) - 1 as seqnum
from t
) t
group by floor(seqnum / 10)

SQL Group up rows data into one row

I have data in a table like this:
I want to organise the table data so that I can get a maximum of 3 letters per row grouped by account number.
Below would be the result I want:
I can use dense rank to group up the account numbers but not sure how to get the data I want in the format above.
Logic:
There are 4 letters for account 123. Final result groups by account number with first 3 letters as you can only have a maximum of 3 letters per row. The fourth letter must go on the second row.
Here's one option using conditional aggregation, first creating a row_number, and then creating a row grouping using every 3 rows with % (modulus operator):
select account_number,
max(case when rn % 3 = 1 then letter end) as letter1,
max(case when rn % 3 = 2 then letter end) as letter2,
max(case when rn % 3 = 0 then letter end) as letter3
from (
select *, row_number() over (partition by account_number, rn % 3 order by rn) newrn
from (
select *, row_number() over (partition by account_number order by letter) rn
from yourtable
) t
) y
group by account_number, newrn
order by account_number
Online Demo
I would do this with only one call to row_number():
select account_number,
max(case when seqnum % 3 = 1 then letter end) as letter_a,
max(case when seqnum % 3 = 2 then letter end) as letter_b,
max(case when seqnum % 3 = 0 then letter end) as letter_c
from (select t.*,
row_number() over (partition by account_number order by letter) as seqnum
from t
) t
group by account_number, floor( (seqnum - 1) / 3)
order by account_number, min(seqnum);