SQL how to select n row from each interval of one column - sql

For example, the table looks like
a
b
c
1
1
1
2
1
1
3
1
1
4
1
1
5
1
1
6
1
1
7
1
1
8
1
1
9
1
1
10
1
1
11
1
1
I want to randomly pick 2 rows from every interval based on column a, where a ~ [0, 2], a ~ [4, 6], a ~ [9-20].
Another more complicated case would be select n rows from every interval based on multiple columns, for example in this case the interval will be a ~ [0, 2], a ~ [4, 6], b ~ [7, 9], ...
Is there a way to do so with just SQL?

Find out to which interval each row belongs, order by random partitioned by an interval id, get the top n rows for each interval:
create transient table mytable as
select seq8() id, random() data
from table(generator(rowcount => 100)) v;
create transient table intervals as
select 0 i_start, 6 i_end, 2 random_rows
union all select 7, 20, 1
union all select 21, 30, 3
union all select 31, 50, 1;
select *
from (
select *
, row_number() over(partition by i_start order by random()) rn
from mytable a
join intervals b
on a.id between b.i_start and b.i_end
)
where rn<=random_rows
Edit: Shorter and cleaner.
select a.*
from mytable a
join intervals b
on a.id between b.i_start and b.i_end
qualify row_number() over(partition by i_start order by random()) <= random_rows

To get two rows per group, you want to use row_number(). To define the groups, you can use a lateral join to define the groupings:
select t.*
from (select t.*,
row_number() over (partition by v.grp order by random()) as seqnum
from t cross join lateral
(values (case when a between 0 and 2 then 1
when a between 4 and 6 then 2
when a between 7 and 9 then d
end)
) v(grp)
where grp is not null
) t
where seqnum <= 2;
You can adjust the case expression to define whatever groups you like.

Related

PSQL, adding a "step increasing" column

have this values in a table column select a from tab:
a
1
2
3
4
5
6
7
15
16
18
Using a variable=3, how can create column b starting with min(a) and with the following values:
a
b
1
1
2
1
3
1
4
4
5
4
6
4
7
7
15
15
17
15
18
18
something like: for each a (ordered) maintain the value at most for 3, otherwise reset.
Thanks,
AAWNSD
I think you want window functions and groups of three based on arithmetic on a:
select a,
min(a) over (partition by ceiling(a / 3.0)) as b
from tab;
Here is a db<>fiddle.
Hmmm . . . I realize that the above returns "16" for the last row rather than 18. My above interpretation may not be correct. You may be saying that you want groups -- once they start -- to never exceed the group starting value plus 2.
If so, one approach is a recursive CTE:
with recursive tt as (
select a, row_number() over (order by a) as seqnum
from tab
),
cte as (
select a, seqnum, a as grp
from tt
where seqnum = 1
union all
select tt.a, tt.seqnum,
(case when tt.a <= grp + 2 then grp else tt.a end)
from cte join
tt
on tt.seqnum = cte.seqnum + 1
)
select *
from cte;

count zeros between 1s in same column

I've data like this.
ID IND
1 0
2 0
3 1
4 0
5 1
6 0
7 0
I want to count the zeros before the value 1. So that, the output will be like below.
ID IND OUT
1 0 0
2 0 0
3 1 2
4 0 0
5 1 1
6 0 0
7 0 2
Is it possible without pl/sql? I tried to find the differences between row numbers but couldn't achieve it.
The match_recognize clause, introduced in Oracle 12.1, can do quick work of such "row pattern recognition" problems. The solution is just a bit complex due to the special treatment of a "last row" with ID = 0, but it is straightforward otherwise.
As usual, the with clause is not part of the solution; I include it to test the query. Remove it and use your actual table and column names.
with
inputs (id, ind) as (
select 1, 0 from dual union all
select 2, 0 from dual union all
select 3, 1 from dual union all
select 4, 0 from dual union all
select 5, 1 from dual union all
select 6, 0 from dual union all
select 7, 0 from dual
)
select id, ind, out
from inputs
match_recognize(
order by id
measures case classifier() when 'Z' then 0
when 'O' then count(*) - 1
else count(*) end as out
all rows per match
pattern ( Z* ( O | X ) )
define Z as ind = 0, O as ind != 0
);
ID IND OUT
---------- ---------- ----------
1 0 0
2 0 0
3 1 2
4 0 0
5 1 1
6 0 0
7 0 2
You can treat this as a gaps-and-islands problem. You can define the "islands" by the number of "1"s one or after each row. Then use a window function:
select t.*,
(case when ind = 1 or row_number() over (order by id desc) = 1
then sum(1 - ind) over (partition by grp)
else 0
end) as num_zeros
from (select t.*,
sum(ind) over (order by id desc) as grp
from t
) t;
If id is sequential with no gaps, you can do this without a subquery:
select t.*,
(case when ind = 1 or row_number() over (order by id desc) = 1
then id - coalesce(lag(case when ind = 1 then id end ignore nulls) over (order by id), min(id) over () - 1)
else 0
end)
from t;
I would suggest removing the case conditions and just using the then clause for the expression, so the value is on all rows.

Assign column value based on the percentage of rows

In DB2 is there a way to assign a column value based on the first x%, then y% and remaining z% of rows?
I've tried using row_number() function but no luck!
Example below
Assuming that the below example count(id) is already arranged in descending order
Input:
ID count(id)
5 10
3 8
1 5
4 3
2 1
Output:
First 30% rows of the above input should be assigned code H, last 30% of the rows will have code L and remaining will have code M. If 30% of rows evaluates to decimal then round up-to 0 decimal place.
ID code
5 H
3 H
1 M
4 L
2 L
You can use window functions:
select t.id,
(case ntile(3) over (order by count(id) desc)
when 1 then 'H'
when 2 then 'M'
when 3 then 'L'
end) as grp
from t
group by t.id;
This puts them into equal sized groups.
For 30-40-30% split with your conditions, you have to be more careful:
select t.id,
(case when (seqnum - 1.0) < 0.3 * cnt then 'H'
when (seqnum + 1.0) > 0.7 * cnt then 'L'
else 'M'
end) as grp
from (select t.id,
count(*) as cnt,
count(*) over () as num_ids,
row_number() over (order by count(*) desc) as seqnum
from t
group by t.id
) t
Try this:
with t(ID, count_id) as (values
(5, 10)
, (3, 8)
, (1, 5)
, (4, 3)
, (2, 1)
)
select t.*
, case
when pst <=30 then 'H'
when pst <=70 then 'M'
else 'L'
end as code
from
(
select t.*
, rownumber() over (order by count_id desc) as rn
, 100*rownumber() over (order by count_id desc)/nullif(count(1) over(), 0) as pst
from t
) t;
The result is:
ID COUNT_ID RN PST CODE
-- -------- -- --- ----
5 10 1 20 H
3 8 2 40 M
1 5 3 60 M
4 3 4 80 L
2 1 5 100 L

Is there a way to find active users in SQL?

I'm trying to find the total count of active users in a database. "Active" users here as defined as those who have registered an event on the selected day or later than the selected day. So if a user registered an event on days 1, 2 and 5, they are counted as "active" throughout days 1, 2, 3, 4 and 5.
My original dataset looks like this (note that this is a sample - the real dataset will run to up to 365 days, and has around 1000 users).
Day ID
0 1
0 2
0 3
0 4
0 5
1 1
1 2
2 1
3 1
4 1
4 2
As you can see, all 5 IDs are active on Day 0, and 2 IDs (1 and 2) are active until Day 4, so I'd like the finished table to look like this:
Day Count
0 5
1 2
2 2
3 2
4 2
I've tried using the following query:
select Day as days, sum(case when Day <= days then 1 else 0 end)
from df
But it gives incorrect output (only counts users who were active on each specific days).
I'm at a loss as to what I could try next. Does anyone have any ideas? Many thanks in advance!
I think I would just use generate_series():
select gs.d, count(*)
from (select id, min(day) as min_day, max(day) as max_day
from t
group by id
) t cross join lateral
generate_series(t.min_day, .max_day, 1) gs(d)
group by gs.d
order by gs.d;
If you want to count everyone as active from day 1 -- but not all have a value on day 1 -- then use 1 instead of min_day.
Here is a db<>fiddle.
A bit verbose, but this should do:
with dt as (
select 0 d, 1 id
union all
select 0 d, 2 id
union all
select 0 d, 3 id
union all
select 0 d, 4 id
union all
select 0 d, 5 id
union all
select 1 d, 1 id
union all
select 1 d, 2 id
union all
select 2 d, 1 id
union all
select 3 d, 1 id
union all
select 4 d, 1 id
union all
select 4 d, 2 id
)
, active_periods as (
select id
, min(d) min_d
, max(d) max_d
from dt
group by id
)
, days as (
select distinct d
from dt
)
select d.d
, count(ap.id)
from days d
join active_periods ap on d.d between ap.min_d and ap.max_d
group by 1
order by 1 asc
You need count by day.
select
id,
count(*)
from df
GROUP BY
id

SQL group uniquely by type and by position

Given this dataset:
ID type_id Position
1 2 7
2 1 2
3 3 5
4 1 1
5 3 3
6 2 4
7 2 6
8 3 8
(There are only 3 different possible type_ids) I'd like to return a dataset with one of each type_id in groups, ordered by position.
so it would be grouped like so:
Results (ID): [4, 6, 5], [2, 7, 3], [null, 1, 8]
So the first group would consist of each of the entries type_id's with the highest (Relative) position score, the second group would have the second highest score, the third would only consist of two entries (and a null) because there are not three more of each type_id
Does this make sense? And is it possible?
something like that:
with CTE as (
select
row_number() over (partition by type_id order by Position) as row_num,
*
from test
)
select array_agg(ID order by type_id)
from CTE
group by row_num
SQL FIDDLE
of, if you absolutely need nulls in your arrays:
with CTE as (
select
row_number() over (partition by type_id order by Position) as row_num,
*
from test
)
select array_agg(c.ID order by t.type_id)
from (select distinct row_num from CTE) as a
cross join (select distinct type_id from test) as t
left outer join CTE as c on c.row_num = a.row_num and c.type_id = t.type_id
group by a.row_num
SQL FIDDLE