Dense rank, partitioned by column A, incremented by change in column B but ordered by column C - sql

I have a table like so
name|subtitle|date
ABC|excel|2018-07-07
ABC|excel|2018-08-08
ABC|ppt|2018-09-09
ABC|ppt|2018-10-10
ABC|excel|2018-11-11
ABC|ppt|2018-12-12
DEF|ppt|2018-12-31
I want to add a column that increments whenever there's a change in the subtitle, like so:
name|subtitle|date|Group_Number
ABC|excel|2018-07-07|1
ABC|excel|2018-08-08|1
ABC|ppt|2018-09-09|2
ABC|ppt|2018-10-10|2
ABC|excel|2018-11-11|3
ABC|ppt|2018-12-12|4
DEF|ppt|2018-12-31|1
the problem is if I do Dense_rank() over(partition by name order by subtitle) then not only will this group all subtitles into one group but it also remove the date ordering. I've also tried using the lag function but that doesn't seem to be very useful when you're trying to increment a column.
Is there a simple way to achieve this?
Bear in mind that the table I'm using has hundreds of different names.

Quick answer
declare #table table (name varchar(20),subtitle varchar(20),[date] date )
insert into #table (name,subtitle,date)
values
('ABC','excel','2018-07-07'),
('ABC','excel','2018-08-08'),
('ABC','ppt','2018-09-09'),
('ABC','ppt','2018-10-10'),
('ABC','excel','2018-11-11'),
('ABC','ppt','2018-12-12'),
('DEF','ppt','2018-12-31');
with nums as (
select *,
case when subtitle != lag(subtitle,1) over (partition by name order by date)
then 1
else 0 end as num
from #table
)
select *,
1+sum(num) over (partition by name order by date) AS Group_Number
from nums
Explanation
What you ask isn't exactly ranking. You are trying to detect "islands" where the name and subtitle are the same in a sequences ordered strictly by the date.
To do that, you can compare the current row's value to the previous one. If they match, you are in the same "island". If not, there's a switch. You can use that to emit eg 1 each time a change is detected.
That's what:
CASE WHEN subtitle != LAG(subtitle,1) OVER (PARTITION BY name ORDER BY date)
THEN 1
Once you have that, you can calculate the number of changes with a running total :
sum(num) over (partition by name order by date) AS Group_Number
This will generate values starting from 0. To get numbers starting from 1, just add 1:
1+sum(num) over (partition by name order by date) AS Group_Number
UPDATE
As T. Clausen explains in the comments, reversing the comparison will get rid of the +1 :
with nums as (
select *,
case when subtitle = lag(subtitle,1) over (partition by name order by date)
then 0
else 1 end as num
from #table
)
select *,
sum(num) over (partition by name order by date) AS Group_Number
from nums
It's also a better way to detect islands, even if the results in this case are the same. The first query would produce this result :
name subtitle date num Group_Number
ABC excel 2018-07-07 0 1
ABC excel 2018-08-08 0 1
ABC ppt 2018-09-09 1 2
ABC ppt 2018-10-10 0 2
ABC excel 2018-11-11 1 3
ABC ppt 2018-12-12 1 4
DEF ppt 2018-12-31 0 1
The query emits 1 when a subtitle break is detected except at the boundaries.
The second query returns :
name subtitle date num Group_Number
ABC excel 2018-07-07 1 1
ABC excel 2018-08-08 0 1
ABC ppt 2018-09-09 1 2
ABC ppt 2018-10-10 0 2
ABC excel 2018-11-11 1 3
ABC ppt 2018-12-12 1 4
DEF ppt 2018-12-31 1 1
In this case 1 is emitted for each change, including the boundaries

Related

Adding labels to row based on condition of prior/next row

I have sample data like the following in Snowflake. I'd like to assign groupings (without aggregation based on the grp_start -> grp_end (basically when one grp_start = 1 I want to assign it a label, and assign each sequential row the same ID until grp_end is equal to 1. That would constitute a single grp. Then the next grp should have a different label and follow the same logic.
Note: If a single row is a grp_start = 1 and grp_end = 1 I want it to have a single grp label for that row as well, and thus following the pattern.
The data needs to be partitioned by id and ordered by start_time as well. Please see the below sample data and the mockup of what the desired result is to be. Ideally, I need this to scale to large amounts of data.
Current data:
create or replace temporary table grp_test (id char(4), start_time date, grp_start int, grp_end int)
as select * from values
('0001','2021-01-10',1,0),
('0001','2021-01-11',0,0),
('0001','2021-01-14',0,1),
('0001','2021-07-01',1,1),
('0001','2021-09-25',1,0),
('0001','2021-09-29',0,1),
('0002','2022-11-04',1,0),
('0002','2022-11-25',0,1);
select * from grp_test;
Desired result mockup:
create or replace temporary table desired_result (id char(4), start_time date, grp_start int, grp_end int, label int)
as select * from values
('0001','2021-01-10',1,0,0),
('0001','2021-01-11',0,0,0),
('0001','2021-01-14',0,1,0),
('0001','2021-07-01',1,1,1),
('0001','2021-09-25',1,0,2),
('0001','2021-09-29',0,1,2),
('0002','2022-11-04',1,0,0),
('0002','2022-11-25',0,1,0);
select * from desired_result;
so changing the setup data to:
create or replace temporary table grp_test as
select * from values
('0001','2021-01-10'::date,1,0),
('0001','2021-01-11'::date,0,0),
('0001','2021-01-14'::date,0,1),
('0001','2021-01-15'::date,0,0),
('0001','2021-07-01'::date,1,1),
('0001','2021-09-25'::date,1,0),
('0001','2021-09-29'::date,0,1),
('0002','2022-11-04'::date,1,0),
('0002','2022-11-25'::date,0,1)
t(id, start_time, grp_start, grp_end);
We can use two CONDITIONAL_TRUE_EVENT's, this allows us to know we we are outside the end, but before a start, and thus can alter the label to null.
select d.*
,CONDITIONAL_TRUE_EVENT(grp_start=1) over (partition by id order by start_time) as s_e
,CONDITIONAL_TRUE_EVENT(grp_end=1) over (partition by id order by start_time) as e_e
,iff(s_e != e_e OR grp_end = 1, s_e, null) as label
from grp_test as d
order by 1,2;
ID
START_TIME
GRP_START
GRP_END
S_E
E_E
LABEL
0001
2021-01-10
1
0
1
0
1
0001
2021-01-11
0
0
1
0
1
0001
2021-01-14
0
1
1
1
1
0001
2021-01-15
0
0
1
1
0001
2021-07-01
1
1
2
2
2
0001
2021-09-25
1
0
3
2
3
0001
2021-09-29
0
1
3
3
3
0002
2022-11-04
1
0
1
0
1
0002
2022-11-25
0
1
1
1
1
If you don't actually care about label rows after an end, but before the next start, you can just use a single CONDITIONAL_TRUE_EVENT
select d.*
,CONDITIONAL_TRUE_EVENT(grp_start=1) over (partition by id order by start_time) as label
from grp_test as d
order by 1,2;
ID
START_TIME
GRP_START
GRP_END
LABEL
0001
2021-01-10
1
0
1
0001
2021-01-11
0
0
1
0001
2021-01-14
0
1
1
0001
2021-01-15
0
0
1
0001
2021-07-01
1
1
2
0001
2021-09-25
1
0
3
0001
2021-09-29
0
1
3
0002
2022-11-04
1
0
1
0002
2022-11-25
0
1
1
Here's a solution that uses two nested window functions, max and dense_rank. Snowflake (as well as most other DBMSs) doesn't allow you to nest two window functions, so we'll process the first one in a subquery and the second one in the query itself.
The key to this method is to assign a common date-value to all members of the group, in this case the start date of the group, then dense_rank will give a 1 to all the records tied for first place, a 2 to the next group, etc. So we want the max(Start_Time) of the records with grp_start=1 at or before this time for every row in grp_test.
max(Case When grp_start=1 Then Start_Time End)
Over (Partition By ID Order By Start_Time
Rows Between Unbounded Preceding And Current Row) as grp_start_time
So put it all together with
Select ID, Start_Time, Grp_Start, Grp_End,
dense_rank(grp_start_time) Over (Partition By ID) as label
From (
Select ID, Start_Time, Grp_Start, Grp_End,
max(Case When grp_start=1 Then Start_Time End)
Over (Partition By ID Order By Start_Time
Rows Between Unbounded Preceding And Current Row) as grp_start_time
From grp_test
)
Order by ID,Start_Time
METHOD 2
You can simplify this considerably if you are certain grp_start must only contain zeros and ones. This one simply creates a running sum of grp_start:
Select ID, Start_Time, Grp_Start, Grp_End,
sum(Grp_Start) Over (Partition By ID Order By Start_Time
Rows Between Unbounded Preceding And Current Row) as label
Order by ID,Start_Time

Get certain rows, plus rows before and after

Let's say I have the following data set:
ID
Identifier
Admission_Date
Release_Date
234
2
5/1/22
5/5/22
234
1
4/25/22
4/30/22
234
2
4/20/22
4/24/22
234
2
4/15/22
4/18/22
789
1
7/15/22
7/19/22
789
2
7/8/22
7/14/22
789
2
7/1/22
7/5/22
321
2
6/1/21
6/3/21
321
2
5/27/21
5/31/21
321
1
5/20/21
5/26/21
321
2
5/15/21
5/19/21
321
2
5/6/21
5/10/21
I want all rows with identifier=1. I also want rows that are either directly below or above rows with Identifier=1 - sorted by most recent to least recent.
There is always a row below rows with identifier=1. There may or may not be a row above. If there is no row with identifier=1 for an ID, then it will not be brought in with a prior step.
The resulting data set should be as follows:
ID
Identifier
Admission Date
Release Date
234
2
5/1/22
5/5/22
234
1
4/25/22
4/30/22
234
2
4/20/22
4/24/22
789
1
7/15/22
7/19/22
789
2
7/8/22
7/14/22
321
2
5/27/21
5/31/21
321
1
5/20/21
5/26/21
321
2
5/15/21
5/19/21
I am using DBeaver, which runs PostgreSQL.
I admittedly don't know Postgres well so the following could possibly be optimised, however using a combination of lag and lead to obtain the previous and next dates (assuming Admission_date is the one to order by) you could try
with d as (
select *,
case when identifier = 1 then Lag(admission_date) over(partition by id order by Admission_Date desc) end pd,
case when identifier = 1 then Lead(admission_date) over(partition by id order by Admission_Date desc) end nd
from t
)
select id, Identifier, Admission_Date, Release_Date
from d
where identifier = 1
or exists (
select * from d d2
where d2.id = d.id
and (d.Admission_Date = pd or d.admission_date = nd)
)
order by Id, Admission_Date desc;
One way:
SELECT (x.my_row).* -- decompose fields from row type
FROM (
SELECT identifier
, lag(t) OVER w AS t0 -- take whole row
, t AS t1
, lead(t) OVER w AS t2
FROM tbl t
WINDOW w AS (PARTITION BY id ORDER BY admission_date)
) sub
CROSS JOIN LATERAL (
VALUES (t0), (t1), (t2) -- pivot
) x(my_row)
WHERE sub.identifier = 1
AND (x.my_row).id IS NOT NULL; -- exclude rows with NULL ( = missing row)
db<>fiddle here
The query is designed to only make a single pass over the table.
Uses some advanced SQL / Postgres features.
About LATERAL:
What is the difference between a LATERAL JOIN and a subquery in PostgreSQL?
About the VALUES expression:
Postgres: convert single row to multiple rows (unpivot)
The manual about extracting fields from a composite type.
If there are many rows per id, other solutions will be (much) faster - with proper index support. You did not specify ...

SQL compares the value of 2 columns and select the column with max value row-by-row

I have table something like:
GROUP
NAME
Value_1
Value_2
1
ABC
0
0
1
DEF
4
4
50
XYZ
6
6
50
QWE
6
7
100
XYZ
26
2
100
QWE
26
2
What I would like to do is to groupby group and select the name with highest value_1. If their value_1 are the same, compare and select the max with value_2. If they're still the same, select the first one.
The output will be something like:
GROUP
NAME
Value_1
Value_2
1
DEF
4
4
50
QWE
6
7
100
XYZ
26
2
The challenge for me here is I don't know how many categories in NAME so a simple case when is not working. Thanks for help
You can use window functions to solve the bulk of your problem:
select t.*
from (select t.*,
row_number() over (partition by group order by value1 desc, value2 desc) as seqnum
from t
) t
where seqnum = 1;
The one caveat is the condition:
If they're still the same, select the first one.
SQL tables represent unordered (multi-) sets. There is no "first" one unless a column specifies the ordering. The best you can do is choose an arbitrary value when all the other values are the same.
That said, you might have another column that has an ordering. If so, add that as a third key to the order by.

Calculate "position in run" in SQL

I have a table of consecutive ids (integers, 1 ... n), and values (integers), like this:
Input Table:
id value
-- -----
1 1
2 1
3 2
4 3
5 1
6 1
7 1
Going down the table i.e. in order of increasing id, I want to count how many times in a row the same value has been seen consecutively, i.e. the position in a run:
Output Table:
id value position in run
-- ----- ---------------
1 1 1
2 1 2
3 2 1
4 3 1
5 1 1
6 1 2
7 1 3
Any ideas? I've searched for a combination of windowing functions including lead and lag, but can't come up with it. Note that the same value can appear in the value column as part of different runs, so partitioning by value may not help solve this. I'm on Hive 1.2.
One way is to use a difference of row numbers approach to classify consecutive same values into one group. Then a row number function to get the desired positions in each group.
Query to assign groups (Running this will help you understand how the groups are assigned.)
select t.*
,row_number() over(order by id) - row_number() over(partition by value order by id) as rnum_diff
from tbl t
Final Query using row_number to get positions in each group assigned with the above query.
select id,value,row_number() over(partition by value,rnum_diff order by id) as pos_in_grp
from (select t.*
,row_number() over(order by id) - row_number() over(partition by value order by id) as rnum_diff
from tbl t
) t

Resetting row number according to column value T-SQL

I have got the following data with a column indicating the first record within what we'll call an episode, though there is no episode ID. The ID column indicates and individual person.
ID StartDate EndDate First_Record
1 2013-11-30 2013-12-08 0
1 2013-12-08 2013-12-14 NULL
1 2013-12-14 2013-12-16 NULL
1 2013-12-16 2013-12-24 NULL
2 2001-02-02 2001-02-02 0
2 2001-02-03 2001-02-05 NULL
2 2010-03-11 2010-03-15 0
2 2010-03-15 2010-03-23 NULL
2 2010-03-24 2010-03-26 NULL
And I am trying to get a column indicating row number (starting with 0) grouped by ID ordered by start date, but the row number needs to reset when the First_Record column is not null, basically. Hence the desired output column Depth.
ID StartDate EndDate First_Record Depth
1 2013-11-30 2013-12-08 0 0
1 2013-12-08 2013-12-14 NULL 1
1 2013-12-14 2013-12-16 NULL 2
1 2013-12-16 2013-12-24 NULL 3
2 2001-02-02 2001-02-02 0 0
2 2001-02-03 2001-02-05 NULL 1
2 2010-03-11 2010-03-15 0 0
2 2010-03-15 2010-03-23 NULL 1
2 2010-03-24 2010-03-26 NULL 2
I can't seem to think of any solutions although I found a similar thread, but I'm needing help to translate it into what I'm trying to do. It has to use the First_Record column, as it has been set from specific conditions. Any help appreciated
If you can have only one episode per person (as in your sample data) you can just use row_number():
select t.*, row_number() over (partition by id order by startDate) - 1 as depth
from t;
Otherwise, you can calculate the episode grouping using a cumulative sum and then use that:
select t.*,
row_number() over (partition by id, grp order by startDate) - 1 as depth
from (select t.*,
count(first_record) over (partition by id order by startdate) as grp
from t
) t;
Now the depth will start from 0.
SELECT t.*
,convert(INT, (
row_number() OVER (
PARTITION BY id ORDER BY startDate
)
)) - 1 AS Depth
FROM t;