Adding labels to row based on condition of prior/next row - sql

I have sample data like the following in Snowflake. I'd like to assign groupings (without aggregation based on the grp_start -> grp_end (basically when one grp_start = 1 I want to assign it a label, and assign each sequential row the same ID until grp_end is equal to 1. That would constitute a single grp. Then the next grp should have a different label and follow the same logic.
Note: If a single row is a grp_start = 1 and grp_end = 1 I want it to have a single grp label for that row as well, and thus following the pattern.
The data needs to be partitioned by id and ordered by start_time as well. Please see the below sample data and the mockup of what the desired result is to be. Ideally, I need this to scale to large amounts of data.
Current data:
create or replace temporary table grp_test (id char(4), start_time date, grp_start int, grp_end int)
as select * from values
('0001','2021-01-10',1,0),
('0001','2021-01-11',0,0),
('0001','2021-01-14',0,1),
('0001','2021-07-01',1,1),
('0001','2021-09-25',1,0),
('0001','2021-09-29',0,1),
('0002','2022-11-04',1,0),
('0002','2022-11-25',0,1);
select * from grp_test;
Desired result mockup:
create or replace temporary table desired_result (id char(4), start_time date, grp_start int, grp_end int, label int)
as select * from values
('0001','2021-01-10',1,0,0),
('0001','2021-01-11',0,0,0),
('0001','2021-01-14',0,1,0),
('0001','2021-07-01',1,1,1),
('0001','2021-09-25',1,0,2),
('0001','2021-09-29',0,1,2),
('0002','2022-11-04',1,0,0),
('0002','2022-11-25',0,1,0);
select * from desired_result;

so changing the setup data to:
create or replace temporary table grp_test as
select * from values
('0001','2021-01-10'::date,1,0),
('0001','2021-01-11'::date,0,0),
('0001','2021-01-14'::date,0,1),
('0001','2021-01-15'::date,0,0),
('0001','2021-07-01'::date,1,1),
('0001','2021-09-25'::date,1,0),
('0001','2021-09-29'::date,0,1),
('0002','2022-11-04'::date,1,0),
('0002','2022-11-25'::date,0,1)
t(id, start_time, grp_start, grp_end);
We can use two CONDITIONAL_TRUE_EVENT's, this allows us to know we we are outside the end, but before a start, and thus can alter the label to null.
select d.*
,CONDITIONAL_TRUE_EVENT(grp_start=1) over (partition by id order by start_time) as s_e
,CONDITIONAL_TRUE_EVENT(grp_end=1) over (partition by id order by start_time) as e_e
,iff(s_e != e_e OR grp_end = 1, s_e, null) as label
from grp_test as d
order by 1,2;
ID
START_TIME
GRP_START
GRP_END
S_E
E_E
LABEL
0001
2021-01-10
1
0
1
0
1
0001
2021-01-11
0
0
1
0
1
0001
2021-01-14
0
1
1
1
1
0001
2021-01-15
0
0
1
1
0001
2021-07-01
1
1
2
2
2
0001
2021-09-25
1
0
3
2
3
0001
2021-09-29
0
1
3
3
3
0002
2022-11-04
1
0
1
0
1
0002
2022-11-25
0
1
1
1
1
If you don't actually care about label rows after an end, but before the next start, you can just use a single CONDITIONAL_TRUE_EVENT
select d.*
,CONDITIONAL_TRUE_EVENT(grp_start=1) over (partition by id order by start_time) as label
from grp_test as d
order by 1,2;
ID
START_TIME
GRP_START
GRP_END
LABEL
0001
2021-01-10
1
0
1
0001
2021-01-11
0
0
1
0001
2021-01-14
0
1
1
0001
2021-01-15
0
0
1
0001
2021-07-01
1
1
2
0001
2021-09-25
1
0
3
0001
2021-09-29
0
1
3
0002
2022-11-04
1
0
1
0002
2022-11-25
0
1
1

Here's a solution that uses two nested window functions, max and dense_rank. Snowflake (as well as most other DBMSs) doesn't allow you to nest two window functions, so we'll process the first one in a subquery and the second one in the query itself.
The key to this method is to assign a common date-value to all members of the group, in this case the start date of the group, then dense_rank will give a 1 to all the records tied for first place, a 2 to the next group, etc. So we want the max(Start_Time) of the records with grp_start=1 at or before this time for every row in grp_test.
max(Case When grp_start=1 Then Start_Time End)
Over (Partition By ID Order By Start_Time
Rows Between Unbounded Preceding And Current Row) as grp_start_time
So put it all together with
Select ID, Start_Time, Grp_Start, Grp_End,
dense_rank(grp_start_time) Over (Partition By ID) as label
From (
Select ID, Start_Time, Grp_Start, Grp_End,
max(Case When grp_start=1 Then Start_Time End)
Over (Partition By ID Order By Start_Time
Rows Between Unbounded Preceding And Current Row) as grp_start_time
From grp_test
)
Order by ID,Start_Time
METHOD 2
You can simplify this considerably if you are certain grp_start must only contain zeros and ones. This one simply creates a running sum of grp_start:
Select ID, Start_Time, Grp_Start, Grp_End,
sum(Grp_Start) Over (Partition By ID Order By Start_Time
Rows Between Unbounded Preceding And Current Row) as label
Order by ID,Start_Time

Related

Calculate sum and cumul by age group and by date (but people changes age group as time passes)

DBMS : postgreSQL
my problem :
In my database I have a person table with id and birth date, an events table that links a person, an event (id_event) and a date, an age table used for grouping ages. In the real database the person table is about 40 millions obs, and events 3 times bigger.
I need to produce a report (sum and cumul of X events) by age (age_group) and date (event_date). There isn't any problem to count the number of events by date. The problem lies with the cumul : contrary to other variables (sex for example), a person grow older and changes age group
as time passes, so for a given age group the cumul can increase then decrease. I want that the event's cumul, on every date in my report, uses the age of the persons on these dates.
Example of my inputs and desired output
The only way I found is to do a Cartesian product on the tables person and the dates v_dates, so it's easy to follow an event and make it change age_group. The code below uses this method.
BUT I can't use a cartesian product on my real data (makes a table way too big) and I need to use another method.
reproductible example
In this simplified example I want to produce a report by month from 2020-07-01 to 2022-07-01 (view v_dates). In reality I need to produce the same report by day but the logic remains the same.
My inputs
/* create table person*/
DROP TABLE IF EXISTS person;
CREATE TABLE person
(
person_id varchar(1),
person_birth_date date
);
INSERT INTO person
VALUES ('A', '2017-01-01'),
('B', '2016-07-01');
person_id
person_birth_date
A
2000-10-01
B
2010-02-01
/* create table events*/
DROP TABLE IF EXISTS events;
CREATE TABLE events
(
person_id varchar(1),
event_id integer,
event_date date
);
INSERT INTO events
VALUES ('A', 1, '2020-07-01'),
('A', 2, '2021-07-01'),
('B', 1, '2021-01-01'),
('B', 2, '2022-01-01');
person_id
event_id
event_date
A
1
2020-01-01
A
2
2021-01-01
B
1
2020-07-01
B
2
2021-01-01
/* create table age*/
DROP TABLE IF EXISTS age;
CREATE TABLE age
(
age integer,
age_group varchar(8)
);
INSERT INTO age
VALUES (0,'[0-4]'),
(1,'[0-4]'),
(2,'[0-4]'),
(3,'[0-4]'),
(4,'[0-4]'),
(5,'[5-9]'),
(6,'[5-9]'),
(7,'[5-9]'),
(8,'[5-9]'),
(9,'[5-9]');
/* create view dates : contains monthly dates from 2020-07-01 to 2022-07-01*/
CREATE or replace view v_dates AS
SELECT GENERATE_SERIES('2020-07-01'::date, '2022-07-01'::date, '6 month')::date as event_date;
age
age_group
0
[0-4]
1
[0-4]
5
[5-9]
My current method using a cartesian product
CROSS JOIN person * v_dates
with a LEFT JOIN to get info from table events
with a LEFT JOIN to get age_group from table age
CREATE or replace view v_person_event AS
SELECT
pdev.person_id,
pdev.event_date,
pdev.age,
ag.age_group,
pdev.event1,
pdev.event2
FROM
(
SELECT pd.person_id,
pd.event_date,
date_part('year', age(pd.event_date::TIMESTAMP, pd.person_birth_date::TIMESTAMP)) as age,
CASE WHEN ev.event_id = 1 THEN 1 else 0 END as event1,
CASE WHEN ev.event_id = 2 THEN 1 else 0 END as event2
FROM
(
SELECT *
FROM person
CROSS JOIN v_dates
) pd
LEFT JOIN events ev
on pd.person_id = ev.person_id
and pd.event_date = ev.event_date
) pdev
Left JOIN age as ag on pdev.age = ag.age
ORDER by pdev.person_id, pdev.event_date;
add columns event1_cum and event2_cum
CREATE or replace view v_person_event_cum AS
SELECT *,
SUM(event1) OVER (PARTITION BY person_id ORDER BY event_date) event1_cum,
SUM(event2) OVER (PARTITION BY person_id ORDER BY event_date) event2_cum
FROM v_person_event;
SELECT * FROM v_person_event_cum;
person_id
event_date
age
age_group
event1
event2
event1_cum
event2_cum
A
2020-07-01
3
[0-4]
1
0
1
0
A
2021-01-01
4
[0-4]
0
0
1
0
A
2021-07-01
4
[0-4]
0
1
1
1
A
2022-01-01
5
[5-9]
0
0
1
1
A
2022-07-01
5
[5-9]
0
0
1
1
B
2020-07-01
4
[0-4]
0
0
0
0
B
2021-01-01
4
[0-4]
1
0
1
0
B
2021-07-01
5
[5-9]
0
0
1
0
B
2022-01-01
5
[5-9]
0
1
1
1
B
2022-07-01
6
[5-9]
0
0
1
1
desired output : create a report grouped by variables age_group and event_date
SELECT
age_group,
event_date,
SUM(event1) as event1,
SUM(event2) as event2,
SUM(event1_cum) as event1_cum,
SUM(event2_cum) as event2_cum
FROM v_person_event_cum
GROUP BY age_group, event_date
ORDER BY age_group, event_date;
age_group
event_date
event1
event2
event1_cum
event2_cum
[0-4]
2020-07-01
1
0
1
0
[0-4]
2021-01-01
1
0
2
0
[0-4]
2021-07-01
0
1
1
1
[5-9]
2021-07-01
0
0
1
0
[5-9]
2022-01-01
0
1
2
2
This is why this is not an ordinary cumul : for the age_group [0-4], event1_cum goes from 2 at '2021-01-01' to 1 at '2021-07-01' because A was in [0-4] at the time of the event 1, still in [0-4] at '2021-01-01' but in [5-9] at 2021-07-01
When we read the report:
the 2021-01-01, there was 2 person between 0 and 4 (at that date) who had event1 and 0 person who had event2.
the 2021-07-01, there was 1 person between 0 and 4 who had event1 and 1 person who had event2.
I can't get a solution to this problem without using a cartesian Product...
Thanks in advance!

Getting count of last records of 2 columns SQL

I was looking for a solution for the below mentioned scenario.
So my table structure is like this ; Table name : energy_readings
equipment_id
meter_id
readings
reading_date
1
1
100
01/01/2022
1
1
200
02/01/2022
1
1
null
03/01/2022
1
2
100
01/01/2022
1
2
null
04/01/2022
2
1
null
04/01/2022
2
1
399
05/01/2022
2
2
null
02/01/2022
So from this , I want to get the number of nulls for the last record of same equipment_id and meter_id. (Should only consider the nulls of the last record of same equipment_id and meter_id)
EX : Here , the last reading for equipment 1 and meter 1 is a null , therefore it should be considered for the count. Also the last reading(Latest Date) for equipment 1 and meter 2 is a null , should be considered for count. But even though equipment 2 and meter 1 has a null , it is not the last record (Latest Date) , therefore should not be considered for the count.
Thus , this should be the result ;
equipment_id
Count
1
2
2
1
Hope I was clear with the question.
Thank you!
You can use CTE like below. CTE LatestRecord will get latest record for equipment_id & meter_id. Later you can join it with your current table and use WHERE to filter out record with null values only.
;WITH LatestRecord AS (
SELECT equipment_id, meter_id, MAX(reading_date) AS reading_date
FROM energy_readings
GROUP BY equipment_id, meter_id
)
SELECT er.meter_id, COUNT(1) AS [Count]
FROM energy_readings er
JOIN LatestRecord lr
ON lr.equipment_id = er.equipment_id
AND lr.meter_id = er.meter_id
AND lr.reading_date = er.reading_date
WHERE er.readings IS NULL
GROUP BY er.meter_id
with records as(
select equ_id,meter_id,reading_date,readings,
RANK() OVER(PARTITION BY meter_id,equ_id
order by reading_date) Count
from equipment order by equ_id
)
select equ_id,count(counter)
from
(
select equ_id,meter_id,reading_date,readings,MAX(Count) as counter
from records
group by meter_id,equ_id
order by equ_id
) where readings IS NULL group by equ_id
Explanation:-
records will order data by reading_date and will give counting as 1,2,3..
select max of count from records
select count of counter where reading is null
Partition by will give counting as shown in image
Result

Summing up only the values of previous rows with the same ID

As I am preparing my data for predicting no-shows at a hospital, I ran into the following problem: In the query below I tried to get the number of shows/no-shows relatively shown to the number of appointments (APPTS). INDICATION_NO_SHOW means whether a patient showed up at a appointment. 0 means show, and 1 means no-show.
with t1 as
(
select
PAT_ID
,APPT_TIME
,APPT_ID
,ROW_NUMBER () over(PARTITION BY PAT_ID order by pat_id,APPT_TIME) as [TOTAL_APPTS]
,INDICATION_NO_SHOW
from appointments
)
,
t2 as
(
t1.PAT_ID
,t1.APPT_TIME
,INDICATION_NO_SHOW
,sum(INDICATION_NO_SHOW) over(order by PAT_ID, APPT_TIME ) as TOTAL_NO_SHOWS
,TOTAL_APPT
from t1
)
SELECT *
,(TOTAL_APPT- TOTAL_NO_SHOWS) AS TOTAL_SHOWS
FROM T2
order by PAT_ID, APPT_TIME
This resulted into the following dataset:
PAT ID APPT_TIME INDICATION_NO_SHOW TOTAL_SHOWS TOTAL_NO_SHOWS TOTAL_APPTS
1 1-1-2001 0 1 0 1
1 1-2-2001 0 2 0 2
1 1-3-2001 1 2 1 3
1 1-4-2001 0 3 1 4
2 1-1-2001 0 0 1 1
2 2-1-2001 0 1 1 2
2 2-2-2001 1 1 2 3
2 2-3-2001 0 2 2 4
As you can see my query only worked for patient 1, and then it also counts the no-shows for patient 1 for patient 2. So individually it worked for 1 patient, but not over the whole dataset.
The TOTAL_APPTs column worked out, because it counted the number of appts the patient had at the moment of that given appt. My question is: How do I succesfully get these shows and no-shows succesfully added up (as I did for patient 1)? I'm completely aware why this query doesn't work, I'm just completely in the blue on how to fix it..
I think that you can just use window functions. You seem to be looking for window sums of shows and no shows per patient, so:
select
pat_id,
appt_time,
indication_no_show,
sum(1 - indication_no_show)
over(partition by pat_id order by appt_time) total_shows,
sum(indication_no_show)
over(partition by pat_id order by appt_time) total_no_shows
from appointments

Dense rank, partitioned by column A, incremented by change in column B but ordered by column C

I have a table like so
name|subtitle|date
ABC|excel|2018-07-07
ABC|excel|2018-08-08
ABC|ppt|2018-09-09
ABC|ppt|2018-10-10
ABC|excel|2018-11-11
ABC|ppt|2018-12-12
DEF|ppt|2018-12-31
I want to add a column that increments whenever there's a change in the subtitle, like so:
name|subtitle|date|Group_Number
ABC|excel|2018-07-07|1
ABC|excel|2018-08-08|1
ABC|ppt|2018-09-09|2
ABC|ppt|2018-10-10|2
ABC|excel|2018-11-11|3
ABC|ppt|2018-12-12|4
DEF|ppt|2018-12-31|1
the problem is if I do Dense_rank() over(partition by name order by subtitle) then not only will this group all subtitles into one group but it also remove the date ordering. I've also tried using the lag function but that doesn't seem to be very useful when you're trying to increment a column.
Is there a simple way to achieve this?
Bear in mind that the table I'm using has hundreds of different names.
Quick answer
declare #table table (name varchar(20),subtitle varchar(20),[date] date )
insert into #table (name,subtitle,date)
values
('ABC','excel','2018-07-07'),
('ABC','excel','2018-08-08'),
('ABC','ppt','2018-09-09'),
('ABC','ppt','2018-10-10'),
('ABC','excel','2018-11-11'),
('ABC','ppt','2018-12-12'),
('DEF','ppt','2018-12-31');
with nums as (
select *,
case when subtitle != lag(subtitle,1) over (partition by name order by date)
then 1
else 0 end as num
from #table
)
select *,
1+sum(num) over (partition by name order by date) AS Group_Number
from nums
Explanation
What you ask isn't exactly ranking. You are trying to detect "islands" where the name and subtitle are the same in a sequences ordered strictly by the date.
To do that, you can compare the current row's value to the previous one. If they match, you are in the same "island". If not, there's a switch. You can use that to emit eg 1 each time a change is detected.
That's what:
CASE WHEN subtitle != LAG(subtitle,1) OVER (PARTITION BY name ORDER BY date)
THEN 1
Once you have that, you can calculate the number of changes with a running total :
sum(num) over (partition by name order by date) AS Group_Number
This will generate values starting from 0. To get numbers starting from 1, just add 1:
1+sum(num) over (partition by name order by date) AS Group_Number
UPDATE
As T. Clausen explains in the comments, reversing the comparison will get rid of the +1 :
with nums as (
select *,
case when subtitle = lag(subtitle,1) over (partition by name order by date)
then 0
else 1 end as num
from #table
)
select *,
sum(num) over (partition by name order by date) AS Group_Number
from nums
It's also a better way to detect islands, even if the results in this case are the same. The first query would produce this result :
name subtitle date num Group_Number
ABC excel 2018-07-07 0 1
ABC excel 2018-08-08 0 1
ABC ppt 2018-09-09 1 2
ABC ppt 2018-10-10 0 2
ABC excel 2018-11-11 1 3
ABC ppt 2018-12-12 1 4
DEF ppt 2018-12-31 0 1
The query emits 1 when a subtitle break is detected except at the boundaries.
The second query returns :
name subtitle date num Group_Number
ABC excel 2018-07-07 1 1
ABC excel 2018-08-08 0 1
ABC ppt 2018-09-09 1 2
ABC ppt 2018-10-10 0 2
ABC excel 2018-11-11 1 3
ABC ppt 2018-12-12 1 4
DEF ppt 2018-12-31 1 1
In this case 1 is emitted for each change, including the boundaries

Resetting row number according to column value T-SQL

I have got the following data with a column indicating the first record within what we'll call an episode, though there is no episode ID. The ID column indicates and individual person.
ID StartDate EndDate First_Record
1 2013-11-30 2013-12-08 0
1 2013-12-08 2013-12-14 NULL
1 2013-12-14 2013-12-16 NULL
1 2013-12-16 2013-12-24 NULL
2 2001-02-02 2001-02-02 0
2 2001-02-03 2001-02-05 NULL
2 2010-03-11 2010-03-15 0
2 2010-03-15 2010-03-23 NULL
2 2010-03-24 2010-03-26 NULL
And I am trying to get a column indicating row number (starting with 0) grouped by ID ordered by start date, but the row number needs to reset when the First_Record column is not null, basically. Hence the desired output column Depth.
ID StartDate EndDate First_Record Depth
1 2013-11-30 2013-12-08 0 0
1 2013-12-08 2013-12-14 NULL 1
1 2013-12-14 2013-12-16 NULL 2
1 2013-12-16 2013-12-24 NULL 3
2 2001-02-02 2001-02-02 0 0
2 2001-02-03 2001-02-05 NULL 1
2 2010-03-11 2010-03-15 0 0
2 2010-03-15 2010-03-23 NULL 1
2 2010-03-24 2010-03-26 NULL 2
I can't seem to think of any solutions although I found a similar thread, but I'm needing help to translate it into what I'm trying to do. It has to use the First_Record column, as it has been set from specific conditions. Any help appreciated
If you can have only one episode per person (as in your sample data) you can just use row_number():
select t.*, row_number() over (partition by id order by startDate) - 1 as depth
from t;
Otherwise, you can calculate the episode grouping using a cumulative sum and then use that:
select t.*,
row_number() over (partition by id, grp order by startDate) - 1 as depth
from (select t.*,
count(first_record) over (partition by id order by startdate) as grp
from t
) t;
Now the depth will start from 0.
SELECT t.*
,convert(INT, (
row_number() OVER (
PARTITION BY id ORDER BY startDate
)
)) - 1 AS Depth
FROM t;