Resetting row number according to column value T-SQL - sql

I have got the following data with a column indicating the first record within what we'll call an episode, though there is no episode ID. The ID column indicates and individual person.
ID StartDate EndDate First_Record
1 2013-11-30 2013-12-08 0
1 2013-12-08 2013-12-14 NULL
1 2013-12-14 2013-12-16 NULL
1 2013-12-16 2013-12-24 NULL
2 2001-02-02 2001-02-02 0
2 2001-02-03 2001-02-05 NULL
2 2010-03-11 2010-03-15 0
2 2010-03-15 2010-03-23 NULL
2 2010-03-24 2010-03-26 NULL
And I am trying to get a column indicating row number (starting with 0) grouped by ID ordered by start date, but the row number needs to reset when the First_Record column is not null, basically. Hence the desired output column Depth.
ID StartDate EndDate First_Record Depth
1 2013-11-30 2013-12-08 0 0
1 2013-12-08 2013-12-14 NULL 1
1 2013-12-14 2013-12-16 NULL 2
1 2013-12-16 2013-12-24 NULL 3
2 2001-02-02 2001-02-02 0 0
2 2001-02-03 2001-02-05 NULL 1
2 2010-03-11 2010-03-15 0 0
2 2010-03-15 2010-03-23 NULL 1
2 2010-03-24 2010-03-26 NULL 2
I can't seem to think of any solutions although I found a similar thread, but I'm needing help to translate it into what I'm trying to do. It has to use the First_Record column, as it has been set from specific conditions. Any help appreciated

If you can have only one episode per person (as in your sample data) you can just use row_number():
select t.*, row_number() over (partition by id order by startDate) - 1 as depth
from t;
Otherwise, you can calculate the episode grouping using a cumulative sum and then use that:
select t.*,
row_number() over (partition by id, grp order by startDate) - 1 as depth
from (select t.*,
count(first_record) over (partition by id order by startdate) as grp
from t
) t;

Now the depth will start from 0.
SELECT t.*
,convert(INT, (
row_number() OVER (
PARTITION BY id ORDER BY startDate
)
)) - 1 AS Depth
FROM t;

Related

Adding labels to row based on condition of prior/next row

I have sample data like the following in Snowflake. I'd like to assign groupings (without aggregation based on the grp_start -> grp_end (basically when one grp_start = 1 I want to assign it a label, and assign each sequential row the same ID until grp_end is equal to 1. That would constitute a single grp. Then the next grp should have a different label and follow the same logic.
Note: If a single row is a grp_start = 1 and grp_end = 1 I want it to have a single grp label for that row as well, and thus following the pattern.
The data needs to be partitioned by id and ordered by start_time as well. Please see the below sample data and the mockup of what the desired result is to be. Ideally, I need this to scale to large amounts of data.
Current data:
create or replace temporary table grp_test (id char(4), start_time date, grp_start int, grp_end int)
as select * from values
('0001','2021-01-10',1,0),
('0001','2021-01-11',0,0),
('0001','2021-01-14',0,1),
('0001','2021-07-01',1,1),
('0001','2021-09-25',1,0),
('0001','2021-09-29',0,1),
('0002','2022-11-04',1,0),
('0002','2022-11-25',0,1);
select * from grp_test;
Desired result mockup:
create or replace temporary table desired_result (id char(4), start_time date, grp_start int, grp_end int, label int)
as select * from values
('0001','2021-01-10',1,0,0),
('0001','2021-01-11',0,0,0),
('0001','2021-01-14',0,1,0),
('0001','2021-07-01',1,1,1),
('0001','2021-09-25',1,0,2),
('0001','2021-09-29',0,1,2),
('0002','2022-11-04',1,0,0),
('0002','2022-11-25',0,1,0);
select * from desired_result;
so changing the setup data to:
create or replace temporary table grp_test as
select * from values
('0001','2021-01-10'::date,1,0),
('0001','2021-01-11'::date,0,0),
('0001','2021-01-14'::date,0,1),
('0001','2021-01-15'::date,0,0),
('0001','2021-07-01'::date,1,1),
('0001','2021-09-25'::date,1,0),
('0001','2021-09-29'::date,0,1),
('0002','2022-11-04'::date,1,0),
('0002','2022-11-25'::date,0,1)
t(id, start_time, grp_start, grp_end);
We can use two CONDITIONAL_TRUE_EVENT's, this allows us to know we we are outside the end, but before a start, and thus can alter the label to null.
select d.*
,CONDITIONAL_TRUE_EVENT(grp_start=1) over (partition by id order by start_time) as s_e
,CONDITIONAL_TRUE_EVENT(grp_end=1) over (partition by id order by start_time) as e_e
,iff(s_e != e_e OR grp_end = 1, s_e, null) as label
from grp_test as d
order by 1,2;
ID
START_TIME
GRP_START
GRP_END
S_E
E_E
LABEL
0001
2021-01-10
1
0
1
0
1
0001
2021-01-11
0
0
1
0
1
0001
2021-01-14
0
1
1
1
1
0001
2021-01-15
0
0
1
1
0001
2021-07-01
1
1
2
2
2
0001
2021-09-25
1
0
3
2
3
0001
2021-09-29
0
1
3
3
3
0002
2022-11-04
1
0
1
0
1
0002
2022-11-25
0
1
1
1
1
If you don't actually care about label rows after an end, but before the next start, you can just use a single CONDITIONAL_TRUE_EVENT
select d.*
,CONDITIONAL_TRUE_EVENT(grp_start=1) over (partition by id order by start_time) as label
from grp_test as d
order by 1,2;
ID
START_TIME
GRP_START
GRP_END
LABEL
0001
2021-01-10
1
0
1
0001
2021-01-11
0
0
1
0001
2021-01-14
0
1
1
0001
2021-01-15
0
0
1
0001
2021-07-01
1
1
2
0001
2021-09-25
1
0
3
0001
2021-09-29
0
1
3
0002
2022-11-04
1
0
1
0002
2022-11-25
0
1
1
Here's a solution that uses two nested window functions, max and dense_rank. Snowflake (as well as most other DBMSs) doesn't allow you to nest two window functions, so we'll process the first one in a subquery and the second one in the query itself.
The key to this method is to assign a common date-value to all members of the group, in this case the start date of the group, then dense_rank will give a 1 to all the records tied for first place, a 2 to the next group, etc. So we want the max(Start_Time) of the records with grp_start=1 at or before this time for every row in grp_test.
max(Case When grp_start=1 Then Start_Time End)
Over (Partition By ID Order By Start_Time
Rows Between Unbounded Preceding And Current Row) as grp_start_time
So put it all together with
Select ID, Start_Time, Grp_Start, Grp_End,
dense_rank(grp_start_time) Over (Partition By ID) as label
From (
Select ID, Start_Time, Grp_Start, Grp_End,
max(Case When grp_start=1 Then Start_Time End)
Over (Partition By ID Order By Start_Time
Rows Between Unbounded Preceding And Current Row) as grp_start_time
From grp_test
)
Order by ID,Start_Time
METHOD 2
You can simplify this considerably if you are certain grp_start must only contain zeros and ones. This one simply creates a running sum of grp_start:
Select ID, Start_Time, Grp_Start, Grp_End,
sum(Grp_Start) Over (Partition By ID Order By Start_Time
Rows Between Unbounded Preceding And Current Row) as label
Order by ID,Start_Time

Average and sort by this based on other conditional columns in a table

I have a table in SQL Server 2017 like below:
Name Rank1 Rank2 Rank3 Rank4
Jack null 1 1 3
Mark null 3 2 2
John null 2 3 1
What I need to do is to add an average rank column then rank those names based on those scores. We ignore null ranks. Expected output:
Name Rank1 Rank2 Rank3 Rank4 AvgRank FinalRank
Jack null 1 1 3 1.66 1
Mark null 3 2 2 2.33 3
John null 2 3 1 2 2
My query now looks like this:
;with cte as (
select *, AvgRank= (Rank1+Rank2+Rank3+Rank4)/#NumOfRankedBy
from mytable
)
select *, FinakRank= row_number() over (order by AvgRank)
from cte
I am stuck at finding the value of #NumOfRankedBy, which should be 3 in our case because Rank1 is null for all.
What is the best way to approach such an issue?
Thanks.
Your conumdrum stems from the fact your table in not normalised and you are treating data (Rank) as structure (columns).
You should have a table for Ranks where each rank is a row, then your query is easy.
You can unpivot your columns into rows and then make use of avg
select *, FinakRank = row_number() over (order by AvgRank)
from mytable
cross apply (
select Avg(r * 1.0) AvgRank
from (values(rank1),(rank2),(rank3),(rank4))r(r)
)r;

Getting count of last records of 2 columns SQL

I was looking for a solution for the below mentioned scenario.
So my table structure is like this ; Table name : energy_readings
equipment_id
meter_id
readings
reading_date
1
1
100
01/01/2022
1
1
200
02/01/2022
1
1
null
03/01/2022
1
2
100
01/01/2022
1
2
null
04/01/2022
2
1
null
04/01/2022
2
1
399
05/01/2022
2
2
null
02/01/2022
So from this , I want to get the number of nulls for the last record of same equipment_id and meter_id. (Should only consider the nulls of the last record of same equipment_id and meter_id)
EX : Here , the last reading for equipment 1 and meter 1 is a null , therefore it should be considered for the count. Also the last reading(Latest Date) for equipment 1 and meter 2 is a null , should be considered for count. But even though equipment 2 and meter 1 has a null , it is not the last record (Latest Date) , therefore should not be considered for the count.
Thus , this should be the result ;
equipment_id
Count
1
2
2
1
Hope I was clear with the question.
Thank you!
You can use CTE like below. CTE LatestRecord will get latest record for equipment_id & meter_id. Later you can join it with your current table and use WHERE to filter out record with null values only.
;WITH LatestRecord AS (
SELECT equipment_id, meter_id, MAX(reading_date) AS reading_date
FROM energy_readings
GROUP BY equipment_id, meter_id
)
SELECT er.meter_id, COUNT(1) AS [Count]
FROM energy_readings er
JOIN LatestRecord lr
ON lr.equipment_id = er.equipment_id
AND lr.meter_id = er.meter_id
AND lr.reading_date = er.reading_date
WHERE er.readings IS NULL
GROUP BY er.meter_id
with records as(
select equ_id,meter_id,reading_date,readings,
RANK() OVER(PARTITION BY meter_id,equ_id
order by reading_date) Count
from equipment order by equ_id
)
select equ_id,count(counter)
from
(
select equ_id,meter_id,reading_date,readings,MAX(Count) as counter
from records
group by meter_id,equ_id
order by equ_id
) where readings IS NULL group by equ_id
Explanation:-
records will order data by reading_date and will give counting as 1,2,3..
select max of count from records
select count of counter where reading is null
Partition by will give counting as shown in image
Result

Retrieving last record in each group from database with order by

There is a table ticket that contains data as shown below:
Id Impact group create_date
------------------------------------------
1 3 ABC 2020-07-28 00:42:00.0
1 2 ABC 2020-07-28 00:45:00.0
1 3 ABC 2020-07-28 00:48:00.0
1 3 ABC 2020-07-28 00:52:00.0
1 3 XYZ 2020-07-28 00:55:00.0
1 3 XYZ 2020-07-28 00:59:00.0
Expected result:
Id Impact group create_date
------------------------------------------
1 3 ABC 2020-07-28 00:42:00.0
1 2 ABC 2020-07-28 00:45:00.0
1 3 ABC 2020-07-28 00:52:00.0
1 3 XYZ 2020-07-28 00:59:00.0
At present, this is the query that I use:
WITH final AS (
SELECT p.*,
ROW_NUMBER() OVER(PARTITION BY p.id,p.group,p.impact
ORDER BY p.create_date desc, p.impact) AS rk
FROM ticket p
)
SELECT f.*
FROM final f
WHERE f.rk = 1
Result, i am getting is:
Id Impact group create_date
-----------------------------------------
1 2 ABC 2020-07-28 00:45:00.0
1 3 ABC 2020-07-28 00:52:00.0
1 3 XYZ 2020-07-28 00:59:00.0
it seems that partition by is getting precedence over order by values. is there other way to achieve expected result. I am running these queries on amazon Redshift.
You could use LEAD() to check if the Impact changes between rows, taking only the rows where the value will change.
WITH
look_forward AS
(
SELECT
*,
LEAD(impact) OVER (PARTITION BY id, group ORDER BY create_date) AS lead_impact
FROM
ticket
)
SELECT
*
FROM
look_forward
WHERE
lead_impact IS NULL
OR lead_impact <> impact
You seem to want rows where id/impact/group change relative to the next row. A simple way is to look at the next create_date overall and the next create_date for the group. If these are the same, then filter:
select t.*
from (select t.*,
lead(create_date) over (order by create_date) as next_create_date,
lead(create_date) over (partition by id, impact, group order by create_date) as next_create_date_img
from ticket t
) t
where next_create_date_img is null or next_create_date_img <> next_create_date;

Count how many times a value appears continuously in Hive/SQL

I've got 3 columns in my table. And I'd like to count, for each userid, ordered by time, how many times the value equals B continuously. Something like the longest sublist with the same value. For example, data below
time userid value
2016-01-01 1 A
2016-01-02 1 B
2016-01-03 1 B
2016-01-04 2 C
2016-01-05 2 B
2016-01-06 2 B
2016-01-07 2 B
2016-01-08 2 C
2016-01-09 2 B
would return
userid times
1 2
2 3
Is this even possible without user defined function in Hive? I've digged a bit into LAG or LEAD, but couldn't find a way. :(
select value
,userid
,max (times) as times
from (select value
,userid
,count (*) as times
from (select value
,userid
,row_number () over
(
partition by userid
order by time
) as rn
,row_number () over
(
partition by userid,value
order by time
) as rn_val
from t
-- where value = 'B'
) t
group by value
,userid
,rn - rn_val
) t
group by value
,userid
order by value
,userid
;