SQL status changes with start and end dates - sql

This is a table of user statuses over the period of 9/1/2021 to 9/10/2021. 1 means "active." 0 means "canceled."
date
user
status
9/1/2021
1
1
9/1/2021
2
0
9/1/2021
3
1
9/2/2021
1
1
9/2/2021
2
1
9/2/2021
3
1
9/3/2021
1
0
9/3/2021
2
1
9/3/2021
3
1
9/4/2021
1
0
9/4/2021
2
1
9/4/2021
3
1
9/5/2021
1
0
9/5/2021
2
1
9/5/2021
3
0
9/6/2021
1
1
9/6/2021
2
1
9/6/2021
3
0
9/7/2021
1
1
9/7/2021
2
1
9/7/2021
3
0
9/8/2021
1
0
9/8/2021
2
1
9/8/2021
3
1
9/9/2021
1
0
9/9/2021
2
1
9/9/2021
3
1
9/10/2021
1
1
9/10/2021
2
0
9/10/2021
3
1
I want to get the start and end date for each user's active and canceled periods during this time. I know this involves a window function, but I can't quite figure out how to do it. This is my desired output:
user
status
start date
end date
1
1
9/1/2021
9/2/2021
1
0
9/3/2021
9/5/2021
1
1
9/6/2021
9/7/2021
1
0
9/8/2021
9/9/2021
1
1
9/10/2021
9/10/2021
2
0
9/1/2021
9/1/2021
2
1
9/2/2021
9/9/2021
2
0
9/10/2021
9/10/2021
3
1
9/1/2021
9/4/2021
3
0
9/5/2021
9/7/2021
3
1
9/8/2021
9/10/2021

Updated
Here is an example:fiddle
Updated query,
;with cte as (
SELECT *,Rank() OVER ( partition by usr,status order by dt )as rnk
,LAG(dt,1) OVER (partition by usr order by dt desc) as LAG
,Row_number() over (partition by usr order by dt asc) as rnum
,count(*) over (partition by usr,status) as cnt
FROM TABLE1
)
Select usr,status,dt as start_date,LAG as End_date from cte

I was able to figure it out.
The key components are filtering for when the current status does not equal the previous status. This indicates a date when the status of the user changes.
When you filter for these rows, you can just use the LEAD() window function and subtract 1 day to get the end date for that status.
with win as
(
select
usr
, dt
, lag(status) over (partition by usr order by dt) as prev_status
, status
from subs
)
select
usr
, status
, dt as start_date
, coalesce(lead(dt) over (partition by usr order by dt) - interval '1 day', (select max(dt) from win)) as end_date
from win
where
status <> prev_status
or prev_status is null

Related

Add a counting condition into dense_rank window Function SQL

I have a function that counts how many times you've visited and if you have converted or not.
What I'd like is for the dense_rank to re-start the count, if there has been a conversion:
SELECT
uid,
channel,
time,
conversion,
dense_rank() OVER (PARTITION BY uid ORDER BY time asc) as visit_order
FROM table
current table output:
this customer (uid) had a conversion at visit 18 and now I would want the visit_order count from dense_rank to restart at 0 for the same customer until it hits the next conversion that is non-null.
See this (I do not like "try this" 😉):
SELECT
id,
ts,
conversion,
-- SC,
ROW_NUMBER() OVER (PARTITION BY id,SC) R
FROM (
SELECT
id,
ts,
conversion,
-- COUNT(conversion) OVER (PARTITION BY id, conversion=0 ORDER BY ts ) CC,
SUM(CASE WHEN conversion=1 THEN 1000 ELSE 1 END) OVER (PARTITION BY id ORDER BY ts ) - SUM(CASE WHEN conversion=1 THEN 1000 ELSE 1 END) OVER (PARTITION BY id ORDER BY ts )%1000 SC
FROM sample
ORDER BY ts
) x
ORDER BY ts;
DBFIDDLE
output:
id
ts
conversion
R
1
2022-01-15 10:00:00
0
1
1
2022-01-16 10:00:00
0
2
1
2022-01-17 10:00:00
0
3
1
2022-01-18 10:00:00
1
1
1
2022-01-19 10:00:00
0
2
1
2022-01-20 10:00:00
0
3
1
2022-01-21 10:00:00
0
4
1
2022-01-22 10:00:00
0
5
1
2022-01-23 10:00:00
0
6
1
2022-01-24 10:00:00
0
7
1
2022-01-25 10:00:00
1
1
1
2022-01-26 10:00:00
0
2
1
2022-01-27 10:00:00
0
3

Time series group by day and kind

I create a table using the command below:
CREATE TABLE IF NOT EXISTS stats (
id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
session_kind INTEGER NOT NULL,
ts TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
)
I insert some time series data using the command below:
INSERT INTO stats (session_kind) values (?1)
Some time after having executed several times the insert command, I have some time series data as below:
id session_kind ts
-----------------------------------------
1 0 2020-04-18 12:59:51 // day 1
2 1 2020-04-19 12:59:52 // day 2
3 0 2020-04-19 12:59:53
4 1 2020-04-19 12:59:54
5 0 2020-04-19 12:59:55
6 2 2020-04-19 12:59:56
7 2 2020-04-19 12:59:57
8 2 2020-04-19 12:59:58
9 2 2020-04-19 12:59:59
10 0 2020-04-20 12:59:51 // day 3
11 1 2020-04-20 12:59:52
12 0 2020-04-20 12:59:53
13 1 2020-04-20 12:59:54
14 0 2020-04-20 12:59:55
15 2 2020-04-20 12:59:56
16 2 2020-04-20 12:59:57
17 2 2020-04-20 12:59:58
18 2 2020-04-21 12:59:59 // day 4
What I would like to have a command that groups my data by date from the most recent day to the least and the number of each session_kind like below (I don't want to give any parameter to this command):
0 1 2 ts
-------------------------
0 0 1 2020-04-21 // day 4
3 2 3 2020-04-20 // day 3
2 2 4 2020-04-19 // day 2
1 0 0 2020-04-18 // day 1
How can I group my data as above?
You can do conditional aggregation:
select
sum(session_kind= 0) session_kind_0,
sum(session_kind= 1) session_kind_1,
sum(session_kind= 2) session_kind_2,
date(ts) ts_day
from mytable
group by date(ts)
order by ts_day desc
If you want something dynamic, then it might be simpler to put the results in rows rather than columns:
select date(ts) ts_day, session_kind, count(*) cnt
from mytable
group by date(ts), session_kind
order by ts_day desc, session_kind
If I understand correctly, you just want to sum the values:
select date(timestamp),
sum(case when session_kind = 1 then 1 else 0 end) as cnt_1,
sum(case when session_kind = 2 then 1 else 0 end) as cnt_2,
sum(case when session_kind = 3 then 1 else 0 end) as cnt_3
from t
group by date(timestamp);
You can also simplify this:
select date(timestamp),
sum( session_kind = 1 ) as cnt_1,
sum( session_kind = 2 ) as cnt_2,
sum( session_kind = 3 ) as cnt_3
from t
group by date(timestamp);

How to count consecutive dates using Netezza

I need to count consecutive days in order to define my cohorts. I have a table that looks like:
pat_id admin_date
----------------------------
1 3/10/2019
1 3/11/2019
1 3/23/2019
1 3/24/2019
1 3/25/2019
2 12/26/2017
2 2/27/2019
2 3/16/2019
2 3/17/2019
I want such as output:
pat_id admin_date consecutive
--------------------------------------------
1 3/10/2019 1
1 3/11/2019 2
1 3/23/2019 1
1 3/24/2019 2
1 3/25/2019 3
2 12/26/2017 1
2 2/27/2019 1
2 3/16/2019 1
2 3/17/2019 2
so that I can use these consecutive days value (per pat_id) to filter for my cohort. I've seen few posts that suggested using DateDiff/DateAdd with row_number, such as:
datediff(day, -row_number() over (partition by mrn order by admin_date), admin_date)
but datediff/dateadd functions wouldn't work on Netezza...
The closest I've got so far was:
select row_number() over (partition by mrn order by administration_date) as consecutive
which doesn't recognize gap between dates and return such an output:
pat_id admin_date consecutive
--------------------------------------------
1 3/10/2019 1
1 3/11/2019 2
1 3/23/2019 3
1 3/24/2019 4
1 3/25/2019 5
2 12/26/2017 1
2 2/27/2019 2
2 3/16/2019 3
2 3/17/2019 4
Does anyone know how to tackle this?
Use lag() to see where the groups start and a cumulative sum to define the group. The rest is just row_number():
select t.*,
row_number() over (partition by pat_id, grp order by admin_date) as consecutive
from (select t.*,
sum( case when prev_ad = admin_date - interval '1 day' then 0 else 1 end) over
(partition by pat_id order by admin_date) as grp
from (select t.*,
lag(admin_date) over (partition by pat_id order by admin_date) as prev_ad
from t
) t
)t ;

Sequence of Patterns within Date/time range

I have a problem I would need help on ..
In the example below, if I want to get scenarios based on the data patterns 010 as scenario1, 000 as scenario2, 111 as scenario3 within the Id.. Ignore the records that doesn't follow the pattern..
Ex:
id date Status
1 2012-10-18 1
1 2012-10-19 1
1 2012-10-20 0
1 2012-10-21 0
1 2012-10-22 0
1 2012-10-23 0
1 2012-10-24 1
1 2012-10-25 0
1 2012-10-26 0
1 2012-10-27 0
1 2012-10-28 1
2 2012-10-19 0
2 2012-10-20 0
2 2012-10-21 0
2 2012-10-22 1
2 2012-10-23 1
scenario1:
1 2012-10-23 0
1 2012-10-24 1
1 2012-10-25 0
Scenario2:
1 2012-10-20 0
1 2012-10-21 0
1 2012-10-22 0
2 2012-10-19 0
2 2012-10-20 0
2 2012-10-21 0
Scenario3 - none (no records)
You can construct the patterns as strings and then use string comparison.
At least part of the trick is that you want all rows in the pattern, so you need to construct all potential patterns where each row might appear:
select t.*
from (select t.*,
concat(lag(status), -2) over (partition by id order by date),
lag(status), -1) over (partition by id order by date),
status
) as pat1,
concat(lag(status), -1) over (partition by id order by date),
status,
lead(status), 1) over (partition by id order by date)
) as pat2,
concat(status,
lead(status), 1) over (partition by id order by date),
lead(status), 2) over (partition by id order by date)
) as pat3
from t
) t
where '010' in (pat1, pat2, pat3);

window function in redshift

I have some data that looks like this:
CustID EventID TimeStamp
1 17 1/1/15 13:23
1 17 1/1/15 14:32
1 13 1/1/25 14:54
1 13 1/3/15 1:34
1 17 1/5/15 2:54
1 1 1/5/15 3:00
2 17 2/5/15 9:12
2 17 2/5/15 9:18
2 1 2/5/15 10:02
2 13 2/8/15 7:43
2 13 2/8/15 7:50
2 1 2/8/15 8:00
I'm trying to use the row_number function to get it to look like this:
CustID EventID TimeStamp SeqNum
1 17 1/1/15 13:23 1
1 17 1/1/15 14:32 1
1 13 1/1/25 14:54 2
1 13 1/3/15 1:34 2
1 17 1/5/15 2:54 3
1 1 1/5/15 3:00 4
2 17 2/5/15 9:12 1
2 17 2/5/15 9:18 1
2 1 2/5/15 10:02 2
2 13 2/8/15 7:43 3
2 13 2/8/15 7:50 3
2 1 2/8/15 8:00 4
I tried this:
row_number () over
(partition by custID, EventID
order by custID, TimeStamp asc) SeqNum]
but got this back:
CustID EventID TimeStamp SeqNum
1 17 1/1/15 13:23 1
1 17 1/1/15 14:32 2
1 13 1/1/25 14:54 3
1 13 1/3/15 1:34 4
1 17 1/5/15 2:54 5
1 1 1/5/15 3:00 6
2 17 2/5/15 9:12 1
2 17 2/5/15 9:18 2
2 1 2/5/15 10:02 3
2 13 2/8/15 7:43 4
2 13 2/8/15 7:50 5
2 1 2/8/15 8:00 6
how can I get it to sequence based on the change in the EventID?
This is tricky. You need a multi-step process. You need to identify the groups (a difference of row_number() works for this). Then, assign an increasing constant to each group. And then use dense_rank():
select sd.*, dense_rank() over (partition by custid order by mints) as seqnum
from (select sd.*,
min(timestamp) over (partition by custid, eventid, grp) as mints
from (select sd.*,
(row_number() over (partition by custid order by timestamp) -
row_number() over (partition by custid, eventid order by timestamp)
) as grp
from somedata sd
) sd
) sd;
Another method is to use lag() and a cumulative sum:
select sd.*,
sum(case when prev_eventid is null or prev_eventid <> eventid
then 1 else 0 end) over (partition by custid order by timestamp
) as seqnum
from (select sd.*,
lag(eventid) over (partition by custid order by timestamp) as prev_eventid
from somedata sd
) sd;
EDIT:
The last time I used Amazon Redshift it didn't have row_number(). You can do:
select sd.*, dense_rank() over (partition by custid order by mints) as seqnum
from (select sd.*,
min(timestamp) over (partition by custid, eventid, grp) as mints
from (select sd.*,
(row_number() over (partition by custid order by timestamp rows between unbounded preceding and current row) -
row_number() over (partition by custid, eventid order by timestamp rows between unbounded preceding and current row)
) as grp
from somedata sd
) sd
) sd;
Try this code block:
WITH by_day
AS (SELECT
*,
ts::date AS login_day
FROM table_name)
SELECT
*,
login_day,
FIRST_VALUE(login_day) OVER (PARTITION BY userid ORDER BY login_day , userid rows unbounded preceding) AS first_day
FROM by_day