Select start and end dates for changing values in SQL - sql

I have a database with accounts and historical status changes
select Date, Account, OldStatus, NewStatus from HistoricalCodes
order by Account, Date
Date
Account
OldStatus
NewStatus
2020-01-01
12345
1
2
2020-10-01
12345
2
3
2020-11-01
12345
3
2
2020-12-01
12345
2
1
2020-01-01
54321
2
3
2020-09-01
54321
3
2
2020-12-01
54321
2
3
For every account I need to determine Start Date and End Date when Status = 2. An additional challenge is that the status can change back and forth multiple times. Is there a way in SQL to create something like this for at least first two timeframes when account was in 2? Any ideas?
Account
StartDt_1
EndDt_1
StartDt_2
EndDt_2
12345
2020-01-01
2020-10-01
2020-11-01
2020-12-01
54321
2020-09-01
2020-12-01

I would suggest putting this information in separate rows:
select t.*
from (select account, date as startdate,
lead(date) over (partition by account order by date) as enddate
from t
) t
where newstatus = 2;
This produces a separate row for each period when an account has a status of 2. This is better than putting the dates in separate pairs of columns, because you do not need to know the maximum number of periods of status = 2 when you write the query.

For a fixed maximum of status changes per account, you can use window functions and conditional aggregation:
select account,
max(case when rn = 1 then date end) as start_dt1,
max(case when rn = 1 then lead_date end) as end_dt1,
max(case when rn = 2 then date end) as start_dt2,
max(case when rn = 2 then lead_date end) as end_dt2
from (
select t.*,
row_number() over(partition by account, newstatus order by date) as rn,
lead(date) over(partition by account order by date) as lead_date
from mytable t
) t
where newstatus = 2
group by account
You can extend the select clause with more conditional expressions to handle more possible ranges per account.

Related

Find top common value

I am trying to get the top common date group by code and item. Is there anyway I can achieve this in snowflake?
My current table looks something like this. I need to extract out the date that is available in all item for each code. For e.g. for code = 1, I only want date = 2022-03-01 because it's the only date that is common between item a,b,c.
Code
Date
item
1
2022-01-01
a
1
2022-03-01
a
1
2022-01-01
b
1
2022-03-01
b
1
2022-03-01
c
1
2022-05-01
c
2
2022-01-01
a
2
2022-05-01
a
2
2022-01-01
b
2
2022-03-01
b
2
2022-01-01
c
My end result:
Code
Date
item
1
2022-03-01
a
1
2022-03-01
b
1
2022-03-01
c
2
2022-01-01
a
2
2022-01-01
b
2
2022-01-01
c
You may use count window function to count the similar dates for each code, then use the desne_rank function to get the rows with date value equal to the date with max count.
with count_dates as
(
select *,
count(*) over (partition by Code, Date) cn
from table_name
)
select Code, Date, item
from
(
select *,
dense_rank() over (partition by Code order by cn desc) rnk
from count_dates
) T
where rnk=1
order by Code
Using dense_rank() over (partition by Code order by cn desc) rnk will return all the latest common dates (dates with with same maximum count value), if you want to get only the latest common date use dense_rank() over (partition by Code order by cn desc, Date desc) rnk.
Output:

Create sql Key based on datetime that is persistent overnight

I have a time series with a table like this
CarId
EventDateTime
Event
SessionFlag
CarId
EventDateTime
Event
SessionFlag
ExpectedKey
1
2022-01-01 7:00
Start
1
1-20220101-7
1
2022-01-01 7:05
Drive
1
1-20220101-7
1
2022-01-01 8:00
Park
1
1-20220101-7
1
2022-01-01 10:00
Drive
1
1-20220101-7
1
2022-01-01 18:05
End
0
1-20220101-7
1
2022-01-01 23:00
Start
1
1-20220101-23
1
2022-01-01 23:05
Drive
1
1-20220101-23
1
2022-01-02 2:00
Park
1
1-20220101-23
1
2022-01-02 3:00
Drive
1
1-20220101-23
1
2022-01-02 15:00
End
0
1-20220101-23
1
2022-01-02 16:00
Start
1
1-20220102-16
Other CarIds do exist.
What I am attempting to do is create the last column, ExpectedKey.
The problem I face though is midnight, as the same session can exist over two days.
The record above with ExpectedKey 1-20220101-23 is the prime example of what I'm trying to achieve.
I've played with using:
CASE
WHEN SessionFlag<> 0
AND
SessionFlag= LAG(SessionFlag) OVER (PARTITION BY Carid ORDER BY EventDateTime)
THEN FIRST_VALUE(CarId+'-'+Convert(CHAR(8),EventDateTime,112)+'-'+CAST(DATEPART(HOUR,EventDateTime)AS
VARCHAR))OVER (PARTITION BY CarId ORDER BY EventDateTime)
ELSE CarId+'-'+Convert(CHAR(8),EventDateTime,112)+'-'+CAST(DATEPART(HOUR,EventDateTime)AS VARCHAR) END AS SessionId
But can't seem to make it partition correctly overnight.
Can anyone off advice?
This is a classic gaps-and-islands problem. There are a number of solutions.
The simplest (if not that efficient) is partitioning over a windowed conditional count
WITH Groups AS (
SELECT *,
GroupId = COUNT(CASE WHEN t.Event = 'Start' THEN 1 END)
OVER (PARTITION BY t.CarId ORDER BY t.EventDateTime)
FROM YourTable t
)
SELECT *,
NewKey = CONCAT_WS('-',
t.CarId,
CONVERT(varchar(8), EventDateTime, 112),
FIRST_VALUE(DATEPART(hour, t.EventDateTime))
OVER (PARTITION BY t.CarId, t.GroupId ORDER BY t.EventDateTime
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
)
FROM Groups t;
db<>fiddle
using APPLY to get the Start event datetime and form the key with concat_ws
select *
from time_series t
cross apply
(
select top 1
ExpectedKey = concat_ws('-',
CarId,
convert(varchar(10), EventDateTime, 112),
datepart(hour, EventDateTime))
from time_series x
where x.Event = 'Start'
and x.EventDateTime <= t.EventDateTime
order by x.EventDateTime desc
) k

Category Entry and Exit Dates per ID AND Category

I have the following table, where ID is the unique identifier. An can move from category to category, both up and down. My table records each day an ID stays in a given category. I am trying to identify the start date and the end date of an ID in a given category. The problem is that an ID can move up a category, and move back down to its original category after a certain number of days. Here is my table as an example with only 1 ID:
ID Category Date
1 1 2021-01-01
1 1 2021-01-02
...
1 1 2021-01-24
1 2 2021-01-25
...
1 2 2021-02-15
1 1 2021-02-16
...
1 1 2021-04-20
1 2 2021-04-21
When I try to get the MIN(DATE) and MAX(DATE) and group by the category and ID, it shows me that the account was in Category 1 from 2021-01-01 to 2021-04-20, and in Category 2 from 02-25 to 04-21. I am trying to track the movements of the file in each bucket step by step, meaning in my ideal result, the movements of the account will be tracked as:
ID Category StartDate EndDate
1 1 2021-01-01 2021-01-24
1 2 2021-01-25 2021-02-15
1 1 2021-02-16 2021-04-20
1 2 2021-04-21 NULL (or GETDATE())
How can I achieve this result? Any help would be appreciated. I tried using the RANK() function but because the table records every single day, it seems useless.
This is a type of gaps-and-islands problem that is most easily solved using the difference of row numbers:
select id, category, min(date), max(date)
from (select t.*,
row_number() over (partition by id order by date) as seqnum,
row_number() over (partition by id, category order by date) as seqnum_2
from t
) t
group by id, category, (seqnum - seqnum_2);
Actually, the difference of row numbers is only simplest because you have not specified the database. You can just subtract a sequence of numbers from the date to get a constant that defines each group. That looks like:
select id, category, min(date), max(date)
from (select t.*,
row_number() over (partition by id, category order by date) as seqnum
from t
) t
group by id, category, date - seqnum * interval '1 day';
However, the date arithmetic varies by database.

Google Big Query - Calculating monthly totals by status based on multiple date conditionals

I have table with the following data:
customer_id subscription_id plan status trial_start trial_end activated_at cancelled_at
1 jg1 basic cancelled 2020-06-26 2020-07-14 2020-07-14 2020-09-25
2 ab1 basic cancelled 2020-08-10 2020-08-24 2020-08-24 2021-02-15
3 cf8 basic cancelled 2020-08-25 2020-09-04 2020-09-04 2020-10-24
4 bc2 basic active 2020-10-12 2020-10-26 2020-10-26
5 hg4 basic active 2021-01-09 2021-02-08 2021-02-08
6 cd5 basic in-trial 2021-02-26
As you notice from the table, status = in_trial when a subscription is in trial. When subscription converts from in_trial to active there is activated_at date. When an in_trial or active subscription is cancelled, status switches to cancelled and cancelled_at date is present. Status column always shows only most recent status of a subscription. For every change in status a new row does not appear for subscription. For every change in status, status is changed, and appropriate dates appear to reflect time when status was changed.
My goal is to calculate, month-over-month, how many subscriptions are in status = in_trial, how many are in status = active and how many are in status = cancelled. Because status column reflects the most recent status of subscription, a query has to be able to determine how many subscriptions were in status = in_trial, status = active, and status = active based on available dates column.
If a particular subscription had multiple statuses in a given month (for example, subscription_id = ab1 was in trial in Aug-2020 and also converted to active in Aug-2020), I want only the most recent status to be considered for that subscription. So, as example, for subscription_id = ab1 I want it to be counted as active subscription for the month of Aug-2020.
The output I am looking for is:
date in_trial active cancelled
2020-06-01 1 0 0
2020-07-01 0 1 0
2020-08-01 1 2 0
2020-09-01 0 2 1
2020-10-01 0 2 1
2020-11-01 0 2 0
2020-12-01 0 2 0
2021-01-01 1 2 0
2021-02-01 1 2 1
2021-03-01 1 2 0
Or, results can be displayed in a different format, as long as numbers are correct. Another example of output can be:
date status count
2020-06-01 in_trial 1
2020-06-01 active 0
2020-06-01 cancelled 0
2020-07-01 in_trial 0
2020-07-01 active 1
2020-07-01 cancelled 0
... ... ...
2021-03-01 in_trial 1
2021-03-01 active 2
2021-03-01 cancelled 0
Below is the query you can use to reproduce the example table provided in this question:
SELECT 1 AS customer_id, 'jg1' AS subscription_id, 'basic' AS plan, 'cancelled' AS status, '2020-06-26' AS trial_start, '2020-07-14' AS trial_end, '2020-07-14' AS activated_at, '2020-09-25' AS cancelled_at UNION ALL
SELECT 2 AS customer_id, 'ab1' AS subscription_id, 'basic' AS plan, 'cancelled' AS status, '2020-08-10' AS trial_start, '2020-08-24' AS trial_end, '2020-08-24' AS activated_at, '2021-02-15' AS cancelled_at UNION ALL
SELECT 3 AS customer_id, 'cf8' AS subscription_id, 'basic' AS plan, 'cancelled' AS status, '2020-08-25' AS trial_start, '2020-09-04' AS trial_end, '2020-09-04' AS activated_at, '2020-10-24' AS cancelled_at UNION ALL
SELECT 4 AS customer_id, 'bc2' AS subscription_id, 'basic' AS plan, 'active' AS status, '2020-10-12' AS trial_start, '2020-10-26' AS trial_end, '2020-10-26' AS activated_at, '' AS cancelled_at UNION ALL
SELECT 5 AS customer_id, 'hg4' AS subscription_id, 'basic' AS plan, 'active' AS status, '2021-01-09' AS trial_start, '2021-02-08' AS trial_end, '2021-02-08' AS activated_at, '' AS cancelled_at UNION ALL
SELECT 6 AS customer_id, 'cd5' AS subscription_id, 'basic' AS plan, 'in_trial' AS status, '2021-02-26' AS trial_start, '' AS trial_end, '' AS activated_at, '' AS cancelled_at
I have been working on this problem since yesterday morning and continuing to figure out a way to do this efficiently. Thank you in advance for helping me solve this problem.
Below should work for you
select month,
count(distinct if(status = 0, customer_id, null)) in_trial,
count(distinct if(status = 1, customer_id, null)) active,
count(distinct if(status = 2, customer_id, null)) canceled
from (
select month, customer_id,
array_agg(status order by status desc limit 1)[offset(0)] status
from (
select distinct customer_id, 0 status, date_trunc(date, month) month
from `project.dataset.table`,
unnest(generate_date_array(date(trial_start), ifnull(date(trial_end), current_date()))) date
union all
select distinct customer_id, 1 status, date_trunc(date, month) month
from `project.dataset.table`,
unnest(generate_date_array(date(activated_at), ifnull(date(cancelled_at), current_date()))) date
union all
select distinct customer_id, 2 status, date_trunc(date(cancelled_at), month) month
from `project.dataset.table`
)
where not month is null
group by month, customer_id
)
group by month
# order by month
If applied to sample data in your question - output is

Flaggin active customers - Atleast one transaction every month

Once the customer is registered, between date_registered and current date - if the customer has made atleast one transaction every month, then flag it as active or else flag it has inactive
Note: Every customer has different date_registered
I tried this but doesn't work since few of the customers were onboarded in the middle of the year
Eg -
-------------------------------------
txn_id | txn_date | name | amount
-------------------------------------
101 2018-05-01 ABC 100
102 2018-05-02 ABC 200
-------------------------------------
(case when count(distinct case when txn_date >= '2018-05-01' and txn_date < '2019-06-01' then last_day(txn_date) end) = 13
then 'active' else 'inactive'
end) as flag
from t;
Final output
----------------
name | flag
----------------
ABC active
BCF inactive
You can use filtering on an aggregation query:
select customer,
count(distinct last_day(txn_date)) as num_months
from (select t.*, min(date_registered) over (partition by customer) as min_dr
from t
) t
group by customer, min_dr
having count(distinct last_day(txn_date)) = months_between(last_day(current_date), last_day(min_dr)) + 1;
Note: This may give unexpected results toward the beginning of a month, if customers do not all have transactions on the first day of the month.
EDIT:
If you want a flag, just move the HAVING logic to the SELECT:
select customer,
(case when count(distinct last_day(txn_date)) = months_between(last_day(current_date), last_day(min_dr)) + 1
then 'Active' else 'Inactive'
end) as active_flag
from (select t.*, min(date_registered) over (partition by customer) as min_dr
from t
) t
group by customer, min_dr;