sql query using time series - sql

I have the below table in bigquery:
Timestamp variant_id activity
2020-04-02 08:50 1 active
2020-04-03 07:39 1 not_active
2020-04-04 07:40 1 active
2020-04-05 10:22 2 active
2020-04-07 07:59 2 not_active
I want to query this subset of data to get the number of active variant per day.
If variant_id 1 is active at date 2020-04-04, it still active the follwing dates also 2020-04-05, 2020-04-06 until the value activity column is not_active , the goal is to count each day the number of variant_id who has the value active in the column activity, but I should take into account that each variant_id has the value of the last activity on a specific date.
for example the result of the desired query in the subset data must be:
Date activity_count
2020-04-02 1
2020-04-03 0
2020-04-04 1
2020-04-05 2
2020-04-06 2
2020-04-07 1
2020-04-08 1
2020-04-09 1
2020-04-10 1
any help please ?

Consider below approach
select date, count(distinct if(activity = 'active', variant_id, null)) activity_count
from (
select date(timestamp) date, variant_id, activity,
lead(date(timestamp)) over(partition by variant_id order by timestamp) next_date
from your_table
), unnest(generate_date_array(date, ifnull(next_date - 1, '2020-04-10'))) date
group by date
if applied to sample data in your question - output is

Related

How to find the number of events for the first 24 hours for each user id

I'm working on snowflake to solve a problem. I wanted to find the number of events for the first 24 hours for each user id.
This is a snippet of the database table I'm working on. I modified the table and used a date format without the time for simplification purposes.
user_id
client_event_time
1
2022-07-28
1
2022-07-29
1
2022-08-21
2
2022-07-29
2
2022-07-30
2
2022-08-03
I used the following approach to find the minimum event time per user_id.
SELECT user_id, client_event_time,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY client_event_time) row_number,
MIN(client_event_time) OVER (PARTITION BY user_id) MinEventTime
FROM Data
ORDER BY user_id, client_event_time;
user_id
client_event_time
row_number
MinEventTime
1
2022-07-28
1
2022-07-28
1
2022-07-29
2
2022-07-28
1
2022-08-21
3
2022-07-28
2
2022-07-29
1
2022-07-29
2
2022-07-30
2
2022-07-29
2
2022-08-03
3
2022-07-29
Then I tried to find the difference between the minimum event time and client_event_time, and if the difference is less than or equal to 24, I counted the client_event_time.
with NewTable as (
(SELECT user_id,client_event_time, event_type,
row_number() over (partition by user_id order by CLIENT_EVENT_TIME) row_number,
MIN(client_event_time) OVER (PARTITION BY user_id) MinEventTime
FROM Data
ORDER BY user_id, client_event_time))
SELECT user_id,
COUNT(case when timestampdiff(hh, client_event_time, MinEventTime) <= 24 then 1 else 0 end) AS duration
FROM NEWTABLE
GROUP BY user_id
I got the following result:
user_id
duration
1
3
2
3
I wanted to find the following result:
user_id
duration
1
2
2
2
Could you please help me solve this problem? Thanks!
This looks like a problem for windowed functions! I like them a lot.
Here's you sample data
DECLARE #table TABLE (user_id INT, client_event_time DATETIME)
INSERT INTO #table (user_id, client_event_time) VALUES
(1, '2022-07-28 13:30:00'),
(1, '2022-07-29 08:30:00'),
(1, '2022-08-21 12:34:56'),
(2, '2022-07-29 08:30:00'),
(2, '2022-07-30 13:30:00'),
(2, '2022-08-03 12:34:56')
I added some hours to it, so we can look at 24 hour windows more easily. For user_id 1 we can see they had 2 events in the 24 hours after their initial one. For user_id 2 there was only the first one. We can capture that with a MIN OVER, along with the actual datetimes.
SELECT user_id, MIN(client_event_time) OVER (PARTITION BY user_id) AS FirstEventDateTime, client_event_time
FROM #table
user_id FirstEventDateTime client_event_time
-------------------------------------------------------
1 2022-07-28 13:30:00.000 2022-07-28 13:30:00.000
1 2022-07-28 13:30:00.000 2022-07-29 08:30:00.000
1 2022-07-28 13:30:00.000 2022-08-21 12:34:56.000
2 2022-07-29 08:30:00.000 2022-07-29 08:30:00.000
2 2022-07-29 08:30:00.000 2022-07-30 13:30:00.000
2 2022-07-29 08:30:00.000 2022-08-03 12:34:56.000
Now we have the first datetime and each rows datetime in the resultset together, we can make a comparison:
SELECT user_id, MIN(client_event_time) OVER (PARTITION BY user_id) AS FirstEventDateTime, client_event_time, CASE WHEN DATEDIFF(HOUR,MIN(client_event_time) OVER (PARTITION BY user_id), client_event_time) < 24 THEN 1 ELSE 0 END AS EventsInFirst24Hours
FROM #table
user_id FirstEventDateTime client_event_time EventsInFirst24Hours
----------------------------------------------------------------------------
1 2022-07-28 13:30:00.000 2022-07-28 13:30:00.000 1
1 2022-07-28 13:30:00.000 2022-07-29 08:30:00.000 1
1 2022-07-28 13:30:00.000 2022-08-21 12:34:56.000 0
2 2022-07-29 08:30:00.000 2022-07-29 08:30:00.000 1
2 2022-07-29 08:30:00.000 2022-07-30 13:30:00.000 0
2 2022-07-29 08:30:00.000 2022-08-03 12:34:56.000 0
Now we have an indicator telling us which events occurred in the first 24 hours, all we really need is to sum it, but SQL Server is mean about using a windowed function in another aggregate, so we need to cheat and put it into a subquery.
SELECT user_id, SUM(EventsInFirst24Hours) AS CountOfEventsInFirst24Hours
FROM (
SELECT user_id, MIN(client_event_time) OVER (PARTITION BY user_id) AS FirstEventDateTime, client_event_time, CASE WHEN DATEDIFF(HOUR,MIN(client_event_time) OVER (PARTITION BY user_id), client_event_time) < 24 THEN 1 ELSE 0 END AS EventsInFirst24Hours
FROM #table
) a
GROUP BY user_id
And that gets us to the result:
user_id CountOfEventsInFirst24Hours
-----------------------------------
1 2
2 1
A little about what's going on with the windowed function:
MIN - the aggregation we want it to do. The common aggregate functions have windowed counterparts.
(client_event_time) - the value we want to do it to.
OVER (PARTITION BY user_id) - the window we want to set up. In this case we want to know the minimum datetime for each of the user_ids.
We can partition by as many columns as we'd like.
You can also use an ORDER BY with as many columns as you'd like, but that was not necessary here. Ex:
OVER (PARTITION BY column1, column2 ORDER BY column4, column5 DESC)
Partition (or group by) column1 and column2 and order by column4 and column5 descending.
Easier done with a qualify
with cte as
(select *
from mytable
qualify event_time<=min(event_time) over (partition by user_id) + interval '24 hours')
select user_id, count(*) as counts
from cte
group by user_id
If you want the count of events around 24 hours of the minimun event time, you canuse a group by CTE that givbes you all the minumum event tomes for all users
the rest is to get all the rows that are in the tme limit
WITH min_data as
(SELECT user_id,MIN(client_event_time) mindate FROM data GROUP BY user_id)
SELECT d.user_id, COUNT(*)
FROM data d JOIN min_data md ON d.user_id = md.user_id WHERE client_event_time <= mindate + INTERVAL '24 hour'
GROUP BY d.user_id
ORDER BY d.user_id
user_id
count
1
2
2
2

Row number with condition

I want to increase the row number of a partition based on a condition. This question refers to the same problem, but in my case, the column I want to condition on is another window function.
I want to identify the session number of each user (id) depending on how long ago was their last recorded action (ts).
My table looks as follows:
id ts
1 2022-08-01 09:00:00 -- user 1, first session
1 2022-08-01 09:10:00
1 2022-08-01 09:12:00
1 2022-08-03 12:00:00 -- user 1, second session
1 2022-08-03 12:03:00
2 2022-08-01 11:04:00 -- user 2, first session
2 2022-08-01 11:07:00
2 2022-08-25 10:30:00 -- user 2, second session
2 2022-08-25 10:35:00
2 2022-08-25 10:36:00
I want to assign each user a session identifier based on the following conditions:
If the user's last action was 30 or more minutes ago (or doesn't exist), then increase (or initialize) the row number.
If the user's last action was less than 30 minutes ago, don't increase the row number.
I want to get the following result:
id ts session_id
1 2022-08-01 09:00:00 1
1 2022-08-01 09:10:00 1
1 2022-08-01 09:12:00 1
1 2022-08-03 12:00:00 2
1 2022-08-03 12:03:00 2
2 2022-08-01 11:04:00 1
2 2022-08-01 11:07:00 1
2 2022-08-25 10:30:00 2
2 2022-08-25 10:35:00 2
2 2022-08-25 10:36:00 2
If I had a separate column with the seconds since their last session, I could simply add 1 to each user's partitioned sum. However, this column is a window function itself. Hence, the following query doesn't work:
select
id
,ts
,extract(
epoch from (
ts - lag(ts, 1) over(partition by id order by ts)
)
) as seconds_since -- Number of seconds since last action (works well)
,sum(
case
when coalesce(
extract(
epoch from (
ts - lag(ts, 1) over (partition by id order by ts)
)
), 1800
) >= 1800 then 1
else 0 end
) over (partition by id order by ts) as session_id -- Window inside window (crashes)
from
t
order by
id
,ts
ERROR: Aggregate window functions with an ORDER BY clause require a frame clause
Use LAG() window function to get the previous ts of each row and create flag column indicating if the difference between the 2 timestamps is greater than 30 minutes.
Then use SUM() window function over that flag:
SELECT
id
,ts
,SUM(flag) OVER (
PARTITION BY id
ORDER BY ts
rows unbounded preceding -- necessary in aws-redshift
) as session_id
FROM (
SELECT
*
,COALESCE((LAG(ts) OVER (PARTITION BY id ORDER BY ts) < ts - INTERVAL '30 minute')::int, 1) flag
FROM
tablename
) t
;
See the demo.

Counting subscriber numbers given events on SQL

I have a dataset on mysql in the following format, showing the history of events given some client IDs:
Base Data
Text of the dataset (subscriber_table):
user_id type created_at
A past_due 2021-03-27 10:15:56
A reactivate 2021-02-06 10:21:35
A past_due 2021-01-27 10:30:41
A new 2020-10-28 18:53:07
A cancel 2020-07-22 9:48:54
A reactivate 2020-07-22 9:48:53
A cancel 2020-07-15 2:53:05
A new 2020-06-20 20:24:18
B reactivate 2020-06-14 10:57:50
B past_due 2020-06-14 10:33:21
B new 2020-06-11 10:21:24
date_table:
full_date
2020-05-01
2020-06-01
2020-07-01
2020-08-01
2020-09-01
2020-10-01
2020-11-01
2020-12-01
2021-01-01
2021-02-01
2021-03-01
I have been struggling to come up with a query to count subscriber counts given a range of months, which are not necessary included in the event table either because the client is still subscribed or they cancelled and later resubscribed. The output I am looking for is this:
Output
date subscriber_count
2020-05-01 0
2020-06-01 2
2020-07-01 2
2020-08-01 1
2020-09-01 1
2020-10-01 2
2020-11-01 2
2020-12-01 2
2021-01-01 2
2021-02-01 2
2021-03-01 2
Reactivation and Past Due events do not change the subscription status of the client, however only the Cancel and New event do. If the client cancels in a month, they should still be counted as active for that month.
My initial approach was to get the latest entry given a month per subscriber ID and then join them to the premade date table, but when I have months missing I am unsure on how to fill them with the correct status. Maybe a lag function?
with last_record_per_month as (
select
date_trunc('month', created_at)::date order by created_at) as month_year ,
user_id ,
type,
created_at as created_at
from
subscriber_table
where
user_id in ('A', 'B')
order by
created_at desc
), final as (
select
month_year,
created_at,
type
from
last_record_per_month lrpm
right join (
select
date_trunc('month', full_date)::date as month_year
from
date_table
where
full_date between '2020-05-01' and '2021-03-31'
group by
1
order by
1
) dd
on lrpm.created_at = dd.month_year
and num = 1
order by
month_year
)
select
*
from
final
I do have a premade base table with every single date in many years to use as a joining table
Any help with this is GREATLY appreciated
Thanks!
The approach here is to have the subscriber rows with new connections as base and map them to the cancelled rows using a self join. Then have the date tables as base and aggregate them based on the number of users to get the result.
SELECT full_date, COUNT(DISTINCT user_id) FROM date_tbl
LEFT JOIN(
SELECT new.user_id,new.type,new.created_at created_at_new,
IFNULL(cancel.created_at,CURRENT_DATE) created_at_cancel
FROM subscriber new
LEFT JOIN subscriber cancel
ON new.user_id=cancel.user_id
AND new.type='new' AND cancel.type='cancel'
AND new.created_at<= cancel.created_at
WHERE new.type IN('new'))s
ON DATE_FORMAT(s.created_at_new, '%Y-%m')<=DATE_FORMAT(full_date, '%Y-%m')
AND DATE_FORMAT(s.created_at_cancel, '%Y-%m')>=DATE_FORMAT(full_date, '%Y-%m')
GROUP BY 1
Let me breakdown some sections
First up we need to have the subscriber table self joined based on user_id and then left table with rows as 'new' and the right one with 'cancel' new.type='new' AND cancel.type='cancel'
The new ones should always precede the canceled rows so adding this new.created_at<= cancel.created_at
Since we only care about the rows with new in the base table we filter out the rows in the WHERE clause new.type IN('new'). The result of the subquery would look something like this
We can then join this subquery with a Left join the date table such that the year and month of the created_at_new column is always less than equal to the full_date DATE_FORMAT(s.created_at_new, '%Y-%m')<=DATE_FORMAT(full_date, '%Y-%m') but greater than that of the canceled date.
Lastly we aggregate based on the full_date and consider the unique count of users
fiddle

Create interval from discrete dates

I have a function which saves the current status of several objects and writes it in a table, which looks like something like this:
ObjectId StatusId Date
1 10 2020-04-04 00:00:00.000
2 10 2020-04-04 00:00:00.000
1 11 2020-04-05 00:00:00.000
2 10 2020-04-05 00:00:00.000
1 10 2020-04-06 00:00:00.000
2 10 2020-04-06 00:00:00.000
I would like to make it an interval grouped by ObjectId and StatusId.
So for the above the preferred output would look like this:
ObjectId StatusId StartDate EndDate
1 10 2020-04-04 00:00:00.000 2020-04-04 00:00:00.000
1 11 2020-04-05 00:00:00.000 2020-04-05 00:00:00.000
1 10 2020-04-06 00:00:00.000 2020-04-06 00:00:00.000
2 10 2020-04-04 00:00:00.000 2020-04-06 00:00:00.000
Note one object can have the same status on multiple occasions but if it had a different status it needs to be in a separate interval. So simple group by and max(Date) doesn't work in my case.
Thanks in advance.
This is a form of gaps-and-islands. For this purpose, the difference of row numbers is probably the simplest method:
select objectid, status, min(date), max(date)
from (select t.*,
row_number() over (partition by objectid order by date) as seqnum,
row_number() over (partition by objectid, status order by date) as seqnum_2
from t
) t
group by objectid, status, (seqnum - seqnum_2);
Why this works can be a little cumbersome to explain. However, if you look at the results of the subquery, you will see how the difference is constant for the groups you want to identify.

How many Days each item was in each State, the full value of the period

This post is really similar to my question:
SQL Server : how many days each item was in each state
but I dont have the column Revision to see wich is the previous state, and also I want to get the full time of a status, I b
....
I'm want to get how long one item has been in one status in general, my table look like this:
ID DATE STATUS
3D56B7B1-FCB3-4897-BAEB-004796E0DC8D 2016-04-05 11:30:00.000 1
3D56B7B1-FCB3-4897-BAEB-004796E0DC8D 2016-04-08 11:30:00.000 13
274C5DA9-9C38-4A54-A697-009933BB7B7F 2016-04-29 08:00:00.000 5
274C5DA9-9C38-4A54-A697-009933BB7B7F 2016-05-04 08:00:00.000 4
A70A66DC-9D9E-49BE-93CF-00F9E3E06CE2 2016-04-14 07:50:00.000 1
A70A66DC-9D9E-49BE-93CF-00F9E3E06CE2 2016-04-21 14:00:00.000 2
A70A66DC-9D9E-49BE-93CF-00F9E3E06CE2 2016-04-23 12:15:00.000 3
A70A66DC-9D9E-49BE-93CF-00F9E3E06CE2 2016-04-23 16:15:00.000 1
BF122AE1-CB39-4967-8F37-012DC55E92A7 2016-04-05 10:30:00.000 1
BF122AE1-CB39-4967-8F37-012DC55E92A7 2016-04-20 17:00:00.000 5
I want to get this
Column 1 : ID Column 2 : Status Column 3 : Time with the status
Column 3 : Time with the status
= NextDate - PreviosDate + 1
if is the last Status, is count as 1
if is more than one Status on the same day, I get the Last one (u can say that only mather the last Status of the day)
by ID, Status must be unique
I should look like this:
ID STATUS TIME
3D56B7B1-FCB3-4897-BAEB-004796E0DC8D 1 3
3D56B7B1-FCB3-4897-BAEB-004796E0DC8D 13 1
274C5DA9-9C38-4A54-A697-009933BB7B7F 5 5
274C5DA9-9C38-4A54-A697-009933BB7B7F 4 1
A70A66DC-9D9E-49BE-93CF-00F9E3E06CE2 1 8
A70A66DC-9D9E-49BE-93CF-00F9E3E06CE2 2 2
BF122AE1-CB39-4967-8F37-012DC55E92A7 1 15
BF122AE1-CB39-4967-8F37-012DC55E92A 5 1
Thanks to #ConradFrix comments, this is how works ..
WITH CTE
AS
(
SELECT
ID,
STATUS,
DATE,
LEAD(DATE, 1) over (partition by ID order by DATE) LEAD,
ISNULL(DATEDIFF(DAYOFYEAR, DATE,
LEAD(DATE, 1) over (partition by ID order by DATE)), 1) DIF_BY_LEAD
FROM TABLE_NAME
)
SELECT ID, STATUS, SUM(DIF_BY_LEAD) AS TIME_STATUS
FROM CTE GROUP BY ID, STATUS
ORDER BY ID, STATUS