BIGQUERY: How to query for a rolling monthly user active/churn - sql

So I have a website with news articles and I'm trying to calculate 4 user types for each month. The user types are:
1. New User: A user who registers (their first article view) in the current month and viewed an article in the current month.
2. Retained User: A New User from the previous month OR a user who viewed an article in the previous month and in the current month.
3. Churned User: A New User or Retained User from the previous month who has not viewed an article in the current month OR a Churned User from the previous month.
4. Resurrected User: A Churned User from the previous month who has viewed an article in the current month.
**User Table A - Unique User Article Views**
- Current month = 2019-04-01 00:00:00 UTC
| user_id | viewed_at |
------------------------------------------
| 4 | 2019-04-01 00:00:00 UTC |
| 3 | 2019-04-01 00:00:00 UTC |
| 2 | 2019-04-01 00:00:00 UTC |
| 1 | 2019-03-01 00:00:00 UTC |
| 3 | 2019-03-01 00:00:00 UTC |
| 2 | 2019-02-01 00:00:00 UTC |
| 1 | 2019-02-01 00:00:00 UTC |
| 1 | 2019-01-01 00:00:00 UTC |
The table above outlines the following user types:
2019-01-01
* User 1: New
2019-02-01
* User 1: Retained
* User 2: New
2019-03-01
* User 1: Retained
* User 2: Churned
* User 3: New
2019-04-01
* User 1: Churned
* User 2: Resurrected
* User 3: Retained
* User 4: New
My desired table COUNTS the distinct user_id for each user type in each month.
| month_viewed_at | ut_new | ut_retained | ut_churned | ut_resurrected
------------------------------------------------------------------------------------
| 2019-04-01 00:00:00 UTC | 1 | 1 | 1 | 1
| 2019-03-01 00:00:00 UTC | 1 | 1 | 1 | 0
| 2019-02-01 00:00:00 UTC | 1 | 1 | 0 | 0
| 2019-01-01 00:00:00 UTC | 1 | 0 | 0 | 0

I simply am not sure where to start
Hope you read all my comments and actually tried something by yourself, but as I don't see any update I suppose you still stuck here - so here we go ...
Below is for BigQuery Standard SQL and should give you direction
#standardSQL
WITH temp1 AS (
SELECT user_id,
FORMAT_DATE('%Y-%m', DATE(viewed_at)) month_viewed_at,
DATE_DIFF(DATE(viewed_at), '2000-01-01', MONTH) pos,
DATE_DIFF(DATE(MIN(viewed_at) OVER(PARTITION BY user_id)), '2000-01-01', MONTH) first_pos
FROM `project.dataset.table`
), temp2 AS (
SELECT *, pos = first_pos AS new_user
FROM temp1
), temp3 AS (
SELECT *, LAST_VALUE(new_user) OVER(win) OR pos - 1 = LAST_VALUE(pos) OVER(win) AS retained_user
FROM temp2
WINDOW win AS (PARTITION BY user_id ORDER BY pos RANGE BETWEEN 1 PRECEDING AND 1 PRECEDING)
)
SELECT month_viewed_at,
COUNTIF(new_user) AS new_users,
COUNTIF(retained_user) AS retained_users
FROM temp3
GROUP BY month_viewed_at
-- ORDER BY month_viewed_at DESC
If to apply to your sample data - result is
Row month_viewed_at new_users retained_users
1 2019-04 1 1
2 2019-03 1 1
3 2019-02 1 1
4 2019-01 1 0
In temp1 we preparing data by formatting viewed_at to needed format to present in output ad also we are transforming it to present consecutive number of month since some abstract data (2000-02-02) so we can use analytics function with RANGE as opposed to ROWS
In temp2 we just simply identifying new users and in temp3 - retained users
I think, this can be good start, so I am leaving the rest for you

Related

Calculating user retention on daily basis between the dates in SQL

I have a table that has the data about user_ids, all their last log_in dates to the app
Table:
|----------|--------------|
| User_Id | log_in_dates |
|----------|--------------|
| 1 | 2021-09-01 |
| 1 | 2021-09-03 |
| 2 | 2021-09-02 |
| 2 | 2021-09-04 |
| 3 | 2021-09-01 |
| 3 | 2021-09-02 |
| 3 | 2021-09-03 |
| 3 | 2021-09-04 |
| 4 | 2021-09-03 |
| 4 | 2021-09-04 |
| 5 | 2021-09-01 |
| 6 | 2021-09-01 |
| 6 | 2021-09-09 |
|----------|--------------|
From the above table, I'm trying to understand the user's log in behavior from the present day to the past 90 days.
Num_users_no_log_in defines the count for the number of users who haven't logged in to the app from present_day to the previous days (last_log_in_date)
I want the table like below:
|---------------|------------------|--------------------|-------------------------|
| present_date | days_difference | last_log_in_date | Num_users_no_log_in |
|---------------|------------------|--------------------|-------------------------|
| 2021-09-01 | 0 | 2021-09-01 | 0 |
| 2021-09-02 | 1 | 2021-09-01 | 3 |->(Id = 1,5,6)
| 2021-09-02 | 0 | 2021-09-02 | 3 |->(Id = 1,5,6)
| 2021-09-03 | 2 | 2021-09-01 | 2 |->(Id = 5,6)
| 2021-09-03 | 1 | 2021-09-02 | 1 |->(Id = 2)
| 2021-09-03 | 0 | 2021-09-03 | 3 |->(Id = 2,5,6)
| 2021-09-04 | 3 | 2021-09-01 | 2 |->(Id = 5,6)
| 2021-09-04 | 2 | 2021-09-02 | 0 |
| 2021-09-04 | 1 | 2021-09-03 | 1 |->(Id= 1)
| 2021-09-04 | 0 | 2021-09-04 | 3 |->(Id = 1,5,6)
| .... | .... | .... | ....
|---------------|------------------|--------------------|-------------------------|
I was able to get the first three columns Present_date | days_difference | last_log_in_date using the following query:
with dts as
(
select distinct log_in from users_table
)
select x.log_in_dates as present_date,
DATEDIFF(DAY, y.log_in_dates ,x.log_in_dates ) as Days_since_last_log_in,
y.log_in_dates as log_in_dates
from dts x, dts y
where x.log_in_dates >= y.log_in_dates
I don't understand how I can get the fourth column Num_users_no_log_in
I do not really understand your need: are there values base on users or dates? It it's based on dates, as it looks like (elsewhere you would probably have user_id as first column), what does it mean to have multiple times the same date? I understand that you would like to have a recap for all dates since the beginning until the current date, but in my opinion in does not really make sens (imagine your dashboard in 1 year!!)
Once this is said, let's go to the approach.
In such cases, I develop step by step using common table extensions. For you example, it required 3 steps:
prepare the time series
integrate connections' dates and perform the first calculation (time difference)
Finally, calculate nb connection per day
Then, the final query will display the desired result.
Here is the query I proposed, developed with Postgresql (you did not precise your dbms, but converting should not be such a big deal here):
with init_calendar as (
-- Prepare date series and count total users
select generate_series(min(log_in_dates), now(), interval '1 day') as present_date,
count(distinct user_id) as nb_users
from users
),
calendar as (
-- Add connections' dates for each period from the beginning to current date in calendar
-- and calculate nb days difference for each of them
-- Syntax my vary depending dbms used
select distinct present_date, log_in_dates as last_date,
extract(day from present_date - log_in_dates) as days_difference,
nb_users
from init_calendar
join users on log_in_dates <= present_date
),
usr_con as (
-- Identify last user connection's dates according to running date
-- Tag the line to be counted as no connection
select c.present_date, c.last_date, c.days_difference, c.nb_users,
u.user_id, max(log_in_dates) as last_con,
case when max(log_in_dates) = present_date then 0 else 1 end as to_count
from calendar c
join users u on u.log_in_dates <= c.last_date
group by c.present_date, c.last_date, c.days_difference, c.nb_users, u.user_id
)
select present_date, last_date, days_difference,
nb_users - sum(to_count) as Num_users_no_log_in
from usr_con
group by present_date, last_date, days_difference, nb_users
order by present_date, last_date
Please note that there is a difference with your own expected result as you forgot user_id = 3 in your calculation.
If you want to play with the query, you can with dbfiddle

Way to exclude all subsequent rows after an uncontinuous order

Story:
I am looking at continuous records based on a 1 month interval. As soon as this rule is broken, any subsequent rows should be excluded from the list. Even if the continuous rule reoccurs later in the future
Sample Data:
+----------------+---------+------------+
| date_purchased | product | date_rebill |
+----------------+---------+------------+
| 2019-01-01 | a | 2019-02-01 |
| 2019-01-01 | a | 2019-03-01 |
| 2019-01-01 | a | 2019-04-01 |
| 2019-01-01 | a | 2019-06-01 |
| 2019-01-01 | a | 2019-07-01 |
| 2019-01-01 | a | 2019-08-01 |
| 2019-02-01 | b | 2019-05-01 |
| 2019-02-01 | b | 2019-06-01 |
+----------------+---------+------------+
In this example May is mising for product A, therefore june and july records should be excluded.
As for product B, there should be no records or at least the count should be 0 for rebill. This is because the first rebill happens more than a month after the first date purchased
Query:
I started with something like that. Now I have '1' for consecutive months. The issue is that I can't filter the data set to diff = 1 due to consecutive rows happening after a break has happened.
select
date_puchased
,product
,datediff(month,previous_date,date_rebill) as diff
from (
select date_purchased
, product
, date_rebill
, lag(date_rebill,1,date_purchased)
over (partition by product order by date_purchased ASC) as previous_date
from table
) as base
My Objective:
My objective here is remove any future rows as soon as the "consecutiveness" rule is broken
If I understand correctly, you can use row_number() and arithmetic
select t.*
from (select t.*,
row_number() over (partition by product order by date_rebill) as seqnum
from t
) t
where datediff(month, date_purchased, date_rebill) = seqnum;

get data for first 3 months that have claims

I have a table of members and their claims value, I'm interested in getting the claims values for the first 3 months for each member. Here's what I've tried so far:
WITH START as
(SELECT [HEALTH_ID]
,MIN([CLM_MONTH]) as DOS
FROM [TEST]
GROUP BY
[HEALTH_PLAN_ID])
SELECT HEALTH_ID
,DOS
,FORMAT(DATEADD(month, +1, DOS), 'MM/dd/yyyy')
,FORMAT(DATEADD(month, +2, DOS), 'MM/dd/yyyy')
FROM START
My plan is to get the dates of the first 3 months with claims then join the claim amounts to ID and dates. The problem here is not every member has claims in consecutive months and the dateadd function gives me consecutive months. For example if a member has claims in jan, feb, april, may etc...I'm interested in the claims for jan, feb and april since there were no claims in march. Using the dateadd function would give me dates jan, feb, march excluding april.
In summary, I need help getting the first 3 months that have claims values(months may or may not be consecutive).
Using dense_rank() to rank the months, partitioned by Health_Id, in order to filter for the first three months of each Health_Id.
;with cte as (
select *
, dr = dense_rank() over (
partition by Health_ID
order by dateadd(month, datediff(month, 0, CLM_Month) , 0) /* truncate to month */
)
from test
)
select *
from cte
where dr < 4 -- dense rank of 1-3
test data:
create table test (health_id int, clm_month date)
insert into test values
(1,'20170101'),(1,'20170201'),(1,'20170301'),(1,'20170401')
,(2,'20170101'),(2,'20170201'),(2,'20170401'),(2,'20170501') -- no March
,(3,'20170101'),(3,'20170115'),(3,'20170201'),(3,'20170215') -- Multiple per month
,(3,'20170401'),(3,'20170415'),(3,'20170501'),(3,'20170515')
rextester demo: http://rextester.com/MTZ16877
returns:
+-----------+------------+----+
| health_id | clm_month | dr |
+-----------+------------+----+
| 1 | 2017-01-01 | 1 |
| 1 | 2017-02-01 | 2 |
| 1 | 2017-03-01 | 3 |
| 2 | 2017-01-01 | 1 |
| 2 | 2017-02-01 | 2 |
| 2 | 2017-04-01 | 3 |
| 3 | 2017-01-01 | 1 |
| 3 | 2017-01-15 | 1 |
| 3 | 2017-02-01 | 2 |
| 3 | 2017-02-15 | 2 |
| 3 | 2017-04-01 | 3 |
| 3 | 2017-04-15 | 3 |
+-----------+------------+----+

Can I put a condition on a window function in Redshift?

I have an events-based table in Redshift. I want to tie all events to the FIRST event in the series, provided that event was in the N-hours preceding this event.
If all I cared about was the very first row, I'd simply do:
SELECT
event_time
,first_value(event_time)
OVER (ORDER BY event_time rows unbounded preceding) as first_time
FROM
my_table
But because I only want to tie this to the first event in the past N-hours, I want something like:
SELECT
event_time
,first_value(event_time)
OVER (ORDER BY event_time rows between [N-hours ago] and current row) as first_time
FROM
my_table
A little background on my table. It's user actions, so effectively a user jumps on, performs 1-100 actions, and then leaves. Most users are 1-10x per day. Sessions rarely last over an hour, so I could set N=1.
If I just set a PARTITION BY date_trunc('hour', event_time), I'll double create for sessions that span the hour.
Assume my_table looks like
id | user_id | event_time
----------------------------------
1 | 123 | 2015-01-01 01:00:00
2 | 123 | 2015-01-01 01:15:00
3 | 123 | 2015-01-01 02:05:00
4 | 123 | 2015-01-01 13:10:00
5 | 123 | 2015-01-01 13:20:00
6 | 123 | 2015-01-01 13:30:00
My goal is to get a result that looks like
id | parent_id | user_id | event_time
----------------------------------
1 | 1 | 123 | 2015-01-01 01:00:00
2 | 1 | 123 | 2015-01-01 01:15:00
3 | 1 | 123 | 2015-01-01 02:05:00
4 | 4 | 123 | 2015-01-01 13:10:00
5 | 4 | 123 | 2015-01-01 13:20:00
6 | 4 | 123 | 2015-01-01 13:30:00
The answer appears to be "no" as of now.
There is a functionality in SQL Server of using RANGE instead of ROWS in the frame. This allows the query to compare values to the current row's value.
https://www.simple-talk.com/sql/learn-sql-server/window-functions-in-sql-server-part-2-the-frame/
When I attempt this syntax in Redshift I get the error that "Range is not yet supported"
Someone update this when that "yet" changes!

Select entries where Date Difference not higher than 5 days

I am looking for a SQL Statement which gives me all Entries whoms Date are not more than 5 days apart from another entry in this Table.
Example:
ID | Date
1 | 16.10.14 00:00:00
2 | 14.10.14 00:00:00
3 | 09.09.14 00:00:00
4 | 13.10.14 00:00:00
5 | 06.07.14 00:00:00
6 | 09.01.14 00:00:00
7 | 10.01.14 00:00:00
8 | 14.05.14 00:00:00
Expected Output:
ID | Date
1 | 16.10.14 00:00:00
2 | 14.10.14 00:00:00
4 | 13.10.14 00:00:00
6 | 09.01.14 00:00:00
7 | 10.01.14 00:00:00
8 | 14.01.14 00:00:00
EDIT:
In fact all I need is a way to do a diff over the datatype Date. That's why I cant even show my attempts cause I'm missing the keyword.
Nevermind I will still try
It should be something like this:
select * from example m where m.Date not more apart than 5 days from another entry in the Table
The - operator, when applied on two dates, will return their difference in days. So, you can use the exists operator to construct your query:
SELECT *
FROM my_table o
WHERE EXISTS (SELECT *
FROM my_table i
WHERE ABS (o.my_date - i.my_date) <= 5)