How to calculate retention month over month using SQL - sql

Trying to get a basic table that shows retention from one month to the next. So if someone buys something last month and they do so the next month it gets counted.
month, num_transactions, repeat_transactions, retention
2012-02, 5, 2, 40%
2012-03, 10, 3, 30%
2012-04, 15, 8, 53%
So if everyone that bought last month bought again the following month you have 100%.
So far I can only calculate stuff manually. This gives me the rows that have been seen in both months:
select count(*) as num_repeat_buyers from
(select distinct
to_char(transaction.timestamp, 'YYYY-MM') as month,
auth_user.email
from
auth_user,
transaction
where
auth_user.id = transaction.buyer_id and
to_char(transaction.timestamp, 'YYYY-MM') = '2012-03'
) as table1,
(select distinct
to_char(transaction.timestamp, 'YYYY-MM') as month,
auth_user.email
from
auth_user,
transaction
where
auth_user.id = transaction.buyer_id and
to_char(transaction.timestamp, 'YYYY-MM') = '2012-04'
) as table2
where table1.email = table2.email
This is not right but I feel like I can use some of Postgres' windowing functions. Keep in mind the windowing functions don't let you specify WHERE clauses. You mostly have access to the previous rows and the preceding rows:
select month, count(*) as num_transactions, count(*) over (PARTITION BY month ORDER BY month)
from
(select distinct
to_char(transaction.timestamp, 'YYYY-MM') as month,
auth_user.email
from
auth_user,
transaction
where
auth_user.id = transaction.buyer_id
order by
month
) as transactions_by_month
group by
month

Given the following test table (which you should have provided):
CREATE TEMP TABLE transaction (buyer_id int, tstamp timestamp);
INSERT INTO transaction VALUES
(1,'2012-01-03 20:00')
,(1,'2012-01-05 20:00')
,(1,'2012-01-07 20:00') -- multiple transactions this month
,(1,'2012-02-03 20:00') -- next month
,(1,'2012-03-05 20:00') -- next month
,(2,'2012-01-07 20:00')
,(2,'2012-03-07 20:00') -- not next month
,(3,'2012-01-07 20:00') -- just once
,(4,'2012-02-07 20:00'); -- just once
Table auth_user is not relevant to the problem.
Using tstamp as column name since I don't use base types as identifiers.
I am going to use the window function lag() to identify repeated buyers. To keep it short I combine aggregate and window functions in one query level. Bear in mind that window functions are applied after aggregate functions.
WITH t AS (
SELECT buyer_id
,date_trunc('month', tstamp) AS month
,count(*) AS item_transactions
,lag(date_trunc('month', tstamp)) OVER (PARTITION BY buyer_id
ORDER BY date_trunc('month', tstamp))
= date_trunc('month', tstamp) - interval '1 month'
OR NULL AS repeat_transaction
FROM transaction
WHERE tstamp >= '2012-01-01'::date
AND tstamp < '2012-05-01'::date -- time range of interest.
GROUP BY 1, 2
)
SELECT month
,sum(item_transactions) AS num_trans
,count(*) AS num_buyers
,count(repeat_transaction) AS repeat_buyers
,round(
CASE WHEN sum(item_transactions) > 0
THEN count(repeat_transaction) / sum(item_transactions) * 100
ELSE 0
END, 2) AS buyer_retention
FROM t
GROUP BY 1
ORDER BY 1;
Result:
month | num_trans | num_buyers | repeat_buyers | buyer_retention_pct
---------+-----------+------------+---------------+--------------------
2012-01 | 5 | 3 | 0 | 0.00
2012-02 | 2 | 2 | 1 | 50.00
2012-03 | 2 | 2 | 1 | 50.00
I extended your question to provide for the difference between the number of transactions and the number of buyers.
The OR NULL for repeat_transaction serves to convert FALSE to NULL, so those values do not get counted by count() in the next step.
-> SQLfiddle.

This uses CASE and EXISTS to get repeated transactions:
SELECT
*,
CASE
WHEN num_transactions = 0
THEN 0
ELSE round(100.0 * repeat_transactions / num_transactions, 2)
END AS retention
FROM
(
SELECT
to_char(timestamp, 'YYYY-MM') AS month,
count(*) AS num_transactions,
sum(CASE
WHEN EXISTS (
SELECT 1
FROM transaction AS t
JOIN auth_user AS u
ON t.buyer_id = u.id
WHERE
date_trunc('month', transaction.timestamp)
+ interval '1 month'
= date_trunc('month', t.timestamp)
AND auth_user.email = u.email
)
THEN 1
ELSE 0
END) AS repeat_transactions
FROM
transaction
JOIN auth_user
ON transaction.buyer_id = auth_user.id
GROUP BY 1
) AS summary
ORDER BY 1;
EDIT: Changed from minus 1 month to plus 1 month after reading the question again. My understanding now is that if someone buy something in 2012-02, and then buy something again in 2012-03, then his or her transactions in 2012-02 are counted as retention for the month.

Related

Sum of unique customers in rolling trailing 30d window displayed by week

I'm working in SQL Workbench.
I'd like to track every time a unique customer clicks the new feature in trailing 30 days, displayed week over week. An example of the data output would be as follows:
Week 51: Reflects usage through the end of week 51 (Dec 20th) - 30 days. aka Nov 20-Dec 20th
Week 52: Reflects usage through the end of week 52 (Dec 31st) - 30 days. aka Dec 1 - Dec 31st.
Say there are 22MM unique customer clicks that occurred from Nov 20-Dec 20th. Week 51 data = 22MM.
Say there are 25MM unique customer clicks that occurred from Dec 1-Dec 31st. Week 52 data = 25MM. The customer uniqueness is only relevant to that particular week. Aka, if a customer clicks twice in Week 51 they're only counted once. If they click once in Week 51 and once in Week 52, they are counted once in each week.
Here is what I have so far:
select
min_e_date
,sum(count(*)) over (order by min_e_date rows between unbounded preceding and current row) as running_distinct_customers
from (select customer_id, min(DATE_TRUNC('week', event_date)) as min_e_date
from final
group by 1
) c
group by
min_e_date
I don't think a rolling count is the right way to go. As I add in additional parameters (country, subscription), the rolling count doesn't distinguish between them - the figures just get added to the prior row.
Any suggestions are appreciated!
edit Additional data below. Data collection begins on 11/23. No data precedes that date.
You can get the count of distinct customers per week like so:
select date_trunc('week', event_date) as week_start,
count(distinct customer_id) cnt
from final
group by 1
Now if you want a rolling sum of that count(say, the current week and the three preceding weeks), you can use window functions:
select date_trunc('week', event_date) as week_start,
count(distinct customer_id) cnt,
sum(count(distinct customer_id)) over(
order by date_trunc('week', event_date)
range between 3 week preceding and current row
) as rolling_cnt
from final
group by 1
Rolling distinct counts are quite difficult in RedShift. One method is a self-join and aggregation:
select t.date,
count(distinct case when tprev.date >= t.date - interval '6 day' then customer_id end) as trailing_7,
count(distinct customer_id) as trailing_30
from t join
t tprev
on tprev.date >= t.date - interval '29 day' and
tprev.date <= t.date
group by t.date;
If you can get this to work, you can just select every 7th row to get the weekly values.
EDIT:
An entirely different approach is to use aggregation and keep track of when customers enter and end time periods of being counted. This is a pain with two different time frames. Here is what it looks like for one.
The idea is to
Create an enter/exit record for each record being counted. The "exit" is n days after the enter.
Summarize these into periods of activity for each customer. So, there is one record with an enter and exit date. This is a type of gaps-and-islands problem.
Unpivot this result to count +1 for a customer being counted and -1 for a customer not being counted.
Do a cumulative sum of this count.
The code looks something like this:
with cd as (
select customer_id, date,
lead(date) over (partition by customer_id order by date) as next_date,
sum(sum(inc)) over (partition by customer_id order by date) as cnt
from ((select t.customer_id, t.date, 1 as inc
from t
) union all
(select t.customer_id, t.date + interval '7 day', -1
from t
)
) tt
),
cd2 as (
select customer_id, min(date) as enter_date, max(date) as exit_date
from (select cd.*,
sum(case when cnt = 0 then 1 else 0 end) over (partition by customer_id order by date) as grp
from (select cd.*,
lag(cnt) over (partition by customer_id order by date) as prev_cnt
from cd
) cd
) cd
group by customer_id, grp
having max(cnt) > 0
)
select dte, sum(sum(inc)) over (order by dte)
from ((select customer_id, enter_date as dte, 1 as inc
from cd2
) union all
(select customer_id, exit_date as dte, -1 as inc
from cd2
)
) cd2
group by dte;

How to find the share of clients who "outflow" every month? (SQLite or Oracle)

The CLIENTS table contains a monthly snapshot of the bank's clients,
who have made any transactions in the given month. Attributes: report_month
and client_id. We assume that the client "outflow" from the bank in month N, if in month N
it is active (present in the CLIENTS table) and inactive in months N + 1, N + 2, N + 3.
How to find the share of clients who "outflow" every month?
Table looks like:
report_month client_id
2020-01-01 0023
2020-03-01 0125
...
You can do this with window functions and a window frame. In standard SQL, this would look like:
select report_month, sum(case when cnt = 0 then 1 else 0 end) as outflow
from (
select t.*,
count(*) over(
partition by client_id
order by report_month
range between interval '1' month following and interval '3' month following
) cnt
from mytable t
) t
group by report_month
This assumes that report_month is of a date-like datatype, and that each customer has 0 or 1 record per report_month. If a customer may appear more than once in a month, you would change the outer conditional sum() to:
count(distinct case when cnt = 0 then client_id end) as outflow
In SQLite, that has poor date arithmetics support, it is a bit more complicated. If you can live with an approximation of month periods, you could do something like this:
select report_month, sum(case when cnt = 0 then 1 else 0 end) as outflow
from (
select t.*,
count(*) over(
partition by client_id
order by julianday(report_month)
range between 28 following and 92 following
) cnt
from mytable t
) t
group by report_month

sql user retention calculation

I have a table records like this in Athena, one user one row in a month:
month, id
2020-05 1
2020-05 2
2020-05 5
2020-06 1
2020-06 5
2020-06 6
Need to calculate the percentage=( users come both prior month and current month )/(prior month total users).
Like in the above example, users come both in May and June 1,5 , May total user 3, this should calculate a percentage of 2/3*100
with monthly_mau AS
(SELECT month as mauMonth,
date_format(date_add('month',1,cast(concat(month,'-01') AS date)), '%Y-%m') AS nextMonth,
count(distinct userid) AS monthly_mau
FROM records
GROUP BY month
ORDER BY month),
retention_mau AS
(SELECT
month,
count(distinct useridLeft) AS retention_mau
FROM (
(SELECT
userid as useridLeft,month as monthLeft,
date_format(date_add('month',1,cast(concat(month,'-01') AS date)), '%Y-%m') AS nextMonth
FROM records ) AS prior
INNER JOIN
(SELECT
month ,
userid
FROM records ) AS current
ON
prior.useridLeft = current.userid
AND prior.nextMonth = current.month )
WHERE userid is not null
GROUP BY month
ORDER BY month )
SELECT *, cast(retention_mau AS double)/cast(monthly_mau AS double)*100 AS retention_mau_percentage
FROM monthly_mau as m
INNER JOIN monthly_retention_mau AS r
ON m.nextMonth = r.month
order by r.month
This gives me percentage as 100 which is not right. Any idea?
Hmmm . . . assuming you have one row per user per month, you can use window functions and conditional aggregation:
select month, count(*) as num_users,
sum(case when prev_month = dateadd('month', -1, month) then 1 else 0 end) as both_months
from (select r.*,
cast(concat(month, '-01') AS date) as month_date,
lag(cast(concat(month, '-01') AS date)) over (partition by id order by month) as prev_month_date
from records r
) r
group by month;

Google Big Query SQL - Get most recent unique value by date

#EDIT - Following the comments, I rephrase my question
I have a BigQuery table that i want to use to get some KPI of my application.
In this table, I save each create or update as a new line in order to keep a better history.
So I have several times the same data with a different state.
Example of the table :
uuid |status |date
––––––|–––––––––––|––––––––––
3 |'inactive' |2018-05-12
1 |'active' |2018-05-10
1 |'inactive' |2018-05-08
2 |'active' |2018-05-08
3 |'active' |2018-05-04
2 |'inactive' |2018-04-22
3 |'inactive' |2018-04-18
We can see that we have multiple value of each data.
What I would like to get:
I would like to have the number of current 'active' entry (So there must be no 'inactive' entry with the same uuid after). And to complicate everything, I need this total per day.
So for each day, the amount of 'active' entries, including those from previous days.
So with this example I should have this result :
date | actives
____________|_________
2018-05-02 | 0
2018-05-03 | 0
2018-05-04 | 1
2018-05-05 | 1
2018-05-06 | 1
2018-05-07 | 1
2018-05-08 | 2
2018-05-09 | 2
2018-05-10 | 3
2018-05-11 | 3
2018-05-12 | 2
Actually i've managed to get the good amount of actives for one day. But my problem is when i want the results for each days.
What I've tried:
I'm stuck with two solutions that each return a different error.
First solution :
WITH
dates AS(
SELECT GENERATE_DATE_ARRAY(
DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH), CURRENT_DATE(), INTERVAL 1 DAY)
arr_dates )
SELECT
i_date date,
(
SELECT COUNT(uuid)
FROM (
SELECT
uuid, status, date,
RANK() OVER(PARTITION BY uuid ORDER BY date DESC) rank
FROM users
WHERE
PARSE_DATE("%Y-%m-%d", FORMAT_DATETIME("%Y-%m-%d",date)) <= i_date
)
WHERE
status = 'active'
and rank = 1
## rank is the condition which causes the error
) users
FROM
dates, UNNEST(arr_dates) i_date
ORDER BY i_date;
The SELECT with the RANK() OVER correctly returns the users with a rank column that allow me to know which entry is the last for each uuid.
But when I try this, I got a :
Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN. because of the rank = 1 condition.
Second solution :
WITH
dates AS(
SELECT GENERATE_DATE_ARRAY(
DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH), CURRENT_DATE(), INTERVAL 1 DAY)
arr_dates )
SELECT
i_date date,
(
SELECT
COUNT(t1.uuid)
FROM
users t1
WHERE
t1.date = (
SELECT MAX(t2.date)
FROM users t2
WHERE
t2.uuid = t1.uuid
## Here that's the i_date condition which causes problem
AND PARSE_DATE("%Y-%m-%d", FORMAT_DATETIME("%Y-%m-%d", t2.date)) <= i_date
)
AND status='active' ) users
FROM
dates,
UNNEST(arr_dates) i_date
ORDER BY i_date;
Here, the second select is working too and correctly returning the number of active user for a current day.
But the problem is when i try to use i_date to retrieve datas among the multiple days.
And Here i got a LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join. error...
Which solution is more able to succeed ? What should i change ?
And, if my way of storing the data isn't good, how should i proceed in order to keep a precise history ?
Below is for BigQuery Standard SQL
#standardSQL
SELECT date, COUNT(DISTINCT uuid) total_active
FROM `project.dataset.table`
WHERE status = 'active'
GROUP BY date
-- ORDER BY date
Update to address your "rephrased" question :o)
Below example is using dummy data from your question
#standardSQL
WITH `project.dataset.users` AS (
SELECT 3 uuid, 'inactive' status, DATE '2018-05-12' date UNION ALL
SELECT 1, 'active', '2018-05-10' UNION ALL
SELECT 1, 'inactive', '2018-05-08' UNION ALL
SELECT 2, 'active', '2018-05-08' UNION ALL
SELECT 3, 'active', '2018-05-04' UNION ALL
SELECT 2, 'inactive', '2018-04-22' UNION ALL
SELECT 3, 'inactive', '2018-04-18'
), dates AS (
SELECT day FROM UNNEST((
SELECT GENERATE_DATE_ARRAY(MIN(date), MAX(date))
FROM `project.dataset.users`
)) day
), active_users AS (
SELECT uuid, status, date first, DATE_SUB(next_status.date, INTERVAL 1 DAY) last FROM (
SELECT uuid, date, status, LEAD(STRUCT(status, date)) OVER(PARTITION BY uuid ORDER BY date ) next_status
FROM `project.dataset.users` u
)
WHERE status = 'active'
)
SELECT day, COUNT(DISTINCT uuid) actives
FROM dates d JOIN active_users u
ON day BETWEEN first AND IFNULL(last, day)
GROUP BY day
-- ORDER BY day
with result
Row day actives
1 2018-05-04 1
2 2018-05-05 1
3 2018-05-06 1
4 2018-05-07 1
5 2018-05-08 2
6 2018-05-09 2
7 2018-05-10 3
8 2018-05-11 3
9 2018-05-12 2
I think this -- or something similar -- will do what you want:
SELECT day,
coalesce(running_actives, 0) - coalesce(running_inactives, 0)
FROM UNNEST(GENERATE_DATE_ARRAY(DATE('2015-05-11'), DATE('2018-06-29'), INTERVAL 1 DAY)
) AS day left join
(select date, sum(countif(status = 'active')) over (order by date) as running_actives,
sum(countif(status = 'active')) over (order by date) as running_inactives
from t
group by date
) a
on a.date = day
order by day;
The exact solution depends on whether the "inactive" is inclusive of the day (as above) or takes effect the next day. Either is handled the same way, by using cumulative sums of actives and inactives and then taking the difference.
In order to get data on all days, this generates the days using arrays and unnest(). If you have data on all days, that step may be unnecessary

How to generate date series to occupy absent dates in google BiqQuery?

I am trying to get daily sum of sales from a google big-query table. I used following code for that.
select Day(InvoiceDate) date, Sum(InvoiceAmount) sales from test_gmail_com.sales
where year(InvoiceDate) = Year(current_date()) and
Month(InvoiceDate) = Month(current_date())
group by date order by date
From the above query it gives only the sum of sales daily which were in the table. There is a chance that some days do not have any sales. For those kind of situations, I need to get the date and sum should be 0. As an example, in every month should 30 0r 31 rows with sum of sales. Examples show below. 4th day of the month does not have a sales. So its sum should be 0.
date | sales
-----+------
1 | 259
-----+------
2 | 359
-----+------
3 | 45
-----+------
4 | 0
-----+------
5 | 156
Is it possible to do in Big-query? Basically date column should be a series from 1 - 28/29/30 or 31st depending on the month of the year
Generting a list of dates and then joining whatever table you need on top seems the easiest. I used the generate_date_array + unnest and it looks quite clean.
To generate a list of days (one day per row):
SELECT
*
FROM
UNNEST(GENERATE_DATE_ARRAY('2018-10-01', '2020-09-30', INTERVAL 1 DAY)) AS example
You can use below to generate on fly all dates in given range (in below example it is all dates from 2015-06-01 till CURRENT_DATE() - by changing those you can control which dates range to generate)
SELECT DATE(DATE_ADD(TIMESTAMP("2015-06-01"), pos - 1, "DAY")) AS calendar_day
FROM (
SELECT ROW_NUMBER() OVER() AS pos, *
FROM (FLATTEN((
SELECT SPLIT(RPAD('', 1 + DATEDIFF(TIMESTAMP(CURRENT_DATE()), TIMESTAMP("2015-06-01")), '.'),'') AS h
FROM (SELECT NULL)),h
)))
so, now - you can use it with LEFT JOIN with your table to have all dates accounted. See potential example below
SELECT
calendar_day,
IFNULL(sales, 0) AS sales
FROM (
SELECT DATE(DATE_ADD(TIMESTAMP("2015-06-01"), pos - 1, "DAY")) AS calendar_day
FROM (
SELECT ROW_NUMBER() OVER() AS pos, *
FROM (FLATTEN((
SELECT SPLIT(RPAD('', 1 + DATEDIFF(TIMESTAMP(CURRENT_DATE()), TIMESTAMP("2015-06-01")), '.'),'') AS h
FROM (SELECT NULL)),h
)))
) AS all_dates
LEFT JOIN (
SELECT DAY(InvoiceDate) DATE, SUM(InvoiceAmount) sales
FROM test_gmail_com.sales
WHERE YEAR(InvoiceDate) = YEAR(CURRENT_DATE()) AND
MONTH(InvoiceDate) = MONTH(CURRENT_DATE())
GROUP BY DATE
)
ON DATE = calendar_day
I wanna need to get previous months sales
Below gives all days of previous month
SELECT DATE(DATE_ADD(DATE_ADD(DATE_ADD(CURRENT_DATE(), -1, "MONTH"), 1 - DAY(CURRENT_DATE()), "DAY"), pos - 1, "DAY")) AS calendar_day
FROM (
SELECT ROW_NUMBER() OVER() AS pos, *
FROM (FLATTEN((
SELECT SPLIT(RPAD('', 1 + DATEDIFF(DATE_ADD(CURRENT_DATE(), - DAY(CURRENT_DATE()), "DAY"), DATE_ADD(DATE_ADD(CURRENT_DATE(), -1, "MONTH"), 1 - DAY(CURRENT_DATE()), "DAY")), '.'),'') AS h
FROM (SELECT NULL)),h
)))
Using the Standard SQL dialect and the generate_array function to simplify the code:
WITH serialnum AS (
SELECT
sn
FROM
UNNEST(GENERATE_ARRAY(0,
DATE_DIFF(DATE_ADD(DATE_TRUNC(CURRENT_DATE()
, MONTH)
, INTERVAL 1 MONTH)
, DATE_TRUNC(CURRENT_DATE(), MONTH)
, DAY) - 1)
) AS sn
), date_seq AS (
SELECT
DATE_ADD(DATE_TRUNC(CURRENT_DATE(), MONTH),
INTERVAL(sn) DAY) AS this_day
FROM
serialnum
)
SELECT
Day(InvoiceDate) date
, Sum(IFNULL(InvoiceAmount, 0)) sales
FROM
date_seq
LEFT JOIN
test_gmail_com.sales
ON
date_seq.this_day = DAY(test_gmail_com.sales.InvoiceDate)
WHERE
year(InvoiceDate) = Year(current_date())
and
Month(InvoiceDate) = Month(current_date())
GROUP BY
date
ORDER BY
date
;
UPDATE
Or, simpler still using the generate_date_array function:
WITH date_seq AS (
SELECT
GENERATE_DATE_ARRAY(DATE_TRUNC(CURRENT_DATE(), MONTH),
DATE_ADD(DATE_ADD(DATE_TRUNC(CURRENT_DATE(), MONTH)
, INTERVAL 1 MONTH)
, INTERVAL -1 DAY)
, INTERVAL 1 DAY)
AS this_day
)
SELECT
Day(InvoiceDate) date
, Sum(IFNULL(InvoiceAmount, 0)) sales
FROM
date_seq
LEFT JOIN
test_gmail_com.sales
ON
date_seq.this_day = DAY(test_gmail_com.sales.InvoiceDate)
WHERE
year(InvoiceDate) = Year(current_date())
and
Month(InvoiceDate) = Month(current_date())
GROUP BY
date
ORDER BY
date
;
For these purposes it is practical to have a 'calendar' table, a table that just lists all the days within a certain range. For your specific question, it would suffice to have a table with the numbers 1 to 31. A quick way to get this table is to make a spreadsheet with these numbers, save it as a csv file and import this file into BigQuery as a table.
You then left outer join your result set onto this table, with ifnull(sales,0) as sales.
If you want the number of days per month (28--31) to be right, you basically have two options. Either you create a proper calendar table that covers several years and that you join on using year, month and day. Or you use the simple table with numbers 1--31 and remove numbers based on the month and the year.
For Standard SQL
WITH
splitted AS (
SELECT
*
FROM
UNNEST( SPLIT(RPAD('',
1 + DATE_DIFF(CURRENT_DATE(), DATE("2015-06-01"), DAY),
'.'),''))),
with_row_numbers AS (
SELECT
ROW_NUMBER() OVER() AS pos,
*
FROM
splitted),
calendar_day AS (
SELECT
DATE_ADD(DATE("2015-06-01"), INTERVAL (pos - 1) DAY) AS day
FROM
with_row_numbers)
SELECT
*
FROM
calendar_day
ORDER BY
day DESC