How to Calculate Full/Repeat Retention in BigQuery SQL - sql

I am trying to calculate a "rolling retention" or "repeat retention" (Not sure what the appropriate name for this is), but a scenario where I only want to count the proportion of users who place an order every single month consecutively.
So if 10 users place an order in Jan 2020, and 5 of them come back in Feb, that would equal a 50% retention.
Now for March, I only want to consider the 5 users who ordered in February, still taking note of the total January cohort size.
So if 2 users from February come back in March, retention for March will be 2/10 = 20%. If a user from Jan who didn't return in Feb, places an order in March, they will not be included in the calculation for March, because they did not return in February.
Basically, this retention will progressively decrease to 0% and can never increase.
Here is what I have done so far:
WITH first_order AS (SELECT
customerEmail,
MIN(orderedat) as firstOrder,
FROM fact AS fact
GROUP BY 1 ),
cohort_data AS (SELECT
first_order.customerEmail,
orderedAt as order_month,
MIN(FORMAT_DATE("%y-%m (%b)", date(firstorder))) as cohort_month,
FROM first_order as first_order
LEFT JOIN fact as fact
ON first_order.customeremail = fact.customeremail
GROUP BY 1,2, FACT.orderedAt),
cohort_count AS (select cohort_month, count(distinct customeremail) AS total_cohort_count FROM cohort_data GROUP BY 1 )
SELECT
cd.cohort_month,
date_trunc(date(cd.order_month), month) as order_month,
total_cohort_count,
count(distinct cd.customeremail) as total_repeat
FROM cohort_data as cd
JOIN cohort_data as last_month
ON cd.customeremail= last_month.customeremail
and date(cd.order_month) = date_add(date(last_month.order_month), interval 1 month)
LEFT JOIN cohort_count AS cc
on cd.cohort_month = cc.cohort_month
GROUP BY 1,2,3
ORDER BY cohort_month, order_month ASC
Here is the result. I'm not sure where I got it wrong but the numbers are too small and the retention increases in some months which shouldn't be.
I did an INNER JOIN in the last query so I could compare the previous month to the current month, but it didn't work exactly how I wanted.
Sample Data:
I'd appreciate any help

I would start with one row per customer per month. Then, I would enumerate the customer/months and keep only those with no gaps . . . and aggregate:
with customer_months as (
select customer_email,
date_trunc(ordered_at, month) as yyyymm,
min(date_trunc(ordered_at, month)) over (partition by customer_email) as first_yyyymm
from cohort_data
group by 1, 2
)
select first_yyyymm, yyyymm, count(*)
from (select cm.*,
row_number() over (partition by custoemr_email order by yyyymm) as seqnum
from customer_months cm
) cm
where yyyymm = date_add(first_yyyymm, interval seqnum - 1 month)
group by 1, 2
order by 1, 2;

Related

How to join partitioned table with another one

Sorry for the newbie question, but I'm really having trouble with the following issue:
Say, I have this code in place:
WITH active_pass AS (SELECT DATE_TRUNC(fr.day, MONTH) AS month, id,
CASE
WHEN SUM(fr.imps) > 100 THEN 1
WHEN SUM(fr.imps) < 100 THEN 0
END AS active_or_passive
FROM table1 AS fr
WHERE day between (CURRENT_DATE() - 730) AND (CURRENT_DATE() - EXTRACT(DAY FROM CURRENT_DATE()))
GROUP BY month, id
ORDER BY month desc),
# summing the score for each customer (sum for the whole year)
active_pass_assigned AS (SELECT id, month,
SUM(SUM(active_or_passive)) OVER (PARTITION BY id ORDER BY month rows BETWEEN 3 PRECEDING AND 1 PRECEDING) AS trailing_act
FROM active_pass AS a
GROUP BY month, id
ORDER BY MONTH desc)
What it does is it creates a trailing total over the last 3 months to see how many of those last 3 month the customer was active. However, I have no idea how to join with the next table to get a sum of revenue that said client generated. What I tried is this:
SELECT c.id, DATE_TRUNC(day, MONTH) AS month, SUM(revenue) AS Rev, name
FROM table2 AS c
JOIN active_pass_assigned AS a
ON c.id = a.id
WHERE day between (CURRENT_DATE() - 365) AND (CURRENT_DATE() - EXTRACT(DAY FROM CURRENT_DATE()))
GROUP BY month, id, name
ORDER BY month DESC
However, it returns waaay higher values for Revenue than the actual ones and I have no idea why. Furthermore, could you please tell me how to join those two tables together so that I only get the customer's revenue on the months his activity was equal to 3?

Retrieve Customers with a Monthly Order Frequency greater than 4

I am trying to optimize the below query to help fetch all customers in the last three months who have a monthly order frequency +4 for the past three months.
Customer ID
Feb
Mar
Apr
0001
4
5
6
0002
3
2
4
0003
4
2
3
In the above table, the customer with Customer ID 0001 should only be picked, as he consistently has 4 or more orders in a month.
Below is a query I have written, which pulls all customers with an average purchase frequency of 4 in the last 90 days, but not considering there is a consistent purchase of 4 or more last three months.
Query:
SELECT distinct lines.customer_id Customer_ID, (COUNT(lines.order_id)/90) PurchaseFrequency
from fct_customer_order_lines lines
LEFT JOIN product_table product
ON lines.entity_id= product.entity_id
AND lines.vendor_id= product.vendor_id
WHERE LOWER(product.country_code)= "IN"
AND lines.date >= DATE_SUB(CURRENT_DATE() , INTERVAL 90 DAY )
AND lines.date < CURRENT_DATE()
GROUP BY Customer_ID
HAVING PurchaseFrequency >=4;
I tried to use window functions, however not sure if it needs to be used in this case.
I would sum the orders per month instead of computing the avg and then retrieve those who have that sum greater than 4 in the last three months.
Also I think you should select your interval using "month(CURRENT_DATE()) - 3" instead of using a window of 90 days. Of course if needed you should handle the case of when current_date is jan-feb-mar and in that case go back to oct-nov-dec of the previous year.
I'm not familiar with Google BigQuery so I can't write your query but I hope this helps.
So I've found the solution to this using WITH operator as below:
WITH filtered_orders AS (
select
distinct customer_id ID,
extract(MONTH from date) Order_Month,
count(order_id) CountofOrders
from customer_order_lines` lines
where EXTRACT(YEAR FROM date) = 2022 AND EXTRACT(MONTH FROM date) IN (2,3,4)
group by ID, Order_Month
having CountofOrders>=4)
select distinct ID
from filtered_orders
group by ID
having count(Order_Month) =3;
Hope this helps!
An option could be first count the orders by month and then filter users which have purchases on all months above your threshold:
WITH ORDERS_BY_MONTH AS (
SELECT
DATE_TRUNC(lines.date, MONTH) PurchaseMonth,
lines.customer_id Customer_ID,
COUNT(lines.order_id) PurchaseFrequency
FROM fct_customer_order_lines lines
LEFT JOIN product_table product
ON lines.entity_id= product.entity_id
AND lines.vendor_id= product.vendor_id
WHERE LOWER(product.country_code)= "IN"
AND lines.date >= DATE_SUB(CURRENT_DATE() , INTERVAL 90 DAY )
AND lines.date < CURRENT_DATE()
GROUP BY PurchaseMonth, Customer_ID
)
SELECT
Customer_ID,
AVG(PurchaseFrequency) AvgPurchaseFrequency
FROM ORDERS_BY_MONTH
GROUP BY Customer_ID
HAVING COUNT(1) = COUNTIF(PurchaseFrequency >= 4)

Sum of unique customers in rolling trailing 30d window displayed by week

I'm working in SQL Workbench.
I'd like to track every time a unique customer clicks the new feature in trailing 30 days, displayed week over week. An example of the data output would be as follows:
Week 51: Reflects usage through the end of week 51 (Dec 20th) - 30 days. aka Nov 20-Dec 20th
Week 52: Reflects usage through the end of week 52 (Dec 31st) - 30 days. aka Dec 1 - Dec 31st.
Say there are 22MM unique customer clicks that occurred from Nov 20-Dec 20th. Week 51 data = 22MM.
Say there are 25MM unique customer clicks that occurred from Dec 1-Dec 31st. Week 52 data = 25MM. The customer uniqueness is only relevant to that particular week. Aka, if a customer clicks twice in Week 51 they're only counted once. If they click once in Week 51 and once in Week 52, they are counted once in each week.
Here is what I have so far:
select
min_e_date
,sum(count(*)) over (order by min_e_date rows between unbounded preceding and current row) as running_distinct_customers
from (select customer_id, min(DATE_TRUNC('week', event_date)) as min_e_date
from final
group by 1
) c
group by
min_e_date
I don't think a rolling count is the right way to go. As I add in additional parameters (country, subscription), the rolling count doesn't distinguish between them - the figures just get added to the prior row.
Any suggestions are appreciated!
edit Additional data below. Data collection begins on 11/23. No data precedes that date.
You can get the count of distinct customers per week like so:
select date_trunc('week', event_date) as week_start,
count(distinct customer_id) cnt
from final
group by 1
Now if you want a rolling sum of that count(say, the current week and the three preceding weeks), you can use window functions:
select date_trunc('week', event_date) as week_start,
count(distinct customer_id) cnt,
sum(count(distinct customer_id)) over(
order by date_trunc('week', event_date)
range between 3 week preceding and current row
) as rolling_cnt
from final
group by 1
Rolling distinct counts are quite difficult in RedShift. One method is a self-join and aggregation:
select t.date,
count(distinct case when tprev.date >= t.date - interval '6 day' then customer_id end) as trailing_7,
count(distinct customer_id) as trailing_30
from t join
t tprev
on tprev.date >= t.date - interval '29 day' and
tprev.date <= t.date
group by t.date;
If you can get this to work, you can just select every 7th row to get the weekly values.
EDIT:
An entirely different approach is to use aggregation and keep track of when customers enter and end time periods of being counted. This is a pain with two different time frames. Here is what it looks like for one.
The idea is to
Create an enter/exit record for each record being counted. The "exit" is n days after the enter.
Summarize these into periods of activity for each customer. So, there is one record with an enter and exit date. This is a type of gaps-and-islands problem.
Unpivot this result to count +1 for a customer being counted and -1 for a customer not being counted.
Do a cumulative sum of this count.
The code looks something like this:
with cd as (
select customer_id, date,
lead(date) over (partition by customer_id order by date) as next_date,
sum(sum(inc)) over (partition by customer_id order by date) as cnt
from ((select t.customer_id, t.date, 1 as inc
from t
) union all
(select t.customer_id, t.date + interval '7 day', -1
from t
)
) tt
),
cd2 as (
select customer_id, min(date) as enter_date, max(date) as exit_date
from (select cd.*,
sum(case when cnt = 0 then 1 else 0 end) over (partition by customer_id order by date) as grp
from (select cd.*,
lag(cnt) over (partition by customer_id order by date) as prev_cnt
from cd
) cd
) cd
group by customer_id, grp
having max(cnt) > 0
)
select dte, sum(sum(inc)) over (order by dte)
from ((select customer_id, enter_date as dte, 1 as inc
from cd2
) union all
(select customer_id, exit_date as dte, -1 as inc
from cd2
)
) cd2
group by dte;

how to produce a customer retention table /cohort analysis with SQL

I'm trying to write an SQL query (Presto SQL syntax) to produce a customer retention table (see sample below).
A customer who makes at least one transaction in a month is considered as retained for that month.
this is the table
user_id transaction_date
bdcff651- . 2018-01-01
bdcff641 . 2018-03-15
this is the result I would like to get
The first row should be understood as follows:
Out of all customers who made their first transaction in the month of Jan 2018 (defined as “Jan Activation Cohort”), 35% subsequently made a transaction during the one month period following their first transaction date, 23% in the next month, 15% in the next month and so on.
Date 1st Month 2nd Month 3rd Month
2018-01-01 35% 23% . 15%
2018-02-0 33 % 26% . 13%
2018-03-0 36% 27% 12%
As an example, if person XYZ makes his first transaction on 10th February 2018, his 1st month will be from 11th February 2018 to 10th March 2018, 2nd month will be from 11th March 2018 to 10th April 2018 and so on. This person’s details need to appear in the Feb 2018 cohort in the Customer Retention Table.
would appreciate any help! thanks.
You can use conditional aggregation. However, I am not sure what your real calculations are.
If I just use the built-in definitions of date_diff(), then the logic looks like:
select date_trunc(month, first_td) as yyyymm,
count(distinct user_id) as cnt,
(count(distinct case when date_diff(month, first_td, transaction_date) = 1
then user_id
end) /
count(distinct user_id)
) as month_1_ratio,
(count(distinct case when date_diff(month, first_td, transaction_date) = 2
then user_id
end) /
count(distinct user_id)
) as month_2_ratio
from (select t.*,
min(transaction_date) over (partition by user_id) as first_td
from t
) t
group by date_trunc(month, first_td)
order by yyyymm;
I am not familiar with Presto exactly, and do not have a way to test Presto code. However, it looks like from searching around a bit that it wouldn't be too hard to convert to Presto syntax from something like SQL Server syntax. Here is what I would do in SQL Server and you should be able to carry the concept over to Presto:
with transactions_info_per_user as (
select user_id, min(transaction_date) as first_transaction,
convert(datepart(year, min(transaction_date)) as varchar(4)) + convert(datepart(month, min(transaction_date)) as varchar(2)) as activation_cohort
from my_table
group by user_id
),
users_per_activation_cohort as (
select activation_cohort, count(*) as number_of_users
from transactions_info_per_user
group by activation_cohort
),
months_after_activation_per_purchase as (
select distinct mt.user_id, ti.activation_cohort, datediff(month, mt.transaction_date, ti.first_transaction) AS months_after_activation
from my_table mt
left join transactions_info_per_user as ti
on mt.user_id = ti.user_id
),
final as (
select activation_cohort, months_after_activation, count(*) as user_count_per_cohort_with_purchase_per_month_after_activation
from months_after_activation_per_purchase
group by activation_cohort, months_after_activation
)
select activation_cohort, months_after_activation,
convert(user_count_per_cohort_with_purchase_per_month_after_activation as decimal(9,2)) / convert(users_per_activation_cohort as decimal(9,2)) * 100
from final
--Then pivot months_after_activation into columns
I was very explicit with the naming of things so you could follow the thought process. Here is an example of how to pivot in Presto. Hopefully this helps you!

Accumulating values until up to date

I'm working on an order system where orders come in. For the analytics department I want to build a view that accumulates all sales for a given day.
That is not an issue, I got the working query for that. More complicated is a second number where I want to show the accumulated sales to that day.
Meaning if I have $100 of sales on Feb 1 the column should show $100. If I have $200 of sales on Feb 2 that column should show $300 and so on.
This is what I came up with so far:
select
date_trunc('day', o.created_at) :: date,
sum(o.value) sales_for_day,
count(o.accepted_at) as num_of_orders_for_day,
-- sales_for_month_to_date
-- num_of_orders_for_month_to_date
from
orders o
where
status = 'accepted'
group by
date_trunc('day', o.accepted_at);
Just use window functions:
select date_trunc('day', o.created_at) :: date,
sum(o.value) as sales_for_day,
count(o.accepted_at) as num_of_orders_for_day,
sum(sum(o.value)) over (partition by date_trunc('month', o.accepted_at order by min(o.created_at)) as sales_for_month_to_date
sum(count(*)) over (partition by date_trunc('month', o.accepted_at order by min(o.created_at)) as num_of_orders_for_month_to_date
from orders o
where status = 'accepted'
group by date_trunc('day', o.accepted_at);
Based on the comments in your code, I surmise that you want month-to-date numbers, so this also partitions by month.