how to produce a customer retention table /cohort analysis with SQL - sql

I'm trying to write an SQL query (Presto SQL syntax) to produce a customer retention table (see sample below).
A customer who makes at least one transaction in a month is considered as retained for that month.
this is the table
user_id transaction_date
bdcff651- . 2018-01-01
bdcff641 . 2018-03-15
this is the result I would like to get
The first row should be understood as follows:
Out of all customers who made their first transaction in the month of Jan 2018 (defined as “Jan Activation Cohort”), 35% subsequently made a transaction during the one month period following their first transaction date, 23% in the next month, 15% in the next month and so on.
Date 1st Month 2nd Month 3rd Month
2018-01-01 35% 23% . 15%
2018-02-0 33 % 26% . 13%
2018-03-0 36% 27% 12%
As an example, if person XYZ makes his first transaction on 10th February 2018, his 1st month will be from 11th February 2018 to 10th March 2018, 2nd month will be from 11th March 2018 to 10th April 2018 and so on. This person’s details need to appear in the Feb 2018 cohort in the Customer Retention Table.
would appreciate any help! thanks.

You can use conditional aggregation. However, I am not sure what your real calculations are.
If I just use the built-in definitions of date_diff(), then the logic looks like:
select date_trunc(month, first_td) as yyyymm,
count(distinct user_id) as cnt,
(count(distinct case when date_diff(month, first_td, transaction_date) = 1
then user_id
end) /
count(distinct user_id)
) as month_1_ratio,
(count(distinct case when date_diff(month, first_td, transaction_date) = 2
then user_id
end) /
count(distinct user_id)
) as month_2_ratio
from (select t.*,
min(transaction_date) over (partition by user_id) as first_td
from t
) t
group by date_trunc(month, first_td)
order by yyyymm;

I am not familiar with Presto exactly, and do not have a way to test Presto code. However, it looks like from searching around a bit that it wouldn't be too hard to convert to Presto syntax from something like SQL Server syntax. Here is what I would do in SQL Server and you should be able to carry the concept over to Presto:
with transactions_info_per_user as (
select user_id, min(transaction_date) as first_transaction,
convert(datepart(year, min(transaction_date)) as varchar(4)) + convert(datepart(month, min(transaction_date)) as varchar(2)) as activation_cohort
from my_table
group by user_id
),
users_per_activation_cohort as (
select activation_cohort, count(*) as number_of_users
from transactions_info_per_user
group by activation_cohort
),
months_after_activation_per_purchase as (
select distinct mt.user_id, ti.activation_cohort, datediff(month, mt.transaction_date, ti.first_transaction) AS months_after_activation
from my_table mt
left join transactions_info_per_user as ti
on mt.user_id = ti.user_id
),
final as (
select activation_cohort, months_after_activation, count(*) as user_count_per_cohort_with_purchase_per_month_after_activation
from months_after_activation_per_purchase
group by activation_cohort, months_after_activation
)
select activation_cohort, months_after_activation,
convert(user_count_per_cohort_with_purchase_per_month_after_activation as decimal(9,2)) / convert(users_per_activation_cohort as decimal(9,2)) * 100
from final
--Then pivot months_after_activation into columns
I was very explicit with the naming of things so you could follow the thought process. Here is an example of how to pivot in Presto. Hopefully this helps you!

Related

How to get number of billable customers per month SQL

This is what my table looks like:
NOTE: Don't worry about the BMI field being empty in some rows. We assume that each row is a reading. I have omitted some columns for privacy reasons.
I want to get a count of the number of active customers per month. A customer is active if they have at least 18 readings in total (1 reading per day for 18 days in a given month). How do I write this SQL query? Assume the table name is 'cust'. I'm using SQL Server. Any help is appreciated.
Presumably a patient is a customer in your world. If so, you can use two levels of aggregation:
select yyyy, mm, count(*)
from (select year(createdat) as yyyy, month(createdat) as mm,
patient_id,
count(distinct convert(date, createdat)) as num_days
from t
group by year(createdat), month(createdat), patient_id
) ymp
where num_days >= 18
group by yyyy, mm;
You need to group by patient and the month, then group again by just the month
SELECT
mth,
COUNT(*) NumPatients
FROM (
SELECT
EOMONTH(c.createdat) mth
FROM cust c
GROUP BY EOMONTH(c.createdat), c.patient_id
HAVING COUNT(*) >= 18
-- for distinct days you could change it to:
-- HAVING COUNT(DISTINCT CAST(c.createdat AS date)) >= 18
) c
GROUP BY mth;

How to Calculate Full/Repeat Retention in BigQuery SQL

I am trying to calculate a "rolling retention" or "repeat retention" (Not sure what the appropriate name for this is), but a scenario where I only want to count the proportion of users who place an order every single month consecutively.
So if 10 users place an order in Jan 2020, and 5 of them come back in Feb, that would equal a 50% retention.
Now for March, I only want to consider the 5 users who ordered in February, still taking note of the total January cohort size.
So if 2 users from February come back in March, retention for March will be 2/10 = 20%. If a user from Jan who didn't return in Feb, places an order in March, they will not be included in the calculation for March, because they did not return in February.
Basically, this retention will progressively decrease to 0% and can never increase.
Here is what I have done so far:
WITH first_order AS (SELECT
customerEmail,
MIN(orderedat) as firstOrder,
FROM fact AS fact
GROUP BY 1 ),
cohort_data AS (SELECT
first_order.customerEmail,
orderedAt as order_month,
MIN(FORMAT_DATE("%y-%m (%b)", date(firstorder))) as cohort_month,
FROM first_order as first_order
LEFT JOIN fact as fact
ON first_order.customeremail = fact.customeremail
GROUP BY 1,2, FACT.orderedAt),
cohort_count AS (select cohort_month, count(distinct customeremail) AS total_cohort_count FROM cohort_data GROUP BY 1 )
SELECT
cd.cohort_month,
date_trunc(date(cd.order_month), month) as order_month,
total_cohort_count,
count(distinct cd.customeremail) as total_repeat
FROM cohort_data as cd
JOIN cohort_data as last_month
ON cd.customeremail= last_month.customeremail
and date(cd.order_month) = date_add(date(last_month.order_month), interval 1 month)
LEFT JOIN cohort_count AS cc
on cd.cohort_month = cc.cohort_month
GROUP BY 1,2,3
ORDER BY cohort_month, order_month ASC
Here is the result. I'm not sure where I got it wrong but the numbers are too small and the retention increases in some months which shouldn't be.
I did an INNER JOIN in the last query so I could compare the previous month to the current month, but it didn't work exactly how I wanted.
Sample Data:
I'd appreciate any help
I would start with one row per customer per month. Then, I would enumerate the customer/months and keep only those with no gaps . . . and aggregate:
with customer_months as (
select customer_email,
date_trunc(ordered_at, month) as yyyymm,
min(date_trunc(ordered_at, month)) over (partition by customer_email) as first_yyyymm
from cohort_data
group by 1, 2
)
select first_yyyymm, yyyymm, count(*)
from (select cm.*,
row_number() over (partition by custoemr_email order by yyyymm) as seqnum
from customer_months cm
) cm
where yyyymm = date_add(first_yyyymm, interval seqnum - 1 month)
group by 1, 2
order by 1, 2;

Count Records Prior to Date for Whole Year

I have a historical database with about 9000 records with unique UserID and date they created an account CreatedDate that looks like this:
UserID CreatedDate
1 5/12/2019
2 1/1/2018
3 4/2/2015
4 8/9/2016
. ..
I would like to know how many accounts were created UP TO a certain date, but for multiple months.
For example, how many accounts were there in Jan 2020, Feb 2020, Mar 2020, so on and so forth.
The manual way would be to do this for each month but it would be tedious:
select count(*)
from SCHEMA
--KEEP REPLACING THE MONTH TO GET COUNTS
where CreatedDate <= '2020-01-31'
Just wondering if there is a more efficient way? A group by wouldn't work because it just totals for each month, but I'm trying to get a historical count. Thanks!
You seem to need running total for each month. If so, you need group by to compute total counts per month and then you have to sum them using analytical sum function.
This is how you would do it in Postgres (db fiddle). Other vendors may differ in the way how month is extracted but the principle is same.
with schema(UserID, CreatedDate) as (values
(1, date '2019-12-05'),
(2, date '2018-01-01'),
(3, date '2015-01-04'),
(4, date '2016-09-08')
)
select month, sum(cnt) over (order by month) from (
select date_trunc('month', CreatedDate)::date as month, count(*) as cnt
from schema
group by date_trunc('month', CreatedDate)::date
) x
Note if data has gaps in month sequence and you want continuous sequence (for example all months between 2015-01 and 2019-12), you have to pregenerate calendar (relation with all months) and left join table schema to it. (It is not in my example yet because of YAGNI.)

Presenting cumulative average in time series

I am trying to present a time series of a score to view the trend.
Score is an Average of all of the scores from the first Date in the table until the of the end of Year-Month.
ie. Jan 2018 = where date < Jan 2018
Feb 2018 = where date < Feb 2018
I would like to present this as a Monthly score for each Year-Month (Dec 2017, Jan 2018)
If score was not an average, i could utilize the Cumulative option in the Timeseries, however this does not work when introducing Avg(Metric).
I am really scratching my head on this one. Any advice on how to structure the data and present this in Google Datastudio would be greatly appreciated.
I have access to the database, and we are utilizing Big query to create the views.
avg() should work. Something like this:
select t.*,
avg(val) over (partition by format_date('%Y%m', date))
from t;
Oops, this is the average for the current month. If you want the running average:
select format_date('%Y%m', date) as yyyymm,
(sum(sum(val)) over (order by min(date)) /
sum(count(*)) over (order by min(date))
) as running_avg
from t
group by yyyymm
order by yyyymm;

How to run sql n times increasing variable and after joining results

I've a transact table (historical) with a CreatedDate, this transact is related to employee transact table. (inner join in transact_id)
This being said, comes the problem: I need to query these table and get the state by month , because during the year, the CreatedDate can change. e.g. An employee update in July will create a new line, but this shouldn't affect the March total.
The solution looks like a forech, but how can I join all lines at the end? The result should be something like:
January - $123
February - $234
March - $123
...
I get the last state of each employee with this:
select AllTransact.id_employee, AllTransact.id_department from (
select id_employee, id_department, rank() over (partition by id_employee order by created_date desc) desc_rank
from Transact_Employee TransEmployee
inner join Transact on TransEmployee.ID_Transact = Transact.ID_Transact
and Transact.Status = 8
and Transact.Created_Date < #currentMonth) AllTransact
where desc_rank = 1
*I don't want to copy and past all the code 12 times. :)
You can partition over many columns. rank() OVER (partition BY [id_employee],datepart(month,[Created_Date]) ORDER BY [Created_Date] DESC) will give you what you have now but for each month (and it doesn't care what year that month is in so you either need to partition by year too or add limit on created_date).