sql user retention calculation - sql

I have a table records like this in Athena, one user one row in a month:
month, id
2020-05 1
2020-05 2
2020-05 5
2020-06 1
2020-06 5
2020-06 6
Need to calculate the percentage=( users come both prior month and current month )/(prior month total users).
Like in the above example, users come both in May and June 1,5 , May total user 3, this should calculate a percentage of 2/3*100
with monthly_mau AS
(SELECT month as mauMonth,
date_format(date_add('month',1,cast(concat(month,'-01') AS date)), '%Y-%m') AS nextMonth,
count(distinct userid) AS monthly_mau
FROM records
GROUP BY month
ORDER BY month),
retention_mau AS
(SELECT
month,
count(distinct useridLeft) AS retention_mau
FROM (
(SELECT
userid as useridLeft,month as monthLeft,
date_format(date_add('month',1,cast(concat(month,'-01') AS date)), '%Y-%m') AS nextMonth
FROM records ) AS prior
INNER JOIN
(SELECT
month ,
userid
FROM records ) AS current
ON
prior.useridLeft = current.userid
AND prior.nextMonth = current.month )
WHERE userid is not null
GROUP BY month
ORDER BY month )
SELECT *, cast(retention_mau AS double)/cast(monthly_mau AS double)*100 AS retention_mau_percentage
FROM monthly_mau as m
INNER JOIN monthly_retention_mau AS r
ON m.nextMonth = r.month
order by r.month
This gives me percentage as 100 which is not right. Any idea?

Hmmm . . . assuming you have one row per user per month, you can use window functions and conditional aggregation:
select month, count(*) as num_users,
sum(case when prev_month = dateadd('month', -1, month) then 1 else 0 end) as both_months
from (select r.*,
cast(concat(month, '-01') AS date) as month_date,
lag(cast(concat(month, '-01') AS date)) over (partition by id order by month) as prev_month_date
from records r
) r
group by month;

Related

Unify two tables with different number of rows and no common id

I want to merge these two tables into one. The idea is to create a table by month and show: month, number of unique customers who bought, number of invoices, number of products, total income, total income from product A
I'm having trouble adding the total income from product A per month since the table has two rows while the other results have four.
Example of table:
CustomerID
InvoiceID
ProductId
Date
Income
1
101
A
1/11/2016
600
2
103
B
12/10/2015
300
My query so far:
SELECT
MONTH(date) AS month,
COUNT (DISTINCT customerId) AS numOfCustomers,
SUM(income) AS sumOfIncome,
COUNT(invoiceId) AS numOfInvoice,
COUNT(productId) AS numOfProduct
FROM
x
WHERE
YEAR(date) = 2016
GROUP BY
MONTH(date)
SELECT
MONTH(date) AS month,
SUM(income) AS sumOfIncomeA
FROM
x
WHERE
(productId) = 'A'
AND YEAR(date) = 2016
GROUP BY
MONTH(date)
Here's a solution that first creates a big list of months. You can modify the "months" CTE to go back as far as you need. By default, this query will go back 83 years from today. After you have a good list of months, then you can join your data to it so that you are guaranteed to have all the months, and only sales data if present.
--First CTE "x" is used to create a sequence of 10 numbers.
WITH x as (
SELECT * FROM (VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) as x(a)
)
--Second CTE "y" creates a sequence of 1000 numbers.
, y as (
SELECT ROW_NUMBER() OVER(ORDER BY hundreds.a, tens.a, ones.a) as row_num
FROM x as ones, x as tens, x as hundreds
)
--Third CTE "months" creates a sequence of months going back in time from today.
--To go farther back than 1000 months, modify the "y" CTE to have a "thousands" (or more) table(s).
, months as (
SELECT
YEAR(DATEADD(month, -1 * y.row_num, GETDATE())) as [year]
, MONTH(DATEADD(month, -1 * y.row_num, GETDATE())) as [month]
, CAST(YEAR(DATEADD(month, -1 * y.row_num, GETDATE())) as nvarchar(6))
+ RIGHT('00' + CAST(MONTH(DATEADD(month, -1 * y.row_num, GETDATE())) as nvarchar(6)),2) as YEAR_MONTH
FROM y
)
--Main select.
--First FROM is a list of months so that we know for a fact we have all the months in the year.
--Then do a LEFT OUER JOIN to your main data. All months will be returned.
--If there is no match in the data table, then the value will be null.
--You can use an ISNULL(SUM(x.income),0) to convert nulls to 0.
SELECT
m.[month] AS month,
COUNT (DISTINCT x.customerId) AS numOfCustomers,
SUM(x.income) AS sumOfIncome,
COUNT(x.invoiceId) AS numOfInvoice,
COUNT(x.productId) AS numOfProduct
FROM months as m
LEFT OUTER JOIN x
ON YEAR(x.[date]) = m.[year]
AND MONTH(x.[date]) = m.[month]
WHERE
x.YEAR([date]) = 2016
GROUP BY
m.MONTH([date])
Like I wrote in the comment, as both tables could not have the same month present you need a FULL OUTER JOIN
so
The COALESCE for month is needed, as it could be that one of the month coulb be NULL
SELECT
COALESCE(t1.month,t2-month),
t1.numOfCustomers,t1.sumOfIncome,t1.numOfInvoice,t1.numOfProduct
,t2.sumOfIncomeA
FROM
(SELECT
MONTH(date) AS month,
COUNT (DISTINCT customerId) AS numOfCustomers,
SUM(income) AS sumOfIncome,
COUNT(invoiceId) AS numOfInvoice,
COUNT(productId) AS numOfProduct
FROM
x
WHERE
YEAR(date) = 2016
GROUP BY
MONTH(date)) t1
FULL OUTER JOIN
(SELECT
MONTH(date) AS month,
SUM(income) AS sumOfIncomeA
FROM
x
WHERE
(productId) = 'A'
AND YEAR(date) = 2016
GROUP BY
MONTH(date)) t2 ON t1.month = t2.month

Sum of unique customers in rolling trailing 30d window displayed by week

I'm working in SQL Workbench.
I'd like to track every time a unique customer clicks the new feature in trailing 30 days, displayed week over week. An example of the data output would be as follows:
Week 51: Reflects usage through the end of week 51 (Dec 20th) - 30 days. aka Nov 20-Dec 20th
Week 52: Reflects usage through the end of week 52 (Dec 31st) - 30 days. aka Dec 1 - Dec 31st.
Say there are 22MM unique customer clicks that occurred from Nov 20-Dec 20th. Week 51 data = 22MM.
Say there are 25MM unique customer clicks that occurred from Dec 1-Dec 31st. Week 52 data = 25MM. The customer uniqueness is only relevant to that particular week. Aka, if a customer clicks twice in Week 51 they're only counted once. If they click once in Week 51 and once in Week 52, they are counted once in each week.
Here is what I have so far:
select
min_e_date
,sum(count(*)) over (order by min_e_date rows between unbounded preceding and current row) as running_distinct_customers
from (select customer_id, min(DATE_TRUNC('week', event_date)) as min_e_date
from final
group by 1
) c
group by
min_e_date
I don't think a rolling count is the right way to go. As I add in additional parameters (country, subscription), the rolling count doesn't distinguish between them - the figures just get added to the prior row.
Any suggestions are appreciated!
edit Additional data below. Data collection begins on 11/23. No data precedes that date.
You can get the count of distinct customers per week like so:
select date_trunc('week', event_date) as week_start,
count(distinct customer_id) cnt
from final
group by 1
Now if you want a rolling sum of that count(say, the current week and the three preceding weeks), you can use window functions:
select date_trunc('week', event_date) as week_start,
count(distinct customer_id) cnt,
sum(count(distinct customer_id)) over(
order by date_trunc('week', event_date)
range between 3 week preceding and current row
) as rolling_cnt
from final
group by 1
Rolling distinct counts are quite difficult in RedShift. One method is a self-join and aggregation:
select t.date,
count(distinct case when tprev.date >= t.date - interval '6 day' then customer_id end) as trailing_7,
count(distinct customer_id) as trailing_30
from t join
t tprev
on tprev.date >= t.date - interval '29 day' and
tprev.date <= t.date
group by t.date;
If you can get this to work, you can just select every 7th row to get the weekly values.
EDIT:
An entirely different approach is to use aggregation and keep track of when customers enter and end time periods of being counted. This is a pain with two different time frames. Here is what it looks like for one.
The idea is to
Create an enter/exit record for each record being counted. The "exit" is n days after the enter.
Summarize these into periods of activity for each customer. So, there is one record with an enter and exit date. This is a type of gaps-and-islands problem.
Unpivot this result to count +1 for a customer being counted and -1 for a customer not being counted.
Do a cumulative sum of this count.
The code looks something like this:
with cd as (
select customer_id, date,
lead(date) over (partition by customer_id order by date) as next_date,
sum(sum(inc)) over (partition by customer_id order by date) as cnt
from ((select t.customer_id, t.date, 1 as inc
from t
) union all
(select t.customer_id, t.date + interval '7 day', -1
from t
)
) tt
),
cd2 as (
select customer_id, min(date) as enter_date, max(date) as exit_date
from (select cd.*,
sum(case when cnt = 0 then 1 else 0 end) over (partition by customer_id order by date) as grp
from (select cd.*,
lag(cnt) over (partition by customer_id order by date) as prev_cnt
from cd
) cd
) cd
group by customer_id, grp
having max(cnt) > 0
)
select dte, sum(sum(inc)) over (order by dte)
from ((select customer_id, enter_date as dte, 1 as inc
from cd2
) union all
(select customer_id, exit_date as dte, -1 as inc
from cd2
)
) cd2
group by dte;

Get last data recorded of the date and group it by month

tbl_totalMonth has id,time, date and kwh column.
I want to get the last recorded data of the months and group it per month so the result would be the name of the month and kwh.
the result should be something like this:
month | kwh
------------
January | 150
February | 400
the query I tried: (but it returns the max kwh not the last kwh recorded)
SELECT DATENAME(MONTH, a.date) as monthly, max(a.kwh) as kwh
from tbl_totalMonth a
WHERE date > = DATEADD(yy,DATEDIFF(yy,0, GETDATE() -1 ),0)
group by DATENAME(MONTH, a.date)
I suspect you need something quite different:
select *
from (
select *
, row_number() over(partition by month(a.date), year(a.date) order by a.date DESC) as rn
from tbl_totalMonth a
WHERE date > = DATEADD(yy,DATEDIFF(yy,0, GETDATE() -1 ),0)
) d
where rn = 1
To get "the last kwh recorded (per month)" you need to use row_number() which - per month - will order the rows (descending) and give each one a row number. When that number is 1 you have "the most recent" row for that month, and you won't need group by at all.
You could use group by and month
select datename(month, date), sum(kwh)
from tbl_totalMonth
where date = (select max(date) from tbl_totalMonth )
group by datename(month, date)
if you need only the last row for each month then youn should use
select datename(month, date), khw
from tbl_totalMonth a
inner join (
select max(date) as max_date
from tbl_totalMonth
group by month(date)) t on t.max_date = a.date

How can I count users in a month that were not present in the month before?

I am trying to count unique users on a monthly basis that were not present in the previous month. So if a user has a record for January and then another one for February, then I would only count January for that user.
user_id time
a1 1/2/17
a1 2/10/17
a2 2/18/17
a4 2/5/17
a5 3/25/17
My results should look like this
Month User Count
January 1
February 2
March 1
I'm not really familiar with BigQuery, but here's how I would solve the problem using TSQL. I imagine that you'd be able to use similar logic in BigQuery.
1). Order the data by user_id first, and then time. In TSQL, you can accomplish this with the following and store it in a common table expression, which you will query in the step after this.
;WITH cte AS
(
select ROW_NUMBER() OVER (PARTITION BY [user_id] ORDER BY [time]) AS rn,*
from dbo.employees
)
2). Next query for only the rows with rn = 1 (the first occurrence for a particular user) and group by the month.
select DATENAME(month, [time]) AS [Month], count(*) AS user_count
from cte
where rn = 1
group by DATENAME(month, [time])
This is assuming that 2017 is the only year you're dealing with. If you're dealing with more than one year, you probably want step #2 to look something like this:
select year([time]) as [year], DATENAME(month, [time]) AS [month],
count(*) AS user_count
from cte
where rn = 1
group by year([time]), DATENAME(month, [time])
First aggregate by the user id and the month. Then use lag() to see if the user was present in the previous month:
with du as (
select date_trunc(time, month) as yyyymm, user_id
from t
group by date_trunc(time, month)
)
select yyyymm, count(*)
from (select du.*,
lag(yyyymm) over (partition by user_id order by yyyymm) as prev_yyyymm
from du
) du
where prev_yyyymm is not null or
prev_yyyymm < date_add(yyyymm, interval 1 month)
group by yyyymm;
Note: This uses the date functions, but similar functions exist for timestamp.
The way I understood question is - to exclude user to be counted in given month only if same user presented in previous month. But if same user present in few months before given, but not in previous - user should be counted.
If this is correct - Try below for BigQuery Standard SQL
#standardSQL
SELECT Year, Month, COUNT(DISTINCT user_id) AS User_Count
FROM (
SELECT *,
DATE_DIFF(time, LAG(time) OVER(PARTITION BY user_id ORDER BY time), MONTH) AS flag
FROM (
SELECT
user_id,
DATE_TRUNC(PARSE_DATE('%x', time), MONTH) AS time,
EXTRACT(YEAR FROM PARSE_DATE('%x', time)) AS Year,
FORMAT_DATE('%B', PARSE_DATE('%x', time)) AS Month
FROM yourTable
GROUP BY 1, 2, 3, 4
)
)
WHERE IFNULL(flag, 0) <> 1
GROUP BY Year, Month, time
ORDER BY time
you can test / play with above using below example with dummy data from your question
#standardSQL
WITH yourTable AS (
SELECT 'a1' AS user_id, '1/2/17' AS time UNION ALL
SELECT 'a1', '2/10/17' UNION ALL
SELECT 'a2', '2/18/17' UNION ALL
SELECT 'a4', '2/5/17' UNION ALL
SELECT 'a5', '3/25/17'
)
SELECT Year, Month, COUNT(DISTINCT user_id) AS User_Count
FROM (
SELECT *,
DATE_DIFF(time, LAG(time) OVER(PARTITION BY user_id ORDER BY time), MONTH) AS flag
FROM (
SELECT
user_id,
DATE_TRUNC(PARSE_DATE('%x', time), MONTH) AS time,
EXTRACT(YEAR FROM PARSE_DATE('%x', time)) AS Year,
FORMAT_DATE('%B', PARSE_DATE('%x', time)) AS Month
FROM yourTable
GROUP BY 1, 2, 3, 4
)
)
WHERE IFNULL(flag, 0) <> 1
GROUP BY Year, Month, time
ORDER BY time
The output is
Year Month User_Count
2017 January 1
2017 February 2
2017 March 1
Try this query:
SELECT
t1.d,
count(DISTINCT t1.user_id)
FROM
(
SELECT
EXTRACT(MONTH FROM time) AS d,
--EXTRACT(MONTH FROM time)-1 AS d2,
user_id
FROM nbitra.tmp
) t1
LEFT JOIN
(
SELECT
EXTRACT(MONTH FROM time) AS d,
user_id
FROM nbitra.tmp
) t2
ON t1.d = t2.d+1
WHERE
(
t1.user_id <> t2.user_id --User is in previous month
OR t2.user_id IS NULL --To handle january, since there is no previous month to compare to
)
GROUP BY t1.d;

How to calculate retention month over month using SQL

Trying to get a basic table that shows retention from one month to the next. So if someone buys something last month and they do so the next month it gets counted.
month, num_transactions, repeat_transactions, retention
2012-02, 5, 2, 40%
2012-03, 10, 3, 30%
2012-04, 15, 8, 53%
So if everyone that bought last month bought again the following month you have 100%.
So far I can only calculate stuff manually. This gives me the rows that have been seen in both months:
select count(*) as num_repeat_buyers from
(select distinct
to_char(transaction.timestamp, 'YYYY-MM') as month,
auth_user.email
from
auth_user,
transaction
where
auth_user.id = transaction.buyer_id and
to_char(transaction.timestamp, 'YYYY-MM') = '2012-03'
) as table1,
(select distinct
to_char(transaction.timestamp, 'YYYY-MM') as month,
auth_user.email
from
auth_user,
transaction
where
auth_user.id = transaction.buyer_id and
to_char(transaction.timestamp, 'YYYY-MM') = '2012-04'
) as table2
where table1.email = table2.email
This is not right but I feel like I can use some of Postgres' windowing functions. Keep in mind the windowing functions don't let you specify WHERE clauses. You mostly have access to the previous rows and the preceding rows:
select month, count(*) as num_transactions, count(*) over (PARTITION BY month ORDER BY month)
from
(select distinct
to_char(transaction.timestamp, 'YYYY-MM') as month,
auth_user.email
from
auth_user,
transaction
where
auth_user.id = transaction.buyer_id
order by
month
) as transactions_by_month
group by
month
Given the following test table (which you should have provided):
CREATE TEMP TABLE transaction (buyer_id int, tstamp timestamp);
INSERT INTO transaction VALUES
(1,'2012-01-03 20:00')
,(1,'2012-01-05 20:00')
,(1,'2012-01-07 20:00') -- multiple transactions this month
,(1,'2012-02-03 20:00') -- next month
,(1,'2012-03-05 20:00') -- next month
,(2,'2012-01-07 20:00')
,(2,'2012-03-07 20:00') -- not next month
,(3,'2012-01-07 20:00') -- just once
,(4,'2012-02-07 20:00'); -- just once
Table auth_user is not relevant to the problem.
Using tstamp as column name since I don't use base types as identifiers.
I am going to use the window function lag() to identify repeated buyers. To keep it short I combine aggregate and window functions in one query level. Bear in mind that window functions are applied after aggregate functions.
WITH t AS (
SELECT buyer_id
,date_trunc('month', tstamp) AS month
,count(*) AS item_transactions
,lag(date_trunc('month', tstamp)) OVER (PARTITION BY buyer_id
ORDER BY date_trunc('month', tstamp))
= date_trunc('month', tstamp) - interval '1 month'
OR NULL AS repeat_transaction
FROM transaction
WHERE tstamp >= '2012-01-01'::date
AND tstamp < '2012-05-01'::date -- time range of interest.
GROUP BY 1, 2
)
SELECT month
,sum(item_transactions) AS num_trans
,count(*) AS num_buyers
,count(repeat_transaction) AS repeat_buyers
,round(
CASE WHEN sum(item_transactions) > 0
THEN count(repeat_transaction) / sum(item_transactions) * 100
ELSE 0
END, 2) AS buyer_retention
FROM t
GROUP BY 1
ORDER BY 1;
Result:
month | num_trans | num_buyers | repeat_buyers | buyer_retention_pct
---------+-----------+------------+---------------+--------------------
2012-01 | 5 | 3 | 0 | 0.00
2012-02 | 2 | 2 | 1 | 50.00
2012-03 | 2 | 2 | 1 | 50.00
I extended your question to provide for the difference between the number of transactions and the number of buyers.
The OR NULL for repeat_transaction serves to convert FALSE to NULL, so those values do not get counted by count() in the next step.
-> SQLfiddle.
This uses CASE and EXISTS to get repeated transactions:
SELECT
*,
CASE
WHEN num_transactions = 0
THEN 0
ELSE round(100.0 * repeat_transactions / num_transactions, 2)
END AS retention
FROM
(
SELECT
to_char(timestamp, 'YYYY-MM') AS month,
count(*) AS num_transactions,
sum(CASE
WHEN EXISTS (
SELECT 1
FROM transaction AS t
JOIN auth_user AS u
ON t.buyer_id = u.id
WHERE
date_trunc('month', transaction.timestamp)
+ interval '1 month'
= date_trunc('month', t.timestamp)
AND auth_user.email = u.email
)
THEN 1
ELSE 0
END) AS repeat_transactions
FROM
transaction
JOIN auth_user
ON transaction.buyer_id = auth_user.id
GROUP BY 1
) AS summary
ORDER BY 1;
EDIT: Changed from minus 1 month to plus 1 month after reading the question again. My understanding now is that if someone buy something in 2012-02, and then buy something again in 2012-03, then his or her transactions in 2012-02 are counted as retention for the month.