Rolling unique count based on condition over last 31 days per day - sql

I have a table orders with information regarding the time an onder was placed and who made that order.
order_timestamp user_id
-------------------- ---------
1-JAN-20 02.56.12 123
3-JAN-20 12.01.01 533
23-JAN-20 08.42.18 123
12-JAN-20 02.53.59 238
19-JAN-20 02.33.72 34
Using this information, I would like to calculate on a per day basis, a count of distinct users who placed only one order in the previous 31 days, resulting in a table as
date distinct_user_count
----------- ---------------------
1-JAN-20 8
2-JAN-20 10
3-JAN-20 11
(i.e in the 31 days before and including 1st jan 2020, 8 unique users ordered only once, etc...)
Simply put, for every single day - 31, count the number of orders(entries in the table) for every user in that period , and if that count is only 1, count that user for the initial start date.
I can write the query to count those who ordered only once as:
with temp as (
select
user_id,
count(*) as order_count
from
orders
where
trunc(order_timestamp) >= trunc(systimestamp - interval '31' day)
group by
user_id
)
select
user_id,
order_count
from
temp
where
login_count=1
but am unsure on how to implement the counting per date. Please can you assist in helping me to complete/write the query? Thanks for supporting in advance.

You can use use two level of grouping and self join as follows:
Select dt, count(1) as cnt
from
(Select distinct trunc(t1.order_timestamp) as dt,
t1.user_id
From your_table t1
join your_table t2
On t1.user_id = t2.user_id
And trunc(t2.order_timestamp) between trunc(t1.order_timestamp - interval '31' day)
and trunc(t1.order_timestamp)
Group by t1.user_id, trunc(t1.order_timestamp)
Having count(1) = 1)
Group by dt;
Or you can use NOT EXISTS as follows:
Select trunc(t1.order_timestamp),
Count(1) as cnt
From your_table t1
Where not exists
(Select 1
From your_table t2
Where t1.rowid <> t2.row_id
And t1.user_id = t2.user_id
And trunc(t2.order_timestamp) between trunc(t1.order_timestamp - interval '31' day)
and trunc(t1.order_timestamp)
Group by trunc(t1.order_timestamp)

Related

SQL Retention Rates

I am trying to construct a rolling retention measure but am having troubles figuring how to do it in redshift.
I have defined retention as the intersection between two sets of users. The first a cohort of distinct user ids who 90 days from todays date had been active at least once 30 days from that date (between 90 and 120 days from today). The second is the number of those users who were active in the last 30 days from today.
Retention = Todays 30 day active users who were in original cohort 90 days ago / 30 day active suers 90 days ago
My sessions table looks like this:
id
created_date
1
2021-03-04
1
2021-01-01
1
2020-12-15
2
2021-02-17
The only way I can seem to do this is as follows:
Create a temple table and insert into for todays date.
with t1 as (
select distinct customer_id id
from sessions
and created_date >= dateadd('day', -29, current_date)
)
, t2 as (
select distinct customer_id id
from sessions
and created_date <= dateadd('day', -89, current_date)
and created_date >= dateadd('day', -119, current_date)
)
select current_date,
count(t1.id) as original,
count(t2.id) as current,
round(cast(count(t2.id) as float) / cast(count(t1.id) as float), 2) as ratio
into temp table temp1
from t1
left join t2
on t1.id = t2.id
Run an insert statement into the temp table multiple times subtracting one day from current date in each query
insert into temp1
with t1 as (
select distinct customer_id id
from sessions
and created_date >= dateadd('day', -29, current_date-1)
)
, t2 as (
select distinct customer_id id
from sessions
and created_date <= dateadd('day', -89, current_date-1)
and created_date >= dateadd('day', -119, current_date-1)
)
select current_date-1,
count(t1.id) as original,
count(t2.id) as current,
round(cast(count(t2.id) as float) / cast(count(t1.id) as float), 2) as ratio
from t1
left join t2
on t1.id = t2.id
Obtain this table with a daily retention rate for all days so far in 2021
The column original is the user cohort of 30 day active users 90 days ago from the reference date.
The current column is the number of users from the cohort in the original column that are 30 day active users at the reference date.
Step 1 returns only the first row 2021-03-05 and step 2 gives me the other row.
date
original
current
ratio
2021-03-05
100
70
0.7
2021-03-04
100
60
0.6
This process obviously is very inefficient and I am trying to figure out whether there is faster, easier way to do it? The issue is I need to compare a distinct user cohort from 3 months ago and then see today how many of those users from the cohort are still active.
All hep will be greatly appreciated!
If you want to get the number of users for 30 days today and 90 days ago for each date, the query is:
with t1 as (
select
s2.created_date,
count(distinct customer_id id) as cnt30
from sessions s1 inner join
(select distinct created_date from sessions) s2
on dateadd('day', -29, s2.created_date)<=s1.created_date
and s1.created_date<=s2.created_date
group by s2.created_date
)
select a1.current_date,
a1.cnt30 as original,
a2.cnt32 as current,
round(cast(a2.cnt30) as float) / cast(count(a1.cnt30) as float), 2) as ratio
from t1 as a1 inner join t1 as a2
on dateadd('day', -89, a1.created_date)=a2.created_date
order by 1
Using the subquery in the select-list, the query is:
with t1 as (
select
s2.created_date,
(select count(distinct s1.customer_id) from sessions s1
where dateadd('day', -29, s2.created_date)<=s1.created_date
and s1.created_date<=s2.created_date) as cnt30
from
(select distinct created_date from sessions) s2
)
select a1.current_date,
a1.cnt30 as original,
a2.cnt32 as current,
round(cast(a2.cnt30) as float) / cast(count(a1.cnt30) as float), 2) as ratio
from t1 as a1 inner join t1 as a2
on dateadd('day', -89, a1.created_date)=a2.created_date
order by 1
First, use joins and subqueries to calculate the number of unique IDs for the last 30 days on each date.
Next, join the same tables and output the number of unique IDs on the current day and 90 days ago.
Note taht I've never used redshift, so I'll write this based on your query and common SQL syntax. I hope my answer helps you.

Filter customers with atleast 3 transactions a year for the past 2 years Presto/SQL

I have a table of customer transactions called cust_trans where each transaction made by a customer is stored as one row. I have another col called visit_date that contains the transaction date. I would like to filter the customers who transact atleast 3 times a year for the past 2 years.
The data looks like below
Id visit_date
---- ------
1 01/01/2019
1 01/02/2019
1 01/01/2019
1 02/01/2020
1 02/01/2020
1 03/01/2020
1 03/01/2020
2 01/02/2019
3 02/04/2019
I would like to know the customers who visited atleast 3 times every year for the past two years
ie. I want below output.
id
---
1
From the customer table only one person visited atleast 3 times for 2 years.
I tried with below query but it only checks if total visits greater than or equal to 3
select id
from
cust_scan
GROUP by
id
having count(visit_date) >= 3
and year(date(max(visit_date)))-year(date(min(visit_date))) >=2
I would appreciate any help, guidance or suggestions
One option would be to generate a list of distinct ids, cross join it with the last two years, and then bring the original table with a left join. You can then aggregate to count how many visits each id had each year. The final step is to aggregate again, and filter with a having clause
select i.id
from (
select i.id, y.yr, count(c.id) cnt
from (select distinct id from cust_scan) i
cross join (values
(date_trunc('year', current_date)),
(date_trunc('year', current_date) - interval '1' year)
) as y(yr)
left join cust_scan c
on i.id = c.id
and c.visit_date >= y.yr
and c.visit_date < y.yr + interval '1' year
group by i.id, y.yr
) t
group by i.id
having min(cnt) >= 3
Another option would be to use two correlated subqueries:
select distinct id
from cust_scan c
where
(
select count(*)
from cust_scan c1
where
c1.id = c.id
and c1.visit_date >= date_trunc('year', current_date)
and c1.visit_date < date_trunc('year', current_date) + interval '1' year
) >= 3
and (
select count(*)
from cust_scan c1
where
c1.id = c.id
and c1.visit_date >= date_trunc('year', current_date) - interval '1' year
and c1.visit_date < date_trunc('year', current_date)
) >= 3
I assume you mean calendar years. I think I would use two levels of aggregation:
select ct.id
from (select ct.id, year(visit_date) as yyyy, count(*) as cnt
from cust_trans ct
where ct.visit_date >= '2019-01-01' -- or whatever
group by ct.id
) ct
group by ct.id
having count(*) = 2 and -- both year
min(cnt) >= 3; -- at least three transactions
If you want the last two complete years, just change the where clause in the subquery.
You can use a similar idea -- of two aggregations -- if you want the last two years relative to the current date. That would be two full years, rather than 1 and some fraction of the current year.

SQL count occurrences in window

I have user logins by date. My requirement is to track the number of users that have been logged in during the past 90 days window.
I am new to both SQL in general and Teradata specifically and I can't get the window functionality to work as I need.
I need the following result, where ACTIVE is a count of the unique USER_IDs that appear in the previous 90 day window the DATE.
DATES ACTIVE_IN_WINDOW
12/06/2018 20
13/06/2018 45
14/06/2018 65
15/06/2018 73
17/06/2018 24
18/06/2018 87
19/06/2018 34
20/06/2018 51
Currently my script is as follows.
It is this line here that I cant get right
COUNT ( USER_ID) OVER (PARTITION BY USER_ID ORDER BY EVT_DT ROWS BETWEEN 90 PRECEDING AND 0 FOLLOWING)
I suspect I need a different set of functions to make this work.
SELECT b.DATES , a.ACTIVE_IN_WINDOW
FROM
(
SELECT
CAST(CALENDAR_DATE AS DATE) AS DATES FROM SYS_CALENDAR.CALENDAR
WHERE DATES BETWEEN ADD_MONTHS(CURRENT_DATE, - 10) AND CURRENT_DATE
) b
LEFT JOIN
(
SELECT USER_ID , EVT_DT
, COUNT ( USER_ID) OVER (PARTITION BY USER_ID ORDER BY EVT_DT ROWS BETWEEN 90 PRECEDING AND 0 FOLLOWING) AS ACTIVE_IN_WINDOW
FROM ENV0.R_ONBOARDING
) a
ON a.EVT_DT = b.DATES
ORDER BY b.DATES
Thank you for any assistance.
The logic is similar to Gordon', but a non-equi-Join instead of a Correlated Scalar Subquery is usually more efficient on Teradata:
SELECT b.DATES , Count(DISTINCT USER_ID)
FROM
(
SELECT CALENDAR_DATE AS DATES
FROM SYS_CALENDAR.CALENDAR
WHERE DATES BETWEEN Add_Months(Current_Date, - 10) AND Current_Date
) b
LEFT JOIN
( -- apply DISTINCT before aggregation to reduce intermediate spool
SELECT DISTINCT USER_ID, EVT_DT
FROM ENV0.R_ONBOARDING
) AS a
ON a.EVT_DT BETWEEN Add_Months(b.DATES,-3) AND b.DATES
GROUP BY 1
ORDER BY 1
Of course this will require a large spool and much CPU.
Edit:
Switching to weeks reduces the overhead, I'm using dates instead of week numbers (it's easier to modify for other ranges):
SELECT b.Week , Count(DISTINCT USER_ID)
FROM
( -- Return only Mondays instead of DISTINCT over all days
SELECT calendar_date AS Week
FROM SYS_CALENDAR.CALENDAR
WHERE CALENDAR_DATE BETWEEN Add_Months(Current_Date, -9) AND Current_Date
AND day_of_week = 2 -- 2 = Monday
) b
LEFT JOIN
(
SELECT DISTINCT USER_ID,
-- td_monday returns the previous Monday, but we need the following monday
-- covers the previous Tuesday up to the current Monday
Td_Monday(EVT_DT+6) AS PERIOD_WEEK
FROM ENV0.R_ONBOARDING
-- You should add another condition to limit the actually covered date range, e.g.
-- where EVT_DT BETWEEN Add_Months(b.DATES,-13) AND b.DATES
) AS a
ON a.PERIOD_WEEK BETWEEN b.Week-(12*7) AND b.Week
GROUP BY 1
ORDER BY 1
Explain should duplicate the calendar as preparation for the product join, if not you might need to materialize the dates in a Volatile Table. Better don't use sys_calendar, there are no statistics, e.g. optimizer doesn't know about how many days per week/month/year, etc. Check your system, there should be a calendar table designed for you company needs (with stats on all columns)
If your data is not too big, a subquery might be the simplest method:
SELECT c.dte,
(SELECT COUNT(DISTINCT o.USER_ID)
FROM ENV0.R_ONBOARDING o
WHERE o.EVT_DT > ADD_MONTHS(dte, -3) AND
o.EVT_DT <= dte
) as three_month_count
FROM (SELECT CAST(CALENDAR_DATE AS DATE) AS dte
FROM SYS_CALENDAR.CALENDAR
WHERE CALENDAR_DATE BETWEEN ADD_MONTHS(CURRENT_DATE, - 10) AND CURRENT_DATE
) c;
You might want to start on a shorter timeframe then 3 months to see how the query performs.

sql count statement with multiple date ranges

I have two table with different appointment dates.
Table 1
id start date
1 5/1/14
2 3/2/14
3 4/5/14
4 9/6/14
5 10/7/14
Table 2
id start date
1 4/7/14
1 4/10/14
1 7/11/13
2 2/6/14
2 2/7/14
3 1/1/14
3 1/2/14
3 1/3/14
If i had set date ranges i can count each appointment date just fine but i need to change the date ranges.
For each id in table 1 I need to add the distinct appointment dates from table 2 BUT only
6 months prior to the start date from table 1.
Example: count all distinct appointment dates for id 1 (in table 2) with appointment dates between 12/1/13 and 5/1/14 (6 months prior). So the result is 2...4/7/14 and 4/10/14 are within and 7/1/13 is outside of 6 months.
So my issue is that the range changes for each record and i can not seem to figure out how to code this.For id 2 the date range will be 9/1/14-3/2/14 and so on.
Thanks everyone in advance!
Try this out:
SELECT id,
(
SELECT COUNT(*)
FROM table2
WHERE id = table1.id
AND table2.start_date >= DATEADD(MM,-6,table1.start_date)
) AS table2records
FROM table1
The DATEADD subtracts 6 months from the date in table1 and the subquery returns the count of related records.
I think what you want is a type of join.
select t1.id, count(t2.id) as numt2dates
from table1 t1 left outer join
table2 t2
on t1.id = t2.id and
t2.startdate between dateadd(month, -6, t1.startdate) and t1.startdate
group by t1.id;
The exact syntax for the date arithmetic depends on the database.
Thank you this solved my issue. Although this may not help you since you are not attempting to group by date. But the answer gave me the insights to resolve the issue I was facing.
I was attempting to gather the total users a date criteria that had to be evaluated by multiple fields.
WITH data AS (
SELECT generate_series(
(date '2020-01-01')::timestamp,
NOW(),
INTERVAL '1 week'
) AS date
)
SELECT d.date, (SELECT COUNT(DISTINCT h.id) AS user_count
FROM history h WHERE h.startDate < d.date AND h.endDate > d.date
ORDER BY 1 DESC) AS total_records
FROM data d ORDER BY d.date DESC
2022-05-16, 15
2022-05-09, 13
2022-05-02, 13
...

How to calculate retention month over month using SQL

Trying to get a basic table that shows retention from one month to the next. So if someone buys something last month and they do so the next month it gets counted.
month, num_transactions, repeat_transactions, retention
2012-02, 5, 2, 40%
2012-03, 10, 3, 30%
2012-04, 15, 8, 53%
So if everyone that bought last month bought again the following month you have 100%.
So far I can only calculate stuff manually. This gives me the rows that have been seen in both months:
select count(*) as num_repeat_buyers from
(select distinct
to_char(transaction.timestamp, 'YYYY-MM') as month,
auth_user.email
from
auth_user,
transaction
where
auth_user.id = transaction.buyer_id and
to_char(transaction.timestamp, 'YYYY-MM') = '2012-03'
) as table1,
(select distinct
to_char(transaction.timestamp, 'YYYY-MM') as month,
auth_user.email
from
auth_user,
transaction
where
auth_user.id = transaction.buyer_id and
to_char(transaction.timestamp, 'YYYY-MM') = '2012-04'
) as table2
where table1.email = table2.email
This is not right but I feel like I can use some of Postgres' windowing functions. Keep in mind the windowing functions don't let you specify WHERE clauses. You mostly have access to the previous rows and the preceding rows:
select month, count(*) as num_transactions, count(*) over (PARTITION BY month ORDER BY month)
from
(select distinct
to_char(transaction.timestamp, 'YYYY-MM') as month,
auth_user.email
from
auth_user,
transaction
where
auth_user.id = transaction.buyer_id
order by
month
) as transactions_by_month
group by
month
Given the following test table (which you should have provided):
CREATE TEMP TABLE transaction (buyer_id int, tstamp timestamp);
INSERT INTO transaction VALUES
(1,'2012-01-03 20:00')
,(1,'2012-01-05 20:00')
,(1,'2012-01-07 20:00') -- multiple transactions this month
,(1,'2012-02-03 20:00') -- next month
,(1,'2012-03-05 20:00') -- next month
,(2,'2012-01-07 20:00')
,(2,'2012-03-07 20:00') -- not next month
,(3,'2012-01-07 20:00') -- just once
,(4,'2012-02-07 20:00'); -- just once
Table auth_user is not relevant to the problem.
Using tstamp as column name since I don't use base types as identifiers.
I am going to use the window function lag() to identify repeated buyers. To keep it short I combine aggregate and window functions in one query level. Bear in mind that window functions are applied after aggregate functions.
WITH t AS (
SELECT buyer_id
,date_trunc('month', tstamp) AS month
,count(*) AS item_transactions
,lag(date_trunc('month', tstamp)) OVER (PARTITION BY buyer_id
ORDER BY date_trunc('month', tstamp))
= date_trunc('month', tstamp) - interval '1 month'
OR NULL AS repeat_transaction
FROM transaction
WHERE tstamp >= '2012-01-01'::date
AND tstamp < '2012-05-01'::date -- time range of interest.
GROUP BY 1, 2
)
SELECT month
,sum(item_transactions) AS num_trans
,count(*) AS num_buyers
,count(repeat_transaction) AS repeat_buyers
,round(
CASE WHEN sum(item_transactions) > 0
THEN count(repeat_transaction) / sum(item_transactions) * 100
ELSE 0
END, 2) AS buyer_retention
FROM t
GROUP BY 1
ORDER BY 1;
Result:
month | num_trans | num_buyers | repeat_buyers | buyer_retention_pct
---------+-----------+------------+---------------+--------------------
2012-01 | 5 | 3 | 0 | 0.00
2012-02 | 2 | 2 | 1 | 50.00
2012-03 | 2 | 2 | 1 | 50.00
I extended your question to provide for the difference between the number of transactions and the number of buyers.
The OR NULL for repeat_transaction serves to convert FALSE to NULL, so those values do not get counted by count() in the next step.
-> SQLfiddle.
This uses CASE and EXISTS to get repeated transactions:
SELECT
*,
CASE
WHEN num_transactions = 0
THEN 0
ELSE round(100.0 * repeat_transactions / num_transactions, 2)
END AS retention
FROM
(
SELECT
to_char(timestamp, 'YYYY-MM') AS month,
count(*) AS num_transactions,
sum(CASE
WHEN EXISTS (
SELECT 1
FROM transaction AS t
JOIN auth_user AS u
ON t.buyer_id = u.id
WHERE
date_trunc('month', transaction.timestamp)
+ interval '1 month'
= date_trunc('month', t.timestamp)
AND auth_user.email = u.email
)
THEN 1
ELSE 0
END) AS repeat_transactions
FROM
transaction
JOIN auth_user
ON transaction.buyer_id = auth_user.id
GROUP BY 1
) AS summary
ORDER BY 1;
EDIT: Changed from minus 1 month to plus 1 month after reading the question again. My understanding now is that if someone buy something in 2012-02, and then buy something again in 2012-03, then his or her transactions in 2012-02 are counted as retention for the month.