SQL Retention Rates - sql

I am trying to construct a rolling retention measure but am having troubles figuring how to do it in redshift.
I have defined retention as the intersection between two sets of users. The first a cohort of distinct user ids who 90 days from todays date had been active at least once 30 days from that date (between 90 and 120 days from today). The second is the number of those users who were active in the last 30 days from today.
Retention = Todays 30 day active users who were in original cohort 90 days ago / 30 day active suers 90 days ago
My sessions table looks like this:
id
created_date
1
2021-03-04
1
2021-01-01
1
2020-12-15
2
2021-02-17
The only way I can seem to do this is as follows:
Create a temple table and insert into for todays date.
with t1 as (
select distinct customer_id id
from sessions
and created_date >= dateadd('day', -29, current_date)
)
, t2 as (
select distinct customer_id id
from sessions
and created_date <= dateadd('day', -89, current_date)
and created_date >= dateadd('day', -119, current_date)
)
select current_date,
count(t1.id) as original,
count(t2.id) as current,
round(cast(count(t2.id) as float) / cast(count(t1.id) as float), 2) as ratio
into temp table temp1
from t1
left join t2
on t1.id = t2.id
Run an insert statement into the temp table multiple times subtracting one day from current date in each query
insert into temp1
with t1 as (
select distinct customer_id id
from sessions
and created_date >= dateadd('day', -29, current_date-1)
)
, t2 as (
select distinct customer_id id
from sessions
and created_date <= dateadd('day', -89, current_date-1)
and created_date >= dateadd('day', -119, current_date-1)
)
select current_date-1,
count(t1.id) as original,
count(t2.id) as current,
round(cast(count(t2.id) as float) / cast(count(t1.id) as float), 2) as ratio
from t1
left join t2
on t1.id = t2.id
Obtain this table with a daily retention rate for all days so far in 2021
The column original is the user cohort of 30 day active users 90 days ago from the reference date.
The current column is the number of users from the cohort in the original column that are 30 day active users at the reference date.
Step 1 returns only the first row 2021-03-05 and step 2 gives me the other row.
date
original
current
ratio
2021-03-05
100
70
0.7
2021-03-04
100
60
0.6
This process obviously is very inefficient and I am trying to figure out whether there is faster, easier way to do it? The issue is I need to compare a distinct user cohort from 3 months ago and then see today how many of those users from the cohort are still active.
All hep will be greatly appreciated!

If you want to get the number of users for 30 days today and 90 days ago for each date, the query is:
with t1 as (
select
s2.created_date,
count(distinct customer_id id) as cnt30
from sessions s1 inner join
(select distinct created_date from sessions) s2
on dateadd('day', -29, s2.created_date)<=s1.created_date
and s1.created_date<=s2.created_date
group by s2.created_date
)
select a1.current_date,
a1.cnt30 as original,
a2.cnt32 as current,
round(cast(a2.cnt30) as float) / cast(count(a1.cnt30) as float), 2) as ratio
from t1 as a1 inner join t1 as a2
on dateadd('day', -89, a1.created_date)=a2.created_date
order by 1
Using the subquery in the select-list, the query is:
with t1 as (
select
s2.created_date,
(select count(distinct s1.customer_id) from sessions s1
where dateadd('day', -29, s2.created_date)<=s1.created_date
and s1.created_date<=s2.created_date) as cnt30
from
(select distinct created_date from sessions) s2
)
select a1.current_date,
a1.cnt30 as original,
a2.cnt32 as current,
round(cast(a2.cnt30) as float) / cast(count(a1.cnt30) as float), 2) as ratio
from t1 as a1 inner join t1 as a2
on dateadd('day', -89, a1.created_date)=a2.created_date
order by 1
First, use joins and subqueries to calculate the number of unique IDs for the last 30 days on each date.
Next, join the same tables and output the number of unique IDs on the current day and 90 days ago.
Note taht I've never used redshift, so I'll write this based on your query and common SQL syntax. I hope my answer helps you.

Related

SQL - Get historic count of rows collected within a certain period by date

For many years I've been collecting data and I'm interested in knowing the historic counts of IDs that appeared in the last 30 days. The source looks like this
id
dates
1
2002-01-01
2
2002-01-01
3
2002-01-01
...
...
3
2023-01-10
If I wanted to know the historic count of ids that appeared in the last 30 days I would do something like this
with total_counter as (
select id, count(id) counts
from source
group by id
),
unique_obs as (
select id
from source
where dates >= DATEADD(Day ,-30, current_date)
group by id
)
select count(distinct(id))
from unique_obs
left join total_counter
on total_counter.id = unique_obs.id;
The problem is that this results would return a single result for today's count as provided by current_date.
I would like to see a table with such counts as if for example I had ran this analysis yesterday, and the day before and so on. So the expected result would be something like
counts
date
1235
2023-01-10
1234
2023-01-09
1265
2023-01-08
...
...
7383
2022-12-11
so for example, let's say that if the current_date was 2023-01-10, my query would've returned 1235.
If you need a distinct count of Ids from the 30 days up to and including each date the below should work
WITH CTE_DATES
AS
(
--Create a list of anchor dates
SELECT DISTINCT
dates
FROM source
)
SELECT COUNT(DISTINCT s.id) AS "counts"
,D.dates AS "date"
FROM CTE_DATES D
LEFT JOIN source S ON S.dates BETWEEN DATEADD(DAY,-29,D.dates) AND D.dates --30 DAYS INCLUSIVE
GROUP BY D.dates
ORDER BY D.dates DESC
;
If the distinct count didnt matter you could likely simplify with a rolling sum, only hitting the source table once:
SELECT S.dates AS "date"
,COUNT(1) AS "count_daily"
,SUM("count_daily") OVER(ORDER BY S.dates DESC ROWS BETWEEN CURRENT ROW AND 29 FOLLOWING) AS "count_rolling" --assumes there is at least one row for every day.
FROM source S
GROUP BY S.dates
ORDER BY S.dates DESC;
;
This wont work though if you have gaps in your list of dates as it'll just include the latest 30 days available. In which case the first example without distinct in the count will do the trick.
SELECT count(*) AS Counts
dates AS Date
FROM source
WHERE dates >= DATEADD(DAY, -30, CURRENT_DATE)
GROUP BY dates
ORDER BY dates DESC

How to count only the working days between two dates?

I have the following table called vacations, where the employee number is displayed along with the start and end date of their vacations:
id_employe
start
end
1001
2020-12-24
2021-01-04
What I am looking for is to visualize the amount of vacation days that each employee had, but separating them by employee number, month, year and number of days; without taking into account non-business days (Saturdays, Sundays and holidays).
I have the following query, which manages to omit Saturday and Sunday from the posting:
SELECT id_employee,
EXTRACT(YEAR FROM t.Date) AS year,
EXTRACT(MONTH FROM t.Date) AS month,
SUM(WEEKDAY(`Date`) < 5) AS days
FROM (SELECT v.id_employee,
DATE_ADD(v.start, interval s.seq - 1 DAY) AS Date
FROM vacations v CROSS JOIN seq_1_to_100 s
WHERE DATE_ADD(v.start, interval s.seq - 1 DAY) <= v.end
ORDER BY v.id_employee, v.start, s.seq ) t
GROUP BY id_employee, EXTRACT(YEAR_MONTH FROM t.Date);
My question is, how could I in addition to skipping the weekends, also skip the holidays? I suppose that I should establish another table where the dates of those holidays are stored, but how could my * query * be adapted to perform the comparison?
If we consider that the employee 1001 took his vacations from 2020-12-24 to 2021-01-04 and we take Christmas and New Years as holidays, we should get the following result:
id_employee
month
year
days
1001
12
2020
5
1001
1
2021
1
After you have created a table that stores the holiday dates, then you probably can do something like this:
SELECT id_employee,
EXTRACT(YEAR FROM t.Date) AS year,
EXTRACT(MONTH FROM t.Date) AS month,
SUM(CASE WHEN h.holiday_date IS NULL THEN WEEKDAY(`Date`) < 5 END) AS days
FROM (SELECT v.id_employee,
DATE_ADD(v.start, interval s.seq - 1 DAY) AS Date
FROM vacations v CROSS JOIN seq_1_to_100 s
WHERE DATE_ADD(v.start, interval s.seq - 1 DAY) <= v.end
ORDER BY v.id_employee, v.start, s.seq ) t
LEFT JOIN holidays h ON t.date=h.holiday_date
GROUP BY id_employee, EXTRACT(YEAR_MONTH FROM t.Date);
Assuming that the holidays table structure would be something like this:
CREATE TABLE holidays (
id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
holiday_date DATE,
holiday_description VARCHAR(255));
Then LEFT JOIN it to your current query and change the SUM() slightly by adding CASE expression to check. If the ON t.date=h.holiday_date in the left join matches, there will be result of field h.holiday_date, otherwise it will be NULL, hence only the CASE h.holiday_date WHEN IS NULL .. will be considered.
Demo fiddle
Adding this solution compatible with both MariaDB and MySQL version that supports common table expression:
WITH RECURSIVE cte AS
(SELECT id_employee, start, start lvdt, end FROM vacations
UNION ALL
SELECT id_employee, start, lvdt+INTERVAL 1 DAY, end FROM cte
WHERE lvdt+INTERVAL 1 DAY <=end)
SELECT id_employee,
YEAR(v.lvdt) AS year,
MONTH(v.lvdt) AS month,
SUM(CASE WHEN h.holiday_date IS NULL THEN WEEKDAY(v.lvdt) < 5 END) AS days
FROM cte v
LEFT JOIN holidays h
ON v.lvdt=h.holiday_date
GROUP BY id_employee,
YEAR(v.lvdt),
MONTH(v.lvdt);

Rolling unique count based on condition over last 31 days per day

I have a table orders with information regarding the time an onder was placed and who made that order.
order_timestamp user_id
-------------------- ---------
1-JAN-20 02.56.12 123
3-JAN-20 12.01.01 533
23-JAN-20 08.42.18 123
12-JAN-20 02.53.59 238
19-JAN-20 02.33.72 34
Using this information, I would like to calculate on a per day basis, a count of distinct users who placed only one order in the previous 31 days, resulting in a table as
date distinct_user_count
----------- ---------------------
1-JAN-20 8
2-JAN-20 10
3-JAN-20 11
(i.e in the 31 days before and including 1st jan 2020, 8 unique users ordered only once, etc...)
Simply put, for every single day - 31, count the number of orders(entries in the table) for every user in that period , and if that count is only 1, count that user for the initial start date.
I can write the query to count those who ordered only once as:
with temp as (
select
user_id,
count(*) as order_count
from
orders
where
trunc(order_timestamp) >= trunc(systimestamp - interval '31' day)
group by
user_id
)
select
user_id,
order_count
from
temp
where
login_count=1
but am unsure on how to implement the counting per date. Please can you assist in helping me to complete/write the query? Thanks for supporting in advance.
You can use use two level of grouping and self join as follows:
Select dt, count(1) as cnt
from
(Select distinct trunc(t1.order_timestamp) as dt,
t1.user_id
From your_table t1
join your_table t2
On t1.user_id = t2.user_id
And trunc(t2.order_timestamp) between trunc(t1.order_timestamp - interval '31' day)
and trunc(t1.order_timestamp)
Group by t1.user_id, trunc(t1.order_timestamp)
Having count(1) = 1)
Group by dt;
Or you can use NOT EXISTS as follows:
Select trunc(t1.order_timestamp),
Count(1) as cnt
From your_table t1
Where not exists
(Select 1
From your_table t2
Where t1.rowid <> t2.row_id
And t1.user_id = t2.user_id
And trunc(t2.order_timestamp) between trunc(t1.order_timestamp - interval '31' day)
and trunc(t1.order_timestamp)
Group by trunc(t1.order_timestamp)

SQL count occurrences in window

I have user logins by date. My requirement is to track the number of users that have been logged in during the past 90 days window.
I am new to both SQL in general and Teradata specifically and I can't get the window functionality to work as I need.
I need the following result, where ACTIVE is a count of the unique USER_IDs that appear in the previous 90 day window the DATE.
DATES ACTIVE_IN_WINDOW
12/06/2018 20
13/06/2018 45
14/06/2018 65
15/06/2018 73
17/06/2018 24
18/06/2018 87
19/06/2018 34
20/06/2018 51
Currently my script is as follows.
It is this line here that I cant get right
COUNT ( USER_ID) OVER (PARTITION BY USER_ID ORDER BY EVT_DT ROWS BETWEEN 90 PRECEDING AND 0 FOLLOWING)
I suspect I need a different set of functions to make this work.
SELECT b.DATES , a.ACTIVE_IN_WINDOW
FROM
(
SELECT
CAST(CALENDAR_DATE AS DATE) AS DATES FROM SYS_CALENDAR.CALENDAR
WHERE DATES BETWEEN ADD_MONTHS(CURRENT_DATE, - 10) AND CURRENT_DATE
) b
LEFT JOIN
(
SELECT USER_ID , EVT_DT
, COUNT ( USER_ID) OVER (PARTITION BY USER_ID ORDER BY EVT_DT ROWS BETWEEN 90 PRECEDING AND 0 FOLLOWING) AS ACTIVE_IN_WINDOW
FROM ENV0.R_ONBOARDING
) a
ON a.EVT_DT = b.DATES
ORDER BY b.DATES
Thank you for any assistance.
The logic is similar to Gordon', but a non-equi-Join instead of a Correlated Scalar Subquery is usually more efficient on Teradata:
SELECT b.DATES , Count(DISTINCT USER_ID)
FROM
(
SELECT CALENDAR_DATE AS DATES
FROM SYS_CALENDAR.CALENDAR
WHERE DATES BETWEEN Add_Months(Current_Date, - 10) AND Current_Date
) b
LEFT JOIN
( -- apply DISTINCT before aggregation to reduce intermediate spool
SELECT DISTINCT USER_ID, EVT_DT
FROM ENV0.R_ONBOARDING
) AS a
ON a.EVT_DT BETWEEN Add_Months(b.DATES,-3) AND b.DATES
GROUP BY 1
ORDER BY 1
Of course this will require a large spool and much CPU.
Edit:
Switching to weeks reduces the overhead, I'm using dates instead of week numbers (it's easier to modify for other ranges):
SELECT b.Week , Count(DISTINCT USER_ID)
FROM
( -- Return only Mondays instead of DISTINCT over all days
SELECT calendar_date AS Week
FROM SYS_CALENDAR.CALENDAR
WHERE CALENDAR_DATE BETWEEN Add_Months(Current_Date, -9) AND Current_Date
AND day_of_week = 2 -- 2 = Monday
) b
LEFT JOIN
(
SELECT DISTINCT USER_ID,
-- td_monday returns the previous Monday, but we need the following monday
-- covers the previous Tuesday up to the current Monday
Td_Monday(EVT_DT+6) AS PERIOD_WEEK
FROM ENV0.R_ONBOARDING
-- You should add another condition to limit the actually covered date range, e.g.
-- where EVT_DT BETWEEN Add_Months(b.DATES,-13) AND b.DATES
) AS a
ON a.PERIOD_WEEK BETWEEN b.Week-(12*7) AND b.Week
GROUP BY 1
ORDER BY 1
Explain should duplicate the calendar as preparation for the product join, if not you might need to materialize the dates in a Volatile Table. Better don't use sys_calendar, there are no statistics, e.g. optimizer doesn't know about how many days per week/month/year, etc. Check your system, there should be a calendar table designed for you company needs (with stats on all columns)
If your data is not too big, a subquery might be the simplest method:
SELECT c.dte,
(SELECT COUNT(DISTINCT o.USER_ID)
FROM ENV0.R_ONBOARDING o
WHERE o.EVT_DT > ADD_MONTHS(dte, -3) AND
o.EVT_DT <= dte
) as three_month_count
FROM (SELECT CAST(CALENDAR_DATE AS DATE) AS dte
FROM SYS_CALENDAR.CALENDAR
WHERE CALENDAR_DATE BETWEEN ADD_MONTHS(CURRENT_DATE, - 10) AND CURRENT_DATE
) c;
You might want to start on a shorter timeframe then 3 months to see how the query performs.

sql count statement with multiple date ranges

I have two table with different appointment dates.
Table 1
id start date
1 5/1/14
2 3/2/14
3 4/5/14
4 9/6/14
5 10/7/14
Table 2
id start date
1 4/7/14
1 4/10/14
1 7/11/13
2 2/6/14
2 2/7/14
3 1/1/14
3 1/2/14
3 1/3/14
If i had set date ranges i can count each appointment date just fine but i need to change the date ranges.
For each id in table 1 I need to add the distinct appointment dates from table 2 BUT only
6 months prior to the start date from table 1.
Example: count all distinct appointment dates for id 1 (in table 2) with appointment dates between 12/1/13 and 5/1/14 (6 months prior). So the result is 2...4/7/14 and 4/10/14 are within and 7/1/13 is outside of 6 months.
So my issue is that the range changes for each record and i can not seem to figure out how to code this.For id 2 the date range will be 9/1/14-3/2/14 and so on.
Thanks everyone in advance!
Try this out:
SELECT id,
(
SELECT COUNT(*)
FROM table2
WHERE id = table1.id
AND table2.start_date >= DATEADD(MM,-6,table1.start_date)
) AS table2records
FROM table1
The DATEADD subtracts 6 months from the date in table1 and the subquery returns the count of related records.
I think what you want is a type of join.
select t1.id, count(t2.id) as numt2dates
from table1 t1 left outer join
table2 t2
on t1.id = t2.id and
t2.startdate between dateadd(month, -6, t1.startdate) and t1.startdate
group by t1.id;
The exact syntax for the date arithmetic depends on the database.
Thank you this solved my issue. Although this may not help you since you are not attempting to group by date. But the answer gave me the insights to resolve the issue I was facing.
I was attempting to gather the total users a date criteria that had to be evaluated by multiple fields.
WITH data AS (
SELECT generate_series(
(date '2020-01-01')::timestamp,
NOW(),
INTERVAL '1 week'
) AS date
)
SELECT d.date, (SELECT COUNT(DISTINCT h.id) AS user_count
FROM history h WHERE h.startDate < d.date AND h.endDate > d.date
ORDER BY 1 DESC) AS total_records
FROM data d ORDER BY d.date DESC
2022-05-16, 15
2022-05-09, 13
2022-05-02, 13
...