SQL Retention Cohort Analysis - sql

I am trying to write a query for monthly retention, to calculate percentage of users returning from their initial start month and moving forward.
TABLE: customer_order
fields
id
date
store_id
TABLE: customer
id
person_id
job_id
first_time (bool)
This gets me the initial monthly cohorts based on the first dates
SELECT first_job_month, COUNT( DISTINCT person_id) user_counts
FROM
( SELECT DATE_TRUNC(MIN(CAST(date AS DATE)), month) first_job_month, person_id
FROM customer_order cd
INNER JOIN consumer co ON co.job_id = cd.id
GROUP BY 2
ORDER BY 1 ) first_d GROUP BY 1 ORDER BY 1
first_job_month user_counts
2018-04-01 36
2018-05-01 37
2018-06-01 39
2018-07-01 45
2018-08-01 38
I have tried a bunch of things, but I can't figure out how to keep track of the original cohorts/users from the first month onwards

Get your the first order month for every customer
Join orders to the previous subquery to find out what is the difference in months between the given order and the first order
Use conditional aggregates to count customers that still order by X month
There are some alternative options like using window functions to do (1) and (2) in the same subquery but the easiest option is this one:
WITH
cohorts as (
SELECT person_id, DATE_TRUNC(MIN(CAST(date AS DATE)), month) as first_job_month
FROM customer_order cd
JOIN consumer co
ON co.job_id = cd.id
GROUP BY 1
)
,orders as (
SELECT
*
,round(1.0*(DATE_TRUNC(MIN(CAST(cd.date AS DATE))-c.first_job_month)/30) as months_since_first_order
FROM cohorts c
JOIN customer_order cd
USING (person_id)
)
SELECT
first_job_month as cohort
,count(distinct person_id) as size
,count(distinct case when months_since_first_order>=1 then person_id end) as m1
,count(distinct case when months_since_first_order>=2 then person_id end) as m2
,count(distinct case when months_since_first_order>=3 then person_id end) as m3
-- hardcode up to the number of months you want and the history you have
FROM orders
GROUP BY 1
ORDER BY 1
See, you can use CASE statements inside the aggregate functions like COUNT to identify different subsets of rows that you'd like to aggregate within the same group. This is one of the most important BI techniques in SQL.
Note, >= not = is used in the conditional aggregate so that for example if the customer buys in m3 after m1 and doesn't buy in m2 they will still be counted in m2. If you want your customers to buy every month and/or see the actual retention for every month and are ok if subsequent months values can be higher than previous you can use =.
Also, if you don't want the "triangle" view like one you get from this query or you don't want to hardcode the "mX" part you would just group by first_job_month and months_since_first_order and count distinct. Some visualization tools might consume this simple format and make a triangle view out of it.

Related

Get list of months from one table and counts for each from another

I'm trying to pull this through in Postgres 11.8:
SELECT count(distinct e.id) counter_employees,
(SELECT count(distinct id) FROM employees
WHERE date_trunc('month',date_hired) = period AND company = 11
) hires,
FROM employees e
WHERE period IN (SELECT DISTINCT make_date(...) FROM amounts)
I cant figure out how to declare that the period the subquery should check is outside the subquery. Also, the period is not from a table but generated, so there is not a column in amounts to relate to the employees inside the subquery.
employee table:
id date_hired company
431 2020-01-03 11
422 2020-01-02 11
323 2020-02-03 11
amounts table:
payment_period amount company
202001 999 11
202002 999 11
For every payment period in amounts I want to get some data such as employee count and hires of that period:
period count hires
202001 5 1
202002 6 ...
One option uses aggregation and window functions. If you have hires for each month, then you can get the information directly from employees, like so:
select
date_trunc('month', date_hired) month_hired,
sum(count(*)) over(order by date_trunc('month', date_hired)) no_employees,
count(*) hires
from employees
group by date_trunc('month', date_hired)
On the other hand, if there are months without hires, then you could use generate_series() to create the list of months, then bring the employees with a left join, and aggregate:
select
d.month_hired,
sum(count(e.id)) over(order by d.month_hired) no_employees,
count(e.id) hires
from (
select generate_series(
date_trunc('month', min(date_hired)),
date_trunc('month', max(date_hired)),
interval '1' month
) month_hired
from employees
) d
left join employees e
on e.date_hired >= d.month_hired
and e.date_hired < d.month_hired + interval '1' month
group by d.month_hired
We could run another count for every period distilled from amounts, but that's expensive - unless there are only very few?
For more than a few, compute counts per period for the whole employees table, plus a running total. Then LEFT JOIN to it, should be pretty efficient:
SELECT mon AS period, e.mon_hired AS count, e.all_hired AS hires
FROM (
SELECT to_date(payment_period, 'YYYYMM') AS mon
FROM (SELECT DISTINCT payment_period FROM amounts) a0
) a
LEFT JOIN (
SELECT date_trunc('month', date_hired) AS mon
, count(*) AS mon_hired
, sum(count(*)) OVER (ORDER BY date_trunc('month', date_hired)) AS all_hired
FROM employees e
GROUP BY 1
) e USING (mon)
ORDER BY 1;
This assumes we can just count all employees hired so far to get the total number of hires. (Nobody ever gets fired.)
Works just fine as long as there are rows for every period. Else we need to fill in for the gaps. We can compute a complete grid, or default to the latest row in case of a missing month like this:
WITH e AS (
SELECT date_trunc('month', date_hired) AS mon
, count(*) AS mon_hired
, sum(count(*)) OVER (ORDER BY date_trunc('month', date_hired)) AS all_hired
FROM employees e
GROUP BY 1
)
SELECT mon AS period, ae.*
FROM (
SELECT to_date(payment_period, 'YYYYMM') AS mon
FROM (SELECT DISTINCT payment_period FROM amounts) a0
) a
LEFT JOIN LATERAL (
SELECT CASE WHEN e.mon = a.mon THEN e.mon_hired ELSE 0 END AS count -- ①
, e.all_hired AS hires
FROM e
WHERE e.mon <= a.mon
ORDER BY e.mon DESC
LIMIT 1
) ae USING (mon)
ORDER BY 1;
① If nothing changed for the month, we need to fall back to the last month with data. Take the total count from there, but the monthly count is 0.
We can run a window function over an aggregate on the same query level. See:
Group and count events per time intervals, plus running total
Related:
PostgreSQL: running count of rows for a query 'by minute'
Aside: don't omit the AS keyword for a column alias. See:
Date column arithmetic in PostgreSQL query

Creating additional date rows for non existent data (Redshift SQL)

I am looking at agent data and want to create an overview of their sales performance within the last 6 months. I have cases where agents just started, some started 3 months ago, but I want to create a view, where there alsways be 6 rows for each agent, no matter when she/he started, there just won't be any sales listed in these rows. This view is importat, because I want to have the option to average the values and show different granularities at some point (agent level, team level etc..)
I am working with Redshift SQL and have the agent data. That is my query:
select date, id, name, team, country, sum(sales)
from agent
where date >= date_trunc('month', dateadd(month, -6, current_date) and date <= current_date
group by 1,2,3,4,5
order by 1
Which gives me the output below (without the green rows), how could I add additional rows/months for Roman, an agen that started in February. Any ideas, suggestions?
Assuming that all dates are available in the table (as shown in your sample data), one option is to cross join the available dates with the list of agents to generate all possible combinations, then bring the original table with a left join:
select d.date, n.id, n.name, n.team, n.country, sum(a.sales)
from (select distinct date from agent) d
cross join (select distinct id, name, team, country from agent) n
left join agent a on a.date = d.date and a.id = n.id
group by 1, 2, 3, 4, 5
order by 1, 2
This assumes that id uniquely identies an agent; otherwise, you would need additional join conditions in the left join (on name, team, country).

Need to count unique transactions by month but ignore records that occur 3 days after 1st entry for that ID

I have a table with just two columns: User_ID and fail_date. Each time somebody's card is rejected they are logged in the table, their card is automatically tried again 3 days later, and if they fail again, another entry is added to the table. I am trying to write a query that counts unique failures by month so I only want to count the first entry, not the 3 day retries, if they exist. My data set looks like this
user_id fail_date
222 01/01
222 01/04
555 02/15
777 03/31
777 04/02
222 10/11
so my desired output would be something like this:
month unique_fails
jan 1
feb 1
march 1
april 0
oct 1
I'll be running this in Vertica, but I'm not so much looking for perfect syntax in replies. Just help around how to approach this problem as I can't really think of a way to make it work. Thanks!
You could use lag() to get the previous timestamp per user. If the current and the previous timestamp are less than or exactly three days apart, it's a follow up. Mark the row as such. Then you can filter to exclude the follow ups.
It might look something like:
SELECT month,
count(*) unique_fails
FROM (SELECT month(fail_date) month,
CASE
WHEN datediff(day,
lag(fail_date) OVER (PARTITION BY user_id,
ORDER BY fail_date),
fail_date) <= 3 THEN
1
ELSE
0
END follow_up
FROM elbat) x
WHERE follow_up = 0
GROUP BY month;
I'm not so sure about the exact syntax in Vertica, so it might need some adaptions. I also don't know, if fail_date actually is some date/time type variant or just a string. If it's just a string the date/time specific functions may not work on it and have to be replaced or the string has to be converted prior passing it to the functions.
If the data spans several years you might also want to include the year additionally to the month to keep months from different years apart. In the inner SELECT add a column year(fail_date) year and add year to the list of columns and the GROUP BY of the outer SELECT.
You can add a flag about whether this is a "unique_fail" by doing:
select t.*,
(case when lag(fail_date) over (partition by user_id order by fail_date) > fail_date - 3
then 0 else 1
end) as first_failure_flag
from t;
Then, you want to count this flag by month:
select to_char(fail_date, 'Mon'), -- should aways include the year
sum(first_failure_flag)
from (select t.*,
(case when lag(fail_date) over (partition by user_id order by fail_date) > fail_date - 3
then 0 else 1
end) as first_failure_flag
from t
) t
group by to_char(fail_date, 'Mon')
order by min(fail_date)
In a Derived Table, determine the previous fail_date (prev_fail_date), for a specific user_id and fail_date, using a Correlated subquery.
Using the derived table dt, Count the failure, if the difference of number of days between current fail_date and prev_fail_date is greater than 3.
DateDiff() function alongside with If() function is used to determine the cases, which are not repeated tries.
To Group By this result on Month, you can use MONTH function.
But then, the data can be from multiple years, so you need to separate them out yearwise as well, so you can do a multi-level group by, using YEAR function as well.
Try the following (in MySQL) - you can get idea for other RDBMS as well:
SELECT YEAR(dt.fail_date) AS year_fail_date,
MONTH(dt.fail_date) AS month_fail_date,
COUNT( IF(DATEDIFF(dt.fail_date, dt.prev_fail_date) > 3, user_id, NULL) ) AS unique_fails
FROM (
SELECT
t1.user_id,
t1.fail_date,
(
SELECT t2.fail_date
FROM your_table AS t2
WHERE t2.user_id = t1.user_id
AND t2.fail_date < t1.fail_date
ORDER BY t2.fail_date DESC
LIMIT 1
) AS prev_fail_date
FROM your_table AS t1
) AS dt
GROUP BY
year_fail_date,
month_fail_date
ORDER BY
year_fail_date ASC,
month_fail_date ASC

Calculating business days in Teradata

I need help in business days calculation.
I've two tables
1) One table ACTUAL_TABLE containing order date and contact date with timestamp datatypes.
2) The second table BUSINESS_DATES has each of the calendar dates listed and has a flag to indicate weekend days.
using these two tables, I need to ensure business days and not calendar days (which is the current logic) is calculated between these two fields.
My thought process was to first get a range of dates by comparing ORDER_DATE with TABLE_DATE field and then do a similar comparison of CONTACT_DATE to TABLE_DATE field. This would get me a range from the BUSINESS_DATES table which I can then use to calculate count of days, sum(Holiday_WKND_Flag) fields making the result look like:
Order# | Count(*) As DAYS | SUM(WEEKEND DATES)
100 | 25 | 8
However this only works when I use a specific order number and cant' bring all order numbers in a sub query.
My Query:
SELECT SUM(Holiday_WKND_Flag), COUNT(*) FROM
(
SELECT
* FROM
BUSINESS_DATES
WHERE BUSINESS.Business BETWEEN (SELECT ORDER_DATE FROM ACTUAL_TABLE
WHERE ORDER# = '100'
)
AND
(SELECT CONTACT_DATE FROM ACTUAL_TABLE
WHERE ORDER# = '100'
)
TEMP
Uploading the table structure for your reference.
SELECT ORDER#, SUM(Holiday_WKND_Flag), COUNT(*)
FROM business_dates bd
INNER JOIN actual_table at ON bd.table_date BETWEEN at.order_date AND at.contact_date
GROUP BY ORDER#
Instead of joining on a BETWEEN (which always results in a bad Product Join) followed by a COUNT you better assign a bussines day number to each date (in best case this is calculated only once and added as a column to your calendar table). Then it's two Equi-Joins and no aggregation needed:
WITH cte AS
(
SELECT
Cast(table_date AS DATE) AS table_date,
-- assign a consecutive number to each busines day, i.e. not increased during weekends, etc.
Sum(CASE WHEN Holiday_WKND_Flag = 1 THEN 0 ELSE 1 end)
Over (ORDER BY table_date
ROWS Unbounded Preceding) AS business_day_nbr
FROM business_dates
)
SELECT ORDER#,
Cast(t.contact_date AS DATE) - Cast(t.order_date AS DATE) AS #_of_days
b2.business_day_nbr - b1.business_day_nbr AS #_of_business_days
FROM actual_table AS t
JOIN cte AS b1
ON Cast(t.order_date AS DATE) = b1.table_date
JOIN cte AS b2
ON Cast(t.contact_date AS DATE) = b2.table_date
Btw, why are table_date and order_date timestamp instead of a date?
Porting from Oracle?
You can use this query. Hope it helps
select order#,
order_date,
contact_date,
(select count(1)
from business_dates_table
where table_date between a.order_date and a.contact_date
and holiday_wknd_flag = 0
) business_days
from actual_table a

How to get common IDs in each group from a group by SQL clause?

I have data of calls for customers. I want to get those customers between two dates that have activity against every date. They did at least one activity every day. I tried following query.
Following is the query:
select date_id , count (distinct customer_id) from usage_analysis
where usage_direction_type_id = 1
and date_id => 20130608 and date_id <= 20130612
group by date_id
That returns:
DATE_ID COUNT
----------------------------
20130608 23451
20130609 9878
20130610 56122
20130611 7811
20130612 12334
But I want to get those customers that are common in each group. It may happen a person who called on 8 June does not exist on the next day. So I only want those customers that exist in every group.
Any idea who can I do that in SQL?
You can count the distinct dates for each customer. Only customers with five distinct dates would then pass the test. The following provides the list of customers:
select customer_id
from usage_analysis
where usage_direction_type_id = 1 and
date_id >= 20130608 and date_id <= 20130612
group by customer_id
having count(distinct date_id) = 5
#Gordon Linoff answer should be working fine for your situation. When you tried with 2 days, did you make sure to change the count value from 5 to 2?