SQL query for user reactivations - sql

I am trying to build a query to count user reactivations per month, where "reactivation" is defined as (for e.g. March 2021):
Sent activity during, or before, January 2021
Did not send activity during February 2021
Sent activity during March 2021
(so 1 or more full calendar months of no activity as the threshold for inactive).
The source table F_ACTIVITY is a per-user per-day time series with columns:
dt (date), user_id, is_active (boolean).
The desired outcome is a table showing:
month, reactivations_this_month
The closest I can get is a count of reactivations in the current month, or something relative to the current date with more case statements (e.g. repeating for current month -2):
SELECT
COUNT(*) AS reactivations_this_month
FROM
(SELECT
* FROM
(SELECT
user_id,
SUM(current_month_active) AS cma,
SUM(last_month_active) AS lma,
SUM(historical_active) AS h_a
FROM
(SELECT
user_id,
dt,
CASE WHEN DATE_TRUNC(MONTH, DT) = ADD_MONTHS(DATE_TRUNC(MONTH, CURRENT_TIMESTAMP), 0) THEN 1 ELSE 0 END AS current_month_active,
CASE WHEN DATE_TRUNC(MONTH, DT) = ADD_MONTHS(DATE_TRUNC(MONTH, CURRENT_TIMESTAMP), -1) THEN 1 ELSE 0 END AS last_month_active,
CASE WHEN DATE_TRUNC(MONTH, DT) < ADD_MONTHS(DATE_TRUNC(MONTH, CURRENT_TIMESTAMP), -1) THEN 1 ELSE 0 END AS historical_active
FROM F_ACTIVITY
WHERE is_active = 1
) AS x
GROUP BY user_id) AS y
WHERE cma > 0
AND lma = 0
AND h_a > 0) AS z
Any help transforming this into a rolling monthly query greatly appreciated - thanks all!
Final note: I'm trying this in Snowflake, so the dialect is SnowSQL

First summarize by month and user, then use lag():
SELECT yyyymm,
SUM(CASE WHEN prev_yyyymm < yyyymm - INTERVAL '1 month' THEN 1 ELSE 0 END) as num_reactivations
FROM (SELECT user_id, DATE_TRUNC(MONTH, DT) as yyyymm,
LAG(DATE_TRUNC(MONTH, DT)) OVER (PARTITION BY user_id ORDER BY DATE_TRUNC(MONTH, DT)) as prev_yyyymm
FROM F_ACTIVITY
WHERE is_active = 1
GROUP BY user_id, DATE_TRUNC(MONTH, DT)
) um
GROUP BY yyyymm;

Related

Sum of unique customers in rolling trailing 30d window displayed by week

I'm working in SQL Workbench.
I'd like to track every time a unique customer clicks the new feature in trailing 30 days, displayed week over week. An example of the data output would be as follows:
Week 51: Reflects usage through the end of week 51 (Dec 20th) - 30 days. aka Nov 20-Dec 20th
Week 52: Reflects usage through the end of week 52 (Dec 31st) - 30 days. aka Dec 1 - Dec 31st.
Say there are 22MM unique customer clicks that occurred from Nov 20-Dec 20th. Week 51 data = 22MM.
Say there are 25MM unique customer clicks that occurred from Dec 1-Dec 31st. Week 52 data = 25MM. The customer uniqueness is only relevant to that particular week. Aka, if a customer clicks twice in Week 51 they're only counted once. If they click once in Week 51 and once in Week 52, they are counted once in each week.
Here is what I have so far:
select
min_e_date
,sum(count(*)) over (order by min_e_date rows between unbounded preceding and current row) as running_distinct_customers
from (select customer_id, min(DATE_TRUNC('week', event_date)) as min_e_date
from final
group by 1
) c
group by
min_e_date
I don't think a rolling count is the right way to go. As I add in additional parameters (country, subscription), the rolling count doesn't distinguish between them - the figures just get added to the prior row.
Any suggestions are appreciated!
edit Additional data below. Data collection begins on 11/23. No data precedes that date.
You can get the count of distinct customers per week like so:
select date_trunc('week', event_date) as week_start,
count(distinct customer_id) cnt
from final
group by 1
Now if you want a rolling sum of that count(say, the current week and the three preceding weeks), you can use window functions:
select date_trunc('week', event_date) as week_start,
count(distinct customer_id) cnt,
sum(count(distinct customer_id)) over(
order by date_trunc('week', event_date)
range between 3 week preceding and current row
) as rolling_cnt
from final
group by 1
Rolling distinct counts are quite difficult in RedShift. One method is a self-join and aggregation:
select t.date,
count(distinct case when tprev.date >= t.date - interval '6 day' then customer_id end) as trailing_7,
count(distinct customer_id) as trailing_30
from t join
t tprev
on tprev.date >= t.date - interval '29 day' and
tprev.date <= t.date
group by t.date;
If you can get this to work, you can just select every 7th row to get the weekly values.
EDIT:
An entirely different approach is to use aggregation and keep track of when customers enter and end time periods of being counted. This is a pain with two different time frames. Here is what it looks like for one.
The idea is to
Create an enter/exit record for each record being counted. The "exit" is n days after the enter.
Summarize these into periods of activity for each customer. So, there is one record with an enter and exit date. This is a type of gaps-and-islands problem.
Unpivot this result to count +1 for a customer being counted and -1 for a customer not being counted.
Do a cumulative sum of this count.
The code looks something like this:
with cd as (
select customer_id, date,
lead(date) over (partition by customer_id order by date) as next_date,
sum(sum(inc)) over (partition by customer_id order by date) as cnt
from ((select t.customer_id, t.date, 1 as inc
from t
) union all
(select t.customer_id, t.date + interval '7 day', -1
from t
)
) tt
),
cd2 as (
select customer_id, min(date) as enter_date, max(date) as exit_date
from (select cd.*,
sum(case when cnt = 0 then 1 else 0 end) over (partition by customer_id order by date) as grp
from (select cd.*,
lag(cnt) over (partition by customer_id order by date) as prev_cnt
from cd
) cd
) cd
group by customer_id, grp
having max(cnt) > 0
)
select dte, sum(sum(inc)) over (order by dte)
from ((select customer_id, enter_date as dte, 1 as inc
from cd2
) union all
(select customer_id, exit_date as dte, -1 as inc
from cd2
)
) cd2
group by dte;

sql user retention calculation

I have a table records like this in Athena, one user one row in a month:
month, id
2020-05 1
2020-05 2
2020-05 5
2020-06 1
2020-06 5
2020-06 6
Need to calculate the percentage=( users come both prior month and current month )/(prior month total users).
Like in the above example, users come both in May and June 1,5 , May total user 3, this should calculate a percentage of 2/3*100
with monthly_mau AS
(SELECT month as mauMonth,
date_format(date_add('month',1,cast(concat(month,'-01') AS date)), '%Y-%m') AS nextMonth,
count(distinct userid) AS monthly_mau
FROM records
GROUP BY month
ORDER BY month),
retention_mau AS
(SELECT
month,
count(distinct useridLeft) AS retention_mau
FROM (
(SELECT
userid as useridLeft,month as monthLeft,
date_format(date_add('month',1,cast(concat(month,'-01') AS date)), '%Y-%m') AS nextMonth
FROM records ) AS prior
INNER JOIN
(SELECT
month ,
userid
FROM records ) AS current
ON
prior.useridLeft = current.userid
AND prior.nextMonth = current.month )
WHERE userid is not null
GROUP BY month
ORDER BY month )
SELECT *, cast(retention_mau AS double)/cast(monthly_mau AS double)*100 AS retention_mau_percentage
FROM monthly_mau as m
INNER JOIN monthly_retention_mau AS r
ON m.nextMonth = r.month
order by r.month
This gives me percentage as 100 which is not right. Any idea?
Hmmm . . . assuming you have one row per user per month, you can use window functions and conditional aggregation:
select month, count(*) as num_users,
sum(case when prev_month = dateadd('month', -1, month) then 1 else 0 end) as both_months
from (select r.*,
cast(concat(month, '-01') AS date) as month_date,
lag(cast(concat(month, '-01') AS date)) over (partition by id order by month) as prev_month_date
from records r
) r
group by month;

How to define the filter in dates?

With the query, I basically want to compare avg_clicks at different time periods and set a filter according to the avg_clicks.
The below query gives us avg_clicks for each shop in January 2020. But I want to see the avg_clicks that is higher than 0 in January 2020.
Question 1: When I add the where avg_clicks > 0 in the query, I am getting the following error: Column 'avg_clicks' cannot be resolved. Where to put the filter?
SELECT AVG(a.clicks) AS avg_clicks,
a.shop_id,
b.shop_name
FROM
(SELECT SUM(clicks_on) AS clicks,
shop_id,
date
FROM X
WHERE site = ‘com’
AND date >= CAST('2020-01-01' AS date)
AND date <= CAST('2020-01-31' AS date)
GROUP BY shop_id, date) as a
JOIN Y as b
ON a.shop_id = b.shop_id
GROUP BY a.shop_id, b.shop_name
Question 2: As I wrote, I want to compare two different times. And now, I want to see avg_clicks that is 0 in February 2020.
As a result, the desired output will show me the list of shops that had more than 0 clicks in January, but 0 clicks in February.
Hope I could explain my question. Thanks in advance.
For your Question 1 try to use having clause. Read execution order of SQL statement which gives you a better idea why are you getting avg_clicks() error.
SELECT AVG(a.clicks) AS avg_clicks,
a.shop_id,
b.shop_name
FROM
(SELECT SUM(clicks_on) AS clicks,
shop_id,
date
FROM X
WHERE site = ‘com’
AND date >= '2020-01-01'
AND date <= '2020-01-31'
GROUP BY shop_id, date) as a
JOIN Y as b
ON a.shop_id = b.shop_id
GROUP BY a.shop_id, b.shop_name
HAVING AVG(a.clicks) > 0
For your Question 2, you can do something like this
SELECT
shop_id,
b.shop_name,
jan_avg_clicks,
feb_avg_clicks
FROM
(
SELECT
AVG(clicks) AS jan_avg_clicks,
shop_id
FROM
(
SELECT
SUM(clicks_on) AS clicks,
shop_id,
date
FROM X
WHERE site = ‘com’
AND date >= '2020-01-01'
AND date <= '2020-01-31'
GROUP BY
shop_id,
date
) as a
GROUP BY
shop_id
HAVING AVG(clicks) > 0
) jan
join
(
SELECT
AVG(clicks) AS feb_avg_clicks,
shop_id
FROM
(
SELECT
SUM(clicks_on) AS clicks,
shop_id,
date
FROM X
WHERE site = ‘com’
AND date >= '2020-02-01'
AND date < '2020-03-01'
GROUP BY
shop_id,
date
) as a
GROUP BY
shop_id
HAVING AVG(clicks) = 0
) feb
on jan.shop_id = feb.shop_id
join Y as b
on jan.shop_id = b.shop_id
Start with conditional aggregation:
SELECT shop_id,
SUM(CASE WHEN DATE_TRUNC('month', date) = '2020-01-01' THEN clicks_on END) / COUNT(DISTINCT date) as avg_clicks_jan,
SUM(CASE WHEN DATE_TRUNC('month', date) = '2020-02-01' THEN clicks_on END) / COUNT(DISTINCT date) as avg_clicks_feb
FROM X
WHERE site = 'com' AND
date >= '2020-01-01' AND
date < '2020-03-01'
GROUP BY shop_id;
I'm not sure what comparison you want to make. But if you want to filter based on the aggregated values, use a HAVING clause.

SQL manipulation of table (aggregate and grouping)

I would like to make a daily query (using bigquery) to compare the sums for different metrics between yesterday and today. sample dataset look like this:
assuming today is 23 Dec 2019, the query will aggregate different metrics (revenue, cost, profit) for different customer for 23 Dec (today) and 22 Dec (yesterday), if sum(yesterday)/sum(today) is not within the threshold of 0.5-1.5, then it will be labelled as anomalous
the query will be made daily and new result will simply be appended. ideally the final table would look like this:
My main concern is that I am able to do this for one metric only (i.e revenue), but not sure how to apply to all metrics (and also make the query more efficient). this is the code i have written
SELECT cust_id,
SUM(CASE WHEN date = DATE_ADD(CURRENT_DATE(), INTERVAL -1 DAY)
THEN revenue
END) AS sum(yesterday),
SUM(CASE WHEN date = DATE_ADD(CURRENT_DATE(), INTERVAL 0 DAY)
THEN revenue
END) AS sum(today),
SUM(CASE WHEN date = DATE_ADD(CURRENT_DATE(), INTERVAL -1 DAY)
THEN revenue
END) / SUM(CASE WHEN date = DATE_ADD(CURRENT_DATE(), INTERVAL 0 DAY)
THEN revenue
END) as ratio,
FROM `dataset`
GROUP BY cust_id
and the code gives me:
Apologies in advance for the lack of clarity in the question, as I am new to this and not sure how to phrase this question more accurately
My suggestion would be to put the source data in an Excel pivot table. (move the Values group to the rows to get the desired view.).
if you want to stick to SQL however, you need to unpivot the rows first, by putting each measure in a separate row and then group the intermediate results, like this:
WITH unpivoted AS
(
SELECT
date
, 'revenue' AS metrics
, SUM( revenue ) AS amount
, cust_id
FROM
`dataset`
GROUP
BY
date
, cust_id
UNION ALL
SELECT
date
, 'cost' AS metrics
, SUM( cost ) AS amount
, cust_id
FROM
`dataset`
GROUP
BY
date
, cust_id
-- add more desired metrics
)
SELECT
date as date_generated
, cust_id
, metrics
, SUM( CASE WHEN date = DATE_ADD( CURRENT_DATE() , INTERVAL 0 DAY ) THEN amount END ) AS today
, SUM( CASE WHEN date = DATE_ADD( CURRENT_DATE() , INTERVAL -1 DAY ) THEN amount END ) AS yesterday
...
FROM
unpivoted
WHERE
date >= DATE_ADD(CURRENT_DATE(), INTERVAL -1 DAY )
AND date <= DATE_ADD(CURRENT_DATE(), INTERVAL 0 DAY )
GROUP
BY
date, cust_id, metrics
You can summarize the data and then use lag() or a join to bring in the previous days data:
with t as (
select cust_id, date,
sum(revenue) as revenue,
sum(cost) as cost,
sum(profit) as profit
from dataset
where date >= date_add(current_date, interval -1 day)
group by cust_id, date
)
select t.cust_id,
today, yesterday
from t today left join
t yesterday
on yesterday.cust_id = today.cust_id and
yesterday.date = date_add(current_date, interval -1 day)
where today.date = current_date;
You can unpivot the columns first and then group the results. After that, you might need to use LAG() to show data from one day and the previous one in the same row.
WITH unpivoted AS
(
SELECT
date,
'revenue' AS metrics,
SUM( revenue ) AS amount,
cust_id
FROM
`dataset`
GROUP BY
date, metrics, cust_id
UNION ALL
SELECT
date,
'cost' AS metrics,
SUM( cost ) AS amount,
cust_id
FROM
`dataset`
GROUP BY
date, metrics, cust_id
UNION ALL
SELECT
date,
'profit' AS metrics,
SUM( profit ) AS amount,
cust_id
FROM
`dataset`
GROUP BY
date, metrics, cust_id
)
SELECT
date as date_generated,
metrics,
cust_id,
LAG(SUM( amount )) OVER (PARTITION BY cust_id, metrics ORDER BY date) yesterday,
SUM( amount ) AS today,
LAG(SUM( amount )) OVER (PARTITION BY cust_id, metrics ORDER BY date) / SUM(amount) as ratio,
CASE WHEN LAG(SUM( amount )) OVER (PARTITION BY cust_id, metrics ORDER BY date) / SUM(amount)<0.5 then 'TRUE'
WHEN LAG(SUM( amount )) OVER (PARTITION BY cust_id, metrics ORDER BY date) / SUM(amount)>1.5 then 'TRUE'
WHEN LAG(SUM( amount )) OVER (PARTITION BY cust_id, metrics ORDER BY date) / SUM(amount) is NULL then 'TRUE'
ELSE 'FALSE'
END as anomalous
FROM
unpivoted
WHERE date >= DATE_ADD(CURRENT_DATE(), INTERVAL -1 DAY ) AND date <= DATE_ADD(CURRENT_DATE(), INTERVAL 0 DAY )
GROUP BY
date_generated, cust_id, metrics
ORDER BY
date_generated, metrics, cust_id
Note that my solution is only limited to current day and previous day (today and yesterday) when using the WHERE clause, so this could be used to aggregate metrics from more than two days.

I need a query to compare one Saturday's total sales with the rest of the year's average Saturday's total sales

My data set's fields are ts, quantity, unit_price
I first need to run sum(quanitiy * unit_price) to get my sales number
ts(time stamp) is formatted like this - 2019-01-15 14:55:00 UTC
Is this what you want?
select avg(case when datecol = ? then total end) as sales_your_date,
avg(case when datecol <> ? then total end) as sales_other
from (select date(t.ts) as dte, sum(t.quantity * t.unit_price) as total
from t
where ts >= timestamp('2018-01-01') and
ts < timestamp('2019-01-01')
group by dte
) t
where extract(dayofweek from datecol) = 6 -- Saturday
This is not much different from your previous question. The same idea works, just with aggregating the data first.
? is for the date you care about.
Below is for BigQuery Standard SQL
#standardSQL
SELECT DATE(ts) AS sale_date, quanitiy * unit_price AS sale_total,
ROUND((SUM(quanitiy * unit_price) OVER() - quanitiy * unit_price) / (COUNT(1) OVER() - 1), 2) AS sale_rest_average
FROM `project.dataset.table`
WHERE EXTRACT(DAYOFWEEK FROM DATE(ts)) = 7
AND EXTRACT(YEAR FROM DATE(ts)) = 2018
In case if you timestamp field is of TIMESTAMP data type (vs STRING) you can use just
WHERE EXTRACT(DAYOFWEEK FROM ts) = 7
AND EXTRACT(YEAR FROM ts) = 2018