SQL Rolling LTV (Lifetime Value) - sql

I am trying to get a rolling calculation of customer lifetime value. The basic formula that I am using would 'SUM(revenue) / COUNT(DISTINCT CUSTOMERS)' but am running into issues when trying to just get those numbers from whatever day it is moving backward. I have code below that isn't correct but had also tried PARTITION code that also didn't work.
CREATE TEMP TABLE customer_revenue AS
(
SELECT TRUNC(timestamp) AS "order_date", COUNT(DISTINCT customer_email) AS "customers",
SUM(revenue)-SUM(discount)-SUM(shipping)-SUM(tax) AS "revenue"
FROM public.fact_shopify_orders
GROUP BY TRUNC(timestamp)
);
SELECT TRUNC(SO.timestamp) AS "date", SUM(CR.revenue) / COUNT(customers) AS "LTV"
FROM customer_revenue CR
LEFT JOIN public.fact_shopify_orders SO ON CR.order_date = SO.timestamp
WHERE CR.order_date <= SO.timestamp
GROUP BY TRUNC(SO.timestamp)
ORDER BY TRUNC(SO.timestamp) DESC

I think you want rolling sums and count(distinct). The latter is a little tricky but you can emulate it easily using a flag based on the first time the customer is seen:
SELECT date,
( SUM(SUM(net_revenue)) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) /
SUM(SUM( (seqnum = 1)::int )) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
) as LTV
FROM (SELECT so.*, TRUNC(SO.timestamp) as date,
(revenue - discount - shipping - tax) as net_revenue,
ROW_NUMBER() OVER (PARTITION BY customer_email ORDER BY timestamp) as seqnum
FROM public.fact_shopify_orders so
) so
GROUP BY date;
EDIT:
I think Redshift supports window functions with aggregation . . . but there is some database out there that does not. You can try this:
SELECT date,
( SUM(net_revenue) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) /
SUM(num_firsts) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
) as LTV
FROM (SELECT date, SUM(net_revenue) as net_revenue,
SUM( (seqnum = 1)::int ) as num_firsts
FROM (SELECT so.*, TRUNC(SO.timestamp) as date,
(revenue - discount - shipping - tax) as net_revenue,
ROW_NUMBER() OVER (PARTITION BY customer_email ORDER BY timestamp) as seqnum
FROM public.fact_shopify_orders so
) so
GROUP BY date
) so;
Here is a similar version running in Postgres.

Related

How can i group rows on sql base on condition

I am using redshift sql and would like to group users who has overlapping voucher period into a single row instead (showing the minimum start date and max end date)
For E.g if i have these records,
I would like to achieve this result using redshift
Explanation is tat since row 1 and row 2 has overlapping dates, I would like to just combine them together and get the min(Start_date) and max(End_Date)
I do not really know where to start. Tried using row_number to partition them but does not seem to work well. This is what I tried.
select
id,
start_date,
end_date,
lag(end_date, 1) over (partition by id order by start_date) as prev_end_date,
row_number() over (partition by id, (case when prev_end_date >= start_date then 1 else 0) order by start_date) as rn
from users
Are there any suggestions out there? Thank you kind sirs.
This is a type of gaps-and-islands problem. Because the dates are arbitrary, let me suggest the following approach:
Use a cumulative max to get the maximum end_date before the current date.
Use logic to determine when there is no overall (i.e. a new period starts).
A cumulative sum of the starts provides an identifier for the group.
Then aggregate.
As SQL:
select id, min(start_date), max(end_date)
from (select u.*,
sum(case when prev_end_date >= start_date then 0 else 1
end) over (partition by id
order by start_date, voucher_code
rows between unbounded preceding and current row
) as grp
from (select u.*,
max(end_date) over (partition by id
order by start_date, voucher_code
rows between unbounded preceding and 1 preceding
) as prev_end_date
from users u
) u
) u
group by id, grp;
Another approach would be using recursive CTE:
Divide all rows into numbered partitions grouped by id and ordered by start_date and end_date
Iterate over them calculating group_start_date for each row (rows which have to be merged in final result would have the same group_start_date)
Finally you need to group the CTE by id and group_start_date taking max end_date from each group.
Here is corresponding sqlfiddle: http://sqlfiddle.com/#!18/7059b/2
And the SQL, just in case:
WITH cteSequencing AS (
-- Get Values Order
SELECT *, start_date AS group_start_date,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY start_date, end_date) AS iSequence
FROM users),
Recursion AS (
-- Anchor - the first value in groups
SELECT *
FROM cteSequencing
WHERE iSequence = 1
UNION ALL
-- Remaining items
SELECT b.id, b.start_date, b.end_date,
CASE WHEN a.end_date > b.start_date THEN a.group_start_date
ELSE b.start_date
END
AS groupStartDate,
b.iSequence
FROM Recursion AS a
INNER JOIN cteSequencing AS b ON a.iSequence + 1 = b.iSequence AND a.id = b.id)
SELECT id, group_start_date as start_date, MAX(end_date) as end_date FROM Recursion group by id, group_start_date ORDER BY id, group_start_date

Redshift - Group Table based on consecutive rows

I am working right now with this table:
What I want to do is to clear up this table a little bit, grouping some consequent rows together.
Is there any form to achieve this kind of result?
The first table is already working fine, I just want to get rid of some rows to free some disk space.
One method is to peak at the previous row to see when the value changes. Assuming that valid_to and valid_from are really dates:
select id, class, min(valid_to), max(valid_from)
from (select t.*,
sum(case when prev_valid_to >= valid_from + interval '-1 day' then 0 else 1 end) over (partition by id order by valid_to rows between unbounded preceding and current row) as grp
from (select t.*,
lag(valid_to) over (partition by id, class order by valid_to) as prev_valid_to
from t
) t
) t
group by id, class, grp;
If the are not dates, then this gets trickier. You could convert to dates. Or, you could use the difference of row_numbers:
select id, class, min(valid_from), max(valid_to)
from (select t.*,
row_number() over (partition by id order by valid_from) as seqnum,
row_number() over (partition by id, class order by valid_from) as seqnum_2
from t
) t
group by id, class, (seqnum - seqnum_2)

Use a regular aggregative function (sum) alongside a window function

I was reading this tutorial on how to calculate running totals.
Copying the suggested approach I have a query of the form:
select
date,
sum(sales) over (order by date rows unbounded preceding) as cumulative_sales
from sales_table;
This works fine and does what I want - a running total by date.
However, in addition to the running total, I'd also like to add daily sales:
select
date,
sum(sales),
sum(sales) over (order by date rows unbounded preceding) as cumulative_sales
from sales_table
group by 1;
This throws an error:
SYNTAX_ERROR: line 6:8: '"sum"("sales") OVER (ORDER BY "activity_date" ASC ROWS UNBOUNDED PRECEDING)' must be an aggregate expression or appear in GROUP BY clause
How can I calculate both daily total as well as running total?
I think you can try it, but it will repeat your daily_sales. In this way you don't need to group by your date field.
SELECT date,
SUM(sales) OVER (PARTITION BY DATE) as daily_sales
SUM(sales) OVER (ORDER BY DATE ROWS UNBOUNDED PRECEDING) as cumulative_sales
FROM sales_table;
Presumably, you intend an aggregation query to begin with:
select date, sum(sales) as daily_sales,
sum(sum(sales)) over (order by date rows unbounded preceding) as cumulative_sales
from sales_table
group by date
order by date;

Number of sales relative to historical date in previous year

I have a database containing sales transactions. These are in the following (simplified) format:
sales_id | customer_id | sales_date | number_of_units | total_price
The goal for my query is for each of these transactions, to get the number of sales that this specific customer_id made before the current record, during the whole history of this database, but also during the 365 days before the current record.
Lifetime sales works right now, but the last 365 days part has me stuck. My query right now can identify IF a record had at least one sale in the previous 365 days, and I do it like so:
SELECT sales_id ,customer_id,sales_date,number_of_units,total_price,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY sales_date ASC) as 'LifeTimeSales' ,
CASE WHEN DATEDIFF(DAY,sales_date,LAG(sales_date, 1) OVER (PARTITION BY customer_id ORDER BY sales_date ASC)) > -365
THEN 1 ELSE 0 END as 'Last365Sales'
FROM sales_db
+ some non-important WHERE clauses. After which I aggregate the result of this query in some other ways.
But this does not tell me if this purchase is for example the 4th sale in the previous 365 days of a customer.
Note:
This is a query that runs daily on the full database with 6 million records and growing. I drop and recreate this table right now, which is obviously not efficient. Updating the table when new sales come in would be ideal, but right now this is not possible to create. Any ideas?
Some test data:
sales_id,customer_id,sales_date,number_of_units,total_price
1001,2001,2016-01-01,1,86
1002,2001,2016-08-01,3,98
1003,2001,2017-06-01,2,87
1004,2002,2017-06-01,2,15
+ expected result:
sales_id,customer_id,sales_date,number_of_units,total_price,LifeTimeSales,Last365Sales
1001,2001,2016-01-01,1,86,0,0
1002,2001,2016-08-01,3,98,1,1
1003,2001,2017-06-01,2,87,2,1
1004,2002,2017-06-01,2,15,0,0
For the count of sales before a sale you could use correlated subqueries.
SELECT s1.sales_id,
s1.customer_id,
s1.sales_date,
s1.number_of_units,
s1.total_price,
(SELECT count(*)
FROM sales_db s2
WHERE s2.customer_id = s1.customer_id
AND s2.sales_date <= s1.sales_date) - 1 lifetimesales,
(SELECT count(*)
FROM sales_db s2
WHERE s2.customer_id = s1.customer_id
AND s2.sales_date <= s1.sales_date
AND s2.sales_date >= dateadd(day, s1.sales_date, -356)) - 1 last365sales
FROM sales_db s1;
(I used s2.sales_date <= s1.sales_date and then subtracted 1 from the reuslt, so that multiple sales on the same day, if such data exists, are also counted. But as this also counts the sale of the current row, it has to be decremented by 1.)
I create report view where all required fields are available.
Select all that you need:
with all_history_statistics as
(select customer_id, sales_id, sales_date, number_of_units, total_price,
max(sales_date) over (partition by customer_id order by (select null)) as last_sale_date,
count(sales_id) over (partition by customer_id order by (select null)) total_number_of_sales,
count(sales_id) over (partition by customer_id order by sales_date asc ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) number_of_sales_for_current_date,
sum(number_of_units) over (partition by customer_id order by (select null)) total_number_saled_units,
sum(number_of_units) over (partition by customer_id order by sales_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) number_saled_units_for_current_date,
sum(total_price) over (partition by customer_id order by (select null)) as total_earned,
sum(total_price) over (partition by customer_id order by sales_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) earned_for_current_date)
from sales_db),
with last_year_statistics as
(select customer_id, sales_id, sales_date, number_of_units, total_price,
max(sales_date) over (partition by customer_id order by (select null)) as last_sale_date,
count(sales_id) over (partition by customer_id order by (select null)) total_number_of_sales,
count(sales_id) over (partition by customer_id order by sales_date asc ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) number_of_sales_for_current_date,
sum(number_of_units) over (partition by customer_id order by (select null)) total_number_saled_units,
sum(number_of_units) over (partition by customer_id order by sales_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) number_saled_units_for_current_date,
sum(total_price) over (partition by customer_id order by (select null)) as total_earned,
sum(total_price) over (partition by customer_id order by sales_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) earned_for_current_date)
from sales_db)
select <specify list of fields which you need>
from all_history_statistics t1 inner join last_year_statistics
on t1.customer_id = t2.cutomer_id
;

Distinct in Window Functions. BigQuery

I'm trying to do something like this in BigQuery
COUNT(DISTINCT user_id) OVER (PARTITION BY DATE_TRUNC(date, month), sample, app_id ORDER BY DATE RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as ACTIVE_USERS
In other words, I have a table with Date, Userid, Sample and Application ID. I need to count the cumulative number of unique active users for each day starting from the beginning of the month and ending with the current day.
The function works properly without distinct, however, this gives me a total count of users and it's not what I need.
Tried some tricks with dense_rank, however it doesn't work here as well.
Are there any ways to calculative the number of distinct users using window functions?
-------------UPDATED----------------
here is the full query, so you could better understand what I need
with mtd1 as (select
'MonthToDate' as TIMELINE
,fd.date DATE
,td.SAMPLE as SAMPLE
,td.APPNAME as APP_ID
,sum(fd.revenue) as REVENUE
,td.user_id ACTIVE_USERS
from DWH.DailyUser fd
join DWH.Depositors td using (userid)
group by 1,2,3,4,6
),
mtd as (
select TIMELINE
,DATE
,SAMPLE
,APP_ID
,sum(revenue) over (partition by date_trunc(date, month), sample, app_id order by date range BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as REVENUE
,COUNT(distinct active_users) over (partition by date_trunc(date, month), sample, app_id order by date range BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as ACTIVE_USERS
from mtd1
)
select * from mtd
where extract(day from date) = extract(day from current_date)
group by 1,2,3,4,5,6
Distinct in Window Functions. BigQuery - Are there any ways to calculate the number of distinct users using window functions?
This specific question is a duplicate and already answered here
... here is the full query ...
As of how to apply above to your particular query - see below (not tested and fully based on your code
#standardSQL
WITH mtd1 AS (
SELECT
'MonthToDate' AS TIMELINE
,fd.date DATE
,td.SAMPLE AS SAMPLE
,td.APPNAME AS APP_ID
,SUM(fd.revenue) AS REVENUE
,td.user_id ACTIVE_USERS
FROM `DWH.DailyUser` fd
JOIN `DWH.Depositors` td USING (userid)
GROUP BY 1,2,3,4,6
), mtd2 AS (
SELECT
TIMELINE
,DATE
,SAMPLE
,APP_ID
,SUM(REVENUE) OVER (PARTITION BY DATE_TRUNC(DATE, MONTH), SAMPLE, APP_ID ORDER BY DATE RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS REVENUE
,ARRAY_AGG(ACTIVE_USERS) OVER (PARTITION BY DATE_TRUNC(DATE, MONTH), SAMPLE, APP_ID ORDER BY DATE RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS ACTIVE_USERS
FROM mtd1
), mtd AS (
SELECT * REPLACE((SELECT COUNT(DISTINCT u) FROM UNNEST(ACTIVE_USERS) AS u) AS ACTIVE_USERS)
FROM mtd2
)
SELECT * FROM mtd
WHERE EXTRACT(day FROM DATE) = EXTRACT(day FROM CURRENT_DATE)
GROUP BY 1,2,3,4,5,6
You can use ARRAY_AGG, then count the distinct elements in each array. Note that your query will run out of memory if the arrays end up being too big, though.
with mtd1 as (select
'MonthToDate' as TIMELINE
,fd.date DATE
,td.SAMPLE as SAMPLE
,td.APPNAME as APP_ID
,sum(fd.revenue) as REVENUE
,td.user_id ACTIVE_USERS
from DWH.DailyUser fd
join DWH.Depositors td using (userid)
group by 1,2,3,4,6
),
mtd1 as (
select TIMELINE
,DATE
,SAMPLE
,APP_ID
,sum(revenue) over (partition by date_trunc(date, month), sample, app_id order by date range BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as REVENUE
,ARRAY_AGG(active_users) over (partition by date_trunc(date, month), sample, app_id order by date range BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as ACTIVE_USERS
from mtd1
), mtd AS (
SELECT * EXCEPT(ACTIVE_USERS),
(SELECT COUNT(DISTINCT u) FROM UNNEST(ACTIVE_USERS) AS u) AS ACTIVE_USERS
FROM mtd1
)
select * from mtd
where extract(day from date) = extract(day from current_date)
group by 1,2,3,4,5,6
One method for implementing count(distinct) uses row_number() and then counts the "1"s:
select SUM(CASE WHEN seqnum = 1 THEN 1 ELSE 0 END) OVER (PARTITION BY DATE_TRUNC(date, month), sample, app_id ORDER BY date) as Active_Users
FROM (SELECT t.*,
ROW_NUMBER() OVER (PARTITION BY DATE_TRUNC(date, month), sample, app_id, user_id ORDER BY DATE) as seqnum
FROM t
) t