Number of sales relative to historical date in previous year - sql

I have a database containing sales transactions. These are in the following (simplified) format:
sales_id | customer_id | sales_date | number_of_units | total_price
The goal for my query is for each of these transactions, to get the number of sales that this specific customer_id made before the current record, during the whole history of this database, but also during the 365 days before the current record.
Lifetime sales works right now, but the last 365 days part has me stuck. My query right now can identify IF a record had at least one sale in the previous 365 days, and I do it like so:
SELECT sales_id ,customer_id,sales_date,number_of_units,total_price,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY sales_date ASC) as 'LifeTimeSales' ,
CASE WHEN DATEDIFF(DAY,sales_date,LAG(sales_date, 1) OVER (PARTITION BY customer_id ORDER BY sales_date ASC)) > -365
THEN 1 ELSE 0 END as 'Last365Sales'
FROM sales_db
+ some non-important WHERE clauses. After which I aggregate the result of this query in some other ways.
But this does not tell me if this purchase is for example the 4th sale in the previous 365 days of a customer.
Note:
This is a query that runs daily on the full database with 6 million records and growing. I drop and recreate this table right now, which is obviously not efficient. Updating the table when new sales come in would be ideal, but right now this is not possible to create. Any ideas?
Some test data:
sales_id,customer_id,sales_date,number_of_units,total_price
1001,2001,2016-01-01,1,86
1002,2001,2016-08-01,3,98
1003,2001,2017-06-01,2,87
1004,2002,2017-06-01,2,15
+ expected result:
sales_id,customer_id,sales_date,number_of_units,total_price,LifeTimeSales,Last365Sales
1001,2001,2016-01-01,1,86,0,0
1002,2001,2016-08-01,3,98,1,1
1003,2001,2017-06-01,2,87,2,1
1004,2002,2017-06-01,2,15,0,0

For the count of sales before a sale you could use correlated subqueries.
SELECT s1.sales_id,
s1.customer_id,
s1.sales_date,
s1.number_of_units,
s1.total_price,
(SELECT count(*)
FROM sales_db s2
WHERE s2.customer_id = s1.customer_id
AND s2.sales_date <= s1.sales_date) - 1 lifetimesales,
(SELECT count(*)
FROM sales_db s2
WHERE s2.customer_id = s1.customer_id
AND s2.sales_date <= s1.sales_date
AND s2.sales_date >= dateadd(day, s1.sales_date, -356)) - 1 last365sales
FROM sales_db s1;
(I used s2.sales_date <= s1.sales_date and then subtracted 1 from the reuslt, so that multiple sales on the same day, if such data exists, are also counted. But as this also counts the sale of the current row, it has to be decremented by 1.)

I create report view where all required fields are available.
Select all that you need:
with all_history_statistics as
(select customer_id, sales_id, sales_date, number_of_units, total_price,
max(sales_date) over (partition by customer_id order by (select null)) as last_sale_date,
count(sales_id) over (partition by customer_id order by (select null)) total_number_of_sales,
count(sales_id) over (partition by customer_id order by sales_date asc ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) number_of_sales_for_current_date,
sum(number_of_units) over (partition by customer_id order by (select null)) total_number_saled_units,
sum(number_of_units) over (partition by customer_id order by sales_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) number_saled_units_for_current_date,
sum(total_price) over (partition by customer_id order by (select null)) as total_earned,
sum(total_price) over (partition by customer_id order by sales_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) earned_for_current_date)
from sales_db),
with last_year_statistics as
(select customer_id, sales_id, sales_date, number_of_units, total_price,
max(sales_date) over (partition by customer_id order by (select null)) as last_sale_date,
count(sales_id) over (partition by customer_id order by (select null)) total_number_of_sales,
count(sales_id) over (partition by customer_id order by sales_date asc ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) number_of_sales_for_current_date,
sum(number_of_units) over (partition by customer_id order by (select null)) total_number_saled_units,
sum(number_of_units) over (partition by customer_id order by sales_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) number_saled_units_for_current_date,
sum(total_price) over (partition by customer_id order by (select null)) as total_earned,
sum(total_price) over (partition by customer_id order by sales_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) earned_for_current_date)
from sales_db)
select <specify list of fields which you need>
from all_history_statistics t1 inner join last_year_statistics
on t1.customer_id = t2.cutomer_id
;

Related

Before&After purchase of a product

I have two tables:
orders_product: all the orders. Each line is a product sold with some details about the order in which it was included. So, if the order has more than 1 product, there are more than 1 line for this order.
orders_grouped: each line is an order with some details about this specific order.
I would like know if there was a previous purchase and a following purchase for each product.
SELECT
product_name,
last_value(product_all_grouped_list) over (partition by ord.customer_id order by created_at asc rows between unbounded preceding and 1 preceding ) as last_order,
last_value(product_all_grouped_list) over (partition by ord.customer_id order by created_at desc rows between unbounded preceding and 1 preceding ) as next_order_products,
last_value(basket_size) over (partition by ord.customer_id order by created_at desc rows between unbounded preceding and 1 preceding ) as next_order_basket_size
FROM
`orders_product` ord
left join `orders_grouped` ordgroup
on ord.order_number=ordgroup.order_number
When the order has only one product (basket_size=1), everything is correct but when the basket_size>1, the results for the first product of this order is OK but for the rest of products of the order is wrong.
Can someone help me?
Because several orders items are present and thus several rows the windows function has to be different.
RANGE instead of ROWS in the over statement.
Also use window at the end:
With tbl as (
Select * from unnest(generate_timestamp_array("2022-09-01","2022-09-15",interval 1 hour)) update_time
)
SELECT
*,
LAST_VALUE(update_time) OVER (ORDER BY update_time ASC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING ),
timestamp_diff(update_time,timestamp("1999-01-01"),second) ,
LAST_VALUE(update_time) OVER SETUP_window
FROM
tbl
window SETUP_window as (ORDER BY timestamp_diff(update_time,timestamp("1999-01-01"),second) ASC RANGE BETWEEN UNBOUNDED PRECEDING AND 36000 PRECEDING )
order by update_time desc

How to get min value at max date in sql?

I have a table with snapshot data. It has productid and date and quantity columns. I need to find min value in the max date. Let's say, we have product X: X had the last snapshot at Y date but it has two snapshots at Y with 9 and 8 quantity values. I need to get
product_id | date | quantity
X Y 8
So far I came up with this.
select
productid
, max(snapshot_date) max_date
, min(quantity) min_quantity
from snapshot_table
group by 1
It works but I don't know why. Why this does not bring min value for each date?
I would use RANK here along with a scalar subquery:
WITH cte AS (
SELECT *, RANK() OVER (ORDER BY quantity) rnk
FROM snapshot_table
WHERE snapshot_date = (SELECT MAX(snapshot_date) FROM snapshot_table)
)
SELECT productid, snapshot_date, quantity
FROM cte
WHERE rnk = 1;
Note that this solution caters to the possibility that two or more records happened to be tied for having the lower quantity among those most recent records.
Edit: We could simplify by doing away with the CTE and instead using the QUALIFY clause for the restriction on the RANK:
SELECT productid, snapshot_date, quantity
FROM snapshot_table
WHERE snapshot_date = (SELECT MAX(snapshot_date) FROM snapshot_table)
QUALIFY RANK() OVER (ORDER BY quantity) = 1;
Consider also below approach
select distinct product_id,
max(snapshot_date) over product as max_date,
first_value(quantity) over(product order by snapshot_date desc, quantity) as min_quantity
from your_table
window product as (partition by product_id)
use row_number()
with cte as (select *,
row_number() over(partition by product_id order by date desc) rn
from table_name) select * from cte where rn=1

SQL Rolling LTV (Lifetime Value)

I am trying to get a rolling calculation of customer lifetime value. The basic formula that I am using would 'SUM(revenue) / COUNT(DISTINCT CUSTOMERS)' but am running into issues when trying to just get those numbers from whatever day it is moving backward. I have code below that isn't correct but had also tried PARTITION code that also didn't work.
CREATE TEMP TABLE customer_revenue AS
(
SELECT TRUNC(timestamp) AS "order_date", COUNT(DISTINCT customer_email) AS "customers",
SUM(revenue)-SUM(discount)-SUM(shipping)-SUM(tax) AS "revenue"
FROM public.fact_shopify_orders
GROUP BY TRUNC(timestamp)
);
SELECT TRUNC(SO.timestamp) AS "date", SUM(CR.revenue) / COUNT(customers) AS "LTV"
FROM customer_revenue CR
LEFT JOIN public.fact_shopify_orders SO ON CR.order_date = SO.timestamp
WHERE CR.order_date <= SO.timestamp
GROUP BY TRUNC(SO.timestamp)
ORDER BY TRUNC(SO.timestamp) DESC
I think you want rolling sums and count(distinct). The latter is a little tricky but you can emulate it easily using a flag based on the first time the customer is seen:
SELECT date,
( SUM(SUM(net_revenue)) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) /
SUM(SUM( (seqnum = 1)::int )) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
) as LTV
FROM (SELECT so.*, TRUNC(SO.timestamp) as date,
(revenue - discount - shipping - tax) as net_revenue,
ROW_NUMBER() OVER (PARTITION BY customer_email ORDER BY timestamp) as seqnum
FROM public.fact_shopify_orders so
) so
GROUP BY date;
EDIT:
I think Redshift supports window functions with aggregation . . . but there is some database out there that does not. You can try this:
SELECT date,
( SUM(net_revenue) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) /
SUM(num_firsts) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
) as LTV
FROM (SELECT date, SUM(net_revenue) as net_revenue,
SUM( (seqnum = 1)::int ) as num_firsts
FROM (SELECT so.*, TRUNC(SO.timestamp) as date,
(revenue - discount - shipping - tax) as net_revenue,
ROW_NUMBER() OVER (PARTITION BY customer_email ORDER BY timestamp) as seqnum
FROM public.fact_shopify_orders so
) so
GROUP BY date
) so;
Here is a similar version running in Postgres.

Distinct in Window Functions. BigQuery

I'm trying to do something like this in BigQuery
COUNT(DISTINCT user_id) OVER (PARTITION BY DATE_TRUNC(date, month), sample, app_id ORDER BY DATE RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as ACTIVE_USERS
In other words, I have a table with Date, Userid, Sample and Application ID. I need to count the cumulative number of unique active users for each day starting from the beginning of the month and ending with the current day.
The function works properly without distinct, however, this gives me a total count of users and it's not what I need.
Tried some tricks with dense_rank, however it doesn't work here as well.
Are there any ways to calculative the number of distinct users using window functions?
-------------UPDATED----------------
here is the full query, so you could better understand what I need
with mtd1 as (select
'MonthToDate' as TIMELINE
,fd.date DATE
,td.SAMPLE as SAMPLE
,td.APPNAME as APP_ID
,sum(fd.revenue) as REVENUE
,td.user_id ACTIVE_USERS
from DWH.DailyUser fd
join DWH.Depositors td using (userid)
group by 1,2,3,4,6
),
mtd as (
select TIMELINE
,DATE
,SAMPLE
,APP_ID
,sum(revenue) over (partition by date_trunc(date, month), sample, app_id order by date range BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as REVENUE
,COUNT(distinct active_users) over (partition by date_trunc(date, month), sample, app_id order by date range BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as ACTIVE_USERS
from mtd1
)
select * from mtd
where extract(day from date) = extract(day from current_date)
group by 1,2,3,4,5,6
Distinct in Window Functions. BigQuery - Are there any ways to calculate the number of distinct users using window functions?
This specific question is a duplicate and already answered here
... here is the full query ...
As of how to apply above to your particular query - see below (not tested and fully based on your code
#standardSQL
WITH mtd1 AS (
SELECT
'MonthToDate' AS TIMELINE
,fd.date DATE
,td.SAMPLE AS SAMPLE
,td.APPNAME AS APP_ID
,SUM(fd.revenue) AS REVENUE
,td.user_id ACTIVE_USERS
FROM `DWH.DailyUser` fd
JOIN `DWH.Depositors` td USING (userid)
GROUP BY 1,2,3,4,6
), mtd2 AS (
SELECT
TIMELINE
,DATE
,SAMPLE
,APP_ID
,SUM(REVENUE) OVER (PARTITION BY DATE_TRUNC(DATE, MONTH), SAMPLE, APP_ID ORDER BY DATE RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS REVENUE
,ARRAY_AGG(ACTIVE_USERS) OVER (PARTITION BY DATE_TRUNC(DATE, MONTH), SAMPLE, APP_ID ORDER BY DATE RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS ACTIVE_USERS
FROM mtd1
), mtd AS (
SELECT * REPLACE((SELECT COUNT(DISTINCT u) FROM UNNEST(ACTIVE_USERS) AS u) AS ACTIVE_USERS)
FROM mtd2
)
SELECT * FROM mtd
WHERE EXTRACT(day FROM DATE) = EXTRACT(day FROM CURRENT_DATE)
GROUP BY 1,2,3,4,5,6
You can use ARRAY_AGG, then count the distinct elements in each array. Note that your query will run out of memory if the arrays end up being too big, though.
with mtd1 as (select
'MonthToDate' as TIMELINE
,fd.date DATE
,td.SAMPLE as SAMPLE
,td.APPNAME as APP_ID
,sum(fd.revenue) as REVENUE
,td.user_id ACTIVE_USERS
from DWH.DailyUser fd
join DWH.Depositors td using (userid)
group by 1,2,3,4,6
),
mtd1 as (
select TIMELINE
,DATE
,SAMPLE
,APP_ID
,sum(revenue) over (partition by date_trunc(date, month), sample, app_id order by date range BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as REVENUE
,ARRAY_AGG(active_users) over (partition by date_trunc(date, month), sample, app_id order by date range BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as ACTIVE_USERS
from mtd1
), mtd AS (
SELECT * EXCEPT(ACTIVE_USERS),
(SELECT COUNT(DISTINCT u) FROM UNNEST(ACTIVE_USERS) AS u) AS ACTIVE_USERS
FROM mtd1
)
select * from mtd
where extract(day from date) = extract(day from current_date)
group by 1,2,3,4,5,6
One method for implementing count(distinct) uses row_number() and then counts the "1"s:
select SUM(CASE WHEN seqnum = 1 THEN 1 ELSE 0 END) OVER (PARTITION BY DATE_TRUNC(date, month), sample, app_id ORDER BY date) as Active_Users
FROM (SELECT t.*,
ROW_NUMBER() OVER (PARTITION BY DATE_TRUNC(date, month), sample, app_id, user_id ORDER BY DATE) as seqnum
FROM t
) t

Retrieve recent 5 days forecast for each cities with latest issue date

I need to retrieve the recent 5 days forecast info for each cities.
My table looks like below
The real problem is with the issue date.
the city may contain several forecast info for the same date with distinct issue date.
I need to retrieve recent 5 records for each cities with latest issue date and group by forecast date
I have tried something like below but not giving the expected result
SELECT * FROM(
SELECT
ROW_NUMBER () OVER (PARTITION BY CITY_ID ORDER BY FORECAST_DATE DESC, ISSUE_DATE DESC) AS rn,
CITY_ID, FORECAST_DATE, ISSUE_DATE
FROM
FORECAST
GROUP BY FORECAST_DATE
) WHERE rn <= 5
Any suggestion or advice will be helpful
This will get the latest issued forecast per day over the most recent 5 days for each city:
SELECT *
FROM (
SELECT f.*,
DENSE_RANK() OVER ( PARTITION BY city_id ORDER BY forecast_date DESC )
AS forecast_rank,
ROW_NUMBER() OVER ( PARTITION BY city_id, forecast_date ORDER BY issue_date DESC )
AS issue_rn
FROM Forecast f
)
WHERE forecast_rank <= 5
AND issue_rn = 1;
Partition by works like group by but for the function only.
Try
with CTE as
(
select t1.*,
row_number() over (partition by city_id, forecast_date order by issue_date desc) as r_ord
from Forecast
)
select CTE.*
from CTE
where r_ord <= 5
Try this
SELECT * FROM(
SELECT
ROW_NUMBER () OVER (PARTITION BY CITY_ID, FORECAST_DATE order by ISSUE_DATE DESC) AS rn,
CITY_ID, FORECAST_DATE, ISSUE_DATE
FROM
FORECAST
) WHERE rn <= 5