How to get difference in value over a sliding time window? - sql

I'm attempting to write a SQL query which returns every product where the most recent price on an order within the last 30 days is different than the most recent price in the previous 30 days, and that calculated variance. I'm currently using PostgreSQL 11.
Data Model
Right now, the data is structured into three tables: orders, products, and a pivot table, order_product. Here is the simplified version of the table structure:
Orders
id
order_date
1
2022-01-15
2
2022-02-15
3
2022-03-08
Products
id
name
1
Some product
2
Another product
3
Yet another product
Order_Product
order_id
product_id
unit_price
1
1
10
1
2
20
1
3
10
2
1
12
2
2
20
2
3
5
3
1
15
Desired Output
The desired output would be something like the following:
id
name
order_date
latest_unit_price
previous_unit_price
variance
1
Some product
2022-03-08
15
10
5
3
Yet another product
2022-02-15
5
10
-5
What I've done so far
I've been able to write a join that combines the Orders and Products via the order_product table, within the 60-day window, which is seemingly the easy part:
SELECT
"products"."id",
"products"."name",
"order_product"."unit_price",
"orders"."order_date"
FROM
products
JOIN order_product ON products.id = order_product.product_id
JOIN orders ON order_product.order_id = orders.id
WHERE
order_date BETWEEN now() - INTERVAL '60 days'
AND now()
I've been trying to work with RANK() and LAG(); however, where I'm getting stuck is being able to find the rank the rows within the 30-day time windows, and then calculate the variance between the two windows.
Any help would be much appreciated!
Update: Added solution
Building off of the answer by D-Shih, I had to tweak this to work based on the time window starting from the current date:
WITH CTE AS (
SELECT
"products"."id",
"products"."name",
"order_product"."unit_price",
"orders"."order_date"
FROM
products
JOIN order_product ON products.id = order_product.product_id
JOIN orders ON order_product.order_id = orders.id
WHERE
order_date BETWEEN now() - INTERVAL '60 days' AND now()
),
CTE2 AS (
SELECT
*,
EXTRACT(DAYS FROM now() - order_date :: timestamp) gap_days
FROM
CTE
),
CTE3 AS (
SELECT
*,
(CASE WHEN gap_days < 30 THEN 1 ELSE 0 END) grp
FROM
CTE2
)
SELECT
id,
name,
MAX(CASE WHEN grp = 1 THEN order_date END) order_date,
MAX(CASE WHEN grp = 1 THEN unit_price END) latest_unit_price,
MAX(CASE WHEN grp = 0 THEN unit_price END) previous_unit_price,
SUM(CASE WHEN grp = 1 THEN unit_price ELSE - unit_price END) variance
FROM
(
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY ID, grp ORDER BY order_date DESC) rn
FROM
CTE3
) t1
WHERE
rn = 1
GROUP BY
id,
name
HAVING
MAX(CASE WHEN grp = 1 THEN unit_price END) <> MAX(CASE WHEN grp = 0 THEN unit_price END)
sqlfiddle

You can try to use EXTRACT with LAG window function to get days difference from order_date and previous order_date each productId.
Then use SUM aggregate condition window function to calculate the group
grp = 0 within the last 30 days
grp = 1 most recent price in the previous 30 days,
the query would be look like as below.
WITH CTE AS (
SELECT "products"."id",
"products"."name",
"order_product"."unit_price",
"orders"."order_date"
FROM
products
JOIN order_product ON products.id = order_product.product_id
JOIN orders ON order_product.order_id = orders.id
WHERE
order_date BETWEEN now() - INTERVAL '60 days'
AND now()
), CTE2 AS (
SELECT *,EXTRACT(DAYS FROM order_date - LAG(order_date,1,order_date) OVER(PARTITION BY id ORDER BY order_date)) gap_seconds
FROM CTE
), CTE3 AS (
SELECT *,(CASE WHEN SUM(gap_seconds) OVER(PARTITION BY id ORDER BY order_date) > 30 THEN 1 ELSE 0 END) grp
FROM CTE2
)
SELECT id,
name,
MAX(CASE WHEN grp = 1 THEN order_date END) order_date,
MAX(CASE WHEN grp = 1 THEN unit_price END) latest_unit_price,
MAX(CASE WHEN grp = 0 THEN unit_price END) previous_unit_price,
SUM(CASE WHEN grp = 1 THEN unit_price ELSE - unit_price END) variance
FROM (
SELECT *,ROW_NUMBER() OVER(PARTITION BY ID,grp ORDER BY order_date DESC) rn
FROM CTE3
) t1
WHERE rn = 1
GROUP BY id,
name
HAVING MAX(CASE WHEN grp = 1 THEN unit_price END) <> MAX(CASE WHEN grp = 0 THEN unit_price END)
sqlfiddle

Related

Presto SQL to find number of transactions in the year before the current transaction

I have a (simplified) transaction table of customer and order date. For each row/order I want to find the number of orders the year before the current order. I can do this with a self join, but when my transactions table is far bigger, it gets inefficient. I think I really want to use a window function with range between on the date field, but this isn't implemented in Presto yet. Any ideas of how I can do this more efficiently?
with
transactions as (
select
1 as customer,
date '2020-01-01' as order_date
union all
select
1 as customer,
date '2020-01-26' as order_date
union all
select
1 as customer,
date '2020-02-01' as order_date
union all
select
1 as customer,
date '2020-02-02' as order_date
)
select
t1.*,
count(case when t2.order_date between date_add('day', -14, t1.order_date) and date_add('day', -1, t1.order_date) then t2.order_date else null end) as orders_14_days_before
from
transactions t1
left join
transactions t2 on t1.customer = t2.customer
group by
t1.customer,
t1.order_date
Result:
customer order_date orders_14_days_before
1 2020-01-01 0
1 2020-01-26 0
1 2020-02-01 1
1 2020-02-02 2
Presto does not seem to fully support the range window specification. So you can do this another way . . . by doings ins-and-outs:
with cd as (
select customer, order_date as dte, 1 as inc
from transactions
union all
select customer, order_date + interval '1' year, -1 inc
from transactions
)
select t.*, cd.one_year_count
from (select customer, dte,
sum(sum(inc)) over (partition by customer order by dte) as one_year_count
from cd
group by customer, date
) cd join
transactions t
on cd.dte = t.order_date;
You should find that this is much faster.
Thanks to Gordon Linoff's answer above, I tweaked it to get the correct answer (at least in Athena). You don't need the sum(sum()) over ..., just sum() over ... is sufficient.
with
transactions as (
select
1 as customer,
date '2020-01-01' as order_date
union all
select
1 as customer,
date '2020-01-26' as order_date
union all
select
1 as customer,
date '2020-02-01' as order_date
union all
select
1 as customer,
date '2020-02-02' as order_date
),
cd as (
select
customer,
order_date as dte,
1 as inc
from
transactions
union all
select
customer,
order_date + interval '13' day,
-1 inc
from
transactions
),
cd2 as (
select
customer,
dte,
inc,
sum(inc) over (partition by customer order by dte rows between unbounded preceding and 1 preceding) as one_year_count
from
cd
)
select
t.*,
coalesce(cd2.one_year_count, 0) as one_year_count
from
cd2
inner join
transactions t
on cd2.dte = t.order_date
where
cd2.inc = 1
order by
2 asc

I am looking to find customers repurchase frequency in SQL from their first purchase date

I am trying to find the customer's repurchase rates from their first order date. For example, for 2016, how many customer purchased 1X in days 1-365 from their initial purchase, how many purchased twice etc.
I have a transaction_detail table which looks like below:
txn_date Customer_ID Transaction_Number Sales
1/2/2019 1 12345 $10
4/3/2018 1 65890 $20
3/22/2019 3 64453 $30
4/3/2019 4 88567 $20
5/21/2019 4 85446 $15
1/23/2018 5 89464 $40
4/3/2019 5 99674 $30
4/3/2019 6 32224 $20
1/23/2018 6 46466 $30
1/20/2018 7 56558 $30
I am able to find the customers who have shopped in 2016 and how many times have they repurchased in 2016, but I need to find the customer who have shopped in 2016 and how many times have they come back from their first purchase date.
I need a starting point for the query, I am not sure how to build this logic in my SQL code.
Any help would be appreciated.
I am using the below query:
WITH by_year
AS (SELECT
Customer_ID,
to_char(txn_date, 'YYYY') AS visit_year
FROM table
GROUP BY Customer_ID, to_char(txn_date, 'YYYY')),
with_first_year
AS (SELECT
Customer_ID,
visit_year,
FIRST_VALUE(visit_year) OVER (PARTITION BY Customer_ID ORDER BY visit_year) AS first_year
FROM by_year),
with_year_number
AS (SELECT
Customer_ID,
visit_year,
first_year,
(visit_year - first_year) AS year_number
FROM with_first_year)
SELECT
first_year AS first_year,
SUM(CASE WHEN year_number = 0 THEN 1 ELSE 0 END) AS year_0,
SUM(CASE WHEN year_number = 1 THEN 1 ELSE 0 END) AS year_1,
SUM(CASE WHEN year_number = 2 THEN 1 ELSE 0 END) AS year_2,
SUM(CASE WHEN year_number = 3 THEN 1 ELSE 0 END) AS year_3,
SUM(CASE WHEN year_number = 4 THEN 1 ELSE 0 END) AS year_4,
SUM(CASE WHEN year_number = 5 THEN 1 ELSE 0 END) AS year_5,
SUM(CASE WHEN year_number = 6 THEN 1 ELSE 0 END) AS year_6,
SUM(CASE WHEN year_number = 7 THEN 1 ELSE 0 END) AS year_7,
SUM(CASE WHEN year_number = 8 THEN 1 ELSE 0 END) AS year_8,
SUM(CASE WHEN year_number = 9 THEN 1 ELSE 0 END) AS year_9
FROM with_year_number
GROUP BY first_year
ORDER BY first_year
Use window functions and aggregation:
select cnt, count(*), min(customer_id), max(customer_id)
from (select customer_id, count(*) as cnt
from (select td.*,
min(txn_date) over (partition by Customer_ID) as min_txn_date
from transaction_detail td
) td
where txn_date >= min_txn_date and txn_date < min_txn_date + interval '365' day
group by customer_id
) c
group by cnt
order by cnt;
So as per my understanding, you want to know the count of the distinct person who first purchased in 2016 and repurchased after one year or more from date of purchase.
Select * from
(
Select customer_id,
Floor(months_between(txn_date, lead_txn_date)/12) as num_years
From
(
Select customer_id,
txn_date,
row_number() over (partition by Customer_ID order by txn_date) as rn,
lead(txn_date) over (partition by Customer_ID order by txn_date) as lead_txn_date
From your_table
)
Where txn_date >= date '2016-01-01'
and txn_date < date '2017-01-01'
and rn = 1
And months_between(txn_date, lead_txn_date) >= 12
)
Pivot
(
Count(1) for num_year in (1,2,3,4)
)
Ultimately, we are finding the number of years between first and second purchase of the customer. And first purchase must be in 2016.
Cheers!!

How to do filter for a field generated by using MAX(CASE WHEN ... END)?

I have retrieved data successfully using the query below from a data in January until May generating every first and second purchase for each customer.
SELECT
MAX(CASE WHEN row_num = 1 THEN month END) AS month,
customer_id,
1 AS row_num,
DATE_DIFF(MAX(CASE WHEN row_num = 2 THEN verified_date END),
MAX(CASE WHEN row_num = 1 THEN verified_date END), DAY) AS difference
FROM yourTable
GROUP BY
customer_id;
Now, I would like to filter the month to get all the user doing FIRST transaction in Jan - Apr, and doing SECOND transaction anytime (Jan - May) and try this query:
SELECT
MAX(CASE WHEN row_num = 1 AND month IN (1,2,3,4) THEN month END) AS month,
customer_id,
1 AS row_num,
DATE_DIFF(MAX(CASE WHEN row_num = 2 THEN verified_date END),
MAX(CASE WHEN row_num = 1 THEN verified_date END), DAY) AS difference
FROM yourTable
GROUP BY
customer_id;
The query successfully runs, however, it generated month 1 2 3 4, and NULL in the month field.
Why there's NULL in it?
Thank you
Assuming there's a row_num in yourTable that orders transactions of each customer chronologically, MAX(CASE WHEN row_num = 1 AND month IN (1,2,3,4) THEN month END) AS month, will end up null when the row with row_num = 1 has a month different from 1, 2, 3, or 4, i.e. "first transaction is in May".
To filter, use HAVING MAX(CASE WHEN row_num = 1 AND month IN (1,2,3,4) THEN month END) is not null.

Additional condition withing partition over

https://www.db-fiddle.com/f/rgLXTu3VysD3kRwBAQK3a4/3
My problem here is that I want function partition over to start counting the rows only from certain time range.
In this example, if I would add rn = 1 at the end, order_id = 5 would be excluded from the results (because partition is ordering by paid_date and there's order_id = 6 with earlier date) but it shouldn't be as I want that time range for partition starts from '2019-01-10'.
Adding condition rn = 1expected output should be order_id 3,5,11,15, now its only 3,11,15
it should include only orders with is_paid = 0 that are the first one within given time range (if there's preceeding order with is_paid = 1 it shouldn't be counted)
use correlated subquery with not exists
DEMO
SELECT order_id, customer_id, amount, is_paid, paid_date, rn FROM (
SELECT o.*,
ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY paid_date,order_id) rn
FROM orders o
WHERE paid_date between '2019-01-10'
and '2019-01-15'
) x where rn=1 and not exists (select 1 from orders o1 where x.order_id=o1.order_id
and is_paid=1)
OUTPUT:
order_id customer_id amount is_paid paid_date rn
3 101 30 0 10/01/2019 00:00:00 1
5 102 15 0 10/01/2019 00:00:00 1
11 104 31 0 10/01/2019 00:00:00 1
15 105 11 0 10/01/2019 00:00:00 1
If priority should be given to order_id then put that before paid date in the partition function order by clause, this will solve your issue.
SELECT order_id, customer_id, amount, is_paid, paid_date, rn FROM (
SELECT o.*,
ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY order_id,paid_date) rn
FROM orders o
) x WHERE is_paid = 0 and paid_date between
'2019-01-10' and '2019-01-15' and rn=1
Since you need the paid date to be ordered first you need to imply a where condition in the partitioning table in order to avoid unnecessary dates interrupting the partition function.
SELECT order_id, customer_id, amount, is_paid, paid_date, rn FROM (
SELECT o.*,
ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY paid_date, order_id) rn
FROM orders o
where paid_date between '2019-01-10' and '2019-01-15'
) x WHERE is_paid = 0 and rn=1

Row number custom coded in sql

I am using bigquery #standardsql to work on a table. The table will note a conversion (1) for user who purchase something in month 9 and month 10. And for user who did not purchase at month 10, will only have 0 in their row
So far , this is the query for custom_coded
(case when row_number()
over (partition by customer_id order by purchase_date asc) =
count(*) over (partition by customer_id)
then 1 else 0 END) AS custom_coded
and this is the result so far
What i expect is that customer_id = 288 only have 0 in custom_coded since he did not purchase in next month, or month 10. And customer_id = 879 expected to have 1 in his latest purchase_date since he have a purchase record at month 10
This is the expected result
I previously asked in this thread (Decode maximum number in rows for sql), however the dataset didn't satisfy the idea for the analysis that i'm going to executed
Below is for BigQuery Standard SQL
#standardSQL
SELECT customer_id, item_purchased, purchase_date,
(CASE WHEN
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY purchase_date ASC) =
COUNT(*) OVER (PARTITION BY customer_id)
AND SUM(DISTINCT (CASE FORMAT_DATE('%Y%m', purchase_date)
WHEN '201709' THEN 1 WHEN '201710' THEN 2 ELSE 0 END))
OVER(PARTITION BY customer_id) = 3
THEN 1 ELSE 0
END) AS custom_coded
FROM `project.dataset.table`
You can test / play with above using dummy data from your question
#standardSQL
WITH `project.dataset.table` AS (
SELECT 288 customer_id, 'Rice' item_purchased, DATE '2017-09-02' purchase_date UNION ALL
SELECT 288, 'Rice', DATE '2017-09-02' UNION ALL
SELECT 288, 'Rice', DATE '2017-09-06' UNION ALL
SELECT 879, 'Plate', DATE '2017-09-01' UNION ALL
SELECT 879, 'Plate', DATE '2017-09-25' UNION ALL
SELECT 879, 'Plate', DATE '2017-10-25' UNION ALL
SELECT 879, 'Plate', DATE '2017-10-27'
)
SELECT customer_id, item_purchased, purchase_date,
(CASE WHEN
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY purchase_date ASC) =
COUNT(*) OVER (PARTITION BY customer_id)
AND SUM(DISTINCT (CASE FORMAT_DATE('%Y%m', purchase_date)
WHEN '201709' THEN 1 WHEN '201710' THEN 2 ELSE 0 END))
OVER(PARTITION BY customer_id) = 3
THEN 1 ELSE 0
END) AS custom_coded
FROM `project.dataset.table`
ORDER BY customer_id, purchase_date
result is
customer_id item_purchased purchase_date custom_coded
288 Rice 2017-09-02 0
288 Rice 2017-09-02 0
288 Rice 2017-09-06 0
879 Plate 2017-09-01 0
879 Plate 2017-09-25 0
879 Plate 2017-10-25 0
879 Plate 2017-10-27 1