Presto SQL to find number of transactions in the year before the current transaction - sql

I have a (simplified) transaction table of customer and order date. For each row/order I want to find the number of orders the year before the current order. I can do this with a self join, but when my transactions table is far bigger, it gets inefficient. I think I really want to use a window function with range between on the date field, but this isn't implemented in Presto yet. Any ideas of how I can do this more efficiently?
with
transactions as (
select
1 as customer,
date '2020-01-01' as order_date
union all
select
1 as customer,
date '2020-01-26' as order_date
union all
select
1 as customer,
date '2020-02-01' as order_date
union all
select
1 as customer,
date '2020-02-02' as order_date
)
select
t1.*,
count(case when t2.order_date between date_add('day', -14, t1.order_date) and date_add('day', -1, t1.order_date) then t2.order_date else null end) as orders_14_days_before
from
transactions t1
left join
transactions t2 on t1.customer = t2.customer
group by
t1.customer,
t1.order_date
Result:
customer order_date orders_14_days_before
1 2020-01-01 0
1 2020-01-26 0
1 2020-02-01 1
1 2020-02-02 2

Presto does not seem to fully support the range window specification. So you can do this another way . . . by doings ins-and-outs:
with cd as (
select customer, order_date as dte, 1 as inc
from transactions
union all
select customer, order_date + interval '1' year, -1 inc
from transactions
)
select t.*, cd.one_year_count
from (select customer, dte,
sum(sum(inc)) over (partition by customer order by dte) as one_year_count
from cd
group by customer, date
) cd join
transactions t
on cd.dte = t.order_date;
You should find that this is much faster.

Thanks to Gordon Linoff's answer above, I tweaked it to get the correct answer (at least in Athena). You don't need the sum(sum()) over ..., just sum() over ... is sufficient.
with
transactions as (
select
1 as customer,
date '2020-01-01' as order_date
union all
select
1 as customer,
date '2020-01-26' as order_date
union all
select
1 as customer,
date '2020-02-01' as order_date
union all
select
1 as customer,
date '2020-02-02' as order_date
),
cd as (
select
customer,
order_date as dte,
1 as inc
from
transactions
union all
select
customer,
order_date + interval '13' day,
-1 inc
from
transactions
),
cd2 as (
select
customer,
dte,
inc,
sum(inc) over (partition by customer order by dte rows between unbounded preceding and 1 preceding) as one_year_count
from
cd
)
select
t.*,
coalesce(cd2.one_year_count, 0) as one_year_count
from
cd2
inner join
transactions t
on cd2.dte = t.order_date
where
cd2.inc = 1
order by
2 asc

Related

How to get difference in value over a sliding time window?

I'm attempting to write a SQL query which returns every product where the most recent price on an order within the last 30 days is different than the most recent price in the previous 30 days, and that calculated variance. I'm currently using PostgreSQL 11.
Data Model
Right now, the data is structured into three tables: orders, products, and a pivot table, order_product. Here is the simplified version of the table structure:
Orders
id
order_date
1
2022-01-15
2
2022-02-15
3
2022-03-08
Products
id
name
1
Some product
2
Another product
3
Yet another product
Order_Product
order_id
product_id
unit_price
1
1
10
1
2
20
1
3
10
2
1
12
2
2
20
2
3
5
3
1
15
Desired Output
The desired output would be something like the following:
id
name
order_date
latest_unit_price
previous_unit_price
variance
1
Some product
2022-03-08
15
10
5
3
Yet another product
2022-02-15
5
10
-5
What I've done so far
I've been able to write a join that combines the Orders and Products via the order_product table, within the 60-day window, which is seemingly the easy part:
SELECT
"products"."id",
"products"."name",
"order_product"."unit_price",
"orders"."order_date"
FROM
products
JOIN order_product ON products.id = order_product.product_id
JOIN orders ON order_product.order_id = orders.id
WHERE
order_date BETWEEN now() - INTERVAL '60 days'
AND now()
I've been trying to work with RANK() and LAG(); however, where I'm getting stuck is being able to find the rank the rows within the 30-day time windows, and then calculate the variance between the two windows.
Any help would be much appreciated!
Update: Added solution
Building off of the answer by D-Shih, I had to tweak this to work based on the time window starting from the current date:
WITH CTE AS (
SELECT
"products"."id",
"products"."name",
"order_product"."unit_price",
"orders"."order_date"
FROM
products
JOIN order_product ON products.id = order_product.product_id
JOIN orders ON order_product.order_id = orders.id
WHERE
order_date BETWEEN now() - INTERVAL '60 days' AND now()
),
CTE2 AS (
SELECT
*,
EXTRACT(DAYS FROM now() - order_date :: timestamp) gap_days
FROM
CTE
),
CTE3 AS (
SELECT
*,
(CASE WHEN gap_days < 30 THEN 1 ELSE 0 END) grp
FROM
CTE2
)
SELECT
id,
name,
MAX(CASE WHEN grp = 1 THEN order_date END) order_date,
MAX(CASE WHEN grp = 1 THEN unit_price END) latest_unit_price,
MAX(CASE WHEN grp = 0 THEN unit_price END) previous_unit_price,
SUM(CASE WHEN grp = 1 THEN unit_price ELSE - unit_price END) variance
FROM
(
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY ID, grp ORDER BY order_date DESC) rn
FROM
CTE3
) t1
WHERE
rn = 1
GROUP BY
id,
name
HAVING
MAX(CASE WHEN grp = 1 THEN unit_price END) <> MAX(CASE WHEN grp = 0 THEN unit_price END)
sqlfiddle
You can try to use EXTRACT with LAG window function to get days difference from order_date and previous order_date each productId.
Then use SUM aggregate condition window function to calculate the group
grp = 0 within the last 30 days
grp = 1 most recent price in the previous 30 days,
the query would be look like as below.
WITH CTE AS (
SELECT "products"."id",
"products"."name",
"order_product"."unit_price",
"orders"."order_date"
FROM
products
JOIN order_product ON products.id = order_product.product_id
JOIN orders ON order_product.order_id = orders.id
WHERE
order_date BETWEEN now() - INTERVAL '60 days'
AND now()
), CTE2 AS (
SELECT *,EXTRACT(DAYS FROM order_date - LAG(order_date,1,order_date) OVER(PARTITION BY id ORDER BY order_date)) gap_seconds
FROM CTE
), CTE3 AS (
SELECT *,(CASE WHEN SUM(gap_seconds) OVER(PARTITION BY id ORDER BY order_date) > 30 THEN 1 ELSE 0 END) grp
FROM CTE2
)
SELECT id,
name,
MAX(CASE WHEN grp = 1 THEN order_date END) order_date,
MAX(CASE WHEN grp = 1 THEN unit_price END) latest_unit_price,
MAX(CASE WHEN grp = 0 THEN unit_price END) previous_unit_price,
SUM(CASE WHEN grp = 1 THEN unit_price ELSE - unit_price END) variance
FROM (
SELECT *,ROW_NUMBER() OVER(PARTITION BY ID,grp ORDER BY order_date DESC) rn
FROM CTE3
) t1
WHERE rn = 1
GROUP BY id,
name
HAVING MAX(CASE WHEN grp = 1 THEN unit_price END) <> MAX(CASE WHEN grp = 0 THEN unit_price END)
sqlfiddle

sql get balance at end of year

I have a transactions table for a single year with the amount indicating the debit transaction if the value is negative or credit transaction values are positive.
Now in a given month if the number of debit records is less than 3 or if the sum of debits for a month is less than 100 then I want to charge a fee of 5.
I want to build and sql query for this in postgre:
select sum(amount), count(1), date_part('month', date) as month from transactions where amount < 0 group by month;
I am able get records per month level, I am stuck on how to proceed further and get the result.
You can start by generating the series of month with generate_series(). Then join that with an aggregate query on transactions, and finally implement the business logic in the outer query:
select sum(t.balance)
- 5 * count(*) filter(where coalesce(t.cnt, 0) < 3 or coalesce(t.debit, 0) < 100) as balance
from generate_series(date '2020-01-01', date '2020-12-01', '1 month') as d(dt)
left join (
select date_trunc('month', date) as dt, count(*) cnt, sum(amount) as balance,
sum(-amount) filter(where amount < 0) as debit
from transactions t
group by date_trunc('month', date)
) t on t.dt = d.dt
Demo on DB Fiddle:
| balance |
| ------: |
| 2746 |
How about this approach?
SELECT
SUM(
CASE
WHEN usage.amount_s > 100
OR usage.event_c > 3
THEN 0
ELSE 5
END
) AS YEAR_FEE
FROM (SELECT 1 AS month UNION
SELECT 2 UNION
SELECT 3 UNION
SELECT 4 UNION
SELECT 5 UNION
SELECT 6 UNION
SELECT 7 UNION
SELECT 8 UNION
SELECT 9 UNION
SELECT 10 UNION
SELECT 11 UNION
SELECT 12
) months
LEFT OUTER JOIN
(
SELECT
sum(amount) AS amount_s,
count(1) event_c,
date_part('month', date) AS month
FROM transactions
WHERE amount < 0
GROUP BY month
) usage ON months.month = usage.month;
First you must use a resultset that returns all the months (1-12) and join it with a LEFT join to your table.
Then aggregate to get the the sum of each month's amount and with conditional aggregation subtract 5 from the months that meet your conditions.
Finally use SUM() window function to sum the result of each month:
SELECT DISTINCT SUM(
COALESCE(SUM(t.Amount), 0) -
CASE
WHEN SUM((t.Amount < 0)::int) < 3
OR SUM(CASE WHEN t.Amount < 0 THEN -t.Amount ELSE 0 END) < 100 THEN 5
ELSE 0
END
) OVER () total
FROM generate_series(1, 12, 1) m(month) LEFT JOIN transactions t
ON m.month = date_part('month', t.date) AND date_part('year', t.date) = 2020
GROUP BY m.month
See the demo.
Results:
> | total |
> | ----: |
> | 2746 |
I think you can use the hanving clause.
Select ( sum(a.total) - (12- count(b.cnt ))*5 ) as result From
(Select sum(amount) as total , 'A' as name from transactions ) as a left join
(Select count(amount) as cnt , 'A' as name
From transactions
where amount <0
group by month(date)
having not(count(amount) <3 or sum(amount) >-100) ) as b
on a.name = b.name
select
sum(amount) - 5*(12-(
select count(*)
from(select month, count(amount),sum(amount)
from transactions
where amount<0
group by month
having Count(amount)>=3 And Sum(amount)<=-100))) as balance
from transactions ;

SQL Between date column filed not null

I would like to count all unique customers that were active on 2019-01-01 with the condition that they also were active in the subsequent 3 days.
Main table
date customer_id time_spent_online_min
2019-01-01 1 5
2019-01-01 2 6
2019-01-01 3 4
2019-01-02 1 7
2019-01-02 2 5
2019-01-03 3 3
2019-01-04 1 4
2019-01-04 2 6
Output table
date total_active_customers
2019-01-01 2
This is what I have tried so far:
with cte as(
select customer_id
,date
,time_spent_online_min
from main_table
where date between date '2019-01-01' and date '2019-01-04'
and customer_id is not null)
select date
,count(distinct(customer_id)) as total_active_customers
from cte
where date = date '2019-01-01'
group by 1
If you have only one record per day, you can use lead():
select date, count(*)
from (select t.*, lead(date, 3) over (partition by customer_id order by date) as date_3
from main_table t
) t
where date = '2019-01-01' and
date_3 = '2019-01-04'
group by date;
If you can have more than one record per day, then aggregate and then use lead():
select date, count(*)
from (select t.*, lead(date, 3) over (partition by customer_id order by date) as date_3
from (select customer_id, date, sum(time_spent_online_min) as time_spent_online_min
from maintable t
group by customer_id, date
) t
) t
where date = '2019-01-01' and
date_3 = '2019-01-04'
group by date;
You can also easily expand this to any dates:
select date, count(*)
from (select t.*, lead(date, 3) over (partition by customer_id order by date) as date_3
from main_table t
) t
where date_3 = date + interval '3' day
group by date;
I would use exists logic here:
SELECT COUNT(*)
FROM main_table t1
WHERE
date = '2019-01-01' AND
EXISTS (SELECT 1 FROM main_table t2
WHERE t2.customer_id = t1.customer_id AND t2.date = '2019-01-02') AND
EXISTS (SELECT 1 FROM main_table t2
WHERE t2.customer_id = t1.customer_id AND t2.date = '2019-01-03') AND
EXISTS (SELECT 1 FROM main_table t2
WHERE t2.customer_id = t1.customer_id AND t2.date = '2019-01-04');
This answer assumes that a given customer would only have one record for one date of activity.
WITH
-- your input
input(dt,customer_id,time_spent_online_min) AS (
SELECT DATE '2019-01-01',1,5
UNION ALL SELECT DATE '2019-01-01',2,6
UNION ALL SELECT DATE '2019-01-01',3,4
UNION ALL SELECT DATE '2019-01-02',1,7
UNION ALL SELECT DATE '2019-01-02',2,5
UNION ALL SELECT DATE '2019-01-03',3,3
UNION ALL SELECT DATE '2019-01-04',1,4
UNION ALL SELECT DATE '2019-01-04',2,6
)
,
-- count the active days in this row and the following 3 days
count_activity AS (
SELECT
*
, COUNT(customer_id) OVER(
PARTITION BY customer_id ORDER BY dt
RANGE BETWEEN CURRENT ROW AND INTERVAL '3 DAY' FOLLOWING
) AS act_count
FROM input
)
SELECT
dt
, COUNT(*) AS total_active_customers
FROM count_activity
WHERE dt='2019-01-01'
AND act_count > 2
GROUP BY dt
;
-- out dt | total_active_customers
-- out ------------+------------------------
-- out 2019-01-01 | 2

How to find users who made an order in any year then didn't make one the year after

My table scheme looks like this
id | user_id | price | date
1235085 | 429009 | 1301.3 | 2016-01-01
1235016 | 1106100 | 2343.6 | 2016-01-01
1235007 | 707164 | 980.7 | 2016-01-01
there are 20 million records.
I have to find users which are made some orders in any year, but didn't the following year.
I tried use this query
select user_id
from orders o1
where not exists (select user_id from orders o2
where extract(year from o2.date) + 1 > extract(year from o1.date))
but it doesn't work
Use EXCEPT:
select distinct user_id from orders
except
select distinct user_id
from orders o1
where exists(
select 1
from orders o2
where o2.user_id = o1.user_id
and extract(year from o2.date) + 1 = extract(year from o1.date)
)
Here is one method:
select user_id, yyyy
from (select user_id, date_trunc('year', date) as yyyy,
lead(date_trunc('year', date)) over (partition by user_id order by date_trunc('year', date)) as next_year
from t
group by user_id, yyyy
) u
where next_year <> yyyy + interval '1 year' or next_year is null;
This assumes that you actually want the year as well. If not, use select distinct user_id.
You might also want to add the condition yyyy <> date_trunc(now()) so you don't get users who made their first purchase this year. Without this condition, I think you will return all users, because every user has a "last purchase" with no purchases the following year.
EDIT:
Interestingly, you can do this with lead() as well:
select user_id, date
from (select t.*, lead(date) over (partition by user_id order by date) as next_date
from t
) t
where (next_date is null or
extract(year from next_date) <> extract(year from date) + 1
) and
date < date_trunc('year', now());
Because lead() orders the values, this should return at most one value for a given year, even when there are multiple orders in a year.

SQL query needed - Counting 365 days backwards

I have searched the forum many times but couldn't find a solution for my situation. I am working with an Oracle database.
I have a table with all Order Numbers and Customer Numbers by Day. It looks like this:
Day | Customer Nbr | Order Nbr
2018-01-05 | 25687459 | 256
2018-01-09 | 36478592 | 398
2018-03-07 | 25687459 | 1547
and so on....
Now I need a SQL Query which gives me a table by day and Customer Nbr and counts the number of unique Order Numbers within the last 365 days starting from column 1.
For the example above the resulting table should look like:
Day | Customer Nbr | Order Cnt
2019-01-01 | 25687459 | 2
2019-01-02 | 25687459 | 2
...
2019-03-01 | 25687459 | 1
One method is to generate values for all days of interest for each customer and then use a correlated subquery:
with dates as (
select date '2019-01-01' + rownum as dte from dual
connect by date '2019-01-01' + rownum < sysdate
)
select d.dte, t.customer_nbr,
(select count(*)
from t t2
where t2.customer_nbr = t.customer_nbr and
t2.day <= t.dte and
t2.date > t.dte - 365
) as order_cnt
from dates d cross join
(select distinct customer_nbr from t) ;
Edit:
I've just seen you clarify the question, which I've interpreted to mean:
For every day in the last year, show how many orders there were for each customer between that date, and 1 year previously. Working on an answer now...
Updated Answer:
For each customer, we count the number of records between the order day, and 365 days before it...
WITH yourTable AS
(
SELECT SYSDATE - 1 Day, 'Alex' CustomerNbr FROM DUAL
UNION ALL
SELECT SYSDATE - 2, 'Alex' FROM DUAL
UNION ALL
SELECT SYSDATE - 366, 'Alex'FROM DUAL
UNION ALL
SELECT SYSDATE - 400, 'Alex'FROM DUAL
UNION ALL
SELECT SYSDATE - 500, 'Alex'FROM DUAL
UNION ALL
SELECT SYSDATE - 1, 'Joe'FROM DUAL
UNION ALL
SELECT SYSDATE - 300, 'Chris'FROM DUAL
UNION ALL
SELECT SYSDATE - 1, 'Chris'FROM DUAL
)
SELECT Day, CustomerNbr, OrdersLast365Days
FROM yourTable t
OUTER APPLY
(
SELECT COUNT(1) OrdersLast365Days
FROM yourTable t2
WHERE t.CustomerNbr = t2.CustomerNbr
AND TRUNC(t2.Day) >= TRUNC(t.Day) - 364
AND TRUNC(t2.Day) <= TRUNC(t.Day)
)
ORDER BY t.Day DESC, t.CustomerNbr;
If you want to report on just the days you have orders for, then a simple WHERE clause should be enough:
SELECT Day, CustomerNbr, COUNT(1) OrderCount
FROM <yourTable>
WHERE TRUNC(DAY) >= TRUNC(SYSDATE -364)
GROUP BY Day, CustomerNbr
ORDER BY Day Desc;
If you want to report on every day, you'll need to generate them first. This can be done by a recursive CTE, which you then join to your table:
WITH last365Days AS
(
SELECT TRUNC (SYSDATE - ROWNUM + 1) dt
FROM DUAL CONNECT BY ROWNUM < 365
)
SELECT d.Day, COALESCE(t.CustomerNbr, 'None') CustomerNbr, SUM(CASE WHEN t.CustomerNbr IS NULL THEN 0 ELSE 1 END) OrderCount
FROM last365Days d
LEFT OUTER JOIN <yourTable> t
ON d.Day = TRUNC(t.Day)
GROUP BY d.Day, t.CustomerNbr
ORDER BY d.Day Desc;
I would probably have done it with and analytic function. In your windowing clause, you can specify a number of rows before, or a range. In this case I will use a range.
This will give you, For Each customer for each day the number of orders during one rolling year before the date displayed
WITH DATES AS (
SELECT * FROM
(SELECT TRUNC(SYSDATE)-(LEVEL-1) AS DAY FROM DUAL CONNECT BY TRUNC(SYSDATE)-(LEVEL-1) >= ( SELECT MIN(TRUNC(DAY)) FROM MY_TABLE ))
CROSS JOIN
(SELECT DISTINCT CUST_ID FROM MY_TABLE))
SELECT DISTINCT
DATES.DAY,
DATES.CUST_ID,
COUNT(ORDER_ID) OVER (PARTITION BY DATES.CUST_ID ORDER BY DATES.DAY RANGE BETWEEN INTERVAL '1' YEAR PRECEDING AND INTERVAL '1' SECOND PRECEDING)
FROM
DATES
LEFT JOIN
MY_TABLE
ON DATES.DAY=TRUNC(MY_TABLE.DAY) AND DATES.CUST_ID=MY_TABLE.CUST_ID
ORDER BY DATES.CUST_ID,DATES.DAY;