How to do one-to-one inner join - sql

I've a transaction table of purchased and returned items, and I want to match a return transaction with the transaction where that corresponding item was purchased. (Here I used the same item ID and amount in all records for simplicity)
trans_ID
date
item_ID
amt
type
1
2022-01-09
100
5000
purchase
2
2022-01-07
100
5000
return
3
2022-01-06
100
5000
purchase
4
2022-01-05
100
5000
purchase
5
2022-01-04
100
5000
return
6
2022-01-03
100
5000
return
7
2022-01-03
100
5000
purchase
8
2022-01-02
100
5000
purchase
9
2022-01-01
100
5000
return
Matching conditions are:
The return date must be greater than or equal the purchase date
The return and purchase transactions must relate to the same item's ID and same transaction amount
For each return, there must be only 1 purchase matched to it (In case there are many related purchases, choose one with the most recent purchase date. But if the most recent purchase was already used for mapping with another return, choose the second-most recent purchase instead, and so on.)
From 3), that means each purchase must be matched with only 1 return as well.
The result should look like this.
trans_ID
date
trans_ID_matched
date_matched
2
2022-01-07
3
2022-01-06
5
2022-01-04
7
2022-01-03
6
2022-01-03
8
2022-01-02
This is what I've tried.
with temp as (
select a.trans_ID, a.date
, b.trans_ID as trans_ID_matched
, b.date as date_matched
, row_number() over (partition by a.trans_ID, a.date order by b.date desc) as rn1
from
(
select *
from transaction_table
where type = 'return'
) a
inner join
(
select *
from transaction_table
where type = 'purchase'
) b
on a.item_ID = b.item_ID and a.amount = b.amount and a.date >= b.date
)
select * from temp where rn = 1
But what I got is
trans_ID
date
trans_ID_matched
date_matched
2
2022-01-07
3
2022-01-06
5
2022-01-04
7
2022-01-03
6
2022-01-03
7
2022-01-03
Here, the trans ID 7 shouldn't be used again in the last row as it has been already matched with trans ID 5 in the row 2. So is there any way to match trans ID 6 with 8 (or any way to tell SQL not to use the already-used one like the purchase 7) ?

I created a fiddle, the result seem OK, but it's up to you to test if this is OK on all situtations..... 😉
WITH cte as (
SELECT
t1.trans_ID,
t1.[date],
t1.item_ID,
t1.amt,
t1.[type],
pur.trans_ID trans_ID_matched,
pur.[date] datE_matched,
jojo.c
FROM table1 t1
CROSS APPLY (
SELECT
trans_ID,
item_ID,
[date],
amt
FROM table1 pur
WHERE pur.[type] = 'purchase' and t1.[type]='return'
and pur.item_ID = t1.item_ID
and pur.amt = t1.amt
and pur.[date] <= t1.[date]
) pur
CROSS APPLY (
SELECt count(*) as c FROM table1 WHERE trans_ID> t1.trans_ID and trans_ID<pur.trans_ID
) jojo
where jojo.c <=2
)
select
trans_ID,
[date],
item_ID,
amt,
CASE WHEN min(c)=0 then min(trans_ID_matched) else max(trans_ID_matched) end
from cte
group by
trans_ID,
[date],
item_ID,
amt
order by trans_ID;
DBFIDDLE
The count(*) detects the distance between the selected trans_ID from the return and the purchase.
This might go wrong the are more than 2 adjacent 'returns'... (I am afraid it will break, so I did not test this 😢).
But is's a nice problem. Hopefully this will give you any other ideas to find the correct sulution!

Related

Calculating the cumulative sum with some conditions (gaps-and-islands problem)

Sorry if the title is a bit vague please suggest a title if you think it can articulate the problem. I'll start with what data I have and the end result I'm trying to get and then the TLDR:
This is the table I have:
Each row is a transaction. Outgoing amounts are negative, incomings are positive. The transactions can either be someone spending money ('spend' event) or it can be a loan disbursement into their account (amount > 0 and event = 'loan') or it can be them paying back their loan (amount < 0 and event = 'loan').
row number
id
created
amount
event
1
1
2022-01-01
-200
spend
2
1
2022-01-02
1000
loan
3
1
2022-01-03
-200
spend
4
1
2022-01-04
-500
spend
5
1
2022-01-05
-500
loan
6
1
2022-01-06
100
spend
7
1
2022-01-07
-500
spend
8
1
2022-01-08
1000
loan
9
1
2022-01-09
-100
spend
I'm trying to make:
row number
id
created
amount
event
cumulative_sum
1
1
2022-01-01
-200
spend
-200
2
1
2022-01-02
1000
loan
1000
3
1
2022-01-03
-200
spend
800
4
1
2022-01-04
-500
spend
300
5
1
2022-01-05
-500
loan
300
6
1
2022-01-06
100
spend
300
7
1
2022-01-07
-500
spend
-200
8
1
2022-01-08
1000
loan
1000
9
1
2022-01-09
-100
spend
900
Required logic:
I want to get a special cumulative sum which sums the amount only when:
(the amount is < 0 AND the event is spend) OR (when amount is > 0 AND event is loan)
.
The thing is I want the cumulative sum to start when that first positive loan amount. I don't care about anything before the positive loan amount and if they are counted it will obscure the results. The requirement is trying to select the rows which the loan enabled (if the loan is 1000 then we want to select the rows that add up to -1000 but only when event is spend and amount < 0).
my attempt
WITH tmp AS (
SELECT
1 AS id,
'2021-01-01' AS created,
-200 AS amount,
'spend' AS scheme
UNION ALL
SELECT
1 AS id,
'2022-01-02' AS created,
1000 AS amount,
'loan' AS scheme
UNION ALL
SELECT
1 AS id,
'2022-01-03' AS created,
-200 AS amount,
'spend' AS scheme
UNION ALL
SELECT
1 AS id,
'2022-01-04' AS created,
-500 AS amount,
'spend' AS scheme
UNION ALL
SELECT
1 AS id,
'2022-01-05' AS created,
-500 AS amount,
'loan' AS scheme
UNION ALL
SELECT
1 AS id,
'2022-01-06' AS created,
100 AS amount,
'spend' AS scheme
UNION ALL
SELECT
1 AS id,
'2022-01-07' AS created,
-500 AS amount,
'spend' AS scheme
UNION ALL
SELECT
1 AS id,
'2022-01-08' AS created,
1000 AS amount,
'loan' AS scheme
UNION ALL
SELECT
1 AS id,
'2022-01-09' AS created,
-100 AS amount,
'spend' AS scheme
)
SELECT
*,
SUM(CASE WHEN (scheme != 'loan' AND amount<0) OR (scheme = 'loan' AND amount > 0) THEN amount ELSE 0 END)
OVER (PARTITION BY id ORDER BY created ASC) AS cumulative_sum_spend
FROM tmp
Question
How do I make the cumulative sum reset at row 2 (not conditional to the row number - the requirement is the positive loan amount)?
That's a gaps-and-islands problem if I am understanding this correctly.
Islands start with a positive loan ; within each island, you want to compute a running sum in a subset of rows.
We can identify the islands in a subquery with a window count of positive loans, then do the maths in each group with a conditional expression:
select id, created, amount, event,
sum(case when (event = 'loan' and amount > 0) or (event = 'spend' and amount < 0) then amount end)
over(partition by id, grp order by created) as cumulative_sum
from (
select t.*,
sum(case when event = 'loan' and amount > 0 then 1 else 0 end)
over(partition by id order by created) grp
from tmp t
) t
order by id, created
One option would be something like this:
SELECT
*,
SUM(CASE WHEN cnt >= 1 AND ((scheme != 'loan' AND amount<0) OR (scheme = 'loan' AND amount > 0)) THEN amount ELSE 0 END)
OVER (PARTITION BY id ORDER BY created ASC) AS cumulative_sum_spend
FROM (
SELECT *, SUM(CASE WHEN amount > 0 THEN 1 ELSE 0 END) OVER (PARTITION BY id ORDER BY created) cnt
FROM tmp
) a
The idea here is that the inner query's window function counts the number of previous positive values. Then the outer query can do an extra check cnt >= 1 as part of its window function, so it will only consider values after the first positive one.

snowflake sql: sum for each day between two dates

I hope someone can help. Suppose I have this table
id
actual_date
target_date
qty
1
2022-01-01
2022-01-01
2
2
2022-01-02
2022-01-01
1
3
2022-01-03
2022-01-01
3
4
2022-01-03
2022-01-02
1
5
2022-01-03
2022-01-03
2
what i would like to calculate is the qty that has to be processed on each date.
E.g. on the target date 2022-01-01 the quota qty is 6 (2+1+3).
On the 2.1.2022 i would also have to add the qtys that havent been processed on the day before, which means id 2 because the actual date is 2022-01-02 (so after the target date) and id 3. The quota qty for the 2022-01-02 is then 1+3+1.
And for the 2022-01-03 is 6 = 2+1+3, because id 3 has an actual date on 2022-01-02 (it wasnt processed neither on 01-01 nor on 01-02 and id 4 wasnt processed on 01-02.
Here's what the desired output would look like:
target_date
qty_qouta
2022-01-01
6
2022-01-02
4
2022-01-03
6
Hopefully this gets you started ... recommend testing heaps more edge cases, the business rules don't quite feel right to me -> as you don't seem to show when actual>target. But hope this helps.
WITH CTE AS( SELECT 1 ID, '2022-01-01'::DATE ACTUAL_DATE,'2022-01-01'::DATE TARGET_DATE, 2 QTY
UNION ALL SELECT 2 ID, '2022-01-02'::DATE ACTUAL_DATE,'2022-01-01'::DATE TARGET_DATE, 1 QTY
UNION ALL SELECT 3 ID, '2022-01-03'::DATE ACTUAL_DATE,'2022-01-01'::DATE TARGET_DATE, 3 QTY
UNION ALL SELECT 4 ID, '2022-01-03'::DATE ACTUAL_DATE,'2022-01-02'::DATE TARGET_DATE, 1 QTY
UNION ALL SELECT 5 ID, '2022-01-03'::DATE ACTUAL_DATE,'2022-01-03'::DATE TARGET_DATE, 2 QTY
)
,CTE2 AS(SELECT
ACTUAL_DATE D
, SUM(QTY) ACTUAL_QTY
FROM CTE GROUP BY 1)
,CTE3 AS(SELECT
TARGET_DATE D
, SUM(QTY) TARGET_QTY
FROM CTE GROUP BY 1)
SELECT
D DATE
,ACTUAL_QTY
,TARGET_QTY
,TARGET_QTY-ACTUAL_QTY DELTA
,ZEROIFNULL(LAG(DELTA)OVER(PARTITION BY 1 ORDER BY D))GHOST
,GREATEST(TARGET_QTY,DELTA+GHOST,ACTUAL_QTY)VOLIA
FROM
CTE2 FULL OUTER JOIN CTE3 USING(D);

Calculating average time between customer orders and average order value in Postgres

In PostgreSQL I have an orders table that represents orders made by customers of a store:
SELECT * FROM orders
order_id
customer_id
value
created_at
1
1
188.01
2020-11-24
2
2
25.74
2022-10-13
3
1
159.64
2022-09-23
4
1
201.41
2022-04-01
5
3
357.80
2022-09-05
6
2
386.72
2022-02-16
7
1
200.00
2022-01-16
8
1
19.99
2020-02-20
For a specified time range (e.g. 2022-01-01 to 2022-12-31), I need to find the following:
Average 1st order value
Average 2nd order value
Average 3rd order value
Average 4th order value
E.g. the 1st purchases for each customer are:
for customer_id 1, order_id 8 is their first purchase
customer 2, order 6
customer 3, order 5
So, the 1st-purchase average order value is (19.99 + 386.72 + 357.80) / 3 = $254.84
This needs to be found for the 2nd, 3rd and 4th purchases also.
I also need to find the average time between purchases:
order 1 to order 2
order 2 to order 3
order 3 to order 4
The final result would ideally look something like this:
order_number
AOV
av_days_since_last_order
1
254.84
0
2
300.00
28
3
322.22
21
4
350.00
20
Note that average days since last order for order 1 would always be 0 as it's the 1st purchase.
Thanks.
select order_number
,round(avg(value),2) as AOV
,coalesce(round(avg(days_between_orders),0),0) as av_days_since_last_order
from
(
select *
,row_number() over(partition by customer_id order by created_at) as order_number
,created_at - lag(created_at) over(partition by customer_id order by created_at) as days_between_orders
from t
) t
where created_at between '2022-01-01' and '2022-12-31'
group by order_number
order by order_number
order_number
aov
av_days_since_last_order
1
372.26
0
2
25.74
239
3
200.00
418
4
201.41
75
5
159.64
175
Fiddle
Im suppose it should be something like this
WITH prep_data AS (
SELECT order_id,
cuntomer_id,
ROW_NUMBER() OVER(PARTITION BY order_id, cuntomer_id ORDER BY created_at) AS pushcase_num,
created_at,
value
FROM pushcases
WHERE created_at BETWEEN :date_from AND :date_to
), prep_data2 AS (
SELECT pd1.order_id,
pd1.cuntomer_id,
pd1.pushcase_num
pd2.created_at - pd1.created_at AS date_diff,
pd1.value
FROM prep_data pd1
LEFT JOIN prep_data pd2 ON (pd1.order_id = pd2.order_id AND pd1.cuntomer_id = pd2.cuntomer_id AND pd1.pushcase_num = pd2.pushcase_num+1)
)
SELECT order_id,
cuntomer_id,
pushcase_num,
avg(value) AS avg_val,
avg(date_diff) AS avg_date_diff
FROM prep_data2
GROUP BY pushcase_num

PostgreSQL users and balances - use previous balance for user in time series if value is missing

Given the following tables:
users:
name
alice
bob
balances:
id
user_name
date
balance
1
alice
2022-01-01
100
2
alice
2022-01-03
200
3
alice
2022-01-04
300
4
bob
2022-01-01
400
5
bob
2022-01-02
500
6
bob
2022-01-05
600
I would like to get a full list of all days from the first available to the last for all users, replacing NULL balances with the last available balance for that user.
This is what I have so far:
select u.name, s.day, b.balance
from users u
cross join (select generate_series(min(day)::date, max(day)::date, interval '1 day')::date as day from balances) s
left join balances b on b.user_name = u.name and s.day = b.day
order by u.name, s.day
;
SQL Fiddle Here
I have tried LAG() and some other examples found here but none of them seem to get the right last balance for the user.
We group every balance with the nulls that come after it by using count() over() and then we use max() over() to give the entire group the same value.
select name
,day
,max(balance) over(partition by name, grp order by day) as balance
from
(
select u.name
,s.day
,b.balance
,count(case when b.balance is not null then 1 end) over(partition by u.name order by s.day) as grp
from users u
cross join (select generate_series(min(day)::date, max(day)::date, interval '1 day')::date as day from balances) s
left join balances b on b.user_name = u.name and s.day = b.day
order by u.name, s.day
) t
name
day
balance
alice
2022-01-01
100
alice
2022-01-02
100
alice
2022-01-03
200
alice
2022-01-04
300
alice
2022-01-05
300
bob
2022-01-01
400
bob
2022-01-02
500
bob
2022-01-03
500
bob
2022-01-04
500
bob
2022-01-05
600
Fiddle
Based on How do I efficiently select the previous non-null value?, I ended up getting successful results with the following query:
select
name,
day,
first_value(balance) over (partition by x.name, value_partition order by day) as balance
from (
select
u.name as name,
s.day as day,
b.balance as balance,
sum(case when b.balance is null then 0 else 1 end) over (partition by u.name order by s.day) as value_partition
from users u
cross join (select generate_series(min(day)::date, max(day)::date, interval '1 day')::date as day from balances) s
left join balances b on b.user_name = u.name and s.day = b.day
) x
order by x.name, x.day
DB Fiddle

How to fill missing values for missing dates with value from date before in sql bigquery? [duplicate]

This question already has an answer here:
Create Balance Sheet with every date is filled in Bigquery
(1 answer)
Closed 8 months ago.
Hi I have a product table with daily price, the catch here is that for the table only updates if there's a price change, and for the dates in between will not be written into the table because the price is the same as the day before.
How do I fill missing values of price with the last entry of date before?
date
id
price
2022-01-01
1
5
2022-01-03
1
6
2022-01-05
1
7
2022-01-01
2
10
2022-01-02
2
11
2022-01-06
2
12
into
date
id
price
2022-01-01
1
5
2022-01-02
1
5
2022-01-03
1
6
2022-01-04
1
6
2022-01-05
1
7
2022-01-01
2
10
2022-01-02
2
11
2022-01-03
2
11
2022-01-04
2
11
2022-01-05
2
11
2022-01-06
2
12
I am currently thinking of creating a table for dates and joining and using lag function. Anyone can help?
select
date,id,
case
when price is null then nullPrice
else price
end as price
from(
select *,
Lag(price, 1) OVER(.
ORDER BY date,id ASC) AS nullPrice
from price_table
join date_table using(date)
)
Consider below:
WITH days_by_id AS (
SELECT id, GENERATE_DATE_ARRAY(MIN(date), MAX(date)) days
FROM sample
GROUP BY id
)
SELECT date, id,
IFNULL(price, LAST_VALUE(price IGNORE NULLS) OVER (PARTITION BY id ORDER BY date)) AS price
FROM days_by_id, UNNEST(days) date LEFT JOIN sample USING (id, date);
output :
You can use generate_date_array function for this
with date_arr
as(
select *
from unnest(generate_date_array('2022-01-01', '2022-05-01')) as dt)
select da.dt, t1.*
from date_arr da
left outer join table1 t1
on da.dt = t1.dt
You can replace hardcoded dates with max and min date from table.