Calculating the cumulative sum with some conditions (gaps-and-islands problem) - sql

Sorry if the title is a bit vague please suggest a title if you think it can articulate the problem. I'll start with what data I have and the end result I'm trying to get and then the TLDR:
This is the table I have:
Each row is a transaction. Outgoing amounts are negative, incomings are positive. The transactions can either be someone spending money ('spend' event) or it can be a loan disbursement into their account (amount > 0 and event = 'loan') or it can be them paying back their loan (amount < 0 and event = 'loan').
row number
id
created
amount
event
1
1
2022-01-01
-200
spend
2
1
2022-01-02
1000
loan
3
1
2022-01-03
-200
spend
4
1
2022-01-04
-500
spend
5
1
2022-01-05
-500
loan
6
1
2022-01-06
100
spend
7
1
2022-01-07
-500
spend
8
1
2022-01-08
1000
loan
9
1
2022-01-09
-100
spend
I'm trying to make:
row number
id
created
amount
event
cumulative_sum
1
1
2022-01-01
-200
spend
-200
2
1
2022-01-02
1000
loan
1000
3
1
2022-01-03
-200
spend
800
4
1
2022-01-04
-500
spend
300
5
1
2022-01-05
-500
loan
300
6
1
2022-01-06
100
spend
300
7
1
2022-01-07
-500
spend
-200
8
1
2022-01-08
1000
loan
1000
9
1
2022-01-09
-100
spend
900
Required logic:
I want to get a special cumulative sum which sums the amount only when:
(the amount is < 0 AND the event is spend) OR (when amount is > 0 AND event is loan)
.
The thing is I want the cumulative sum to start when that first positive loan amount. I don't care about anything before the positive loan amount and if they are counted it will obscure the results. The requirement is trying to select the rows which the loan enabled (if the loan is 1000 then we want to select the rows that add up to -1000 but only when event is spend and amount < 0).
my attempt
WITH tmp AS (
SELECT
1 AS id,
'2021-01-01' AS created,
-200 AS amount,
'spend' AS scheme
UNION ALL
SELECT
1 AS id,
'2022-01-02' AS created,
1000 AS amount,
'loan' AS scheme
UNION ALL
SELECT
1 AS id,
'2022-01-03' AS created,
-200 AS amount,
'spend' AS scheme
UNION ALL
SELECT
1 AS id,
'2022-01-04' AS created,
-500 AS amount,
'spend' AS scheme
UNION ALL
SELECT
1 AS id,
'2022-01-05' AS created,
-500 AS amount,
'loan' AS scheme
UNION ALL
SELECT
1 AS id,
'2022-01-06' AS created,
100 AS amount,
'spend' AS scheme
UNION ALL
SELECT
1 AS id,
'2022-01-07' AS created,
-500 AS amount,
'spend' AS scheme
UNION ALL
SELECT
1 AS id,
'2022-01-08' AS created,
1000 AS amount,
'loan' AS scheme
UNION ALL
SELECT
1 AS id,
'2022-01-09' AS created,
-100 AS amount,
'spend' AS scheme
)
SELECT
*,
SUM(CASE WHEN (scheme != 'loan' AND amount<0) OR (scheme = 'loan' AND amount > 0) THEN amount ELSE 0 END)
OVER (PARTITION BY id ORDER BY created ASC) AS cumulative_sum_spend
FROM tmp
Question
How do I make the cumulative sum reset at row 2 (not conditional to the row number - the requirement is the positive loan amount)?

That's a gaps-and-islands problem if I am understanding this correctly.
Islands start with a positive loan ; within each island, you want to compute a running sum in a subset of rows.
We can identify the islands in a subquery with a window count of positive loans, then do the maths in each group with a conditional expression:
select id, created, amount, event,
sum(case when (event = 'loan' and amount > 0) or (event = 'spend' and amount < 0) then amount end)
over(partition by id, grp order by created) as cumulative_sum
from (
select t.*,
sum(case when event = 'loan' and amount > 0 then 1 else 0 end)
over(partition by id order by created) grp
from tmp t
) t
order by id, created

One option would be something like this:
SELECT
*,
SUM(CASE WHEN cnt >= 1 AND ((scheme != 'loan' AND amount<0) OR (scheme = 'loan' AND amount > 0)) THEN amount ELSE 0 END)
OVER (PARTITION BY id ORDER BY created ASC) AS cumulative_sum_spend
FROM (
SELECT *, SUM(CASE WHEN amount > 0 THEN 1 ELSE 0 END) OVER (PARTITION BY id ORDER BY created) cnt
FROM tmp
) a
The idea here is that the inner query's window function counts the number of previous positive values. Then the outer query can do an extra check cnt >= 1 as part of its window function, so it will only consider values after the first positive one.

Related

create additional date after and before current row and create new column based on it

Lets say I have this kind of data
create table example
(cust_id VARCHAR, product VARCHAR, price float, datetime varchar);
insert into example (cust_id, product, price, datetime)
VALUES
('1', 'scooter', 2000, '2022-01-10'),
('1', 'skateboard', 1500, '2022-01-20'),
('1', 'beefmeat', 300, '2022-06-08'),
('2', 'wallet', 200, '2022-02-25'),
('2', 'hairdryer', 250, '2022-04-28'),
('3', 'skateboard', 1600, '2022-03-29')
I want to make some kind of additional rows, and after that make new column based on this additional rows
My expectation output will like this
cust_id
total_price
date
is_active
1
3500
2022-01
active
1
0
2022-02
active
1
0
2022-03
active
1
0
2022-04
inactive
1
0
2022-05
inactive
1
300
2022-06
active
1
0
2022-07
active
2
0
2022-01
inactive
2
200
2022-02
active
2
0
2022-03
active
2
250
2022-04
active
2
0
2022-05
active
2
0
2022-06
active
2
0
2022-07
inactive
3
0
2022-01
inactive
3
0
2022-02
inactive
3
1600
2022-03
active
3
0
2022-04
active
3
0
2022-05
active
3
0
2022-06
inactive
3
0
2022-07
inactive
the rules is like this
the first month when the customer make transaction is called active, before this transaction called inactive.
ex: first transaction in month 2, then month 2 is active, month 1 is inactive (look cust_id 2 and 3)
if more than 2 months there isnt transaction, the next month is inactive until there is new transaction is active.
ex: if last transaction in month 1, then month 2 and month 3 is inactive, and month 4, month 5 inactive if month 6 there is new transaction (look cust_id 1 and 3)
well my first thought is used this code, but I dont know what the next step after it
select *,
date_part('month', age(to_date(date, 'YYYY-MM'), to_date(lag(date) over (partition by cust_id order by date),'YYYY-MM')))date_diff
from(
select
cust_id,
sum(price)total_price,
to_char(to_date(datetime, 'YYYY-MM-DD'),'YYYY-MM')date
from example
group BY
cust_id,
date
order by
cust_id,
date)test
I'm open to any suggestion
Try the following, an explanation within query comments:
/* use generate_series to generate a series of dates
starting from the min date of datetime up to the
max datetime with one-month intervals, then do a
cross join with the distinct cust_id to map each cust_id
to each generated date.*/
WITH cust_dates AS
(
SELECT EX.cust_id, to_char(dts, 'YYYY-mm') dts
FROM generate_series
(
(SELECT MIN(datetime)::timestamp FROM example),
(SELECT MAX(datetime)::timestamp + '2 month'::interval FROM example),
'1 month'::interval
) dts
CROSS JOIN (SELECT DISTINCT cust_id FROM example) EX
),
/* do a left join with your table to find prices
for each cust_id/ month, and aggregate for cust_id, month_date
to find the sum of prices for each cust_id, month_date.
*/
monthly_price AS
(
SELECT CD.cust_id,
CD.dts AS month_date,
COALESCE(SUM(price), 0) total_price
FROM cust_dates CD LEFT JOIN example EX
ON CD.cust_id = EX.cust_id AND
CD.dts = to_char(EX.datetime, 'YYYY-mm')
GROUP BY CD.cust_id, CD.dts
)
/* Now, we have the sum of monthly prices for each cust_id,
we can use the max window function with "ROWS BETWEEN 2 PRECEDING AND CURRENT ROW"
to check if one of the (current month or the previous two months) has a sum of prices > 0.
*/
SELECT cust_id, month_date, total_price,
CASE MAX(total_price) OVER
(PARTITION BY cust_id ORDER BY month_date
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
WHEN 0 THEN 'inactive'
ELSE 'active'
END AS is_active
FROM monthly_price
ORDER BY cust_id, month_date
See demo

How to do one-to-one inner join

I've a transaction table of purchased and returned items, and I want to match a return transaction with the transaction where that corresponding item was purchased. (Here I used the same item ID and amount in all records for simplicity)
trans_ID
date
item_ID
amt
type
1
2022-01-09
100
5000
purchase
2
2022-01-07
100
5000
return
3
2022-01-06
100
5000
purchase
4
2022-01-05
100
5000
purchase
5
2022-01-04
100
5000
return
6
2022-01-03
100
5000
return
7
2022-01-03
100
5000
purchase
8
2022-01-02
100
5000
purchase
9
2022-01-01
100
5000
return
Matching conditions are:
The return date must be greater than or equal the purchase date
The return and purchase transactions must relate to the same item's ID and same transaction amount
For each return, there must be only 1 purchase matched to it (In case there are many related purchases, choose one with the most recent purchase date. But if the most recent purchase was already used for mapping with another return, choose the second-most recent purchase instead, and so on.)
From 3), that means each purchase must be matched with only 1 return as well.
The result should look like this.
trans_ID
date
trans_ID_matched
date_matched
2
2022-01-07
3
2022-01-06
5
2022-01-04
7
2022-01-03
6
2022-01-03
8
2022-01-02
This is what I've tried.
with temp as (
select a.trans_ID, a.date
, b.trans_ID as trans_ID_matched
, b.date as date_matched
, row_number() over (partition by a.trans_ID, a.date order by b.date desc) as rn1
from
(
select *
from transaction_table
where type = 'return'
) a
inner join
(
select *
from transaction_table
where type = 'purchase'
) b
on a.item_ID = b.item_ID and a.amount = b.amount and a.date >= b.date
)
select * from temp where rn = 1
But what I got is
trans_ID
date
trans_ID_matched
date_matched
2
2022-01-07
3
2022-01-06
5
2022-01-04
7
2022-01-03
6
2022-01-03
7
2022-01-03
Here, the trans ID 7 shouldn't be used again in the last row as it has been already matched with trans ID 5 in the row 2. So is there any way to match trans ID 6 with 8 (or any way to tell SQL not to use the already-used one like the purchase 7) ?
I created a fiddle, the result seem OK, but it's up to you to test if this is OK on all situtations..... 😉
WITH cte as (
SELECT
t1.trans_ID,
t1.[date],
t1.item_ID,
t1.amt,
t1.[type],
pur.trans_ID trans_ID_matched,
pur.[date] datE_matched,
jojo.c
FROM table1 t1
CROSS APPLY (
SELECT
trans_ID,
item_ID,
[date],
amt
FROM table1 pur
WHERE pur.[type] = 'purchase' and t1.[type]='return'
and pur.item_ID = t1.item_ID
and pur.amt = t1.amt
and pur.[date] <= t1.[date]
) pur
CROSS APPLY (
SELECt count(*) as c FROM table1 WHERE trans_ID> t1.trans_ID and trans_ID<pur.trans_ID
) jojo
where jojo.c <=2
)
select
trans_ID,
[date],
item_ID,
amt,
CASE WHEN min(c)=0 then min(trans_ID_matched) else max(trans_ID_matched) end
from cte
group by
trans_ID,
[date],
item_ID,
amt
order by trans_ID;
DBFIDDLE
The count(*) detects the distance between the selected trans_ID from the return and the purchase.
This might go wrong the are more than 2 adjacent 'returns'... (I am afraid it will break, so I did not test this 😢).
But is's a nice problem. Hopefully this will give you any other ideas to find the correct sulution!

Add a 0 in the next row of a column after the last data point

I have written a query which gives the output as shown below:
Date Amount
01-01-2020
01-02-2020 10000
01-03-2020 20000
01-04-2020 30000
01-05-2020 40000
01-06-2020
01-07-2020
01-08-2020
In the above table, we can see that the amount is null for 01-01-2020, 01-06-2020, 01-07-2020, 01-08-2020. Now, I want to add a 0 to the amount column for just 1 row i.e for the date- 01-06-2020 which is after the last data point - 40000. And I'm not sure how to do it. Is there any straight forward query to achieve this? Thank you.
You can use lag() and a case expression:
select date,
case when amount is null and lag(amount) over(order by date) is not null
then 0
else amount
end as amount
from mytable
If you wanted an update statement:
with cte as (
select amount,
case when amount is null and lag(amount) over(order by date) is not null
then 0
end as new_amount
from mytable
)
update cte set amount = new_amount where new_amount = 0

tricky sql interview question(only give you 15mins to solve)

The given table (order info):
client Date Product order amt
1001, 2020-01-01, Desktop1, 100
1001, 2020-01-01, Mobile2, 200
1001, 2020-01-01, Mobile2, 100
1002, 2020-01-02, Mobile1, 100
1002, 2020-01-01, Mobile1, 100
1003, 2020-01-01, Desktop1, 100
1003, 2020-01-02, Desktop2, 100
1004, 2020-01-02, Mobile, 100
The return table should give following information:
On each date, how many client buy only one type of product(either mobile_unique or desktop_unique), and the total amount of order under each type of product
AND
On each date, how many client buy both types pf product, and the total amount of order.
So the return table should like this:
Date. product type total amount number of client
2020-01-01 mobile_only 100 1
2020-01-01 desktop_only 100 1
2020-01-01 both 400 1
2020-01-02 mobile_only 200 2
2020-01-02 desktop_only 100 1
I have solved it by creating multiple tables. But he interviewer only gives 15 mins to solve it, so I'd like to see any simple way to solve it.
You can "classify" clients at first (Mob, Des, Bot) and then group:
select date_, class, sum(amt), count(client)
from (
select date_, client, sum(order_amt) amt,
case when min(substr(product, 1, 1)) <> max(substr(product, 1, 1)) then 'B'
else min(substr(product, 1, 1))
end class
from orders group by date_, client)
group by date_, class order by date_, class
dbfiddle for Oracle
This seems to be an ill designed table. What if product happens to be a vegetable? I think you should test for sanity of the data a give error in that case.
select Date_, product_type, sum(total_amount) as total_amount, count(*) as number_of_clients
from (
select
Date_, sum(order_amt) as total_amount,
case
when sum(case when SUBSTRING(Product,1,7) = 'Desktop'
or SUBSTRING(Product,1,6) = 'Mobile'
then 0 else 1 end) > 0 then 'Error'
when count(distinct SUBSTRING(Product,1,6)) = 2 then 'both'
when min(SUBSTRING(Product,1,6)) = 'Mobile' then 'mobile_only'
else 'desktop_only'
end as product_type
from orders
group by Date_, client
)x
group by Date_, product_type
order by Date_, product_type desc
output:
Date_ product_type total_amount number_of_clients
2020-01-01 mobile_only 100 1
2020-01-01 desktop_only 100 1
2020-01-01 both 400 1
2020-01-02 mobile_only 200 2
2020-01-02 desktop_only 100 1

Calculate Running total in a new column based Adding or Subtracting condition using SQL

I am trying to calculate running total based on the value plus/minus in another column by Account and Date.
Example
Data
ID Account Date Operation Qty Running_Total
1 A 01/01/2018 plus 10 10
2 A 01/02/2018 plus 20 30
3 A 01/03/2018 minus 5 20
4 A 01/03/2018 minus 5 20
5 A 01/04/2018 plus 30 50
6 B 01/01/2018 plus 15 15
the total
Code:
select ID, Date, Operation, Total,
case when Operation = 'Use Table B' then TableB.RunningTotalQty
else
SUM( case when Operation = 'plus' then Qty
else case when Operation = 'minus' then -Qty end)
OVER (PARTITION BY Account ORDER BY Date) end
From TableA A left Join TableB B
on A.ID = B.ID ...
THIS ANSWERS THE ORIGINAL VERSION OF THE QUESTION.
The case goes inside the sum():
select ID, Date, Operation, Total,
sum(case when Operation = 'plus' then qty else - qty end) over
(partition by Account order by Date) as Running_Total
From TableA ;
This assumes only two operations. If you have more:
select ID, Date, Operation, Total,
sum(case when Operation = 'plus' then qty
then Operation = 'minus' then - qty
else 0
end) over
(partition by Account order by Date) as Running_Total
From TableA ;