Categorising transactions in past 90 days - sql

I have a table where the columns are:
Transaction_id(T_id): Distinct id generated for each transactions
Date(Dt): Date of Transaction
account-id(Ac_id): The id from which the transaction is done
Org_id(O_id): It is the id given to the organizations. One organization can have multiple accounts thereby different account id can have the same org_id
Sample table:
T_id
Dt
Ac_id
O_id
101
23/4/22
1
A
102
06/7/22
3
C
103
01/8/22
2
A
104
13/3/22
6
B
*The question is to mark the o_id where transactions are done in the past 90 days as 1 and others as 0
Output
T_id
Dt.
Ac_id.
O_id
Mark
101
23/4/22
1
A
0
102
06/7/22
3
C
1
103
01/8/22
2
A
1
104
13/3/22
6
B
0
The query I am using is:
Select *,
Case when datediff('day', Dt, current_date()) between 0 and 90 then '1'
Else '0'
End as Mark
From Table1
Desired Output:
T_id
Dt.
Ac_id.
O_id
Mark
101
23/4/22
1
A
1
102
06/7/22
3
C
1
103
01/8/22
2
A
1
104
13/3/22
6
B
0
for o_id 'A' from the output the mark I want is 1 in all cases as one transaction is done past 90 days, irrespective of other transactions done prior to 90days.
I have to join this out to another table so need all o_id where ever any one transaction is done in the past 90 days as '1'.
Please help me with it quickly.

The easisest approach is to compare date difference of current date against windowed MAX partitioned by o_id:
SELECT *,
CASE
WHEN DATEDIFF('day', (MAX(Dt) OVER(PARTITION BY o_id)), CURRENT_DATE()) <= 90
THEN 1
ELSE 0
END AS Mark
FROM Tab;
Sample data:
ALTER SESSION SET DATE_INPUT_FORMAT = 'DD/MM/YYYY';
CREATE OR REPLACE TABLE tab(t_id INT,
Dt Date,
Ac_id INT,
O_id TEXT)
AS
SELECT 101, '23/04/2022' ,1 ,'A' UNION
SELECT 102, '06/07/2022' ,3 ,'C' UNION
SELECT 103, '01/08/2022' ,2 ,'A' UNION
SELECT 104, '13/03/2022' ,6 ,'B';
Output:
Snowflake supports natively BOOLEAN data types so entire query could be just:
SELECT *,
DATEDIFF('day', (MAX(Dt) OVER(PARTITION BY o_id)), CURRENT_DATE()) <= 90 AS Mark
FROM tab

Create a subquery where you identify all the distinct o_id where there is a recent transaction, and use that to update the main result.
The subquery would be:
select o_id, dt from table1
group by o_id
having datediff('day', max(Dt), current_date()) between 0 and 90;
Then your main query becomes:
Select *,'1' as Mark
From Tab
where o_id in
(select x.o_id from (select o_id, max(Dt)
from tab
group by o_id
having datediff('day', max(Dt), current_date()) between 0 and 90) x)
union all
select *,'0' as Mark
from Tab
where o_id not in
(select x.o_id from (select o_id, max(Dt)
from tab
group by o_id
having datediff('day', max(Dt), current_date()) between 0 and 90) x);

Related

Frequency Distribution by Day

I have records of No. of calls coming to a call center. When a call comes into a call center a ticket is open.
So, let's say ticket 1 (T1) is open on 8/1/19 and it stays open till 8/5/19. So, if a person ran a query everyday then on 8/1 it will show 1 ticket open...same think on day 2 till day 5....I want to get records by day to see how many tickets were open for each day.....
In short, Frequency Distribution by Day.
Ticket Open_date Close_date
T1 8/1/2019 8/5/2019
T2 8/1/2019 8/6/2019
Result:
Result
Date # Tickets_Open
8/1/2019 2
8/2/2019 2
8/3/2019 2
8/4/2019 2
8/5/2019 2
8/6/2019 1
8/7/2019 0
8/8/2019 0
8/9/2019 0
8/10/2019 0
We can handle your requirement via the use of a calendar table, which stores all dates covering the full range in your data set.
WITH dates AS (
SELECT '2019-08-01' AS dt UNION ALL
SELECT '2019-08-02' UNION ALL
SELECT '2019-08-03' UNION ALL
SELECT '2019-08-04' UNION ALL
SELECT '2019-08-05' UNION ALL
SELECT '2019-08-06' UNION ALL
SELECT '2019-08-07' UNION ALL
SELECT '2019-08-08' UNION ALL
SELECT '2019-08-09' UNION ALL
SELECT '2019-08-10'
)
SELECT
d.dt,
COUNT(t.Open_date) AS num_tickets_open
FROM dates d
LEFT JOIN tickets t
ON d.dt BETWEEN t.Open_date AND t.Close_date
GROUP BY
d.dt;
Note that in practice if you expect to have this reporting requirement in the long term, you might want to replace the dates CTE above with a bona-fide table of dates.
This solution generates the list of dates from the tickets table using CTE recursion and calculates the count:
WITH Tickets(Ticket, Open_date, Close_date) AS
(
SELECT "T1", "8/1/2019", "8/5/2019"
UNION ALL
SELECT "T2", "8/1/2019", "8/6/2019"
),
Ticket_dates(Ticket, Dates) as
(
SELECT t1.Ticket, CONVERT(DATETIME, t1.Open_date)
FROM Tickets t1
UNION ALL
SELECT t1.Ticket, DATEADD(dd, 1, CONVERT(DATETIME, t1.Dates))
FROM Ticket_dates t1
inner join Tickets t2 on t1.Ticket = t2.Ticket
where DATEADD(dd, 1, CONVERT(DATETIME, t1.Dates)) <= CONVERT(DATETIME, t2.Close_date)
)
SELECT CONVERT(varchar, Dates, 1), count(*)
FROM Ticket_dates
GROUP by Dates
ORDER by Dates
A "general purpose" trick is to generate a series of numbers, which can be done using CTE's but there are many alternatives, and from that create the needed range of dates. Once that exists then you can left join your ticket data to this and then count by date.
CREATE TABLE mytable(
Ticket VARCHAR(8) NOT NULL PRIMARY KEY
,Open_date DATE NOT NULL
,Close_date DATE NOT NULL
);
INSERT INTO mytable(Ticket,Open_date,Close_date) VALUES ('T1','8/1/2019','8/5/2019');
INSERT INTO mytable(Ticket,Open_date,Close_date) VALUES ('T2','8/1/2019','8/6/2019');
Also note I am using a cross apply in this example to "attach" the min and max dates of your tickets to each numbered row. You would need to include your own logic on what data to select here.
;WITH
cteDigits AS (
SELECT 0 AS digit UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL
SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9
)
, cteTally AS (
SELECT
[1s].digit
+ [10s].digit * 10
+ [100s].digit * 100 /* add more like this as needed */
AS num
FROM cteDigits [1s]
CROSS JOIN cteDigits [10s]
CROSS JOIN cteDigits [100s] /* add more like this as needed */
)
select
n.num + 1 rownum
, dateadd(day,n.num,ca.min_date) as on_date
, count(t.Ticket) as tickets_open
from cteTally n
cross apply (select min(Open_date), max(Close_date) from mytable) ca (min_date, max_date)
left join mytable t on dateadd(day,n.num,ca.min_date) between t.Open_date and t.Close_date
where dateadd(day,n.num,ca.min_date) <= ca.max_date
group by
n.num + 1
, dateadd(day,n.num,ca.min_date)
order by
rownum
;
result:
+--------+---------------------+--------------+
| rownum | on_date | tickets_open |
+--------+---------------------+--------------+
| 1 | 01.08.2019 00:00:00 | 2 |
| 2 | 02.08.2019 00:00:00 | 2 |
| 3 | 03.08.2019 00:00:00 | 2 |
| 4 | 04.08.2019 00:00:00 | 2 |
| 5 | 05.08.2019 00:00:00 | 2 |
| 6 | 06.08.2019 00:00:00 | 1 |
+--------+---------------------+--------------+

SQL - Find if column dates include at least partially a date range

I need to create a report and I am struggling with the SQL script.
The table I want to query is a company_status_history table which has entries like the following (the ones that I can't figure out)
Table company_status_history
Columns:
| id | company_id | status_id | effective_date |
Data:
| 1 | 10 | 1 | 2016-12-30 00:00:00.000 |
| 2 | 10 | 5 | 2017-02-04 00:00:00.000 |
| 3 | 11 | 5 | 2017-06-05 00:00:00.000 |
| 4 | 11 | 1 | 2018-04-30 00:00:00.000 |
I want to answer to the question "Get all companies that have been at least for some point in status 1 inside the time period 01/01/2017 - 31/12/2017"
Above are the cases that I don't know how to handle since I need to add some logic of type :
"If this row is status 1 and it's date is before the date range check the next row if it has a date inside the date range."
"If this row is status 1 and it's date is after the date range check the row before if it has a date inside the date range."
I think this can be handled as a gaps and islands problem. Consider the following input data: (same as sample data of OP plus two additional rows)
id company_id status_id effective_date
-------------------------------------------
1 10 1 2016-12-15
2 10 1 2016-12-30
3 10 5 2017-02-04
4 10 4 2017-02-08
5 11 5 2017-06-05
6 11 1 2018-04-30
You can use the following query:
SELECT t.id, t.company_id, t.status_id, t.effective_date, x.cnt
FROM company_status_history AS t
OUTER APPLY
(
SELECT COUNT(*) AS cnt
FROM company_status_history AS c
WHERE c.status_id = 1
AND c.company_id = t.company_id
AND c.effective_date < t.effective_date
) AS x
ORDER BY company_id, effective_date
to get:
id company_id status_id effective_date grp
-----------------------------------------------
1 10 1 2016-12-15 0
2 10 1 2016-12-30 1
3 10 5 2017-02-04 2
4 10 4 2017-02-08 2
5 11 5 2017-06-05 0
6 11 1 2018-04-30 0
Now you can identify status = 1 islands using:
;WITH CTE AS
(
SELECT t.id, t.company_id, t.status_id, t.effective_date, x.cnt
FROM company_status_history AS t
OUTER APPLY
(
SELECT COUNT(*) AS cnt
FROM company_status_history AS c
WHERE c.status_id = 1
AND c.company_id = t.company_id
AND c.effective_date < t.effective_date
) AS x
)
SELECT id, company_id, status_id, effective_date,
ROW_NUMBER() OVER (PARTITION BY company_id ORDER BY effective_date) -
cnt AS grp
FROM CTE
Output:
id company_id status_id effective_date grp
-----------------------------------------------
1 10 1 2016-12-15 1
2 10 1 2016-12-30 1
3 10 5 2017-02-04 1
4 10 4 2017-02-08 2
5 11 5 2017-06-05 1
6 11 1 2018-04-30 2
Calculated field grp will help us identify those islands:
;WITH CTE AS
(
SELECT t.id, t.company_id, t.status_id, t.effective_date, x.cnt
FROM company_status_history AS t
OUTER APPLY
(
SELECT COUNT(*) AS cnt
FROM company_status_history AS c
WHERE c.status_id = 1
AND c.company_id = t.company_id
AND c.effective_date < t.effective_date
) AS x
), CTE2 AS
(
SELECT id, company_id, status_id, effective_date,
ROW_NUMBER() OVER (PARTITION BY company_id ORDER BY effective_date) -
cnt AS grp
FROM CTE
)
SELECT company_id,
MIN(effective_date) AS start_date,
CASE
WHEN COUNT(*) > 1 THEN DATEADD(DAY, -1, MAX(effective_date))
ELSE MIN(effective_date)
END AS end_date
FROM CTE2
GROUP BY company_id, grp
HAVING COUNT(CASE WHEN status_id = 1 THEN 1 END) > 0
Output:
company_id start_date end_date
-----------------------------------
10 2016-12-15 2017-02-03
11 2018-04-30 2018-04-30
All you want know is those records from above that overlap with the specified interval.
Demo here with somewhat more complicated use case.
Maybe this is what you are looking for? For these kind of questions, you need to join two instance of your table, in this case I am just joining with next record by Id, which probably is not totally correct. To do it better, you can create a new Id using a windowed function like row_number, ordering the table by your requirement criteria
If this row is status 1 and it's date is before the date range check
the next row if it has a date inside the date range
declare #range_st date = '2017-01-01'
declare #range_en date = '2017-12-31'
select
case
when csh1.status_id=1 and csh1.effective_date<#range_st
then
case
when csh2.effective_date between #range_st and #range_en then true
else false
end
else NULL
end
from company_status_history csh1
left join company_status_history csh2
on csh1.id=csh2.id+1
Implementing second criteria:
"If this row is status 1 and it's date is after the date range check
the row before if it has a date inside the date range."
declare #range_st date = '2017-01-01'
declare #range_en date = '2017-12-31'
select
case
when csh1.status_id=1 and csh1.effective_date<#range_st
then
case
when csh2.effective_date between #range_st and #range_en then true
else false
end
when csh1.status_id=1 and csh1.effective_date>#range_en
then
case
when csh3.effective_date between #range_st and #range_en then true
else false
end
else null -- ¿?
end
from company_status_history csh1
left join company_status_history csh2
on csh1.id=csh2.id+1
left join company_status_history csh3
on csh1.id=csh3.id-1
I would suggest the use of a cte and the window functions ROW_NUMBER. With this you can find the desired records. An example:
DECLARE #t TABLE(
id INT
,company_id INT
,status_id INT
,effective_date DATETIME
)
INSERT INTO #t VALUES
(1, 10, 1, '2016-12-30 00:00:00.000')
,(2, 10, 5, '2017-02-04 00:00:00.000')
,(3, 11, 5, '2017-06-05 00:00:00.000')
,(4, 11, 1, '2018-04-30 00:00:00.000')
DECLARE #StartDate DATETIME = '2017-01-01';
DECLARE #EndDate DATETIME = '2017-12-31';
WITH cte AS(
SELECT *
,ROW_NUMBER() OVER (PARTITION BY company_id ORDER BY effective_date) AS rn
FROM #t
),
cteLeadLag AS(
SELECT c.*, ISNULL(c2.effective_date, c.effective_date) LagEffective, ISNULL(c3.effective_date, c.effective_date)LeadEffective
FROM cte c
LEFT JOIN cte c2 ON c2.company_id = c.company_id AND c2.rn = c.rn-1
LEFT JOIN cte c3 ON c3.company_id = c.company_id AND c3.rn = c.rn+1
)
SELECT 'Included' AS RangeStatus, *
FROM cteLeadLag
WHERE status_id = 1
AND effective_date BETWEEN #StartDate AND #EndDate
UNION ALL
SELECT 'Following' AS RangeStatus, *
FROM cteLeadLag
WHERE status_id = 1
AND effective_date > #EndDate
AND LagEffective BETWEEN #StartDate AND #EndDate
UNION ALL
SELECT 'Trailing' AS RangeStatus, *
FROM cteLeadLag
WHERE status_id = 1
AND effective_date < #EndDate
AND LeadEffective BETWEEN #StartDate AND #EndDate
I first select all records with their leading and lagging Dates and then I perform your checks on the inclusion in the desired timespan.
Try with this, self-explanatory. Responds to this part of your question:
I want to answer to the question "Get all companies that have been at
least for some point in status 1 inside the time period 01/01/2017 -
31/12/2017"
Case that you want to find those id's that have been in any moment in status 1 and have records in the period requested:
SELECT *
FROM company_status_history
WHERE id IN
( SELECT Id
FROM company_status_history
WHERE status_id=1 )
AND effective_date BETWEEN '2017-01-01' AND '2017-12-31'
Case that you want to find id's in status 1 and inside the period:
SELECT *
FROM company_status_history
WHERE status_id=1
AND effective_date BETWEEN '2017-01-01' AND '2017-12-31'

SQL (Vertica) - Calculate number of users who returned to the app at least x days in the past 7 days

Suppose I have my table like:
uid day_used_app
--- -------------
1 2012-04-28
1 2012-04-29
1 2012-04-30
2 2012-04-29
2 2012-04-30
2 2012-05-01
2 2012-05-21
2 2012-05-22
Suppose I want the number of unique users who returned to the app at least 2 different days in the last 7 days (from 2012-05-03).
So as an example to retrieve the number of users who have used the application on at least 2 different days in the past 7 days:
select count(distinct case when num_different_days_on_app >= 2
then uid else null end) as users_return_2_or_more_days
from (
select uid,
count(distinct day_used_app) as num_different_days_on_app
from table
where day_used_app between current_date() - 7 and current_date()
group by 1
)
This gives me:
users_return_2_or_more_days
---------------------------
2
The question I have is:
What if I want to do this for every day up to now so that my table looks like this, where the second field equals the number of unique users who returned 2 or more different days within a week prior to the date in the first field.
date users_return_2_or_more_days
-------- ---------------------------
2012-04-28 2
2012-04-29 2
2012-04-30 3
2012-05-01 4
2012-05-02 4
2012-05-03 3
Would this help?
WITH
-- your original input, don't use in "real" query ...
input(uid,day_used_app) AS (
SELECT 1,DATE '2012-04-28'
UNION ALL SELECT 1,DATE '2012-04-29'
UNION ALL SELECT 1,DATE '2012-04-30'
UNION ALL SELECT 2,DATE '2012-04-29'
UNION ALL SELECT 2,DATE '2012-04-30'
UNION ALL SELECT 2,DATE '2012-05-01'
UNION ALL SELECT 2,DATE '2012-05-21'
UNION ALL SELECT 2,DATE '2012-05-22'
)
-- end of input, start "real" query here, replace ',' with 'WITH'
,
one_week_b4 AS (
SELECT
uid
, day_used_app
, day_used_app -7 AS day_used_1week_b4
FROM input
)
SELECT
one_week_b4.uid
, one_week_b4.day_used_app
, count(*) AS users_return_2_or_more_days
FROM one_week_b4
JOIN input
ON input.day_used_app BETWEEN one_week_b4.day_used_1week_b4 AND one_week_b4.day_used_app
GROUP BY
one_week_b4.uid
, one_week_b4.day_used_app
HAVING count(*) >= 2
ORDER BY 1;
Output is:
uid|day_used_app|users_return_2_or_more_days
1|2012-04-29 | 3
1|2012-04-30 | 5
2|2012-04-29 | 3
2|2012-04-30 | 5
2|2012-05-01 | 6
2|2012-05-22 | 2
Does that help your needs?
Marco the Sane ...
SELECT DISTINCT
t1.day_used_app,
(
SELECT SUM(CASE WHEN t.num_visits >= 2 THEN 1 ELSE 0 END)
FROM
(
SELECT uid,
COUNT(DISTINCT day_used_app) AS num_visits
FROM table
WHERE day_used_app BETWEEN t1.day_used_app - 7 AND t1.day_used_app
GROUP BY uid
) t
) AS users_return_2_or_more_days
FROM table t1

Oracle SQL overlap between begin date and end date in 2 or more records

Database my_table:
id seq start_date end_date
1 1 01-01-2017 02-01-2017
1 2 07-01-2017 09-01-2017
1 3 11-01-2017 11-01-2017
2 1 20-01-2017 20-01-2017
3 1 01-02-2017 02-02-2017
3 2 03-02-2017 04-02-2017
3 3 08-01-2017 09-02-2017
3 4 09-01-2017 10-02-2017
3 5 10-01-2017 12-02-2017
My requirement is to get the first date (normally seq 1 start date) and end date (normally last seq end date) and the number of dates occurred during all seq for each unique ID.
Date occurred:
id 1 2 3
01-01-2017 20-01-2017 01-02-2017
02-01-2017 02-02-2017
07-01-2017 03-02-2017
08-01-2017 04-02-2017
09-01-2017 08-02-2017
11-01-2017 09-02-2017
10-02-2017
11-02-2017
12-02-2017
total 6 1 9
Here is the result I want:
id start_date end_date num_date
1 01-01-2017 11-01-2017 6
2 20-01-2017 20-01-2017 1
3 01-02-2017 12-02-2017 9
I have tried
SELECT id
, MIN(start_date)
, MAX(end_date)
, SUM(end_date - start_date + 1)
FROM my_table
GROUP BY id
and this SQL statement work fine in id 1 and 2 since there is no overlap date between begin date and end date. But for id 3, the result num_date is 11. Could you please suggest the SQL statement to solve this problem? Thank you.
One more question: The date in database is in datetime format. How do I convert it to date. I tried to use TRUNC function but it sometimes convert date to yesterday instead.
You need to count how many times an end_date equals the following start_date. For this you need to use the lag() or the lead() analytic function. You can use a case expression for the comparison, but alas you can't wrap the case expression within a COUNT or SUM in the same query; you need a subquery and an outer query.
Something like this; not tested, since you didn't provide CREATE TABLE and INSERT statements to recreate your sample data.
select id, min(start_date) as start_date, max(end_date) as end_date,
sum(end_date - start_date + 1 - flag) as num_days
from ( select id, start_date, end_date,
case when start_date = lag(end_date)
over (partition by id order by end_date) then 1
else 0 end as flag
from my_table
)
group by id;
SELECT id,
MIN( start_date ) AS start_date,
MAX( end_date ) AS end_date,
SUM( end_date - start_date + 1 ) AS num_days
FROM (
SELECT id,
GREATEST(
start_date,
COALESCE(
LAG( end_date ) OVER ( PARTITION BY id ORDER BY seq ) + 1,
start_date
)
) AS start_date,
end_date
FROM your_table
)
WHERE start_date <= end_date
GROUP BY id;

Oracle SQL sum up values till another value is reached

I hope I can describe my challenge in an understandable way.
I have two tables on a Oracle Database 12c which look like this:
Table name "Invoices"
I_ID | invoice_number | creation_date | i_amount
------------------------------------------------------
1 | 10000000000 | 01.02.2016 00:00:00 | 30
2 | 10000000001 | 01.03.2016 00:00:00 | 25
3 | 10000000002 | 01.04.2016 00:00:00 | 13
4 | 10000000003 | 01.05.2016 00:00:00 | 18
5 | 10000000004 | 01.06.2016 00:00:00 | 12
Table name "payments"
P_ID | reference | received_date | p_amount
------------------------------------------------------
1 | PAYMENT01 | 12.02.2016 13:14:12 | 12
2 | PAYMENT02 | 12.02.2016 15:24:21 | 28
3 | PAYMENT03 | 08.03.2016 23:12:00 | 2
4 | PAYMENT04 | 23.03.2016 12:32:13 | 30
5 | PAYMENT05 | 12.06.2016 00:00:00 | 15
So I want to have a select statement (maybe with oracle analytic functions but I am not really familiar with it) where the payments are getting summed up till the amount of an invoice is reached, ordered by dates. If the sum of for example two payments is more than the invoice amount the rest of the last payment amount should be used for the next invoice.
In this example the result should be like this:
invoice_number | reference | used_pay_amount | open_inv_amount
----------------------------------------------------------
10000000000 | PAYMENT01 | 12 | 18
10000000000 | PAYMENT02 | 18 | 0
10000000001 | PAYMENT02 | 10 | 15
10000000001 | PAYMENT03 | 2 | 13
10000000001 | PAYMENT04 | 13 | 0
10000000002 | PAYMENT04 | 13 | 0
10000000003 | PAYMENT04 | 4 | 14
10000000003 | PAYMENT05 | 14 | 0
10000000004 | PAYMENT05 | 1 | 11
It would be nice if there is a solution with a "simple" select statement.
thx in advance for your time ...
Oracle Setup:
CREATE TABLE invoices ( i_id, invoice_number, creation_date, i_amount ) AS
SELECT 1, 100000000, DATE '2016-01-01', 30 FROM DUAL UNION ALL
SELECT 2, 100000001, DATE '2016-02-01', 25 FROM DUAL UNION ALL
SELECT 3, 100000002, DATE '2016-03-01', 13 FROM DUAL UNION ALL
SELECT 4, 100000003, DATE '2016-04-01', 18 FROM DUAL UNION ALL
SELECT 5, 100000004, DATE '2016-05-01', 12 FROM DUAL;
CREATE TABLE payments ( p_id, reference, received_date, p_amount ) AS
SELECT 1, 'PAYMENT01', DATE '2016-01-12', 12 FROM DUAL UNION ALL
SELECT 2, 'PAYMENT02', DATE '2016-01-13', 28 FROM DUAL UNION ALL
SELECT 3, 'PAYMENT03', DATE '2016-02-08', 2 FROM DUAL UNION ALL
SELECT 4, 'PAYMENT04', DATE '2016-02-23', 30 FROM DUAL UNION ALL
SELECT 5, 'PAYMENT05', DATE '2016-05-12', 15 FROM DUAL;
Query:
WITH total_invoices ( i_id, invoice_number, creation_date, i_amount, i_total ) AS (
SELECT i.*,
SUM( i_amount ) OVER ( ORDER BY creation_date, i_id )
FROM invoices i
),
total_payments ( p_id, reference, received_date, p_amount, p_total ) AS (
SELECT p.*,
SUM( p_amount ) OVER ( ORDER BY received_date, p_id )
FROM payments p
)
SELECT invoice_number,
reference,
LEAST( p_total, i_total )
- GREATEST( p_total - p_amount, i_total - i_amount ) AS used_pay_amount,
GREATEST( i_total - p_total, 0 ) AS open_inv_amount
FROM total_invoices
INNER JOIN
total_payments
ON ( i_total - i_amount < p_total
AND i_total > p_total - p_amount );
Explanation:
The two sub-query factoring (WITH ... AS ()) clauses just add an extra virtual column to the invoices and payments tables with the cumulative sum of the invoice/payment amount.
You can associate a range with each invoice (or payment) as the cumulative amount owing (paid) before the invoice (payment) was placed and the cumulative amount owing (paid) after. The two tables can then be joined where there is an overlap of these ranges.
The open_inv_amount is the positive difference between the cumulative amount invoiced and the cumulative amount paid.
The used_pay_amount is slightly more complicated but you need to find the difference between the lower of the current cumulative invoice and payment totals and the higher of the previous cumulative invoice and payment totals.
Output:
INVOICE_NUMBER REFERENCE USED_PAY_AMOUNT OPEN_INV_AMOUNT
-------------- --------- --------------- ---------------
100000000 PAYMENT01 12 18
100000000 PAYMENT02 18 0
100000001 PAYMENT02 10 15
100000001 PAYMENT03 2 13
100000001 PAYMENT04 13 0
100000002 PAYMENT04 13 0
100000003 PAYMENT04 4 14
100000003 PAYMENT05 14 0
100000004 PAYMENT05 1 11
Update:
Based on mathguy's method of using UNION to join the data, I came up with a different solution re-using some of my code.
WITH combined ( invoice_number, reference, i_amt, i_total, p_amt, p_total, total ) AS (
SELECT invoice_number,
NULL,
i_amount,
SUM( i_amount ) OVER ( ORDER BY creation_date, i_id ),
NULL,
NULL,
SUM( i_amount ) OVER ( ORDER BY creation_date, i_id )
FROM invoices
UNION ALL
SELECT NULL,
reference,
NULL,
NULL,
p_amount,
SUM( p_amount ) OVER ( ORDER BY received_date, p_id ),
SUM( p_amount ) OVER ( ORDER BY received_date, p_id )
FROM payments
ORDER BY 7,
2 NULLS LAST,
1 NULLS LAST
),
filled ( invoice_number, reference, i_prev, i_total, p_prev, p_total ) AS (
SELECT FIRST_VALUE( invoice_number ) IGNORE NULLS OVER ( ORDER BY ROWNUM ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ),
FIRST_VALUE( reference ) IGNORE NULLS OVER ( ORDER BY ROWNUM ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ),
FIRST_VALUE( i_total - i_amt ) IGNORE NULLS OVER ( ORDER BY ROWNUM ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ),
FIRST_VALUE( i_total ) IGNORE NULLS OVER ( ORDER BY ROWNUM ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ),
FIRST_VALUE( p_total - p_amt ) IGNORE NULLS OVER ( ORDER BY ROWNUM ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ),
COALESCE(
p_total,
LEAD( p_total ) IGNORE NULLS OVER ( ORDER BY ROWNUM ),
LAG( p_total ) IGNORE NULLS OVER ( ORDER BY ROWNUM )
)
FROM combined
),
vals ( invoice_number, reference, upa, oia, prev_invoice ) AS (
SELECT invoice_number,
reference,
COALESCE( LEAST( p_total - i_total ) - GREATEST( p_prev, i_prev ), 0 ),
GREATEST( i_total - p_total, 0 ),
LAG( invoice_number ) OVER ( ORDER BY ROWNUM )
FROM filled
)
SELECT invoice_number,
reference,
upa AS used_pay_amount,
oia AS open_inv_amount
FROM vals
WHERE upa > 0
OR ( reference IS NULL AND invoice_number <> prev_invoice AND oia > 0 );
Explanation:
The combined sub-query factoring clause joins the two tables with a UNION ALL and generates the cumulative totals for the amounts invoiced and paid. The final thing it does is order the rows by their ascending cumulative total (and if there are ties it will put the payments, in order created, before the invoices).
The filled sub-query factoring clause will fill the previously generated table so that if a value is null then it will take the value from the next non-null row (and if there is an invoice with no payments then it will find the total of the previous payments from the preceding rows).
The vals sub-query factoring clause applies the same calculations as my previous query (see above). It also adds the prev_invoice column to help identify invoices which are entirely unpaid.
The final SELECT takes the values and filters out the unnecessary rows.
Here is a solution that doesn't require a join. This is important if the amount of data is significant. I did some testing on my laptop (nothing commercial), using the free edition (XE) of Oracle 11.2. Using MT0's solution, the query with the join takes about 11 seconds if there are 10k invoices and 10k payments. For 50k invoices and 50k payments, the query took 287 seconds (almost 5 minutes). This is understandable, since joining two 50k tables requires 2.5 billion comparisons.
The alternative below uses a union. It uses lag() and last_value() to do the work the join does in the other solution. This union-based solution, with 50k invoices and 50k payments, took less than 0.5 seconds on my laptop (!)
I simplified the setup a bit; i_id, invoice_number and creation_date are all used for one purpose only: to order the invoice amounts. I use just an inv_id (invoice id) for that purpose, and similar for payments..
For testing purposes, I created tables invoices and payments like so:
create table invoices (inv_id, inv_amt) as
(select level, trunc(dbms_random.value(20, 80)) from dual connect by level <= 50000);
create table payments (pmt_id, pmt_amt) as
(select level, trunc(dbms_random.value(20, 80)) from dual connect by level <= 50000);
Then, to test the solutions, I use the queries to populate a CTAS, like this:
create table bal_of_pmts as
[select query, including the WITH clause but without the setup CTE's, comes here]
In my solution, I look to show the allocation of payments to one or more invoice, and the payment of invoices from one or more payments; the output discussed in the original post only covers half of this information, but for symmetry it makes more sense to me to show both halves. The output (for the same inputs as in the original post) looks like this, with my version of inv_id and pmt_id:
INV_ID PAID UNPAID PMT_ID USED AVAILABLE
---------- ---------- ---------- ---------- ---------- ----------
1 12 18 101 12 0
1 18 0 103 18 10
2 10 15 103 10 0
2 2 13 105 2 0
2 13 0 107 13 17
3 13 0 107 13 4
4 4 14 107 4 0
4 14 0 109 14 1
5 1 11 109 1 0
5 11 0 11
Notice how the left half is what the original post requested. There is an extra row at the end. Notice the NULL for payment id, for a payment of 11 - that shows how much of the last payment is left uncovered. If there was an invoice with id = 6, for an amount of, say, 22, then there would be one more row - showing the entire amount (22) of that invoice as "paid" from a payment with no id - meaning actually not covered (yet).
The query may be a little easier to understand than the join approach. To see what it does, it may help to look closely at intermediate results, especially the CTE c (in the WITH clause).
with invoices (inv_id, inv_amt) as (
select 1, 30 from dual union all
select 2, 25 from dual union all
select 3, 13 from dual union all
select 4, 18 from dual union all
select 5, 12 from dual
),
payments (pmt_id, pmt_amt) as (
select 101, 12 from dual union all
select 103, 28 from dual union all
select 105, 2 from dual union all
select 107, 30 from dual union all
select 109, 15 from dual
),
c (kind, inv_id, inv_cml, pmt_id, pmt_cml, cml_amt) as (
select 'i', inv_id, sum(inv_amt) over (order by inv_id), null, null,
sum(inv_amt) over (order by inv_id)
from invoices
union all
select 'p', null, null, pmt_id, sum(pmt_amt) over (order by pmt_id),
sum(pmt_amt) over (order by pmt_id)
from payments
),
d (inv_id, paid, unpaid, pmt_id, used, available) as (
select last_value(inv_id) ignore nulls over (order by cml_amt desc),
cml_amt - lead(cml_amt, 1, 0) over (order by cml_amt desc),
case kind when 'i' then 0
else last_value(inv_cml) ignore nulls
over (order by cml_amt desc) - cml_amt end,
last_value(pmt_id) ignore nulls over (order by cml_amt desc),
cml_amt - lead(cml_amt, 1, 0) over (order by cml_amt desc),
case kind when 'p' then 0
else last_value(pmt_cml) ignore nulls
over (order by cml_amt desc) - cml_amt end
from c
)
select inv_id, paid, unpaid, pmt_id, used, available
from d
where paid != 0
order by inv_id, pmt_id
;
In most cases, CTE d is all we need. However, if the cumulative sum for several invoices is exactly equal to the cumulative sum for several payments, my query would add a row with paid = unpaid = 0. (MT0's join solution does not have this problem.) To cover all possible cases, and not have rows with no information, I had to add the filter for paid != 0.