Distribute rows evenly by days - sql

I have table, where I put lets call it manual values that are used later in my code. This table looks like that:
subId | MonthNo | PackagesNumber | Country | EntryMethod | PaidAmount | Version
1 | 201701 | 223 | NO | BCD | 44803 | 2
2 | 201701 | 61 | NO | GHI | 11934 | 2
3 | 201701 | 929 | NO | ABC | 88714 | 2
4 | 201701 | 470 | NO | DEF | 98404 | 2
5 | 201702 | 223 | NO | BCD | 28225 | 2
All I have to do is, to divide those values into single rows, at the level of single package. In example, there are 223 packages in January 2017 in Country NO with EntryMethod BCD, so I want 223 separate rows. PaidAmount should be also divided by number of PackagesNumber.
The problem is I have to associate date to every record. Records should be distributed evenly through whole month. I have Date dimension, that I can intersect with my table by pulling month and year separately from MontNo.
For example, January 2017, EntryMethod BCD I have packages, so it's ~7 packages per day.
That's what I want:
subId | Date | Country | Packages | EntryMethod | PaidAmount | Version
1 | 01.01.2017 | NO | 1 | BCD | 200.910313901345 | 2
2 | 01.01.2017 | NO | 1 | BCD | 200.910313901345 | 2
3 | 01.01.2017 | NO | 1 | BCD | 200.910313901345 | 2
4 | 01.01.2017 | NO | 1 | BCD | 200.910313901345 | 2
5 | 01.01.2017 | NO | 1 | BCD | 200.910313901345 | 2
6 | 01.01.2017 | NO | 1 | BCD | 200.910313901345 | 2
7 | 01.01.2017 | NO | 1 | BCD | 200.910313901345 | 2
8 | 02.01.2017 | NO | 1 | BCD | 200.910313901345 | 2
Bonus: I wrote code, that's dividing Packages into single records, and it's putting first day of each month as date.
SELECT
Date =
(
SELECT TOP 1
date
FROM dim_Date dim
WHERE dim.Month = a.Month
AND dim.Year = a.Year
)
, Country
, EntryMethod
, Deliveries = 1
, PaidAmount = NULLIF(PaidAmount, 0) / PackagesNumber
, SubscriptionId = 90000000 + ROW_NUMBER() OVER(ORDER BY n.number)
, Version
FROM
(
SELECT
[Year] = LEFT(MonthNo, 4)
, [Month] = RIGHT(MonthNo, 2)
, Country
, EntryMethod
, PackagesNumber
, PaidAmount
, Version
FROM tgm.rep_PredictionsReport_ManualValues tgm
/*WHERE MonthNo = 201701*/
) a
JOIN master..spt_values n
ON n.type = 'P'
AND n.number < CAST(PackagesNumber AS INT);
EDIT: I made some progress. I used NTILE function, to divide rows into groups.
The only thing that changed is Date from top level select. It looks like that now:
Date = concat([Year], '-', [Month], '-', case when ntile(31) over(order by n.number) < 10 then '0' + cast(ntile(31) over(order by n.number) as varchar(2)) else cast(ntile(31) over(order by n.number) as varchar(2)) end)
Explanation: I am creating Date filed using Year and Month fields, and NTILE over number of days in month(now it's static number, but later to be changed). Results aren't good as I'd expect, it's creating groups twice as big as they should be(14 instead of 7 rows in each date).

You can accomplish this using the modulo operator, which allows you to divide items into a set number of categories.
Here is a full test: http://rextester.com/TOROA96856
Here is the relevant query:
--recursive query to expand each row.
with expand_rows (subid,monthno,month,packagesnumber,paidamount) as (
select subid,monthno,month,packagesnumber,(paidamount+0.0000)/packagesnumber
from initial_table
union all
select subid,monthno,month,packagesnumber-1,paidamount
from expand_rows where packagesnumber >1
)
select expand_rows.*,(packagesnumber % numdays)+1 day, paidamount from expand_rows
join dayspermonth d on
d.month = expand_rows.month
order by subid, day
option (maxrecursion 0)
(packagesnumber % numdays)+1 is the modulo operation that assigns items to a day.
Note that I precomputed a table of the number of days in each month in order to use in the query. I also simplified the problem slightly for purposes of the answer (added a pure month column because I didn't want to mess around with replicating your date dimension).
You may need to tweak the modulo query if you care where the extra items end up when things don't divide evenly (e.g. if you have 32 items in January, which day has an extra item?). In this example the second day of the month tends to get the most (because of adding 1 to account for the fact that the last day of the month ends up 0). If you want the extra days to fall at the beginning of the month you could use a case statement that converts 0 to the number of days in the month, instead.

To distribute 223 numbers evenly on days of january we do this:
There are 31 days in january
The remainder for 223/3 is 6
223/31 is 7 (integer division)
So thats 7 records pr day, plus 1 record extra for january 1-6.
I've used a tally table to make dates and some more, but the distribution of rows pr day can be determined like this:
with
tally as
(
select row_number() over (order by n)-1 n from
(values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) n(n)
cross join (values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) m(m)
cross join (values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) l(m)
cross join (values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) k(m)
)
,t1 as
(
select
*
from
(values
(1 , 201701 , 223 , 'NO' , 'BCD' , 44803 , 2)
,(2 , 201701 , 61 , 'NO' , 'GHI' , 11934 , 2)
,(3 , 201701 , 929 , 'NO' , 'ABC' , 88714 , 2)
,(4 , 201701 , 470 , 'NO' , 'DEF' , 98404 , 2)
,(5 , 201702 , 223 , 'NO' , 'BCD' , 28225 , 2)
) t(subId , MonthNo , PackagesNumber , Country , EntryMethod , PaidAmount , Version)
)
,dates as
(
select dateadd(day,n,'20170101') as dt
,convert(varchar(10),dateadd(day,n,'20170101'),112)/100 mnthkey
,day(dateadd(day,-1,dateadd(month,1,cast(((convert(varchar(10),dateadd(day,n,'20170101'),112)/100)*100 +1) as varchar(10))))) DaysInMonth
from
tally
)
select
subId
,MonthNo
,dt
,PackagesNumber
,case when day(dt)<=PackagesNumber%DaysInMonth then 1 else 0 end remainder
,PackagesNumber/DaysInMonth evenlyspread
,Country
,EntryMethod
,PaidAmount
,Version
from t1 a
inner join dates b
on a.MonthNo=b.mnthkey
I Join on the month with the data table, and for each day in the month i assig evenlydistributed days, 7 in our example, and for the first days, 6 in our example, i add 1 as remainder
Now we have the info from your base table, multiplied by every day in the relevant months, now we just need to make multiple rows pr day, here we use the tally tabe again:
with
tally as
(
select row_number() over (order by n)-1 n from
(values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) n(n)
cross join (values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) m(m)
cross join (values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) l(m)
cross join (values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) k(m)
)
,t1 as
(
select
*
from
(values
(1 , 201701 , 223 , 'NO' , 'BCD' , 44803 , 2)
,(2 , 201701 , 61 , 'NO' , 'GHI' , 11934 , 2)
,(3 , 201701 , 929 , 'NO' , 'ABC' , 88714 , 2)
,(4 , 201701 , 470 , 'NO' , 'DEF' , 98404 , 2)
,(5 , 201702 , 223 , 'NO' , 'BCD' , 28225 , 2)
) t(subId , MonthNo , PackagesNumber , Country , EntryMethod , PaidAmount , Version)
)
,dates as
(
select dateadd(day,n,'20170101') as dt
,convert(varchar(10),dateadd(day,n,'20170101'),112)/100 mnthkey
,day(dateadd(day,-1,dateadd(month,1,cast(((convert(varchar(10),dateadd(day,n,'20170101'),112)/100)*100 +1) as varchar(10))))) DaysInMonth
from
tally
)
,forshow as
(
select
subId
,MonthNo
,dt
,PackagesNumber
,case when day(dt)<=PackagesNumber%DaysInMonth then 1 else 0 end remainder
,PackagesNumber/DaysInMonth evenlyspread
,Country
,EntryMethod
,(PaidAmount+0.0000)/(PackagesNumber*1.0000) PaidAmount
,Version
,PaidAmount TotalPaidAmount
from t1 a
inner join dates b
on a.MonthNo=b.mnthkey
)
select
subId
,dt [Date]
,Country
,1 Packages
,EntryMethod
,PaidAmount
,Version
-- the following rows are just for control
,remainder+evenlyspread toalday
,count(*) over (partition by subId,MonthNo,dt) calctotalday
,PackagesNumber
,count(*) over (partition by subId) calcPackagesNumber
,sum(PaidAmount)over (partition by subId) calcPaidAmount
,TotalPaidAmount
from forshow
inner join tally on n<(remainder+evenlyspread )
order by subId,MonthNo,dt
I join with the number of days (evenlyspread+remainder) and get one row pr package.
I've added some check columns to make sure I get 8 rows the first 6 days, and 223 rows in total for our example

Related

How to bucket data based on timestamps within a certain period or previous record?

I have some data that I'm trying to bucket. Let's say the data has an user and timestamp. I want to define a session as any rows that has a timestamp within 10 minutes of the previous timestamp by user.
How would I go about this in SQL?
Example
+------+---------------------+---------+
| user | timestamp | session |
+------+---------------------+---------+
| 1 | 2021-05-09 15:12:52 | 1 |
| 1 | 2021-05-09 15:18:52 | 1 | within 10 min of previous timestamp
| 1 | 2021-05-09 15:32:52 | 2 | over 10 min, new session
| 2 | 2021-05-09 16:00:00 | 1 | different user
| 1 | 2021-05-09 17:00:00 | 3 | new session
| 1 | 2021-05-09 17:02:00 | 3 |
+------+---------------------+---------+
This will give me records within 10 minutes but how would I bucket them like above?
with cte as (
select user,
timestamp,
lag(timestamp) over (partition by user order by timestamp) as last_timestamp
from table
)
select *
from cte
where datediff(mm, last_timestamp, timestamp) <= 10
Try this one. It's basically an edge problem.
Working test case for SQL Server
The SQL:
with cte as (
select user1
, timestamp1
, session1 AS session_expected
, lag(timestamp1) over (partition by user1 order by timestamp1) as last_timestamp
, CASE WHEN datediff(n, lag(timestamp1) over (partition by user1 order by timestamp1), timestamp1) <= 10 THEN 0 ELSE 1 END AS edge
from table1
)
select *, SUM(edge) OVER (PARTITION BY user1 ORDER BY timestamp1) AS session_actual
from cte
ORDER BY timestamp1
;
Additional suggestion, see ROWS UNBOUNDED PRECEDING (thanks to #Charlieface):
with cte as (
select user1
, timestamp1
, session1 AS session_expected
, lag(timestamp1) over (partition by user1 order by timestamp1) as last_timestamp
, CASE WHEN datediff(n, lag(timestamp1) over (partition by user1 order by timestamp1), timestamp1) <= 10 THEN 0 ELSE 1 END AS edge
from table1
)
select *
, SUM(edge) OVER (PARTITION BY user1 ORDER BY timestamp1 ROWS UNBOUNDED PRECEDING) AS session_actual
from cte
ORDER BY timestamp1
;
Result:
Setup:
CREATE TABLE table1 (user1 int, timestamp1 datetime, session1 int);
INSERT INTO table1 VALUES
( 1 , '2021-05-09 15:12:52' , 1 )
, ( 1 , '2021-05-09 15:18:52' , 1 ) -- within 10 min of previous timestamp
, ( 1 , '2021-05-09 15:32:52' , 2 ) -- over 10 min, new session
, ( 2 , '2021-05-09 16:00:00' , 1 ) -- different user
, ( 1 , '2021-05-09 17:00:00' , 3 ) -- new session
, ( 1 , '2021-05-09 17:02:00' , 3 )
;

How to return same row multiple times with multiple conditions

My knowledge is pretty basic so your help would be highly appreciated.
I'm trying to return the same row multiple times when it meets the condition (I only have access to select query).
I have a table of more than 500000 records with Customer ID, Start Date and End Date, where end date could be null.
I am trying to add a new column called Week_No and list all rows accordingly. For example if the date range is more than one week, then the row must be returned multiple times with corresponding week number. Also I would like to count overlapping days, which will never be more than 7 (week) per row and then count unavailable days using second table.
Sample data below
t1
ID | Start_Date | End_Date
000001 | 12/12/2017 | 03/01/2018
000002 | 13/01/2018 |
000003 | 02/01/2018 | 11/01/2018
...
t2
ID | Unavailable
000002 | 14/01/2018
000003 | 03/01/2018
000003 | 04/01/2018
000003 | 08/01/2018
...
I cannot pass the stage of adding week no. I have tried using CASE and UNION ALL but keep getting errors.
declare #week01start datetime = '2018-01-01 00:00:00'
declare #week01end datetime = '2018-01-07 00:00:00'
declare #week02start datetime = '2018-01-08 00:00:00'
declare #week02end datetime = '2018-01-14 00:00:00'
...
SELECT
ID,
'01' as Week_No,
'2018' as YEAR,
Start_Date,
End_Date
FROM t1
WHERE (Start_Date <= #week01end and End_Date >= #week01start)
or (Start_Date <= #week01end and End_Date is null)
UNION ALL
SELECT
ID,
'02' as Week_No,
'2018' as YEAR,
Start_Date,
End_Date
FROM t1
WHERE (Start_Date <= #week02end and End_Date >= #week02start)
or (Start_Date <= #week02end and End_Date is null)
...
The new table should look like this
ID | Week_No | Year | Start_Date | End_Date | Overlap | Unavail_Days
000001 | 01 | 2018 | 12/12/2017 | 03/01/2018 | 3 |
000002 | 02 | 2018 | 13/01/2018 | | 2 | 1
000003 | 01 | 2018 | 02/01/2018 | 11/01/2018 | 6 | 2
000003 | 02 | 2018 | 02/01/2018 | 11/01/2018 | 4 | 1
...
business wise i cannot understand what you are trying to achieve. You can use the following code though to calculate your overlapping days etc. I did it the way you asked, but i would recommend a separate table, like a Time dimension to produce a "cleaner" solution
/*sample data set in temp table*/
select '000001' as id, '2017-12-12'as start_dt, ' 2018-01-03' as end_dt into #tmp union
select '000002' as id, '2018-01-13 'as start_dt, null as end_dt union
select '000003' as id, '2018-01-02' as start_dt, '2018-01-11' as end_dt
/*calculate week numbers and week diff according to dates*/
select *,
DATEPART(WK,start_dt) as start_weekNumber,
DATEPART(WK,end_dt) as end_weekNumber,
case
when DATEPART(WK,end_dt) - DATEPART(WK,start_dt) > 0 then (DATEPART(WK,end_dt) - DATEPART(WK,start_dt)) +1
else (52 - DATEPART(WK,start_dt)) + DATEPART(WK,end_dt)
end as WeekDiff
into #tmp1
from
(
SELECT *,DATEADD(DAY, 2 - DATEPART(WEEKDAY, start_dt), CAST(start_dt AS DATE)) [start_dt_Week_Start_Date],
DATEADD(DAY, 8 - DATEPART(WEEKDAY, start_dt), CAST(start_dt AS DATE)) [startdt_Week_End_Date],
DATEADD(DAY, 2 - DATEPART(WEEKDAY, end_dt), CAST(end_dt AS DATE)) [end_dt_Week_Start_Date],
DATEADD(DAY, 8 - DATEPART(WEEKDAY, end_dt), CAST(end_dt AS DATE)) [end_dt_Week_End_Date]
from #tmp
) s
/*cte used to create duplicates when week diff is over 1*/
;with x as
(
SELECT TOP (10) rn = ROW_NUMBER() --modify the max you want
OVER (ORDER BY [object_id])
FROM sys.all_columns
ORDER BY [object_id]
)
/*final query*/
select --*
ID,
start_weekNumber+ (r-1) as Week,
DATEPART(YY,start_dt) as [YEAR],
start_dt,
end_dt,
null as Overlap,
null as unavailable_days
from
(
select *,
ROW_NUMBER() over (partition by id order by id) r
from
(
select d.* from x
CROSS JOIN #tmp1 AS d
WHERE x.rn <= d.WeekDiff
union all
select * from #tmp1
where WeekDiff is null
) a
)a_ext
order by id,start_weekNumber
--drop table #tmp1,#tmp
The above will produce the results you want except the overlap and unavailable columns. Instead of just counting weeks, i added the number of week in the year using start_dt, but you can change that if you don't like it:
ID Week YEAR start_dt end_dt Overlap unavailable_days
000001 50 2017 2017-12-12 2018-01-03 NULL NULL
000001 51 2017 2017-12-12 2018-01-03 NULL NULL
000001 52 2017 2017-12-12 2018-01-03 NULL NULL
000002 2 2018 2018-01-13 NULL NULL NULL
000003 1 2018 2018-01-02 2018-01-11 NULL NULL
000003 2 2018 2018-01-02 2018-01-11 NULL NULL

Display data for all date ranges including missing dates

I'm having a issue with dates. I have a table with given from and to dates for an employee. For an evaluation, I'd like to display each date of the month with corresponding values from the second sql table.
SQL Table:
EmpNr | datefrom | dateto | hours
0815 | 01.01.2019 | 03.01.2019 | 15
0815 | 05.01.2019 | 15.01.2019 | 15
0815 | 20.01.2019 | 31.12.9999 | 40
The given employee (0815) worked during 01.01.-15.01. 15 hours, and during 20.01.-31.01. 40 hours
I'd like to have the following result:
0815 | 01.01.2019 | 15
0815 | 02.01.2019 | 15
0815 | 03.01.2019 | 15
0815 | 04.01.2019 | NULL
0815 | 05.01.2019 | 15
...
0815 | 15.01.2019 | 15
0815 | 16.01.2019 | NULL
0815 | 17.01.2019 | NULL
0815 | 18.01.2019 | NULL
0815 | 19.01.2019 | NULL
0815 | 20.01.2019 | 40
0815 | 21.01.2019 | 40
...
0815 | 31.01.2019 | 40
as for the dates, we have:
declare #year int = 2019, #month int = 1;
WITH numbers
as
(
Select 1 as value
UNion ALL
Select value + 1 from numbers
where value + 1 <= Day(EOMONTH(datefromparts(#year,#month,1)))
)
SELECT b.empnr, b.hours, datefromparts(#year,#month,numbers.value) Datum FROM numbers left outer join
emptbl b on b.empnr = '0815' and (datefromparts(#year,#month,numbers.value) >= b.datefrom and datefromparts(#year,#month,numbers.value) <= case b.dateto )
which is working quite well, yet I have the odd issue, that this code is only shoes the dates between 01.01.2019 and 03.01.2019
thank you very much in advance!
Did you check, if datefrom and dateto is in correct range?
Minimum value of DateTime field is 1753-01-01 and maximum value is 9999-12-31.
Look at your source table to check initial values.
The recursive CTE needs to begin with MIN(datefrom) and MAX(dateto):
DECLARE #t TABLE (empnr INT, datefrom DATE, dateto DATE, hours INT);
INSERT INTO #t VALUES
(815, '2019-01-01', '2019-01-03', 15),
(815, '2019-01-05', '2019-01-15', 15),
(815, '2019-01-20', '9999-01-01', 40),
-- another employee
(999, '2018-01-01', '2018-01-31', 15),
(999, '2018-03-01', '2018-03-31', 15),
(999, '2018-12-01', '9999-01-01', 40);
WITH rcte AS (
SELECT empnr
, MIN(datefrom) AS refdate
, ISNULL(NULLIF(MAX(dateto), '9999-01-01'), CURRENT_TIMESTAMP) AS maxdate -- clamp year 9999 to today
FROM #t
GROUP BY empnr
UNION ALL
SELECT empnr
, DATEADD(DAY, 1, refdate)
, maxdate
FROM rcte
WHERE refdate < maxdate
)
SELECT rcte.empnr
, rcte.refdate
, t.hours
FROM rcte
LEFT JOIN #t AS t ON rcte.empnr = t.empnr AND rcte.refdate BETWEEN t.datefrom AND t.dateto
ORDER BY rcte.empnr, rcte.refdate
OPTION (MAXRECURSION 1000) -- approx 3 years
Demo on db<>fiddle
It could be in your select, try:
SELECT b.empnr, b.hours, datefromparts(#year,#month,numbers.value) Datum
FROM numbers
LEFT OUTER JOIN emptbl b ON b.empnr = '0815' AND
datefromparts(#year,#month,numbers.value) BETWEEN b.datefrom AND b.dateto
Your CTE produces only 31 number and therefore it is showing only January dates.
declare #year int = 2019, #month int = 1;
WITH numbers
as
(
Select 1 as value
UNion ALL
Select value + 1 from numbers
where value + 1 <= Day(EOMONTH(datefromparts(#year,#month,1)))
)
SELECT *
FROM numbers
https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=a24e58ef4ce522d3ec914f90907a0a9e
You can try below code,
with t0 (i) as (select 0 union all select 0 union all select 0),
t1 (i) as (select a.i from t0 a ,t0 b ),
t2 (i) as (select a.i from t1 a ,t1 b ),
t3 (srno) as (select row_number()over(order by a.i) from t2 a ,t2 b ),
tbldt(dt) as (select dateadd(day,t3.srno-1,'01/01/2019') from t3)
select tbldt.dt
from tbldt
where tbldt.dt <= b.dateto -- put your condition here
https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=b16469908b323b8d1b98d77dd09bab3d

Postgresql - How to get value from last record of each month

I have a view like this:
Year | Month | Week | Category | Value |
2017 | 1 | 1 | A | 1
2017 | 1 | 1 | B | 2
2017 | 1 | 1 | C | 3
2017 | 1 | 2 | A | 4
2017 | 1 | 2 | B | 5
2017 | 1 | 2 | C | 6
2017 | 1 | 3 | A | 7
2017 | 1 | 3 | B | 8
2017 | 1 | 3 | C | 9
2017 | 1 | 4 | A | 10
2017 | 1 | 4 | B | 11
2017 | 1 | 4 | C | 12
2017 | 2 | 5 | A | 1
2017 | 2 | 5 | B | 2
2017 | 2 | 5 | C | 3
2017 | 2 | 6 | A | 4
2017 | 2 | 6 | B | 5
2017 | 2 | 6 | C | 6
2017 | 2 | 7 | A | 7
2017 | 2 | 7 | B | 8
2017 | 2 | 7 | C | 9
2017 | 2 | 8 | A | 10
2017 | 2 | 8 | B | 11
2017 | 2 | 8 | C | 12
And I need to make a new view which needs to show average of value column (let's call it avg_val) and the value from the max week of the month (max_val_of_month). Ex: max week of january is 4, so the value of category A is 10. Or something like this to be clear:
Year | Month | Category | avg_val | max_val_of_month
2017 | 1 | A | 5.5 | 10
2017 | 1 | B | 6.5 | 11
2017 | 1 | C | 7.5 | 12
2017 | 2 | A | 5.5 | 10
2017 | 2 | B | 6.5 | 11
2017 | 2 | C | 7.5 | 12
I have use window function, over partition by year, month, category to get the avg value. But how can I get the value of the max week of each month?
Assuming that you need a month average and a value for the max week not the max value per month
SELECT year, month, category, avg_val, value max_week_val
FROM (
SELECT *,
AVG(value) OVER (PARTITION BY year, month, category) avg_val,
ROW_NUMBER() OVER (PARTITION BY year, month, category ORDER BY week DESC) rn
FROM view1
) q
WHERE rn = 1
ORDER BY year, month, category
or more verbose version without window functions
SELECT q.year, q.month, q.category, q.avg_val, v.value max_week_val
FROM (
SELECT year, month, category, avg(value) avg_val, MAX(week) max_week
FROM view1
GROUP BY year, month, category
) q JOIN view1 v
ON q.year = v.year
AND q.month = v.month
AND q.category = v.category
AND q.max_week = v.week
ORDER BY year, month, category
Here is a dbfiddle demo for both queries
And here is my NEW version.
My thanks to #peterm for pointing me about the prior false value of val_from_max_week_of_month. So, I corrected it:
SELECT
a.Year,
a.Month,
a.Category,
max(a.Week) AS max_week,
AVG(a.Value) AS avg_val,
(
SELECT b.Value
FROM decades AS b
WHERE
b.Year = a.Year AND
b.Month = a.Month AND
b.Week = max(a.Week) AND
b.Category = a.Category
) AS val_from_max_week_of_month
FROM decades AS a
GROUP BY
a.Year,
a.Month,
a.Category
;
The new results:
First, you might need to check, how do you handle the first week in January. If 1st of January are not a Monday, there are several interpretations & not every one of them will fit the solutions here. You'll either need to use:
the ISO week concept, ie. the week column should hold the ISO week & the year column should hold the ISO year (week-year, rather). Note: in this concept, 1st of January actually sometimes belongs to the previous year
use your own concept, where the first week of the year is "split" into two if 1st of January is not a Monday.
Note: the solutions below will not work if (in your table) the first week of January can be 52 or 53.
Given that avg_val is just a simple aggregation, while max_val_of_month can be calculated with typical greatest-n-per-group queries. It has a lot of possible solutions in PostgreSQL, with varying performance. Fortunately, your query will naturally have an easily determined selectivity: you'll always need (approx.) a quarter of your data.
Usual winners (in performance) are:
(These are not surprise though, as these 2 should perform more and more as you need more portion of the original data.)
array_agg() with order by variant:
select year, month, category, avg(value) avg_val,
(array_agg(value order by week desc))[1] max_val_of_month
from table_name
group by year, month, category;
distinct on variant:
select distinct on (year, month, category) year, month, category,
avg(value) over (partition by year, month, category) avg_val,
value max_val_of_month
from table_name
order by year, month, category, week desc;
The pure window function variant is not that bad either:
row_number() variant:
select year, month, category, avg_val, max_val_of_month
from (select year, month, category, value max_val_of_month,
avg(value) over (partition by year, month, category) avg_val,
row_number() over (partition by year, month, category order by week desc) rn
from table_name) w
where rn = 1;
But the LATERAL variant is only viable with an index:
LATERAL variant:
create index idx_table_name_year_month_category_week_desc
on table_name(year, month, category, week desc);
select year, month, category,
avg(value) avg_val,
max_val_of_month
from table_name t
cross join lateral (select value max_val_of_month
from table_name
where (year, month, category) = (t.year, t.month, t.category)
order by week desc
limit 1) m
group by year, month, category, max_val_of_month;
But most of the solutions above can actually utilize this index, not just this last one.
Without the index: http://rextester.com/WNEL86809
With the index: http://rextester.com/TYUA52054
with data (yr, mnth, wk, cat, val) as
(
-- begin test data
select 2017 , 1 , 1 , 'A' , 1 from dual union all
select 2017 , 1 , 1 , 'B' , 2 from dual union all
select 2017 , 1 , 1 , 'C' , 3 from dual union all
select 2017 , 1 , 2 , 'A' , 4 from dual union all
select 2017 , 1 , 2 , 'B' , 5 from dual union all
select 2017 , 1 , 2 , 'C' , 6 from dual union all
select 2017 , 1 , 3 , 'A' , 7 from dual union all
select 2017 , 1 , 3 , 'B' , 8 from dual union all
select 2017 , 1 , 3 , 'C' , 9 from dual union all
select 2017 , 1 , 4 , 'A' , 10 from dual union all
select 2017 , 1 , 4 , 'B' , 11 from dual union all
select 2017 , 1 , 4 , 'C' , 12 from dual union all
select 2017 , 2 , 5 , 'A' , 1 from dual union all
select 2017 , 2 , 5 , 'B' , 2 from dual union all
select 2017 , 2 , 5 , 'C' , 3 from dual union all
select 2017 , 2 , 6 , 'A' , 4 from dual union all
select 2017 , 2 , 6 , 'B' , 5 from dual union all
select 2017 , 2 , 6 , 'C' , 6 from dual union all
select 2017 , 2 , 7 , 'A' , 7 from dual union all
select 2017 , 2 , 8 , 'A' , 10 from dual union all
select 2017 , 2 , 8 , 'B' , 11 from dual union all
select 2017 , 2 , 7 , 'B' , 8 from dual union all
select 2017 , 2 , 7 , 'C' , 9 from dual union all
select 2018 , 2 , 7 , 'C' , 9 from dual union all
select 2017 , 2 , 8 , 'C' , 12 from dual
-- end test data
)
select * from
(
select
-- data.*: all columns of the data table
data.*,
-- avrg: partition by a combination of year,month and category to work out -
-- the avg for each category in a month of a year
avg(val) over (partition by yr, mnth, cat) avrg,
-- mwk: partition by year and month to work out -
-- the max week of a month in a year
max(wk) over (partition by yr, mnth) mwk
from
data
)
-- as OP's interest is in the max week of each month of a year, -
-- "wk" column value is matched against
-- the derived column "mwk"
where wk = mwk
order by yr,mnth,cat;

Oracle SQL sum up values till another value is reached

I hope I can describe my challenge in an understandable way.
I have two tables on a Oracle Database 12c which look like this:
Table name "Invoices"
I_ID | invoice_number | creation_date | i_amount
------------------------------------------------------
1 | 10000000000 | 01.02.2016 00:00:00 | 30
2 | 10000000001 | 01.03.2016 00:00:00 | 25
3 | 10000000002 | 01.04.2016 00:00:00 | 13
4 | 10000000003 | 01.05.2016 00:00:00 | 18
5 | 10000000004 | 01.06.2016 00:00:00 | 12
Table name "payments"
P_ID | reference | received_date | p_amount
------------------------------------------------------
1 | PAYMENT01 | 12.02.2016 13:14:12 | 12
2 | PAYMENT02 | 12.02.2016 15:24:21 | 28
3 | PAYMENT03 | 08.03.2016 23:12:00 | 2
4 | PAYMENT04 | 23.03.2016 12:32:13 | 30
5 | PAYMENT05 | 12.06.2016 00:00:00 | 15
So I want to have a select statement (maybe with oracle analytic functions but I am not really familiar with it) where the payments are getting summed up till the amount of an invoice is reached, ordered by dates. If the sum of for example two payments is more than the invoice amount the rest of the last payment amount should be used for the next invoice.
In this example the result should be like this:
invoice_number | reference | used_pay_amount | open_inv_amount
----------------------------------------------------------
10000000000 | PAYMENT01 | 12 | 18
10000000000 | PAYMENT02 | 18 | 0
10000000001 | PAYMENT02 | 10 | 15
10000000001 | PAYMENT03 | 2 | 13
10000000001 | PAYMENT04 | 13 | 0
10000000002 | PAYMENT04 | 13 | 0
10000000003 | PAYMENT04 | 4 | 14
10000000003 | PAYMENT05 | 14 | 0
10000000004 | PAYMENT05 | 1 | 11
It would be nice if there is a solution with a "simple" select statement.
thx in advance for your time ...
Oracle Setup:
CREATE TABLE invoices ( i_id, invoice_number, creation_date, i_amount ) AS
SELECT 1, 100000000, DATE '2016-01-01', 30 FROM DUAL UNION ALL
SELECT 2, 100000001, DATE '2016-02-01', 25 FROM DUAL UNION ALL
SELECT 3, 100000002, DATE '2016-03-01', 13 FROM DUAL UNION ALL
SELECT 4, 100000003, DATE '2016-04-01', 18 FROM DUAL UNION ALL
SELECT 5, 100000004, DATE '2016-05-01', 12 FROM DUAL;
CREATE TABLE payments ( p_id, reference, received_date, p_amount ) AS
SELECT 1, 'PAYMENT01', DATE '2016-01-12', 12 FROM DUAL UNION ALL
SELECT 2, 'PAYMENT02', DATE '2016-01-13', 28 FROM DUAL UNION ALL
SELECT 3, 'PAYMENT03', DATE '2016-02-08', 2 FROM DUAL UNION ALL
SELECT 4, 'PAYMENT04', DATE '2016-02-23', 30 FROM DUAL UNION ALL
SELECT 5, 'PAYMENT05', DATE '2016-05-12', 15 FROM DUAL;
Query:
WITH total_invoices ( i_id, invoice_number, creation_date, i_amount, i_total ) AS (
SELECT i.*,
SUM( i_amount ) OVER ( ORDER BY creation_date, i_id )
FROM invoices i
),
total_payments ( p_id, reference, received_date, p_amount, p_total ) AS (
SELECT p.*,
SUM( p_amount ) OVER ( ORDER BY received_date, p_id )
FROM payments p
)
SELECT invoice_number,
reference,
LEAST( p_total, i_total )
- GREATEST( p_total - p_amount, i_total - i_amount ) AS used_pay_amount,
GREATEST( i_total - p_total, 0 ) AS open_inv_amount
FROM total_invoices
INNER JOIN
total_payments
ON ( i_total - i_amount < p_total
AND i_total > p_total - p_amount );
Explanation:
The two sub-query factoring (WITH ... AS ()) clauses just add an extra virtual column to the invoices and payments tables with the cumulative sum of the invoice/payment amount.
You can associate a range with each invoice (or payment) as the cumulative amount owing (paid) before the invoice (payment) was placed and the cumulative amount owing (paid) after. The two tables can then be joined where there is an overlap of these ranges.
The open_inv_amount is the positive difference between the cumulative amount invoiced and the cumulative amount paid.
The used_pay_amount is slightly more complicated but you need to find the difference between the lower of the current cumulative invoice and payment totals and the higher of the previous cumulative invoice and payment totals.
Output:
INVOICE_NUMBER REFERENCE USED_PAY_AMOUNT OPEN_INV_AMOUNT
-------------- --------- --------------- ---------------
100000000 PAYMENT01 12 18
100000000 PAYMENT02 18 0
100000001 PAYMENT02 10 15
100000001 PAYMENT03 2 13
100000001 PAYMENT04 13 0
100000002 PAYMENT04 13 0
100000003 PAYMENT04 4 14
100000003 PAYMENT05 14 0
100000004 PAYMENT05 1 11
Update:
Based on mathguy's method of using UNION to join the data, I came up with a different solution re-using some of my code.
WITH combined ( invoice_number, reference, i_amt, i_total, p_amt, p_total, total ) AS (
SELECT invoice_number,
NULL,
i_amount,
SUM( i_amount ) OVER ( ORDER BY creation_date, i_id ),
NULL,
NULL,
SUM( i_amount ) OVER ( ORDER BY creation_date, i_id )
FROM invoices
UNION ALL
SELECT NULL,
reference,
NULL,
NULL,
p_amount,
SUM( p_amount ) OVER ( ORDER BY received_date, p_id ),
SUM( p_amount ) OVER ( ORDER BY received_date, p_id )
FROM payments
ORDER BY 7,
2 NULLS LAST,
1 NULLS LAST
),
filled ( invoice_number, reference, i_prev, i_total, p_prev, p_total ) AS (
SELECT FIRST_VALUE( invoice_number ) IGNORE NULLS OVER ( ORDER BY ROWNUM ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ),
FIRST_VALUE( reference ) IGNORE NULLS OVER ( ORDER BY ROWNUM ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ),
FIRST_VALUE( i_total - i_amt ) IGNORE NULLS OVER ( ORDER BY ROWNUM ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ),
FIRST_VALUE( i_total ) IGNORE NULLS OVER ( ORDER BY ROWNUM ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ),
FIRST_VALUE( p_total - p_amt ) IGNORE NULLS OVER ( ORDER BY ROWNUM ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ),
COALESCE(
p_total,
LEAD( p_total ) IGNORE NULLS OVER ( ORDER BY ROWNUM ),
LAG( p_total ) IGNORE NULLS OVER ( ORDER BY ROWNUM )
)
FROM combined
),
vals ( invoice_number, reference, upa, oia, prev_invoice ) AS (
SELECT invoice_number,
reference,
COALESCE( LEAST( p_total - i_total ) - GREATEST( p_prev, i_prev ), 0 ),
GREATEST( i_total - p_total, 0 ),
LAG( invoice_number ) OVER ( ORDER BY ROWNUM )
FROM filled
)
SELECT invoice_number,
reference,
upa AS used_pay_amount,
oia AS open_inv_amount
FROM vals
WHERE upa > 0
OR ( reference IS NULL AND invoice_number <> prev_invoice AND oia > 0 );
Explanation:
The combined sub-query factoring clause joins the two tables with a UNION ALL and generates the cumulative totals for the amounts invoiced and paid. The final thing it does is order the rows by their ascending cumulative total (and if there are ties it will put the payments, in order created, before the invoices).
The filled sub-query factoring clause will fill the previously generated table so that if a value is null then it will take the value from the next non-null row (and if there is an invoice with no payments then it will find the total of the previous payments from the preceding rows).
The vals sub-query factoring clause applies the same calculations as my previous query (see above). It also adds the prev_invoice column to help identify invoices which are entirely unpaid.
The final SELECT takes the values and filters out the unnecessary rows.
Here is a solution that doesn't require a join. This is important if the amount of data is significant. I did some testing on my laptop (nothing commercial), using the free edition (XE) of Oracle 11.2. Using MT0's solution, the query with the join takes about 11 seconds if there are 10k invoices and 10k payments. For 50k invoices and 50k payments, the query took 287 seconds (almost 5 minutes). This is understandable, since joining two 50k tables requires 2.5 billion comparisons.
The alternative below uses a union. It uses lag() and last_value() to do the work the join does in the other solution. This union-based solution, with 50k invoices and 50k payments, took less than 0.5 seconds on my laptop (!)
I simplified the setup a bit; i_id, invoice_number and creation_date are all used for one purpose only: to order the invoice amounts. I use just an inv_id (invoice id) for that purpose, and similar for payments..
For testing purposes, I created tables invoices and payments like so:
create table invoices (inv_id, inv_amt) as
(select level, trunc(dbms_random.value(20, 80)) from dual connect by level <= 50000);
create table payments (pmt_id, pmt_amt) as
(select level, trunc(dbms_random.value(20, 80)) from dual connect by level <= 50000);
Then, to test the solutions, I use the queries to populate a CTAS, like this:
create table bal_of_pmts as
[select query, including the WITH clause but without the setup CTE's, comes here]
In my solution, I look to show the allocation of payments to one or more invoice, and the payment of invoices from one or more payments; the output discussed in the original post only covers half of this information, but for symmetry it makes more sense to me to show both halves. The output (for the same inputs as in the original post) looks like this, with my version of inv_id and pmt_id:
INV_ID PAID UNPAID PMT_ID USED AVAILABLE
---------- ---------- ---------- ---------- ---------- ----------
1 12 18 101 12 0
1 18 0 103 18 10
2 10 15 103 10 0
2 2 13 105 2 0
2 13 0 107 13 17
3 13 0 107 13 4
4 4 14 107 4 0
4 14 0 109 14 1
5 1 11 109 1 0
5 11 0 11
Notice how the left half is what the original post requested. There is an extra row at the end. Notice the NULL for payment id, for a payment of 11 - that shows how much of the last payment is left uncovered. If there was an invoice with id = 6, for an amount of, say, 22, then there would be one more row - showing the entire amount (22) of that invoice as "paid" from a payment with no id - meaning actually not covered (yet).
The query may be a little easier to understand than the join approach. To see what it does, it may help to look closely at intermediate results, especially the CTE c (in the WITH clause).
with invoices (inv_id, inv_amt) as (
select 1, 30 from dual union all
select 2, 25 from dual union all
select 3, 13 from dual union all
select 4, 18 from dual union all
select 5, 12 from dual
),
payments (pmt_id, pmt_amt) as (
select 101, 12 from dual union all
select 103, 28 from dual union all
select 105, 2 from dual union all
select 107, 30 from dual union all
select 109, 15 from dual
),
c (kind, inv_id, inv_cml, pmt_id, pmt_cml, cml_amt) as (
select 'i', inv_id, sum(inv_amt) over (order by inv_id), null, null,
sum(inv_amt) over (order by inv_id)
from invoices
union all
select 'p', null, null, pmt_id, sum(pmt_amt) over (order by pmt_id),
sum(pmt_amt) over (order by pmt_id)
from payments
),
d (inv_id, paid, unpaid, pmt_id, used, available) as (
select last_value(inv_id) ignore nulls over (order by cml_amt desc),
cml_amt - lead(cml_amt, 1, 0) over (order by cml_amt desc),
case kind when 'i' then 0
else last_value(inv_cml) ignore nulls
over (order by cml_amt desc) - cml_amt end,
last_value(pmt_id) ignore nulls over (order by cml_amt desc),
cml_amt - lead(cml_amt, 1, 0) over (order by cml_amt desc),
case kind when 'p' then 0
else last_value(pmt_cml) ignore nulls
over (order by cml_amt desc) - cml_amt end
from c
)
select inv_id, paid, unpaid, pmt_id, used, available
from d
where paid != 0
order by inv_id, pmt_id
;
In most cases, CTE d is all we need. However, if the cumulative sum for several invoices is exactly equal to the cumulative sum for several payments, my query would add a row with paid = unpaid = 0. (MT0's join solution does not have this problem.) To cover all possible cases, and not have rows with no information, I had to add the filter for paid != 0.