Get percentiles of data-set with group by month - sql

I have a SQL table with a whole load of records that look like this:
| Date | Score |
+ -----------+-------+
| 01/01/2010 | 4 |
| 02/01/2010 | 6 |
| 03/01/2010 | 10 |
...
| 16/03/2010 | 2 |
I'm plotting this on a chart, so I get a nice line across the graph indicating score-over-time. Lovely.
Now, what I need to do is include the average score on the chart, so we can see how that changes over time, so I can simply add this to the mix:
SELECT
YEAR(SCOREDATE) 'Year', MONTH(SCOREDATE) 'Month',
MIN(SCORE) MinScore,
AVG(SCORE) AverageScore,
MAX(SCORE) MaxScore
FROM SCORES
GROUP BY YEAR(SCOREDATE), MONTH(SCOREDATE)
ORDER BY YEAR(SCOREDATE), MONTH(SCOREDATE)
That's no problem so far.
The problem is, how can I easily calculate the percentiles at each time-period? I'm not sure that's the correct phrase. What I need in total is:
A line on the chart for the score (easy)
A line on the chart for the average (easy)
A line on the chart showing the band that 95% of the scores occupy (stumped)
It's the third one that I don't get. I need to calculate the 5% percentile figures, which I can do singly:
SELECT MAX(SubQ.SCORE) FROM
(SELECT TOP 45 PERCENT SCORE
FROM SCORES
WHERE YEAR(SCOREDATE) = 2010 AND MONTH(SCOREDATE) = 1
ORDER BY SCORE ASC) AS SubQ
SELECT MIN(SubQ.SCORE) FROM
(SELECT TOP 45 PERCENT SCORE
FROM SCORES
WHERE YEAR(SCOREDATE) = 2010 AND MONTH(SCOREDATE) = 1
ORDER BY SCORE DESC) AS SubQ
But I can't work out how to get a table of all the months.
| Date | Average | 45% | 55% |
+ -----------+---------+-----+-----+
| 01/01/2010 | 13 | 11 | 15 |
| 02/01/2010 | 10 | 8 | 12 |
| 03/01/2010 | 5 | 4 | 10 |
...
| 16/03/2010 | 7 | 7 | 9 |
At the moment I'm going to have to load this lot up into my app, and calculate the figures myself. Or run a larger number of individual queries and collate the results.

Whew. This was a real brain teaser. First, my table schema for testing was:
Create Table Scores
(
Id int not null identity(1,1) primary key clustered
, [Date] datetime not null
, Score int not null
)
Now, first, I calculated the values using a CTE in SQL 2008 in order to check my answers and then I built a solution that should work in SQL 2000. So, in SQL 2008 we do something like:
;With
SummaryStatistics As
(
Select Year([Date]) As YearNum
, Month([Date]) As MonthNum
, Min(Score) As MinScore
, Max(Score) As MaxScore
, Avg(Score) As AvgScore
From Scores
Group By Month([Date]), Year([Date])
)
, Percentiles As
(
Select Year([Date]) As YearNum
, Month([Date]) As MonthNum
, Score
, NTile( 100 ) Over ( Partition By Month([Date]), Year([Date]) Order By Score ) As Percentile
From Scores
)
, ReportedPercentiles As
(
Select YearNum, MonthNum
, Min(Case When Percentile = 45 Then Score End) As Percentile45
, Min(Case When Percentile = 55 Then Score End) As Percentile55
From Percentiles
Where Percentile In(45,55)
Group By YearNum, MonthNum
)
Select SS.YearNum, SS.MonthNum
, SS.MinScore, SS.MaxScore, SS.AvgScore
, RP.Percentile45, RP.Percentile55
From SummaryStatistics As SS
Join ReportedPercentiles As RP
On RP.YearNum = SS.YearNum
And RP.MonthNum = SS.MonthNum
Order By SS.YearNum, SS.MonthNum
Now for a SQL 2000 solution. In essence, the trick is to use a couple of temporary tables to tally the occurances of the scores.
If object_id('tempdb..#Working') is not null
DROP TABLE #Working
GO
Create Table #Working
(
YearNum int not null
, MonthNum int not null
, Score int not null
, Occurances int not null
, Constraint PK_#Working Primary Key Clustered ( MonthNum, YearNum, Score )
)
GO
Insert #Working(MonthNum, YearNum, Score, Occurances)
Select Month([Date]), Year([Date]), Score, Count(*)
From Scores
Group By Month([Date]), Year([Date]), Score
GO
If object_id('tempdb..#SummaryStatistics') is not null
DROP TABLE #SummaryStatistics
GO
Create Table #SummaryStatistics
(
MonthNum int not null
, YearNum int not null
, Score int not null
, Occurances int not null
, CumulativeTotal int not null
, Percentile float null
, Constraint PK_#SummaryStatistics Primary Key Clustered ( MonthNum, YearNum, Score )
)
GO
Insert #SummaryStatistics(YearNum, MonthNum, Score, Occurances, CumulativeTotal)
Select W2.YearNum, W2.MonthNum, W2.Score, W2.Occurances, Sum(W1.Occurances)-W2.Occurances
From #Working As W1
Join #Working As W2
On W2.YearNum = W1.YearNum
And W2.MonthNum = W1.MonthNum
Where W1.Score <= W2.Score
Group By W2.YearNum, W2.MonthNum, W2.Score, W2.Occurances
Update #SummaryStatistics
Set Percentile = SS.CumulativeTotal * 100.0 / MonthTotal.Total
From #SummaryStatistics As SS
Join (
Select SS1.YearNum, SS1.MonthNum, Max(SS1.CumulativeTotal) As Total
From #SummaryStatistics As SS1
Group By SS1.YearNum, SS1.MonthNum
) As MonthTotal
On MonthTotal.YearNum = SS.YearNum
And MonthTotal.MonthNum = SS.MonthNum
Select GeneralStats.*, Percentiles.Percentile45, Percentiles.Percentile55
From (
Select Year(S1.[Date]) As YearNum
, Month(S1.[Date]) As MonthNum
, Min(S1.Score) As MinScore
, Max(S1.Score) As MaxScore
, Avg(S1.Score) As AvgScore
From Scores As S1
Group By Month(S1.[Date]), Year(S1.[Date])
) As GeneralStats
Join (
Select SS1.YearNum, SS1.MonthNum
, Min(Case When SS1.Percentile >= 45 Then Score End) As Percentile45
, Min(Case When SS1.Percentile >= 55 Then Score End) As Percentile55
From #SummaryStatistics As SS1
Group By SS1.YearNum, SS1.MonthNum
) As Percentiles
On Percentiles.YearNum = GeneralStats.YearNum
And Percentiles.MonthNum = GeneralStats.MonthNum

Without the data, I'm not sure if I'm doing this right, but maybe this will help get you there with two queries per year instead of 24...
SELECT MAX(SubQ.SCORE), MyMonth FROM
(SELECT TOP 45 PERCENT SCORE , MONTH(SCOREDATE) as MyMonth
FROM SCORES
WHERE YEAR(SCOREDATE) = 2010
ORDER BY SCORE ASC) AS SubQ
group by MyMonth

Related

Rolling total in SQL that Resets to 0 when going above 90

First time post. Learning SQL over the past 6 months so help is appreciated. I have data structured as below:
DECLARE #tmp4 as TABLE (
AccountNumber int,
Date date,
DateRank int
)
INSERT INTO #tmp4
VALUES (001, '11/13/2018' , 1)
, (002, '12/19/2018', 2)
, (003, '1/23/2019' , 3)
, (004, '2/5/2019' , 4)
, (005, '3/10/2019' , 5)
, (006, '3/20/2019' , 6)
, (007, '4/8/2019' , 7)
, (008, '5/20/2019' , 8)
What I need to do with this data is calculate a rolling total that resets to 0 once a threshold of 90 days is reached. I have used the DateDiff function to calculate the DateDiffs between consecutive dates and have tried multiple things using LAG and other window functions but can't make it reset. The goal is to find "index visits" which can only occur once every 90 days. So my plan is to have a field that reads 0 on the first visit and resets to 0 for the next stay after 90 days is up from the first visit then only pull visits with a value of 0.
One solution I tried was correct for most sets but did not return the right values for the above set (rows 4 and 8 should start over as "index visits").
The results I would expect for this query would be:
Account Date DateRank RollingTotal
001 |'11/13/2018' | 1 | 0
002 |'12/19/2018' | 2 | 35
003 |'1/23/2019' | 3 | 71
004 |'2/5/2019' | 4 | 84
005 |'3/10/2019' | 5 | 0 (not 117)
006 |'3/20/2019' | 6 | 10
007 |'4/8/2019' | 7 | 29
008 |'5/20/2019' | 8 | 71
Thanks for any help.
Here's the code I tried:
DECLARE #tmp2 as TABLE
(EmrNumber varchar(255)
, AdmitDateTime datetime
, DateRank int
, LagDateDiff int
, RunningTotal int
)
INSERT INTO #tmp2
SELECT tmp1.EmrNumber
, tmp1.AdmitDateTime
, tmp1.DateRank
--, LAG(tmp1.AdmitDateTime) OVER(PARTITION BY tmp1.EmrNumber ORDER BY tmp1.DateRank) as NextAdmitDate
, -DATEDIFF(DAY, tmp1.AdmitDateTime, LAG(tmp1.AdmitDateTime) OVER(PARTITION BY tmp1.EmrNumber ORDER BY tmp1.DateRank)) LagDateDiff
, IIF((SELECT SUM(sumt.total)
FROM (
SELECT -DATEDIFF(DAY, tmpsum.AdmitDateTime, LAG(tmpsum.AdmitDateTime) OVER(PARTITION BY tmpsum.EmrNumber ORDER BY tmpsum.DateRank)) total
FROM #tmp tmpsum
WHERE tmp1.EmrNumber = tmpsum.EmrNumber
AND tmpsum.AdmitDateTime <= tmp1.AdmitDateTime
) sumt) IS NULL, 0, (SELECT SUM(sumt.total)
FROM (
SELECT -DATEDIFF(DAY, tmpsum.AdmitDateTime, LAG(tmpsum.AdmitDateTime) OVER(PARTITION BY tmpsum.EmrNumber ORDER BY tmpsum.DateRank)) total
FROM #tmp tmpsum
WHERE tmp1.EmrNumber = tmpsum.EmrNumber
AND tmpsum.AdmitDateTime <= tmp1.AdmitDateTime
) sumt) ) as RunningTotal
FROM #tmp tmp1
SELECT *
, CASE WHEN LagDateDiff >90 THEN 0
WHEN RunningTotal = 0 THEN 0
ELSE LAG(LagDateDiff) OVER(PARTITION BY EmrNumber ORDER BY DateRank) + RunningTotal END AS RollingTotal
FROM #tmp2
You need a recursive query for this, because the running total has to be checked iteratively, row after row:
with cte as (
select
Account,
Date,
DateRank,
0 RollingTotal
from #tmp4
where DateRank = 1
union all
select
t.Account,
t.Date,
t.DateRank,
case when RollingTotal + datediff(day, c.Date, t.Date) > 90
then 0
else RollingTotal + datediff(day, c.Date, t.Date)
end
from cte c
inner join #tmp4 t on t.DateRank = c.DateRank + 1
)
select * from cte
The anchor of the cte selects the first record (as indicated by DateRank. Then, the recursive part processes rows one by one, and resets the running count when it crosses 90.

T-SQL calculate the percent increase or decrease between the earliest and latest for each project

I have a table like below, I am trying to run a query in T-SQL to get the earliest and latest costs for each project_id according to the date column and calculate the percent cost increase or decrease and return the data-set show in the second table (I have simplified the table in this question).
project_id date cost
-------------------------------
123 7/1/17 5000
123 8/1/17 6000
123 9/1/17 7000
123 10/1/17 8000
123 11/1/17 9000
456 7/1/17 10000
456 8/1/17 9000
456 9/1/17 8000
876 1/1/17 8000
876 6/1/17 5000
876 8/1/17 10000
876 11/1/17 8000
Result:
(Edit: Fixed the result)
project_id "cost incr/decr pct"
------------------------------------------------
123 80% which is (9000-5000)/5000
456 -20%
876 0%
Whatever query I run I get duplicates.
This is what I tried:
select distinct
p1.Proj_ID, p1.date, p2.[cost], p3.cost,
(nullif(p2.cost, 0) / nullif(p1.cost, 0)) * 100 as 'OVER UNDER'
from
[PROJECT] p1
inner join
(select
[Proj_ID], [cost], min([date]) min_date
from
[PROJECT]
group by
[Proj_ID], [cost]) p2 on p1.Proj_ID = p2.Proj_ID
inner join
(select
[Proj_ID], [cost], max([date]) max_date
from
[PROJECT]
group by
[Proj_ID], [cost]) p3 on p1.Proj_ID = p3.Proj_ID
where
p1.date in (p2.min_date, p3.max_date)
Unfortunately, SQL Server does not have a first_value() aggregation function. It does have an analytic function, though. So, you can do:
select distinct project_id,
first_value(cost) over (partition by project_id order by date asc) as first_cost,
first_value(cost) over (partition by project_id order by date desc) as last_cost,
(first_value(cost) over (partition by project_id order by date desc) /
first_value(cost) over (partition by project_id order by date asc)
) - 1 as ratio
from project;
If cost is an integer, you may need to convert to a representation with decimal places.
You can use row_number and OUTER APPLY over top 1 ... prior to SQL 2012
select
min_.projectid,
latest_.cost - min_.cost [Calculation]
from
(select
row_number() over (partition by projectid order by date) rn
,projectid
,cost
from projectable) min_ -- get the first dates per project
outer apply (
select
top 1
cost
from projectable
where
projectid = min_.projectid -- get the latest cost for each project
order by date desc
) latest_
where min_.rn = 1
This might perform a little better
;with costs as (
select *,
ROW_NUMBER() over (PARTITION BY project_id ORDER BY date) mincost,
ROW_NUMBER() over (PARTITION BY project_id ORDER BY date desc) maxcost
from table1
)
select project_id,
min(case when mincost = 1 then cost end) as cost1,
max(case when maxcost = 1 then cost end) as cost2,
(max(case when maxcost = 1 then cost end) - min(case when mincost = 1 then cost end)) * 100 / min(case when mincost = 1 then cost end) as [OVER UNDER]
from costs a
group by project_id

Distribute rows evenly by days

I have table, where I put lets call it manual values that are used later in my code. This table looks like that:
subId | MonthNo | PackagesNumber | Country | EntryMethod | PaidAmount | Version
1 | 201701 | 223 | NO | BCD | 44803 | 2
2 | 201701 | 61 | NO | GHI | 11934 | 2
3 | 201701 | 929 | NO | ABC | 88714 | 2
4 | 201701 | 470 | NO | DEF | 98404 | 2
5 | 201702 | 223 | NO | BCD | 28225 | 2
All I have to do is, to divide those values into single rows, at the level of single package. In example, there are 223 packages in January 2017 in Country NO with EntryMethod BCD, so I want 223 separate rows. PaidAmount should be also divided by number of PackagesNumber.
The problem is I have to associate date to every record. Records should be distributed evenly through whole month. I have Date dimension, that I can intersect with my table by pulling month and year separately from MontNo.
For example, January 2017, EntryMethod BCD I have packages, so it's ~7 packages per day.
That's what I want:
subId | Date | Country | Packages | EntryMethod | PaidAmount | Version
1 | 01.01.2017 | NO | 1 | BCD | 200.910313901345 | 2
2 | 01.01.2017 | NO | 1 | BCD | 200.910313901345 | 2
3 | 01.01.2017 | NO | 1 | BCD | 200.910313901345 | 2
4 | 01.01.2017 | NO | 1 | BCD | 200.910313901345 | 2
5 | 01.01.2017 | NO | 1 | BCD | 200.910313901345 | 2
6 | 01.01.2017 | NO | 1 | BCD | 200.910313901345 | 2
7 | 01.01.2017 | NO | 1 | BCD | 200.910313901345 | 2
8 | 02.01.2017 | NO | 1 | BCD | 200.910313901345 | 2
Bonus: I wrote code, that's dividing Packages into single records, and it's putting first day of each month as date.
SELECT
Date =
(
SELECT TOP 1
date
FROM dim_Date dim
WHERE dim.Month = a.Month
AND dim.Year = a.Year
)
, Country
, EntryMethod
, Deliveries = 1
, PaidAmount = NULLIF(PaidAmount, 0) / PackagesNumber
, SubscriptionId = 90000000 + ROW_NUMBER() OVER(ORDER BY n.number)
, Version
FROM
(
SELECT
[Year] = LEFT(MonthNo, 4)
, [Month] = RIGHT(MonthNo, 2)
, Country
, EntryMethod
, PackagesNumber
, PaidAmount
, Version
FROM tgm.rep_PredictionsReport_ManualValues tgm
/*WHERE MonthNo = 201701*/
) a
JOIN master..spt_values n
ON n.type = 'P'
AND n.number < CAST(PackagesNumber AS INT);
EDIT: I made some progress. I used NTILE function, to divide rows into groups.
The only thing that changed is Date from top level select. It looks like that now:
Date = concat([Year], '-', [Month], '-', case when ntile(31) over(order by n.number) < 10 then '0' + cast(ntile(31) over(order by n.number) as varchar(2)) else cast(ntile(31) over(order by n.number) as varchar(2)) end)
Explanation: I am creating Date filed using Year and Month fields, and NTILE over number of days in month(now it's static number, but later to be changed). Results aren't good as I'd expect, it's creating groups twice as big as they should be(14 instead of 7 rows in each date).
You can accomplish this using the modulo operator, which allows you to divide items into a set number of categories.
Here is a full test: http://rextester.com/TOROA96856
Here is the relevant query:
--recursive query to expand each row.
with expand_rows (subid,monthno,month,packagesnumber,paidamount) as (
select subid,monthno,month,packagesnumber,(paidamount+0.0000)/packagesnumber
from initial_table
union all
select subid,monthno,month,packagesnumber-1,paidamount
from expand_rows where packagesnumber >1
)
select expand_rows.*,(packagesnumber % numdays)+1 day, paidamount from expand_rows
join dayspermonth d on
d.month = expand_rows.month
order by subid, day
option (maxrecursion 0)
(packagesnumber % numdays)+1 is the modulo operation that assigns items to a day.
Note that I precomputed a table of the number of days in each month in order to use in the query. I also simplified the problem slightly for purposes of the answer (added a pure month column because I didn't want to mess around with replicating your date dimension).
You may need to tweak the modulo query if you care where the extra items end up when things don't divide evenly (e.g. if you have 32 items in January, which day has an extra item?). In this example the second day of the month tends to get the most (because of adding 1 to account for the fact that the last day of the month ends up 0). If you want the extra days to fall at the beginning of the month you could use a case statement that converts 0 to the number of days in the month, instead.
To distribute 223 numbers evenly on days of january we do this:
There are 31 days in january
The remainder for 223/3 is 6
223/31 is 7 (integer division)
So thats 7 records pr day, plus 1 record extra for january 1-6.
I've used a tally table to make dates and some more, but the distribution of rows pr day can be determined like this:
with
tally as
(
select row_number() over (order by n)-1 n from
(values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) n(n)
cross join (values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) m(m)
cross join (values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) l(m)
cross join (values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) k(m)
)
,t1 as
(
select
*
from
(values
(1 , 201701 , 223 , 'NO' , 'BCD' , 44803 , 2)
,(2 , 201701 , 61 , 'NO' , 'GHI' , 11934 , 2)
,(3 , 201701 , 929 , 'NO' , 'ABC' , 88714 , 2)
,(4 , 201701 , 470 , 'NO' , 'DEF' , 98404 , 2)
,(5 , 201702 , 223 , 'NO' , 'BCD' , 28225 , 2)
) t(subId , MonthNo , PackagesNumber , Country , EntryMethod , PaidAmount , Version)
)
,dates as
(
select dateadd(day,n,'20170101') as dt
,convert(varchar(10),dateadd(day,n,'20170101'),112)/100 mnthkey
,day(dateadd(day,-1,dateadd(month,1,cast(((convert(varchar(10),dateadd(day,n,'20170101'),112)/100)*100 +1) as varchar(10))))) DaysInMonth
from
tally
)
select
subId
,MonthNo
,dt
,PackagesNumber
,case when day(dt)<=PackagesNumber%DaysInMonth then 1 else 0 end remainder
,PackagesNumber/DaysInMonth evenlyspread
,Country
,EntryMethod
,PaidAmount
,Version
from t1 a
inner join dates b
on a.MonthNo=b.mnthkey
I Join on the month with the data table, and for each day in the month i assig evenlydistributed days, 7 in our example, and for the first days, 6 in our example, i add 1 as remainder
Now we have the info from your base table, multiplied by every day in the relevant months, now we just need to make multiple rows pr day, here we use the tally tabe again:
with
tally as
(
select row_number() over (order by n)-1 n from
(values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) n(n)
cross join (values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) m(m)
cross join (values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) l(m)
cross join (values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) k(m)
)
,t1 as
(
select
*
from
(values
(1 , 201701 , 223 , 'NO' , 'BCD' , 44803 , 2)
,(2 , 201701 , 61 , 'NO' , 'GHI' , 11934 , 2)
,(3 , 201701 , 929 , 'NO' , 'ABC' , 88714 , 2)
,(4 , 201701 , 470 , 'NO' , 'DEF' , 98404 , 2)
,(5 , 201702 , 223 , 'NO' , 'BCD' , 28225 , 2)
) t(subId , MonthNo , PackagesNumber , Country , EntryMethod , PaidAmount , Version)
)
,dates as
(
select dateadd(day,n,'20170101') as dt
,convert(varchar(10),dateadd(day,n,'20170101'),112)/100 mnthkey
,day(dateadd(day,-1,dateadd(month,1,cast(((convert(varchar(10),dateadd(day,n,'20170101'),112)/100)*100 +1) as varchar(10))))) DaysInMonth
from
tally
)
,forshow as
(
select
subId
,MonthNo
,dt
,PackagesNumber
,case when day(dt)<=PackagesNumber%DaysInMonth then 1 else 0 end remainder
,PackagesNumber/DaysInMonth evenlyspread
,Country
,EntryMethod
,(PaidAmount+0.0000)/(PackagesNumber*1.0000) PaidAmount
,Version
,PaidAmount TotalPaidAmount
from t1 a
inner join dates b
on a.MonthNo=b.mnthkey
)
select
subId
,dt [Date]
,Country
,1 Packages
,EntryMethod
,PaidAmount
,Version
-- the following rows are just for control
,remainder+evenlyspread toalday
,count(*) over (partition by subId,MonthNo,dt) calctotalday
,PackagesNumber
,count(*) over (partition by subId) calcPackagesNumber
,sum(PaidAmount)over (partition by subId) calcPaidAmount
,TotalPaidAmount
from forshow
inner join tally on n<(remainder+evenlyspread )
order by subId,MonthNo,dt
I join with the number of days (evenlyspread+remainder) and get one row pr package.
I've added some check columns to make sure I get 8 rows the first 6 days, and 223 rows in total for our example

Oracle SQL sum up values till another value is reached

I hope I can describe my challenge in an understandable way.
I have two tables on a Oracle Database 12c which look like this:
Table name "Invoices"
I_ID | invoice_number | creation_date | i_amount
------------------------------------------------------
1 | 10000000000 | 01.02.2016 00:00:00 | 30
2 | 10000000001 | 01.03.2016 00:00:00 | 25
3 | 10000000002 | 01.04.2016 00:00:00 | 13
4 | 10000000003 | 01.05.2016 00:00:00 | 18
5 | 10000000004 | 01.06.2016 00:00:00 | 12
Table name "payments"
P_ID | reference | received_date | p_amount
------------------------------------------------------
1 | PAYMENT01 | 12.02.2016 13:14:12 | 12
2 | PAYMENT02 | 12.02.2016 15:24:21 | 28
3 | PAYMENT03 | 08.03.2016 23:12:00 | 2
4 | PAYMENT04 | 23.03.2016 12:32:13 | 30
5 | PAYMENT05 | 12.06.2016 00:00:00 | 15
So I want to have a select statement (maybe with oracle analytic functions but I am not really familiar with it) where the payments are getting summed up till the amount of an invoice is reached, ordered by dates. If the sum of for example two payments is more than the invoice amount the rest of the last payment amount should be used for the next invoice.
In this example the result should be like this:
invoice_number | reference | used_pay_amount | open_inv_amount
----------------------------------------------------------
10000000000 | PAYMENT01 | 12 | 18
10000000000 | PAYMENT02 | 18 | 0
10000000001 | PAYMENT02 | 10 | 15
10000000001 | PAYMENT03 | 2 | 13
10000000001 | PAYMENT04 | 13 | 0
10000000002 | PAYMENT04 | 13 | 0
10000000003 | PAYMENT04 | 4 | 14
10000000003 | PAYMENT05 | 14 | 0
10000000004 | PAYMENT05 | 1 | 11
It would be nice if there is a solution with a "simple" select statement.
thx in advance for your time ...
Oracle Setup:
CREATE TABLE invoices ( i_id, invoice_number, creation_date, i_amount ) AS
SELECT 1, 100000000, DATE '2016-01-01', 30 FROM DUAL UNION ALL
SELECT 2, 100000001, DATE '2016-02-01', 25 FROM DUAL UNION ALL
SELECT 3, 100000002, DATE '2016-03-01', 13 FROM DUAL UNION ALL
SELECT 4, 100000003, DATE '2016-04-01', 18 FROM DUAL UNION ALL
SELECT 5, 100000004, DATE '2016-05-01', 12 FROM DUAL;
CREATE TABLE payments ( p_id, reference, received_date, p_amount ) AS
SELECT 1, 'PAYMENT01', DATE '2016-01-12', 12 FROM DUAL UNION ALL
SELECT 2, 'PAYMENT02', DATE '2016-01-13', 28 FROM DUAL UNION ALL
SELECT 3, 'PAYMENT03', DATE '2016-02-08', 2 FROM DUAL UNION ALL
SELECT 4, 'PAYMENT04', DATE '2016-02-23', 30 FROM DUAL UNION ALL
SELECT 5, 'PAYMENT05', DATE '2016-05-12', 15 FROM DUAL;
Query:
WITH total_invoices ( i_id, invoice_number, creation_date, i_amount, i_total ) AS (
SELECT i.*,
SUM( i_amount ) OVER ( ORDER BY creation_date, i_id )
FROM invoices i
),
total_payments ( p_id, reference, received_date, p_amount, p_total ) AS (
SELECT p.*,
SUM( p_amount ) OVER ( ORDER BY received_date, p_id )
FROM payments p
)
SELECT invoice_number,
reference,
LEAST( p_total, i_total )
- GREATEST( p_total - p_amount, i_total - i_amount ) AS used_pay_amount,
GREATEST( i_total - p_total, 0 ) AS open_inv_amount
FROM total_invoices
INNER JOIN
total_payments
ON ( i_total - i_amount < p_total
AND i_total > p_total - p_amount );
Explanation:
The two sub-query factoring (WITH ... AS ()) clauses just add an extra virtual column to the invoices and payments tables with the cumulative sum of the invoice/payment amount.
You can associate a range with each invoice (or payment) as the cumulative amount owing (paid) before the invoice (payment) was placed and the cumulative amount owing (paid) after. The two tables can then be joined where there is an overlap of these ranges.
The open_inv_amount is the positive difference between the cumulative amount invoiced and the cumulative amount paid.
The used_pay_amount is slightly more complicated but you need to find the difference between the lower of the current cumulative invoice and payment totals and the higher of the previous cumulative invoice and payment totals.
Output:
INVOICE_NUMBER REFERENCE USED_PAY_AMOUNT OPEN_INV_AMOUNT
-------------- --------- --------------- ---------------
100000000 PAYMENT01 12 18
100000000 PAYMENT02 18 0
100000001 PAYMENT02 10 15
100000001 PAYMENT03 2 13
100000001 PAYMENT04 13 0
100000002 PAYMENT04 13 0
100000003 PAYMENT04 4 14
100000003 PAYMENT05 14 0
100000004 PAYMENT05 1 11
Update:
Based on mathguy's method of using UNION to join the data, I came up with a different solution re-using some of my code.
WITH combined ( invoice_number, reference, i_amt, i_total, p_amt, p_total, total ) AS (
SELECT invoice_number,
NULL,
i_amount,
SUM( i_amount ) OVER ( ORDER BY creation_date, i_id ),
NULL,
NULL,
SUM( i_amount ) OVER ( ORDER BY creation_date, i_id )
FROM invoices
UNION ALL
SELECT NULL,
reference,
NULL,
NULL,
p_amount,
SUM( p_amount ) OVER ( ORDER BY received_date, p_id ),
SUM( p_amount ) OVER ( ORDER BY received_date, p_id )
FROM payments
ORDER BY 7,
2 NULLS LAST,
1 NULLS LAST
),
filled ( invoice_number, reference, i_prev, i_total, p_prev, p_total ) AS (
SELECT FIRST_VALUE( invoice_number ) IGNORE NULLS OVER ( ORDER BY ROWNUM ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ),
FIRST_VALUE( reference ) IGNORE NULLS OVER ( ORDER BY ROWNUM ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ),
FIRST_VALUE( i_total - i_amt ) IGNORE NULLS OVER ( ORDER BY ROWNUM ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ),
FIRST_VALUE( i_total ) IGNORE NULLS OVER ( ORDER BY ROWNUM ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ),
FIRST_VALUE( p_total - p_amt ) IGNORE NULLS OVER ( ORDER BY ROWNUM ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING ),
COALESCE(
p_total,
LEAD( p_total ) IGNORE NULLS OVER ( ORDER BY ROWNUM ),
LAG( p_total ) IGNORE NULLS OVER ( ORDER BY ROWNUM )
)
FROM combined
),
vals ( invoice_number, reference, upa, oia, prev_invoice ) AS (
SELECT invoice_number,
reference,
COALESCE( LEAST( p_total - i_total ) - GREATEST( p_prev, i_prev ), 0 ),
GREATEST( i_total - p_total, 0 ),
LAG( invoice_number ) OVER ( ORDER BY ROWNUM )
FROM filled
)
SELECT invoice_number,
reference,
upa AS used_pay_amount,
oia AS open_inv_amount
FROM vals
WHERE upa > 0
OR ( reference IS NULL AND invoice_number <> prev_invoice AND oia > 0 );
Explanation:
The combined sub-query factoring clause joins the two tables with a UNION ALL and generates the cumulative totals for the amounts invoiced and paid. The final thing it does is order the rows by their ascending cumulative total (and if there are ties it will put the payments, in order created, before the invoices).
The filled sub-query factoring clause will fill the previously generated table so that if a value is null then it will take the value from the next non-null row (and if there is an invoice with no payments then it will find the total of the previous payments from the preceding rows).
The vals sub-query factoring clause applies the same calculations as my previous query (see above). It also adds the prev_invoice column to help identify invoices which are entirely unpaid.
The final SELECT takes the values and filters out the unnecessary rows.
Here is a solution that doesn't require a join. This is important if the amount of data is significant. I did some testing on my laptop (nothing commercial), using the free edition (XE) of Oracle 11.2. Using MT0's solution, the query with the join takes about 11 seconds if there are 10k invoices and 10k payments. For 50k invoices and 50k payments, the query took 287 seconds (almost 5 minutes). This is understandable, since joining two 50k tables requires 2.5 billion comparisons.
The alternative below uses a union. It uses lag() and last_value() to do the work the join does in the other solution. This union-based solution, with 50k invoices and 50k payments, took less than 0.5 seconds on my laptop (!)
I simplified the setup a bit; i_id, invoice_number and creation_date are all used for one purpose only: to order the invoice amounts. I use just an inv_id (invoice id) for that purpose, and similar for payments..
For testing purposes, I created tables invoices and payments like so:
create table invoices (inv_id, inv_amt) as
(select level, trunc(dbms_random.value(20, 80)) from dual connect by level <= 50000);
create table payments (pmt_id, pmt_amt) as
(select level, trunc(dbms_random.value(20, 80)) from dual connect by level <= 50000);
Then, to test the solutions, I use the queries to populate a CTAS, like this:
create table bal_of_pmts as
[select query, including the WITH clause but without the setup CTE's, comes here]
In my solution, I look to show the allocation of payments to one or more invoice, and the payment of invoices from one or more payments; the output discussed in the original post only covers half of this information, but for symmetry it makes more sense to me to show both halves. The output (for the same inputs as in the original post) looks like this, with my version of inv_id and pmt_id:
INV_ID PAID UNPAID PMT_ID USED AVAILABLE
---------- ---------- ---------- ---------- ---------- ----------
1 12 18 101 12 0
1 18 0 103 18 10
2 10 15 103 10 0
2 2 13 105 2 0
2 13 0 107 13 17
3 13 0 107 13 4
4 4 14 107 4 0
4 14 0 109 14 1
5 1 11 109 1 0
5 11 0 11
Notice how the left half is what the original post requested. There is an extra row at the end. Notice the NULL for payment id, for a payment of 11 - that shows how much of the last payment is left uncovered. If there was an invoice with id = 6, for an amount of, say, 22, then there would be one more row - showing the entire amount (22) of that invoice as "paid" from a payment with no id - meaning actually not covered (yet).
The query may be a little easier to understand than the join approach. To see what it does, it may help to look closely at intermediate results, especially the CTE c (in the WITH clause).
with invoices (inv_id, inv_amt) as (
select 1, 30 from dual union all
select 2, 25 from dual union all
select 3, 13 from dual union all
select 4, 18 from dual union all
select 5, 12 from dual
),
payments (pmt_id, pmt_amt) as (
select 101, 12 from dual union all
select 103, 28 from dual union all
select 105, 2 from dual union all
select 107, 30 from dual union all
select 109, 15 from dual
),
c (kind, inv_id, inv_cml, pmt_id, pmt_cml, cml_amt) as (
select 'i', inv_id, sum(inv_amt) over (order by inv_id), null, null,
sum(inv_amt) over (order by inv_id)
from invoices
union all
select 'p', null, null, pmt_id, sum(pmt_amt) over (order by pmt_id),
sum(pmt_amt) over (order by pmt_id)
from payments
),
d (inv_id, paid, unpaid, pmt_id, used, available) as (
select last_value(inv_id) ignore nulls over (order by cml_amt desc),
cml_amt - lead(cml_amt, 1, 0) over (order by cml_amt desc),
case kind when 'i' then 0
else last_value(inv_cml) ignore nulls
over (order by cml_amt desc) - cml_amt end,
last_value(pmt_id) ignore nulls over (order by cml_amt desc),
cml_amt - lead(cml_amt, 1, 0) over (order by cml_amt desc),
case kind when 'p' then 0
else last_value(pmt_cml) ignore nulls
over (order by cml_amt desc) - cml_amt end
from c
)
select inv_id, paid, unpaid, pmt_id, used, available
from d
where paid != 0
order by inv_id, pmt_id
;
In most cases, CTE d is all we need. However, if the cumulative sum for several invoices is exactly equal to the cumulative sum for several payments, my query would add a row with paid = unpaid = 0. (MT0's join solution does not have this problem.) To cover all possible cases, and not have rows with no information, I had to add the filter for paid != 0.

Latest value for each timeperiod, person and category

I have tried browsing the problems & answers in this forum, but neither of them fit's my case sufficiently.
I have some people reporting in their status for 2 categories, which looks like this:
TimeStamp | PersonID | Category | Value
2015-07-02 01:25:00 | 2303 | CatA | 8.2
2015-07-02 01:25:00 | 2303 | CatB | 10.1
2015-07-02 03:35:00 | 2303 | CatA | 8.0
2015-07-02 03:35:00 | 2303 | CatB | 9.9
2015-07-02 02:30:00 | 4307 | CatA | 8.7
2015-07-02 02:30:00 | 4307 | CatB | 12.7
.
.
.
2015-07-31 22:15:00 | 9011 | CatA | 7.9
2015-07-31 22:15:00 | 9011 | CatB | 8.9
Some people report status several times per hour, but others only a couple of times per day.
I need to produce an an output, which shows latest know status for each day, for each hour of the day, for each person and category. This should look like this:
Date |Hour| Person | Category | Value
2015-07-02 | 1 | 2307 | CatA | Null
2015-07-02 | 1 | 2307 | CatB | Null
2015-07-02 | 2 | 2307 | CatA | 8.2
2015-07-02 | 2 | 2307 | CatB | 10.2
2015-07-02 | 3 | 2307 | CatA | 8.2
2015-07-02 | 3 | 2307 | CatB | 10.2
2015-07-02 | 4 | 2307 | CatA | 8.0
2015-07-02 | 4 | 2307 | CatB | 9.9
.
.
.
2015-07-31 | 23 | 9011 | CatA | 7.9
2015-07-31 | 23 | 9011 | CatB | 8.9
The first row(s) for each person and category will probably be null as there will be no known values as this is "beginning of time"
I have tried using a sub query like this:
SELECT Date
,hour
,Person
,Category
,(SELECT TOP 1 status FROM readings WHERE (readings.Date<=structure.Date) AND readings.Hour<=structure.hour)....and so forth.... order by TimeStamp DESC
FROM structure
This works - except in terms of performance because I need to do this for a month, for 2.000 persons for 2 categories and that means that the sub query must run (30*24*2000*2=2,880,000) times, and given the fact that table containing the readings also contains hundreds of thousands of readings, this don't work.
I have also tried messing round with row_number(), but have not succeed in this.
Any suggestions?
Edit (19-10-2015 15:34): In my query example above I am referring to a "structure" table. This is actually just (for the time being) a view, with the following SQL:
SELECT Calendar.CalendarDay, Hours.Hour, Persons.Person, Categories.Category
FROM Calendar CROSS JOIN Hours CROSS JOIN Persons CROSS JOIN Categories
This in order to produce a table containing a row for each day, for each hour for each person and each category. This table then contains (30*24*2000*2=2,880,000) rows.
For each of these rows, I need to locate the latest status from the readings table. So for each Day, for each hour, for each person and each category I need to read the latest available status from the readings table.
Let me guess.
Based on the task "to produce an output, which shows latest know status for each day, for each hour of the day, for each person and category" you need to take three steps:
(1) Find latest records for every hour;
(2) Get a table of all date and hours to show;
(3) Multiply that date-hours-table by persons and categories and left join the result with latest-records-for-every-hour.
-- Test data
declare #t table ([Timestamp] datetime2(0), PersonId int, Category varchar(4), Value decimal(3,1));
insert into #t values
('2015-07-02 01:25:00', 2303, 'CatA', 8.2 ),
('2015-07-02 01:45:00', 2303, 'CatA', 9.9 ),
('2015-07-02 01:25:00', 2303, 'CatB', 10.1 ),
('2015-07-02 03:35:00', 2303, 'CatA', 8.0 ),
('2015-07-02 03:35:00', 2303, 'CatB', 9.9 ),
('2015-07-02 02:30:00', 4307, 'CatA', 8.7 ),
('2015-07-02 02:30:00', 4307, 'CatB', 12.7 );
-- Latest records for every hour
declare #Latest table (
[Date] date,
[Hour] tinyint,
PersonId int,
Category varchar(4),
Value decimal(3,1)
primary key ([Date], [Hour], PersonId, Category)
);
insert into #Latest
select top 1 with ties
[Date] = cast([Timestamp] as date),
[Hour] = datepart(hour, [Timestamp]),
PersonId ,
Category ,
Value
from
#t
order by
row_number() over(partition by cast([Timestamp] as date), datepart(hour, [Timestamp]), PersonId, Category order by [Timestamp] desc);
-- Date-hours table
declare #FromDateTime datetime2(0);
declare #ToDateTime datetime2(0);
select #FromDateTime = min([Timestamp]), #ToDateTime = max([Timestamp]) from #t;
declare #DateDiff int = datediff(day, #FromDateTime, #ToDateTime);
declare #FromDate date = cast(#FromDateTime as date);
declare #FromHour int = datepart(hour, #FromDateTime);
declare #ToHour int = datepart(hour, #ToDateTime);
declare #DayHours table ([Date] date, [Hour] tinyint, primary key clustered ([Date], [Hour]) );
with N as
(
select n from (values (1),(2),(3),(4),(5),(6),(7),(8),(9),(10)) t(n)
),
D as (
select
row_number() over(order by (select 1))-1 as d
from
N n1, N n2, N n3
),
H as (
select top 24
row_number() over(order by (select 1)) - 1 as h
from
N n1, N n2
)
insert into #DayHours
select dateadd(day, d, #FromDate), h
from
D, h
where
#FromHour <= (d * 100 + h)
and (d * 100 + h) <= (#DateDiff * 100 + #ToHour);
-- #PersonsIds & #Categories tables (just an imitation of the real tables)
declare #PersonsIds table (Id int primary key);
declare #Categories table (Category varchar(4) primary key);
insert into #PersonsIds select distinct PersonId from #t;
insert into #Categories select distinct Category from #t;
-- The result
select
dh.[Date],
dh.[Hour],
PersonId = p.Id,
c.Category,
l.Value
from
#PersonsIds p cross join #Categories c cross join #DayHours dh
left join #Latest l on l.[Date] = dh.[Date] and l.[Hour] = dh.[Hour] and l.PersonId = p.Id and l.Category = c.Category
order by
[Date], [Hour], PersonId, Category;
Edit (1):
OK.
In order to bring over the previous values to empty spaces,
let's replace the last select statement with this one:
select top 1 with ties
dh.[Date],
dh.[Hour],
PersonId = p.Id,
c.Category,
l.Value
from
#PersonsIds p cross join #Categories c cross join #DayHours dh
left join #Latest l
on (l.[Date] = dh.[Date] and l.[Hour] <= dh.[Hour] or l.[Date] < dh.[Date])
and l.PersonId = p.Id and l.Category = c.Category
order by
row_number()
over (partition by dh.[Date], dh.[Hour], p.Id, c.Category
order by l.[Date] desc, l.[Hour] desc);
Edit (2):
Let's try to collect the Cartesian product in temporary table with clustered index: PersonId, Category, [Date], [Hour].
And then update the table dragging non-changed values:
declare #Result table (
[Date] date,
[Hour] tinyint,
PersonId int,
Category varchar(4),
Value decimal(3,1)
primary key (PersonId, Category, [Date], [Hour]) -- Important !!!
)
insert into #Result
select
dh.[Date],
dh.[Hour],
PersonId = p.Id,
c.Category,
l.Value
from
#PersonsIds p cross join #Categories c cross join #DayHours dh
left join #Latest l on l.[Date] = dh.[Date] and l.[Hour] = dh.[Hour] and l.PersonId = p.Id and l.Category = c.Category
order by
[Date], [Hour], PersonId, Category;
declare #PersonId int;
declare #Category varchar(4);
declare #Value decimal(3,1);
update #Result set
#Value = Value = isnull(Value, case when #PersonId = PersonId and #Category = Category then #Value end),
#PersonId = PersonId,
#Category = Category;
For yet better performance consider changing table variables with temporary tables and applying indexes in accordance with query plan recommendations.
If i got it correctly ..it should give you desired result.
select st.Date,
case when hour =1 then NULL
else hour
end as hour
,st.Person,st.Category,
(select status from reading qualify row_number() over (partition by personid
order by status desc)=1)
from structure;
You can achieve this in SQL, but it will be quite slow, because for every person, category, day and hour you will have to look for the latest entry for the person and category until then. Just think of the process: Pick a record in your big table, find all statuses until then, order them and find the latest thus and pick its value. And this will be done for every record in your big table.
You might be better of to simply retrieve all data with a program written in a programming language and collect the data with a control-break algorithm.
However, let's see how it's done in SQL.
One problem is SQL Server's poor date/time functions. We want to compare date plus hour, which would be easiest with strings in 'yyyymmddhh' format, e.g. '2015101923' < '2015102001'. In your big table you have date and hour and in your status table you have datetimes. Let's see how we can get the desired strings:
convert(varchar(8), bigtable.calendarday, 112) +
right('0' + convert(varchar(2), bigtable.hour), 2)
and
convert(varchar(8), status.timestamp, 112) +
right('0' + convert(varchar(2), datepart(hour, status.timestamp)), 2)
As this is - along with person and category - our key criterion to find records, you may want to have it as computed columns and add indexes (person + category + dayhourkey) in both tables.
You'd select from your big table and get the status value in a subquery. In order to get the latest matching record, you'd order by timestamp and limit to 1 record.
select
personid,
calendarday,
hour,
category,
(
select value
from status s
where s.personid = b.personid
and s.category = b.category
and convert(varchar(8), s.timestamp, 112) + right('0' + convert(varchar(2), datepart(hour, s.timestamp)), 2) <=
convert(varchar(8), b.calendarday, 112) + right('0' + convert(varchar(2), b.hour), 2)
order by s.timestamp desc limit 1
) as value
from bigtable b;