Speeding up cumulative sum calculation in SQL Server - sql

As part of some solution building, I had to implement a view which is performing a running total (cumulative sum calculation). I took the most simple and basic approach of joining table on table with list of dates but it seems that the view is still fairly slow. Addition of indexes on the table didn't help much, even though the table itself have just 15K rows or so. I was wondering if someone could advice on what would be the right approach to speed it up?
There are several considerations:
I need to calculate cumulative sum up to a date for specific ProjectID and ContractorID. So for same date I may have a lot of ProjectIDs and ContractorIds combinations, but combination of Date, ProjectID and ContractorID is always unique
There is a master table with dates, projectids (but no contractorids) and I need a cumulative sum for each date, projectid in this master dates table
I need to calculate a cumulative sum of several columns at the same time, not just of one
To walk you through the situation slightly more, the tables I have are:
dbo.Project_Reporting_Schedule which holds a master list of projectid, dates. For each of this combinations I need to calculate a cumulative sum based on another table. Please note it has no contractorid!
Project_value_delivery is a table where I have actual value columns to perform a cumulative sum calculation. It has its own set of dates which may or may not match dates in Project_Reporting_Schedule, hence we can't just join the table on itself. Please also note it has contractorid!
Currently I have the following select which is rather self-explanatory and just joins table with values on table with master date list and does the summation. Select works fine, but even with just 15K records it takes almost 5 seconds to run which is fairly slow.
select
pv2.ProjectID,
pv2.ContractorID,
pv1.Date,
sum(pv2.ValuePlanned) as PlannedCumulative,
sum(pv2.ValueActual) as ActualCumulative,
sum(pv2.MobilizationPlanned) as MobilizationPlanned,
sum(pv2.MobilizationActual) as MobilizationActual,
sum(pv2.EngineeringPlanned) as EngineeringPlanned,
sum(pv2.EngineeringActual) as EngineeringActual,
sum(pv2.ProcurementPlanned) as ProcurementPlanned,
sum(pv2.ProcurementActual) as ProcurementActual,
sum(pv2.ConstructionPlanned) as ConstructionPlanned,
sum(pv2.ConstructionActual) as ConstructionActual,
sum(pv2.CommisioningTestingPlanned) as CommisioningTestingPlanned,
sum(pv2.CommisioningTestingActual) as CommisioningTestingActual
from
dbo.Project_Reporting_Schedule as pv1
join
dbo.Project_value_delivery as pv2 on pv1.Date >= pv2.Date and pv1.ProjectID = pv2.ProjectID
group by
pv2.ProjectID, pv2.ContractorID, pv1.Date
UPDATE
For further clarifications, I put execution plan here:
https://www.brentozar.com/pastetheplan/?id=H12t-O1PS
Indexes created are same and on both tables I have them for Projectid, Date combination as well as standalone indexes on ProjectID and Date columns.
All indexes are Unique Nonclustered where applicable or just Nonclustered where applicable.
We can see it does 'non-clustered index seek' which costs most of the execution. Maybe index needs to be adjusted?

OK, so per suggestion from #Alex in the comments windowed functions are a way to go. The below code works lightning-fast compared to original code:
select
pv2.ProjectID,
pv2.ContractorID,
pv1.Date,
sum(pv2.ValuePlanned) over (partition by pv2.ProjectID, pv2.ContractorID order by pv1.Date ROWS between unbounded preceding and current row) as PlannedCumulative,
sum(pv2.ValueActual) over (partition by pv2.ProjectID, pv2.ContractorID order by pv1.Date ROWS between unbounded preceding and current row) as ActualCumulative,
sum(pv2.MobilizationPlanned) over (partition by pv2.ProjectID, pv2.ContractorID order by pv1.Date ROWS between unbounded preceding and current row) as MobilizationPlanned,
sum(pv2.MobilizationActual) over (partition by pv2.ProjectID, pv2.ContractorID order by pv1.Date ROWS between unbounded preceding and current row) as MobilizationActual,
sum(pv2.EngineeringPlanned) over (partition by pv2.ProjectID, pv2.ContractorID order by pv1.Date ROWS between unbounded preceding and current row) as EngineeringPlanned,
sum(pv2.EngineeringActual) over (partition by pv2.ProjectID, pv2.ContractorID order by pv1.Date ROWS between unbounded preceding and current row) as EngineeringActual,
sum(pv2.ProcurementPlanned) over (partition by pv2.ProjectID, pv2.ContractorID order by pv1.Date ROWS between unbounded preceding and current row) as ProcurementPlanned,
sum(pv2.ProcurementActual) over (partition by pv2.ProjectID, pv2.ContractorID order by pv1.Date ROWS between unbounded preceding and current row) as ProcurementActual,
sum(pv2.ConstructionPlanned) over (partition by pv2.ProjectID, pv2.ContractorID order by pv1.Date ROWS between unbounded preceding and current row) as ConstructionPlanned,
sum(pv2.ConstructionActual) over (partition by pv2.ProjectID, pv2.ContractorID order by pv1.Date ROWS between unbounded preceding and current row) as ConstructionActual,
sum(pv2.CommisioningTestingPlanned) over (partition by pv2.ProjectID, pv2.ContractorID order by pv1.Date ROWS between unbounded preceding and current row) as CommisioningTestingPlanned,
sum(pv2.CommisioningTestingActual) over (partition by pv2.ProjectID, pv2.ContractorID order by pv1.Date ROWS between unbounded preceding and current row) as CommisioningTestingActual
from
dbo.Project_Reporting_Schedule as pv1
join dbo.Project_value_delivery as pv2 on pv1.Date = pv2.Date and pv1.ProjectID = pv2.ProjectID

Take the comparison out of the JOIN clause and move it to a WHERE clause:
select
pv2.ProjectID,
pv2.ContractorID,
pv1.Date,
sum(pv2.ValuePlanned) as PlannedCumulative,
sum(pv2.ValueActual) as ActualCumulative,
sum(pv2.MobilizationPlanned) as MobilizationPlanned,
sum(pv2.MobilizationActual) as MobilizationActual,
sum(pv2.EngineeringPlanned) as EngineeringPlanned,
sum(pv2.EngineeringActual) as EngineeringActual,
sum(pv2.ProcurementPlanned) as ProcurementPlanned,
sum(pv2.ProcurementActual) as ProcurementActual,
sum(pv2.ConstructionPlanned) as ConstructionPlanned,
sum(pv2.ConstructionActual) as ConstructionActual,
sum(pv2.CommisioningTestingPlanned) as CommisioningTestingPlanned,
sum(pv2.CommisioningTestingActual) as CommisioningTestingActual
FROM
dbo.Project_Reporting_Schedule as pv1
join dbo.Project_value_delivery as pv2 on pv1.ProjectID = pv2.ProjectID
WHERE pv1.Date >= pv2.Date
GROUP BY pv2.ProjectID, pv2.ContractorID, pv1.Date

Related

Before&After purchase of a product

I have two tables:
orders_product: all the orders. Each line is a product sold with some details about the order in which it was included. So, if the order has more than 1 product, there are more than 1 line for this order.
orders_grouped: each line is an order with some details about this specific order.
I would like know if there was a previous purchase and a following purchase for each product.
SELECT
product_name,
last_value(product_all_grouped_list) over (partition by ord.customer_id order by created_at asc rows between unbounded preceding and 1 preceding ) as last_order,
last_value(product_all_grouped_list) over (partition by ord.customer_id order by created_at desc rows between unbounded preceding and 1 preceding ) as next_order_products,
last_value(basket_size) over (partition by ord.customer_id order by created_at desc rows between unbounded preceding and 1 preceding ) as next_order_basket_size
FROM
`orders_product` ord
left join `orders_grouped` ordgroup
on ord.order_number=ordgroup.order_number
When the order has only one product (basket_size=1), everything is correct but when the basket_size>1, the results for the first product of this order is OK but for the rest of products of the order is wrong.
Can someone help me?
Because several orders items are present and thus several rows the windows function has to be different.
RANGE instead of ROWS in the over statement.
Also use window at the end:
With tbl as (
Select * from unnest(generate_timestamp_array("2022-09-01","2022-09-15",interval 1 hour)) update_time
)
SELECT
*,
LAST_VALUE(update_time) OVER (ORDER BY update_time ASC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING ),
timestamp_diff(update_time,timestamp("1999-01-01"),second) ,
LAST_VALUE(update_time) OVER SETUP_window
FROM
tbl
window SETUP_window as (ORDER BY timestamp_diff(update_time,timestamp("1999-01-01"),second) ASC RANGE BETWEEN UNBOUNDED PRECEDING AND 36000 PRECEDING )
order by update_time desc

How can I specify window frame for my sum?

How/Can I specify this window frame:
sum(Quantity) over (partition by AccountId, SymbolId order by Time rows between unbounded preceding and current row -1) PositionAmount?
I tried to full it by
sum(Quantity) over (partition by AccountId, SymbolId order by Time rows between unbounded preceding and -1 following)
but -1 is not allowed.
I can of course make a second select over it and find prev value of PositionAmount with lag or something.
The documention specifies, in case you use between, that both parts are window frame bound, without forcing you to use following for the second part. Try this:
rows between unbounded preceding and 1 preceding

Faster alternative of MIN/MAX in SQL Server

I need the lowest/highest price of stocks for the past n days. The following query works really slow. I would appreciate faster alternative:
SELECT
*,
MIN(Close) OVER (PARTITION BY Ticker ORDER BY PriceDate ROWS BETWEEN 14 PRECEDING AND 1 PRECEDING) AS MinPrice14d,
MAX(Close) OVER (PARTITION BY Ticker ORDER BY PriceDate ROWS BETWEEN 14 PRECEDING AND 1 PRECEDING) AS MaxPrice14d
FROM
(SELECT CompanyID, Ticker, PriceDate, Close
FROM price.PriceHistoryDaily) a
I need the columns specified.
It is trailing, so I need it day by day.
As for period, I will limit it to one year.
Although it doesn't affect the performance, no subquery is needed. So start with the simpler version:
SELECT phd.CompanyID, phd.Ticker, phd.PriceDate, phd.Close,
min(Close) over (partition by Ticker
order by PriceDate
rows between 14 preceding and 1 preceding
) as MinPrice14d,
max(Close) over (partition by Ticker
order by PriceDate
rows between 14 preceding and 1 preceding
) as MaxPrice14d
FROM price.PriceHistoryDaily phd;
Then try adding an index: PriceHistoryDaily(Ticker, PriceDate).
Note: That this returns all rows from PriceHistoryDaily and -- depending on the size of the table -- that might be what is driving the performance.

Last_Value in SQL Server

with cte
as
(
SELECT
year(h.orderdate)*100+month(h.orderdate) as yearmonth,
YEAR(h.orderdate) as orderyear,
sum(d.OrderQty*d.UnitPrice) as amount
FROM [AdventureWorks].[Sales].[SalesOrderDetail] d
inner join sales.SalesOrderHeader h
on d.SalesOrderID=h.SalesOrderID
group by
year(h.orderdate)*100+month(h.orderdate),
year(h.orderdate)
)
select
c.*,
last_value(c.amount) over (partition by c.orderyear order by c.yearmonth) as lastvalue,
first_value(c.amount) over (partition by c.orderyear order by c.yearmonth) as firstvalue
from cte c
order by c.yearmonth
I am expecting to see the lastvalue of each year (say december value), similar to the firstvalue of each year (jan value). however, last_value is not working at all. It just returns the same value of that month. What did I do wrong?
Thanks for the help.
Your problem is that the default row range for LAST_VALUE is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, so the value you are getting is the current month's value (that being the last value in that range). To get LAST_VALUE to look at all values in the partition you need to expand the range to include the rows after the current row as well. So you need to change your query to:
last_value(c.amount) over (partition by c.orderyear order by c.yearmonth
rows between unbounded preceding and unbounded following) as lastvalue,

SQL Server : PRECEDING with another condition

I have a query that is working fine: The query is to find the sum & Avg for the last 3 months and last year. It is working fine, till I got a new request to break the query down to more details by AwardCode.
So how to include that?
I mean for this section
SUM(1.0 * InvolTerm) OVER (ORDER BY Calendar_Date ASC
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS InvolMov3Mth,
I want to find the last 3 months based on AwardCode.
My original query that is working is
SELECT
Calendar_Date, Mth, NoOfEmp, MaleCount, FemaleCount,
SUM(1.0*InvolTerm) OVER (ORDER BY Calendar_Date ASC
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS InvolMov3Mth,
SUM(1.0*TotalTerm) OVER (ORDER BY Calendar_Date ASC
ROWS BETWEEN 11 PRECEDING AND CURRENT ROW) AS TermSum12Mth
FROM #X
The result is
But now I need to add another group AwardCode
SELECT
Mth, AwardCode, NoOfEmp, MaleCount, FemaleCount,
SUM(1.0 * InvolTerm) OVER (ORDER BY Calendar_Date ASC
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS InvolMov3Mth,
SUM(1.0 * TotalTerm) OVER (ORDER BY Calendar_Date ASC
ROWS BETWEEN 11 PRECEDING AND CURRENT ROW) AS TermSum12Mth
FROM #X
The result will be like this
You can notice that the sum of InvolMov3Mth & TermSum12Mth for the whole period does not match the query above
I think I found the answer for my question.
I used PARTITION BY AwardCode before ORDER BY
seems to be working.
SUM(1.0*TotalTerm) OVER (PARTITION BY AwardCode ORDER BY Calendar_Date ASC
ROWS BETWEEN 11 PRECEDING AND CURRENT ROW) AS TermSum12Mth,
Yes. "Partition by" will make it work for your requirment