I need help solving a performance problem related to a recursive function in SQL Server. I have a table of tasks for items, each of which have a lead time. My function recursively calls itself to calculate the due date for each task, based on the sum of the preceding tasks (simplistically put..). The function performs slowly at large scale, I believe mainly because must recalculate the due date for each ancestor, for each subsequent task.
So I am wondering, is there a way to store a calculated value that could persist from function call to function call, that would last only the lifetime of the connection? Then my function could 'short-circuit' if it found a pre-calculated value, and avoid re-evaluating for each due date request.
The basic schema is as below, with a crude representation of the function in question (This function could also be done with a cte, but the calculations are still repeating the same calculations):
Create Table Projects(id int, DueDate DateTime)
Create Table Items(id int, Parent int, Project int, Offset int)
Create Table Tasks (id int, Parent int, Leadtime Int, Sequence int)
insert into Projects Values
(100,'1/1/2021')
Insert into Items Values
(0,null, 100, 0)
,(1,12, null, 0)
,(2,15, null, 1)
Insert into Tasks Values
(10,0,1,1)
,(11,0,1,2)
,(12,0,2,3)
,(13,0,1,4)
,(14,1,1,1)
,(15,1,1,2)
,(16,2,2,1)
,(17,2,1,2);
CREATE FUNCTION GetDueDate(#TaskID int)
Returns DATETIME
AS BEGIN
Declare #retval DateTime = null
Declare #parent int = (Select Parent from Tasks where ID = #TaskID)
Declare #parentConsumingOp int = (select Parent from Items where ID = #parent)
Declare #parentOffset int = (select Offset from Items where ID = #parent)
Declare #seq int = (Select Sequence from Tasks where ID = #TaskID)
Declare #NextTaskID int = (select ID from Tasks where Parent = #parent and Sequence = #seq-1)
Declare #Due DateTime = (select DueDate from Projects where ID = (Select Project from Items where ID = (Select Parent from Tasks where ID = #TaskID)))
Declare #leadTime int = (Select LeadTime from Tasks where ID = #TaskID)
if #NextTaskID is not null
BEGIN
SET #retval = DateAdd(Day,#leadTime * -1,dbo.GetDueDate(#NextTaskID))
END ELSE IF #parentConsumingOp Is Not Null
BEGIN
SET #retval = DateAdd(Day,(#leadTime + #parentOffset)*-1,dbo.GetDueDate(#parentConsumingOp))
END ELSE SET #retval = DateAdd(Day,#parentOffset*-1,#Due)
Return #retval
END
EDIT: Sql Fiddle Here
Caveat: the following is based on the sample data you've provided rather than trying to work through the logic in your function (i.e. what you are trying to achieve rather than how you have implemented it)...
The result of the function appears to be:
for "this task"
project.due_date - (sum(tasks.leadtime) +1) where tasks.sequence <= sequence of this task and tasks.parent = parent of this task
If this is the case then this function gives the same result as yours but is much simpler:
CREATE FUNCTION GetDueDate1(#TaskID int)
Returns DATETIME
AS BEGIN
Declare #retval DateTime = null
Declare #parent int = (Select Parent from Tasks where ID = #TaskID)
Declare #seq int = (Select sequence from Tasks where ID = #TaskID)
Declare #totlead int = (select Sum(Leadtime) - 1 from Tasks where parent = #parent and sequence <= #Seq)
Declare #duedate DateTime = (select p.DueDate from tasks t inner join items i on t.parent = i.id inner join projects p on i.Project = p.id where t.id = 13)
SET #retval = DateAdd(Day,#totlead * -1,#duedate)
Return #retval
END;
If I run both functions against your data:
select id
,leadtime
, sequence
, [dbo].[GetDueDate](id) "YourFunction"
, [dbo].[GetDueDate1](id) "MyFunction"
from tasks
where parent = 0;
I get the same result:
id leadtime sequence YourFunction MyFunction
10 1 1 2021-01-01 00:00:00.000 2021-01-01 00:00:00.000
11 1 2 2020-12-31 00:00:00.000 2020-12-31 00:00:00.000
12 2 3 2020-12-29 00:00:00.000 2020-12-29 00:00:00.000
13 1 4 2020-12-28 00:00:00.000 2020-12-28 00:00:00.000
Hope this helps? If it doesn't then please provide some sample data where my function doesn't produce the same result as yours
Update following Comment
Good point, the code above doesn't work for all your data.
I've been thinking this problem through and have come up with the following - please feel free to point it out if I have misunderstood anything:
Your function will, obviously, only return the Due Date for the task you have passed in as a parameter. It will also only calculate the due dates of each of the preceding tasks once during this process
Therefore there is no point "saving" the due dates calculated for other tasks as they will only be used once in the calculation of the initial task id (so no performance gain from holding these values as they wont get re-used) and they wont be used if you called the function again - because that's not how functions work: it can't "know" that you may have called the function previously and already calculated the due date for that task id as part of an intermediate step
Re-reading your initial explanation, it appears that you actually want to calculate the due dates for a number of tasks (or all of them?) - not just a single one. If this is the case then I wouldn't (just) use a function (which is inherently limited to 1 task), instead I would write a Stored Procedure that would loop through all your tasks, calculate their due date and save this to a table (either a new table or update your Tasks table).
You would need to ensure that the tasks were processed in an appropriate order so that those used in calculations for subsequent tasks were calculated first
You can re-use the logic in your function (or even call the function from within the SP) but add step(s) that check if the Due Date has already been calculated (i.e. try and select it from the table), use it if it had or calculate it (and save it to the table) if it hadn't
You would need to run this SP whenever relevant data in the tables used in the calculation was amended
Related
I am struggling to figure out how to get this stored procedure to return all appointments for the userId (this is the only input required). Right now it is only returning one appointment. I want it to return all 5 attributes (appointment_id, host_name, visit_start, visit_end, and visit_location) for each appointment associated with the userId, with each appointment separated by {}. So if a visitor has 5 appointments for the day, I want all 5 appointments to be returned (along with the 5 appointment attributes), with each separate appointment enclosed in {}. I also want to take care of the case where a userId has no appointments associated with it. This code works perfectly, but, like I said only returns 1 appointment instead of all that are associated with the userId. I feel like I need an to make an appointment array although I don't believe arrays are part of SQL server. I might need a variable table but i am not sure how to implement one. I am open to any suggestions.
CREATE PROCEDURE [dbo].[GetMeetings]
(
#userId BIGINT,
#appointmentId BIGINT OUTPUT,
#hostName NVARCHAR(101) OUTPUT,
#startTime DATETIMEOFFSET(7) OUTPUT,
#endTime DATETIMEOFFSET(7) OUTPUT,
#location NVARCHAR(100) OUTPUT
)
AS
BEGIN
SET NOCOUNT ON
SELECT TOP 10
#appointmentId = appointment_id,
#hostName = host_name,
#startTime = visit_start,
#endtime = visit_end,
#location = visit_location
FROM
dbo.reg_visits v1 INNER JOIN dbo.reg_visitors v2 on v1.reg_visit_id = v2.reg_visits_reg_visit_id
WHERE
v2.reg_visits_reg_visit_id IN
(SELECT TOP 10
reg_visits_reg_visit_id AS id
FROM
dbo.reg_visitors
WHERE visitor_profile_visitor_id = #userId
ORDER BY
reg_visitors_id DESC)
AND v1.visit_start > GETDATE() ORDER BY v1.reg_visit_id desc
END
Your select statement should just be
CREATE PROCEDURE [dbo].[GetMeetings]
(
#userId BIGINT
)
AS
BEGIN
SET NOCOUNT ON
SELECT TOP 10
appointment_id,
host_name,
visit_start,
visit_end,
visit_location
We want to return this as a result set and not as an out parameter. OUTPUT parameters can really only have one value, so that's why you are getting only one appointment back, and it is probably always the last of the 10 rows that you are expecting. So get rid of all the OUTPUT parameters too and just send in the userid.
I have the following procedure to retrieve some data, based by the year, which is input by the user. However, I always get a 0 back. I'm still fairly new to SQL, but this seemed like it should work
Create PROCEDURE [dbo].[Yearly]
#year int
AS
BEGIN
DECLARE #yearly Datetime
DECLARE #summ int
SELECT #summ = SUM([dbo].[Out].[OutPcs]), #yearly = [dbo].[Out].[DateTime]
FROM [dbo].[Out]
WHERE YEAR(#yearly) = #year
GROUP BY [Out].[DateTime]
END;
Should I have used nested select statements? I suspect something is wrong in that part of the procedure.
You have DECLARE #yearly Datetime.
You attempt to set it in SELECT ... #yearly = Out.Datetime FROM Out, but then you have this WHERE statement: YEAR(#yearly) = #year
This returns nothing since #yearly is NULL when called by YEAR()
This makes the statement equivalent to WHERE NULL = 2018
Which will never be true.
To fix this, you need to set yearly before calling it in your WHERE clause or use something else there.
It looks like you want to use YEAR(Dbo.Out.Datetime) instead there
Since it looks like you're new to SQL I will add some extra explanation. This is an oversimplification.
Most programming languages run top to bottom. Executing the line1 first, line2 second, line3 third, and so on. SQL does not do this.
The command SELECT Name FROM Employee WHERE EmpID = 1 Runs in the following order.
First - FROM Employee --> Load the Employee table
Second - WHERE EmpID = 1 --> Scan Employee for the records where EmpID = 1
Third - SELECT Name --> Display the `Name` field of the records I found.
Your command looks like this to the SQL compiler
First - FROM dbo.Out --> Load Out table
Second - WHERE YEAR(#yearly) = #year --> Scan for records that meet this req.
Third - SELECT ... #yearly = dbo.Out.Datetime --> Set #yearly to the [Datetime] field associated to the record(s) I found.
Note that if your statement had returned multiple records, then SQL would have tried to set your 1-dimensional variable to an array of values. It would fail and give you something like
Too many records returned. Have me only return 1 record.
Why your code is not working is well explained by #Edward
Here is a working code:
Create PROCEDURE [dbo].[Yearly]
#year int
AS
BEGIN
SELECT SUM([dbo].[Out].[OutPcs])
FROM [dbo].[Out]
WHERE YEAR([dbo].[Out].[DateTime]) = #year
END;
You forgot to return "summ":
And #yearly var is not necessary.
Group by Year is not necessary too.
Create PROCEDURE [dbo].[Yearly]
#year int
AS
BEGIN
DECLARE #summ int
SELECT #summ = SUM([dbo].[Out].[OutPcs])
FROM [dbo].[Out]
WHERE YEAR([dbo].[Out].[DateTime]) = #year
Return #summ
END;
I searched the Internet for days, no effort, maybe I cant ask in a right way.
I have a sql table like this:
create table Items
(
Id int identity(1,1),
OrderNumber varchar(7),
ItemName varchar(255),
Count int
)
Then I have a stored procedure inserting items, on demand creating new OrderNumber:
create procedure spx_insertItems
#insertNewOrderNr bit,
#orderNumber varchar(7),
#itemName varchar(255),
#count int
as
begin
set nocount on;
if (#insertNewOrderNr = 1)
begin
declare #newNr = (select dbo.fun_getNewOrderNr())
INSERT INTO Items (OrderNumber, ItemName, Count) values (#newNr, #itemName, #count)
select #newNr
end
else
begin
INSERT INTO Items (OrderNumber, ItemName, Count) values (#orderNumber, #itemName, #count)
select scope_identity()
end
end
Finally there is a user defined function returning new OrderNumber:
create function dbo.fun_getNewOrderNr
()
return varchar(7)
as
begin
/* this func works well */
declare #output varchar(7)
declare #currentMaxNr varchar(7)
set #currentMaxNr = (isnull((select max(OrderNumber) from Items), 'some_default_value_here')
/* lets assume the #currentMaxNr is '01/2014', here comes logic that increments to #newValue='02/2014' and sets to #output, so: */
set #output = #newValue
return #output
end
Into Items can be inserted items that do as well that do not belong to any OrderNumber.
Whether an Item should become new OrderNumber, the procedure is called with #insertNewOrderNr=1, returns the new order number, that can be used to insert next items with that OrderNumber while #insertNewOrderNr=0.
Occasionally there happens that there come simultaneously 2 requests to #insertNewOrderNr and THERE IS THE PROBLEM - Items, that should correspond with different OrderNumbers get the same OrderNumber.
I tried to use transaction with no success.
The table structure cant be modified by me.
What would be the right way to ensure, that there won't be used the same newOrderNumber when simultaneous requests to the procedure come?
I am stuck here for a long time till now. Please, help.
You will have that problem as long as you use MAX(OrderNumber).
You might consider using sequences:
Create sequence
CREATE SEQUENCE dbo.OrderNumbers
AS int
START WITH 1
INCREMENT BY 1
NO CACHE;
GO
CREATE SEQUENCE dbo.OrderNumberYear
AS int
START WITH 2014
INCREMENT BY 1
NO CACHE;
SELECT NEXT VALUE FOR dbo.OrderNumberYear; --Important, run this ONE time after creation, this years value must be returned one initial time to work correctly.
Insert code
DECLARE #orderNumberYear INT = (SELECT CONVERT(INT, current_value) FROM sys.sequences WHERE name = 'OrderNumberYear');
IF(#orderNumberYear < YEAR(GETDATE()))
BEGIN
SELECT #orderNumberYear = NEXT VALUE FOR dbo.OrderNumberYear;
ALTER SEQUENCE dbo.OrderNumbers RESTART WITH 1 ;
END
IF(#orderNumberYear != YEAR(GETDATE()))
RAISERROR(N'Order year sequence is out of sync.', 16, 1);
DECLARE #newNr VARCHAR(15) = CONCAT(FORMAT(NEXT VALUE FOR dbo.OrderNumbers, '00/', 'en-US'), #orderNumberYear);
INSERT INTO Items (OrderNumber, ItemName, Count) values (#newNr, #itemName, #count)
SELECT #newNr
The duplicity still occured, not so often, but did.
The trick I finally used to get around didn't find itself inside SQL. Since the DB is always used by ONLY one web app that is used by several users, this is the solution:
in all (about 5) places in my VB.NET code I surrounded the myCommand.ExecuteScalar() with SyncLock (read more) statement that DID the trick :)
I have a stored procedure which occasionally times out when called from our website (through the website connection pool). Once it has timed out, it has always been locked into the time-out, until the procedure is recompiled using drop/create or sp_recompile from a Management Studio session.
While it is timing out, there is no time-out using the same parameters for the same procedure using Management Studio.
Doing an "ALTER PROCEDURE" through Management Studio and (fairly drastically) changing the internal execution of the procedure did NOT clear the time out - it wouldn't clear until a full sp_recompile was run.
The stored procedure ends with OPTION (RECOMPILE)
The procedure calls two functions, which are used ubiquitously throughout the rest of the product. The other procedures which use these functions (in similar ways) all work, even during a period where the procedure in question is timing out.
If anyone can offer any further advice as to what could be causing this time out it would be greatly appreciated.
The stored procedure is as below:
ALTER PROCEDURE [dbo].[sp_g_VentureDealsCountSizeByYear] (
#DateFrom AS DATETIME = NULL
,#DateTo AS DATETIME = NULL
,#ProductRegion AS INT = NULL
,#PortFirmID AS INT = NULL
,#InvFirmID AS INT = NULL
,#SpecFndID AS INT = NULL
) AS BEGIN
-- Returns the stats used for Market Overview
DECLARE #IDs AS IDLIST
INSERT INTO #IDs
SELECT IDs
FROM dbo.fn_VentureDealIDs(#DateFrom,#DateTo,#ProductRegion,#PortFirmID,#InvFirmID,#SpecFndID)
CREATE TABLE #DealSizes (VentureID INT, DealYear INT, DealQuarter INT, DealSize_USD DECIMAL(18,2))
INSERT INTO #DealSizes
SELECT vDSQ.VentureID, vDSQ.DealYear, vDSQ.DealQuarter, vDSQ.DealSize_USD
FROM dbo.fn_VentureDealsSizeAndQuarter(#IDs) vDSQ
SELECT
yrs.Years Heading
,COUNT(vDSQ.VentureID) AS Num_Deals
,SUM(vDSQ.DealSize_USD) AS DealSize_USD
FROM tblYears yrs
LEFT OUTER JOIN #DealSizes vDSQ ON vDSQ.DealYear = yrs.Years
WHERE (
((#DateFrom IS NULL) AND (yrs.Years >= (SELECT MIN(DealYear) FROM #DealSizes))) -- If no minimum year has been passed through, take all years from the first year found to the present.
OR
((#DateFrom IS NOT NULL) AND (yrs.Years >= DATEPART(YEAR,#DateFrom))) -- If a minimum year has been passed through, take all years from that specified to the present.
) AND (
((#DateTo IS NULL) AND (yrs.Years <= (SELECT MAX(DealYear) FROM #DealSizes))) -- If no maximum year has been passed through, take all years up to the last year found.
OR
((#DateTo IS NOT NULL) AND (yrs.Years <= DATEPART(YEAR,#DateTo))) -- If a maximum year has been passed through, take all years up to that year.
)
GROUP BY yrs.Years
ORDER BY Heading DESC
OPTION (RECOMPILE)
END
If you wanted to recompile SP each time it is executed, you should have declared it with recompile; your syntax recompiles last select only:
ALTER PROCEDURE [dbo].[sp_g_VentureDealsCountSizeByYear] (
#DateFrom AS DATETIME = NULL
,#DateTo AS DATETIME = NULL
,#ProductRegion AS INT = NULL
,#PortFirmID AS INT = NULL
,#InvFirmID AS INT = NULL
,#SpecFndID AS INT = NULL
) WITH RECOMPILE
I could not tell which part of your procedure causes problems. You might try commenting out select part to see if creating temp tables from table functions produces performance issue; if it does not, then the query itself is a problem. You might rewrite filter as following:
WHERE (#DateFrom IS NULL OR yrs.Years >= DATEPART(YEAR,#DateFrom))
AND (#DateTo IS NULL OR yrs.Years <= DATEPART(YEAR,#DateTo))
Or, perhaps better, declare startYear and endYear variables, set them accordingly and change where like this:
declare #startYear int
set #startYear = isnull (year(#DateFrom), (SELECT MIN(DealYear) FROM #DealSizes))
declare #endYear int
set #endYear = isnull (year(#DateTo), (SELECT MAX(DealYear) FROM #DealSizes))
...
where yrs.Year between #startYear and #endYear
If WITH RECOMPILE does not solve the problem, and removing last query does not help either, then you need to check table functions you use to gather data.
I am trying to keep a rolling checksum to account for order, so take the previous 'checksum' and xor it with the current one and generate a new checksum.
Name Checksum Rolling Checksum
------ ----------- -----------------
foo 11829231 11829231
bar 27380135 checksum(27380135 ^ 11829231) = 93291803
baz 96326587 checksum(96326587 ^ 93291803) = 67361090
How would I accomplish something like this?
(Note that the calculations are completely made up and are for illustration only)
This is basically the running total problem.
Edit:
My original claim was that is one of the few places where a cursor based solution actually performs best. The problem with the triangular self join solution is that it will repeatedly end up recalculating the same cumulative checksum as a subcalculation for the next step so is not very scalable as the work required grows exponentially with the number of rows.
Corina's answer uses the "quirky update" approach. I've adjusted it to do the check sum and in my test found that it took 3 seconds rather than 26 seconds for the cursor solution. Both produced the same results. Unfortunately however it relies on an undocumented aspect of Update behaviour. I would definitely read the discussion here before deciding whether to rely on this in production code.
There is a third possibility described here (using the CLR) which I didn't have time to test. But from the discussion here it seems to be a good possibility for calculating running total type things at display time but out performed by the cursor when the result of the calculation must be saved back.
CREATE TABLE TestTable
(
PK int identity(1,1) primary key clustered,
[Name] varchar(50),
[CheckSum] AS CHECKSUM([Name]),
RollingCheckSum1 int NULL,
RollingCheckSum2 int NULL
)
/*Insert some random records (753,571 on my machine)*/
INSERT INTO TestTable ([Name])
SELECT newid() FROM sys.objects s1, sys.objects s2, sys.objects s3
Approach One: Based on the Jeff Moden Article
DECLARE #RCS int
UPDATE TestTable
SET #RCS = RollingCheckSum1 =
CASE WHEN #RCS IS NULL THEN
[CheckSum]
ELSE
CHECKSUM([CheckSum] ^ #RCS)
END
FROM TestTable WITH (TABLOCKX)
OPTION (MAXDOP 1)
Approach Two - Using the same cursor options as Hugo Kornelis advocates in the discussion for that article.
SET NOCOUNT ON
BEGIN TRAN
DECLARE #RCS2 INT
DECLARE #PK INT, #CheckSum INT
DECLARE curRollingCheckSum CURSOR LOCAL STATIC READ_ONLY
FOR
SELECT PK, [CheckSum]
FROM TestTable
ORDER BY PK
OPEN curRollingCheckSum
FETCH NEXT FROM curRollingCheckSum
INTO #PK, #CheckSum
WHILE ##FETCH_STATUS = 0
BEGIN
SET #RCS2 = CASE WHEN #RCS2 IS NULL THEN #CheckSum ELSE CHECKSUM(#CheckSum ^ #RCS2) END
UPDATE dbo.TestTable
SET RollingCheckSum2 = #RCS2
WHERE #PK = PK
FETCH NEXT FROM curRollingCheckSum
INTO #PK, #CheckSum
END
COMMIT
Test they are the same
SELECT * FROM TestTable
WHERE RollingCheckSum1<> RollingCheckSum2
I'm not sure about a rolling checksum, but for a rolling sum for instance, you can do this using the UPDATE command:
declare #a table (name varchar(2), value int, rollingvalue int)
insert into #a
select 'a', 1, 0 union all select 'b', 2, 0 union all select 'c', 3, 0
select * from #a
declare #sum int
set #sum = 0
update #a
set #sum = rollingvalue = value + #sum
select * from #a
Select Name, Checksum
, (Select T1.Checksum_Agg(Checksum)
From Table As T1
Where T1.Name < T.Name) As RollingChecksum
From Table As T
Order By T.Name
To do a rolling anything, you need some semblance of an order to the rows. That can be by name, an integer key, a date or whatever. In my example, I used name (even though the order in your sample data isn't alphabetical). In addition, I'm using the Checksum_Agg function in SQL.
In addition, you would ideally have a unique value on which you compare the inner and outer query. E.g., Where T1.PK < T.PK for an integer key or even string key would work well. In my solution if Name had a unique constraint, it would also work well enough.