Optimising CTE for recursive queries - sql

I have a table with self join. You can think of the structure as standard table to represent organisational hierarchy. Eg table:-
MemberId
MemberName
RelatedMemberId
This table consists of 50000 sample records. I wrote CTE recursive query and it works absolutely fine. However the time it takes to process just 50000 records is round about 3 minutes on my machine (4GB Ram, 2.4 Ghz Core2Duo, 7200 RPM HDD).
How can I possibly improve the performance because 50000 is not so huge number. Over time it will keep on increasing. This is the query which is exactly what I have in my Stored Procedure. The query's purpose is to select all the members that come under a specific member. Eg. Under Owner of the company each and every person comes. For Manager, except Owner all of the records gets returned. I hope you understand the query's purpose.
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
Alter PROCEDURE spGetNonVirtualizedData
(
#MemberId int
)
AS
BEGIN
With MembersCTE As
(
Select parent.MemberId As MemberId, 0 as Level
From Members as parent Where IsNull(MemberId,0) = IsNull(#MemberId,0)
Union ALL
Select child.MemberId As MemberId , Level + 1 as Level
From Members as child
Inner Join MembersCTE on MembersCTE.MemberId = child.RelatedMemberId
)
Select Members.*
From MembersCTE
Inner Join Members On MembersCTE.MemberId = Members.MemberId
option(maxrecursion 0)
END
GO
As you can see to improve the performance, I have even made the Joins at the last step while selecting records so that all unnecessary records do not get inserted into temp table. If I made joins in my base step and recursive step of CTE (instead of Select at the last step) the query takes 20 minutes to execute!
MemberId is primary key in the table.
Thanks in advance :)

In your anchor condition you have Where IsNull(MemberId,0) = IsNull(#MemberId,0) I assume this is just because when you pass NULL as a parameter = doesn't work in terms of bringing back IS NULL values. This will cause a scan rather than a seek.
Use WHERE MemberId = #MemberId OR (#MemberId IS NULL AND MemberId IS NULL) instead which is sargable.
Also I'm assuming that you can't have an index on RelatedMemberId. If not you should add one
CREATE NONCLUSTERED INDEX ix_name ON Members(RelatedMemberId) INCLUDE (MemberId)
(though you can skip the included column bit if MemberId is the clustered index key as it will be included automatically)

Related

Performance - Select query with left join and null check

I have two different tables which are called as Processing (30M records for now) and EtlRecord (4.3M records for now).
As the name of tables suggest, these tables will be used for normalization of data with ETL.
We are trying to process records with batches where we have 1000 records in each batch.
SELECT TOP 1000 P.StreamGuid
FROM [staging].[Processing] P (NOLOCK)
LEFT JOIN [core].[EtlRecord] E (NOLOCK) ON E.StreamGuid = P.StreamGuid
WHERE E.StreamGuid IS NULL
AND P.CompleteDate IS NOT NULL
AND P.StreamGuid IS NOT NULL
Execution of this query takes around 20 seconds now. And we are expecting to have more and more data especially in EtlRecord table. To be able to improve the performance of this query I check the actual execution plan which I shared below.
As you can see, the most time consuming part is index seek to determine null records in EtlRecord table. I have tried several changes but couldn't able to improve it.
Additional notes
All suggested indices by execution plan already applied to tables. So there is no further index suggestion.
There are 8 columns in Processing table which are mostly boolean flags and 4 columns in EtlRecord table.
EtlRecord table is only used by single procedure. So there is no issue with transaction lock.
Any suggestions to improve this query will be really helpful.
Well, in your query you need to get records from [staging].[Processing] which has not got corresponding record in the [core].[EtlRecord].
You can remove the proceeded records, first.
DELETE [staging].[Processing]
FROM [staging].[Processing] P
INNER JOIN [core].[EtlRecord] E
ON E.StreamGuid = P.StreamGuid;
You can use deletion on batches if you need. Removing this records will simplify our initial query and the nasty join by uniqueidentifier. You simply need to do then something like this for each batch:
SELECT TOP 1000 StreamGuid
INTO #buffer
FROM [staging].[Processing]
WHERE CompleteDate IS NOT NULL
AND StreamGuid IS NOT NULL;
-- do whatevery you need with this records
DELETE FROM [staging].[Processing]
WHERE StreamGuid IN (SELECT StreamGuid FROM #buffer);
Also, you have said that you have all indexes created but indexes suggested by the execution plan are not always best. This part here:
WHERE CompleteDate IS NOT NULL
AND StreamGuid IS NOT NULL;
seems like very good candidate for filtered index especially if large amount of the rows has a NULL value for one of this columns.
First, DDL and easily consumable sample data, like below, will help a great deal. You can copy/paste my solutions and run them locally to see what I'm talking about.
IF OBJECT_ID('tempdb..#processing','U') IS NOT NULL DROP TABLE #processing;
IF OBJECT_ID('tempdb..#EtlRecord','U') IS NOT NULL DROP TABLE #EtlRecord;
SELECT TOP (100)
StreamGuid = NEWID(),
CompleteDate = CASE WHEN CHECKSUM(NEWID())%3 < 2 THEN GETDATE() END
INTO #processing
FROM sys.all_columns AS a
SELECT TOP (80) p.StreamGuid
INTO #EtlRecord
FROM #Processing AS p;
ALTER TABLE #processing ALTER COLUMN StreamGuid UNIQUEIDENTIFIER NOT NULL;
ALTER TABLE #EtlRecord ALTER COLUMN StreamGuid UNIQUEIDENTIFIER NOT NULL;
GO
ALTER TABLE #processing ADD CONSTRAINT pk_processing PRIMARY KEY CLUSTERED(StreamGuid);
ALTER TABLE #etlRecord ADD CONSTRAINT pk_etlRecord PRIMARY KEY CLUSTERED(StreamGuid);
GO
Next understand that, without an ORDER BY clause, your query is not guaranteed to return the same records each time. For example, if SQL Server picks a parallel execution plan you will definitely get a different rows. I have also seen cases where including the ORDER BY will actually improve performance.
With that in mind, not that this...
SELECT --TOP 1000
P.StreamGuid
FROM #processing AS p
LEFT JOIN #etlRecord AS e ON e.StreamGuid = p.StreamGuid
WHERE e.StreamGuid IS NOT NULL
AND P.CompleteDate IS NOT NULL
... will return the exact same thing as this:
SELECT TOP 1000
P.StreamGuid
FROM #processing AS p
JOIN #etlRecord AS e ON e.StreamGuid = p.StreamGuid
WHERE p.CompleteDate IS NOT NULL;
note that WHERE e.StreamGuid = p.StreamGuid already implies that both values are NOT NULL. Note that this query...
DECLARE #X INT;
SELECT AreTheyEqual = IIF(#X=#X,'Yep','Nope');
... returns:
AreTheyEqual
------------
Nope
I agree with the solution #gotqn posted about the filtered index. Using my sample data, you can add something like this:
CREATE NONCLUSTERED INDEX nc_processing ON #processing(CompleteDate,StreamGuid)
WHERE CompleteDate IS NOT NULL;
Then you can add an ORDER BY CompleteDate to the query to coerce the optimizer into choosing it that index (on my system it doesn't pick the index unless I add an ORDER BY). The ORDER BY will make you query deterministic and more predictable.
I would suggest writing this as:
SELECT TOP 1000 P.StreamGuid
FROM [staging].[Processing] P
WHERE P.CompleteDate IS NOT NULL AND
P.StreamGuid IS NOT NULL AND
NOT EXISTS (SELECT 1
FROM [core].[EtlRecord] E
WHERE E.StreamGuid = P.StreamGuid
);
I removed the NOLOCK directive. Only use it if you really know what you are doing -- and are prepared to read invalid data.
Then you definitely want an index on EtlRecord(StreamGuid).
You probably also want an index on Processing(CompleteDate, StreamGuid). This is at least a covering index for the query.

How to find tree nodes that don't have child nodes

Firebird Db stores chart accounts records in table:
CREATE TABLE CHARTACC
(
ACCNTNUM Char(8) NOT NULL, -- Account ID (Primary Key)
ACCPARNT Char(8), -- Parent ID
ACCCOUNT Integer, -- account count
ACCORDER Integer, -- order of children in nodes
ACCTITLE varchar(150),
ACDESCRP varchar(4000),
DTCREATE timestamp -- date and time of creation
)
I must write query which selects from table only last nodes e.g.nodes which haven't child nodes(child2, child3, subchild1, subchild2, subchild3 and subchild4).
The not in approach suggested by Jerry typically works quite slow in Interbase/Firebird/Yaffil/RedDatabase family, no indices used, etc.
Same goes for another possible representation Select X from T1 where
NOT EXISTS ( select * from t2 where t2.a = t1.b) - it can turn out really slow too.
I agree that those queries better represent what human wanted and hence are more readable, but still they're not recommended on Firebird. I was badly bitten in 1990-s when doing Herbalife-like app, I chosen this type of request wrapped in a loop to do monthly bottom-up tallying - update ... where not exists ... - and every iteration scaled as o(n^2) in Interbase 5.5. Granted, Firebird 3 made a long way since then, but this "direct" approach is still not recommended.
More SQL-traditional and FB-friendly way to express it, albeit less direct and harder to read, would be Select t1.x from t1 LEFT JOIN t2 on t1.a=t2.b WHERE t2.y IS NULL
Your query needs to work something like:
select * from CHARTACC where ACCNTNUM not in (select ACCPARNT from CHARTACC)
To put it into terms, select items from this table where its identifier is not found in the same table anywhere in its parent field.

Bad performance when not selecting a specific column from a view

Using SQL Server 2016 SP1. I have a view Users that goes like
SELECT
ROW_NUMBER() OVER (ORDER BY ID) AS DataModelID, *
FROM
(Some query) AS tbl
I then select from it
SELECT
U1.ID UserId, U1.IdentityNumber IdentityNumber,
U1.ArabicFirstName, U1.ArabicSecondName
FROM
USERS U1
LEFT JOIN
USERS U2 ON U1.IdentityNumber = U2.IdentityNumber
AND U1.ID <> U2.ID
AND U1.RoleId = 2
WHERE
U2.ID IS NOT NULL
AND U1.IdentityNumber <> ''
AND PATINDEX('[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]', U1.IdentityNumber) = 1
The thing here is with the above query when selecting * or include column DataModelID it runs in 3 secs but when selecting any columns without this one it runs in more than 2 mins.
Why is this happening, running faster when including a column?
I tried everything for the cash to clear it and run multiple times and it has the same results
Without seeing the actual execution plan there is no way to say for sure but as #mvisser mentioned - the likely cause is that the optimizer is choosing a better index when you do a
SELECT * or include column DataModelID than when you don't. There are a number of solutions here, one suggestion would be to look at the execution plan for the queries that run in 3 seconds, note what index is being used and use an index hint (see section G) to force the optimizer to use that index in your queries that don't reference those columns. I would not suggest this though - there are too many unanswered variables to consider this a viable option.
Here's what I recommend:
First, as #Lukasz Szozda mentioned, this is not SARGable:
AND PATINDEX( '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]',U1.IdentityNumber) = 1
But this is:
U1.IdentityNumber LIKE '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]'
So I'd fix that first. Next, the fastest, most sure-fire way to resolve this is to simply include DataModelID in your queries even if you don't need them. You can either filter that column out at the application level or create a stored proc that populates a temp table then, for the final result set, you can retrieve your results from that temp table excluding DataModelID.
OPTION #2
You can create an Indexed View on you USERS table that looks something like this:
CREATE VIEW dbo.vwUSERS_clean
WITH SCHEMABINDING AS
SELECT U1.ID, UserId, U1.IdentityNumber IdentityNumber,
U1.ArabicFirstName, U1.ArabicSecondName, DataModelID, U2.IdentityNumber
FROM USERS U1
WHERE U2.ID IS NOT NULL
AND U1.IdentityNumber <> ''
AND U1.IdentityNumber LIKE '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]';
GO
Then create a unique, clustered index on it. Next you would change the query that you posted to reference your indexed view (e.g. change both references to USERS to dbo.vwUSERS_clean WITH (NOEXPAND)).
Note that ROW_NUMBER is not allowed in indexed views but, if you make ID your clustered index (or the first column in a composite clustered index) then there will be no cost to Adding ROW_NUMBER() OVER ORDER BY ID to queries that reference that Indexed view.

SQL Server query runs slower when nothing is returned

My query runs slowly when the result set is empty. When there is something to return, it is lightning fast.
;with tree(NodeId,CategoryId,ParentId) as (
select ct.NodeId, ct.CategoryId, ct.ParentId
from dbo.CategoryTree as ct
where ct.ParentId = 6
union all
select t.NodeId, t.CategoryId, t.ParentId from dbo.CategoryTree as t
inner join tree as t2 on t.ParentId = t2.NodeId
), branch(NodeId,CategoryId,ParentId) as
(
select NodeId, CategoryId, ParentId from dbo.CategoryTree as t
where t.NodeId = 6
union all
select NodeId, CategoryId, ParentId
from tree as t
),facil(FacilityId) as(
select distinct fct.FacilityId
from dbo.FacilitiesCategoryTree as fct
inner join branch b on b.NodeId = fct.CategoryNodeId
)
select top 51 f.Id, f.CityId, f.NameGEO,
f.NameENG, f.NameRUS, f.DescrGEO, f.DescrENG,
f.DescrRUS, f.MoneyMin, f.MoneyAvg, f.Lat, f.Lng, f.SortIndex,
f.FrontImgUrl from dbo.Facilities f
inner join facil t2 on t2.FacilityId = f.Id
and f.EnabledUntil > 'Jan 14 2015 10:23PM'
order by f.SortIndex
Principal tables are:
Facilities table holds facilities, 256k records.
CategoryTree is used to group categories in a hierarchy.
NodeId int,
CategoryId int,
ParentId int
FacilitiesCategoryTree is used to link CategoryTree to Facilities.
Given NodeId, the second CTE returns all the nodes that are descendant of the given node including itself. Then there is a third CTE that returns facility ids that belong to these nodes.
Finally, the last CTE is joined to actual facilities table. The result is ordered by SortIndex which is used to manually indicate the order of facilities.
This query runs very fast when there is something to return even if I include many more predicates including full-text search and others, but when the given branch does not have any facilities, this query takes approx. 2 seconds to run.
If I exclude the order by clause, the query runs very fast again. All these tables are indexed and the query optimizer does not suggest any improvements.
What do you think is the problem and what can be done to improve the performance of queries with empty results?
Thank you.
Update1:
I am adding execution plans.
http://www.filedropper.com/withorderby
http://www.filedropper.com/withoutorderby
Update2:
I went through the recommendations of oryol and tried to save facility IDs from tree to the table variable and join it with facilities table and order by SortIndex. It eliminated the problem with empty results, but increased the execution time of queries with a result set from 250ms to 950ms.
I also changed the query to select from facil and join to the Facilities and added option (force order). The result was the same as above.
Finally, I denormalized facility/category mapping table to include SortIndex in this table. It increased the execution time of normal queries slightly from 250ms to 300ms, but it resolved the empty result set problem. I guess, I’ll stick to this method.
The first thing - you can slightly simplify the first two CTEs to just one:
with tree(NodeId,CategoryId,ParentId) as (
select ct.NodeId, ct.CategoryId, ct.ParentId
from dbo.CategoryTree as ct
where ct.NodeId = 6
union all
select t.NodeId, t.CategoryId, t.ParentId from dbo.CategoryTree as t
inner join tree as t2 on t.ParentId = t2.NodeId
)
The main problem that optimizer don't know or incorrectly estimate number of facilities which will be returned for your categories. And because you need facilities ordered by SortIndex optimizer decides to:
Go through all facilities ordered by SortIndex (using the appropriate index)
Skip rows which are not covered by other filters (EnabledUntil)
Using given Facility Id find one row in facilities from categories tree. If it exists returns result row. If not - skip this facility.
Repeat these iteration until 51 rows will be returned
So, in the worst case (if there are no 51 such facilities or they have very big SortIndex) it will require scan of all idx_Facilities_SortIndex and it requires a lot of time.
There are several ways to resolve this issue (including hints to optimizer to tell about row count or join order) to find the best way it's better to work with real database. First option which can be tried is to change query to:
Save facility IDs from tree to the table variable
Join it with facilities table and order by SortIndex
Another option (can be also comined with the first one) is to try to use FORCE ORDER query hint. In such case you will need to modify your select statement to select from facil and join it to the Facilities and add option (force order) query hint to the end of statement.
Query without order by select all facilities from tree. And after that extract other facility fields from facilities table.
Also, it's important to know about actual size of facilities in the tree (according to the estimates in the execution plan without order by it's really big - 395982). Does this estimate (more or less) correct?
If you really have a big amount of facilities returned after joining with category tree and facility/categories mapping table then the best solution will be to denormalize facility/category mapping table to include SortIndex in this table and add index to this table by NodeId and SortIndex.
So actually, we need to test queries / indexes with real data. Or to know different statistics of data:
Categories amount
Number of facilities per category and total number of rows in facilities / categories mapping table
SortIndex distribution (is it unique?)
etc.

SQL - Temp Table: Storing all columns in temp table versus only Primary key

I would need to create a temp table for paging purposes. I would be selecting all records into a temp table and then do further processing with it.
I am wondering which of the following is a better approach:
1) Select all the columns of my Primary Table into the Temp Table and then being able to select the rows I would need
OR
2) Select only the primary key of the Primary Table into the Temp Table and then joining with the Primary Table later on?
Is there any size consideration when working with approach 1 versus approach 2?
[EDIT]
I am asking because I would have done the first approach but looking at PROCEDURE [dbo].[aspnet_Membership_FindUsersByName], that was included with ASP.NET Membership, they are doing Approach 2
[EDIT2]
With people without access to the Stored procedure:
-- Insert into our temp table
INSERT INTO #PageIndexForUsers (UserId)
SELECT u.UserId
FROM dbo.aspnet_Users u, dbo.aspnet_Membership m
WHERE u.ApplicationId = #ApplicationId AND m.UserId = u.UserId AND u.LoweredUserName LIKE LOWER(#UserNameToMatch)
ORDER BY u.UserName
SELECT u.UserName, m.Email, m.PasswordQuestion, m.Comment, m.IsApproved,
m.CreateDate,
m.LastLoginDate,
u.LastActivityDate,
m.LastPasswordChangedDate,
u.UserId, m.IsLockedOut,
m.LastLockoutDate
FROM dbo.aspnet_Membership m, dbo.aspnet_Users u, #PageIndexForUsers p
WHERE u.UserId = p.UserId AND u.UserId = m.UserId AND
p.IndexId >= #PageLowerBound AND p.IndexId <= #PageUpperBound
ORDER BY u.UserName
If you have a non-trivial amount of rows (more than 100) than a table variable's performance is generally going to be worse than a temp table equivalent. But test it to make sure.
Option 2 would use less resources, because there is less data duplication.
Tony's points about this being a dirty read are really something you should be considering.
With approach 1, the data in the temp table may be out of step with the real data, i.e. if other sessions make changes to the real data. This may be OK if you are just viewing a snapshot of the data taken at a certain point, but would be dangerous if you were also updating the real table based on changes made to the temporary copy.
This is exactly the approach I use for Paging on the server,
Create a Table Variable (why incur the overhead of transaction logging ?) With just the key values. (Create the table with an autonum Identity column Primary Key - this will be RowNum. )
Insert keys into the table based on users sort/filtering criteria.. Identity column is now a row number which can be used for paging.
Select from table variable joined to other tables with real data required, Joined on key value,
Where RowNum Between ((PageNumber-1) * PageSize) + 1 And PageNumber * PageSize
Think about it this way. Suppose your query would return enough records to populate 1000 pages. How many users do you think would really look at all those pages? By returning only the ids, you aren't returning a lot of information you may or may not need to see. So it should save on network and server resources. And if they really do go through a lot of pages, it would take enough time that the data details might indeed need to be refreshed.
An alternative to paging (the way my company does it) is to use CTE's.
Check out this example from http://softscenario.blogspot.com/2007/11/sql-2005-server-side-paging-using-cte.html
CREATE PROC GetPagedEmployees (#NumbersOnPage INT=25,#PageNumb INT = 1)
AS BEGIN
WITH AllEmployees AS
(SELECT ROW_NUMBER() OVER (Order by [Person].[Contact].[LastName]) AS RowID,
[FirstName],[MiddleName],[LastName],[EmailAddress] FROM [Person].[Contact])
SELECT [FirstName],[MiddleName],[LastName],[EmailAddress]
FROM AllEmployees WHERE RowID BETWEEN
((#PageNumb - 1) * #NumbersOnPage) + 1 AND #PageNumb * NumbersOnPage
ORDER BY RowID