SQL Offset total row count slow with IN Clause

SQL Offset total row count slow with IN Clause - sql

I am using the below SQL code based on another answer. However when including the massive in clause, getting the total count takes too long. If I remove the total count, then the query takes less than 1 second. Is there a more efficient way to get the total row count? The answers I saw were based off of 2013 SQL queries.
DECLARE
#PageSize INT = 10,
#PageNum INT = 1;
WITH TempResult AS(
SELECT ID, Name
FROM Table
Where ID in ( 1 ,2 3, 4, 5, 6, 7, 8, 9 ,10)
), TempCount AS (
SELECT COUNT(*) AS MaxRows FROM TempResult
)
SELECT *
FROM TempResult,
TempCount <----- this is what is slow. Removing this and the query is super fast
ORDER BY TempResult.Name
OFFSET (#PageNum-1)*#PageSize ROWS
FETCH NEXT #PageSize ROWS ONLY

Step one for performance related questions is going to be to analyze your table/index structure, and to review the query plans. You haven't provided that information, so I'm going to make up my own, and go from there.
I'm going to assume that you have a heap, with ~10M rows (12,872,738 for me):
DECLARE #MaxRowCount bigint = 10000000,
#Offset bigint = 0;
DROP TABLE IF EXISTS #ExampleTable;
CREATE TABLE #ExampleTable
(
ID bigint NOT NULL,
Name varchar(50) COLLATE DATABASE_DEFAULT NOT NULL
);
WHILE #Offset < #MaxRowCount
BEGIN
INSERT INTO #ExampleTable
( ID, Name )
SELECT ROW_NUMBER() OVER ( ORDER BY ( SELECT NULL )),
ROW_NUMBER() OVER ( ORDER BY ( SELECT NULL ))
FROM master.dbo.spt_values SV
CROSS APPLY master.dbo.spt_values SV2;
SET #Offset = #Offset + ROWCOUNT_BIG();
END;
If I run the query provided over #ExampleTable, it takes about 4 seconds and gives me this query plan:
This isn't a great query plan by any means, but it is hardly awful. Running with live query stats shows that the cardinality estimates were at most off by one, which is fine.
Lets give a massive number of items in our IN list (5000 items from 1-5000). Compiling the plan took 4 seconds:
I can get my number up to 15000 items before the query processor stops being able to handle it, with no change in query plan (it does take a total of 6 seconds to compile). Running both queries takes about 5 seconds a pop on my machine.
This is probably fine for analytical workloads or for data warehousing, but for OLTP like queries we've definitely exceeded our ideal time limit.
Lets look at some alternatives. We can probably do some of these in combination.
We could cache off the IN list in a temp table or table variable.
We could use a window function to calculate the count
We could cache off our CTE in a temp table or table variable
If on a sufficiently high SQL Server version, use batch mode
Change the indices on your table to make this faster.
Workflow considerations
If this is for an OLTP workflow, then we need something that is fast regardless of how many users we have. As such, we want to minimize recompiles and we want index seeks wherever possible. If this is analytic or warehousing, then recompiles and scans are probably fine.
If we want OLTP, then the caching options are probably off the table. Temp tables will always force recompiles, and table variables in queries that rely on a good estimate require you to force a recompile. The alternative would be to have some other part of your application maintain a persistent table that has paginated counts or filters (or both), and then have this query join against that.
If the same user would look at many pages, then caching off part of it is probably still worth it even in OLTP, but make sure you measure the impact of many concurrent users.
Regardless of workflow, updating indices is probably okay (unless your workflows are going to really mess with your index maintenance).
Regardless of workflow, batch mode will be your friend.
Regardless of workflow, window functions (especially with either indices and/or batch mode) will probably be better.
Batch mode and the default cardinality estimator
We pretty consistently get poor cardinality estimates (and resulting plans) with the legacy cardinality estimator and row-mode executions. Forcing the default cardinality estimator helps with the first, and batch-mode helps with the second.
If you can't update your database to use the new cardinality estimator wholesale, then you'll want to enable it for your specific query. To accomplish that, you can use the following query hint: OPTION( USE HINT( 'FORCE_DEFAULT_CARDINALITY_ESTIMATION' ) ) to get the first. For the second, add a join to a CCI (doesn't need to return data): LEFT OUTER JOIN dbo.EmptyCciForRowstoreBatchmode ON 1 = 0 - this enables SQL Server to pick batch mode optimizations. These recommendations assume a sufficiently new SQL Server version.
What the CCI is doesn't matter; we like to keep an empty one around for consistency, that looks like this:
CREATE TABLE dbo.EmptyCciForRowstoreBatchmode
(
__zzDoNotUse int NULL,
INDEX CCI CLUSTERED COLUMNSTORE
);
The best plan I could get without modifying the table was to use both of them. With the same data as before, this runs in <1s.
WITH TempResult AS
(
SELECT ID,
Name,
COUNT( * ) OVER ( ) MaxRows
FROM #ExampleTable
WHERE ID IN ( <<really long LIST>> )
)
SELECT TempResult.ID,
TempResult.Name,
TempResult.MaxRows
FROM TempResult
LEFT OUTER JOIN dbo.EmptyCciForRowstoreBatchmode ON 1 = 0
ORDER BY TempResult.Name OFFSET ( #PageNum - 1 ) * #PageSize ROWS FETCH NEXT #PageSize ROWS ONLY
OPTION( USE HINT( 'FORCE_DEFAULT_CARDINALITY_ESTIMATION' ) );

As far as I know there are 3 ways to achieve this, besides using the #temp table approach already mentioned. In my test cases below, I've used a SQL Server 2016 Developer instance with 6CPU/16GB RAM, and a simple table containing ~25M rows.
Method 1: CROSS JOIN
DECLARE
#PageSize INT = 10
, #PageNum INT = 1;
WITH TempResult AS (SELECT
id
, shortDesc
FROM dbo.TestName
WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
SELECT
*, MaxRows
FROM TempResult
CROSS JOIN (SELECT COUNT(1) AS MaxRows FROM TempResult) AS TheCount
ORDER BY TempResult.shortDesc OFFSET (#PageNum - 1) * #PageSize ROWS
FETCH NEXT #PageSize ROWS ONLY;
Test result 1:
Method 2: COUNT(*) OVER()
DECLARE
#PageSize INT = 10
, #PageNum INT = 1;
WITH TempResult AS (SELECT
id
, shortDesc
FROM dbo.TestName
WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
)
SELECT
*, MaxRows = COUNT(*) OVER()
FROM TempResult
ORDER BY TempResult.shortDesc OFFSET (#PageNum - 1) * #PageSize ROWS
FETCH NEXT #PageSize ROWS ONLY;
Test result 2:
Method 3: 2nd CTE
Test result 3 (T-SQL used was the same as in the question):
Conclusion
The fastest method depends on your data structure (and total number of rows) in combination with your server sizing/load. In my case using COUNT(*) OVER() proved to be the fastest method. To find what is best for you, you have to test what is best for your scenario. And don't rule out that #table approach either just yet ;-)

You can try to count the rows while filtering the table using ROW_NUMBER():
DECLARE
#PageSize INT = 10,
#PageNum INT = 1;
;WITH
TempResult AS (
SELECT ID, Name, ROW_NUMBER() OVER (ORDER BY ID) N
FROM Table
Where ID in ( 1 ,2 3, 4, 5, 6, 7, 8, 9 ,10)
),
TempCount AS (
SELECT TOP 1 N AS MaxRows
FROM TempResult
ORDER BY ID DESC
)
SELECT *
FROM
TempResult,
TempCount
ORDER BY
TempResult.Name
OFFSET (#PageNum-1)*#PageSize ROWS
FETCH NEXT #PageSize ROWS ONLY

You could try phrasing this as:
WITH TempResult AS(
SELECT ID, Name, COUNT(*) OVER () as maxrows
FROM Table
Where ID in ( 1 ,2 3, 4, 5, 6, 7, 8, 9 ,10)
)
However, I doubt that you will see much performance improvement. The entire table needs to be scanned to get the total count. That is probably where the performance issue is.

This might be a shot in the dark but you can try using a temp table instead of a cte.
Though the performance results and preference of one over the other depends on use-case to use-case, a temp table sometimes can actually prove better since it enables you to leverage indices and dedicated statistics.
INSERT INTO #TempResult
SELECT ID, Name
FROM Table
WHERE ID in ( 1 ,2 3, 4, 5, 6, 7, 8, 9 ,10)

The IN statement is a notorious hurdle for the SQL Server query engine. When it gets "massive" (your words) it slows down even simple queries. In my experience, IN statements with more than 5000 items nearly always unacceptably slow down any query.
It nearly always works better to convert the items of a large IN statement into a temp table or table variable first and then join with this table, as below. I tested this and found it's significantly faster, even with the preparation of the temp table. I think that the IN statement, even though the inner query performs well enough with it, has a detrimental effect on the combined query.
DECLARE #ids TABLE (ID int primary key );
-- This must be done in chunks of 1000
INSERT #ids (ID) VALUES
(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),...
...
;WITH TempResult AS
(
SELECT tbl.ID, tbl.Name
FROM Table tbl
JOIN #ids ids ON ids.ID = tbl.ID
),
TempCount AS
(
SELECT COUNT(*) AS MaxRows FROM TempResult
)
SELECT *
FROM TempResult,
TempCount
ORDER BY TempResult.Name
OFFSET (#PageNum-1)*#PageSize ROWS
FETCH NEXT #PageSize ROWS ONLY

CTEs are very nice, but having many consecutive CTEs (two is not many I think but in general) caused me many times performace horror. The simplest method I think would be calculate number of rows once and assign it to variable:
DECLARE
#PageSize INT = 10,
#PageNum INT = 1,
#MaxRows bigint = (SELECT COUNT(1) FROM Table Where ID in ( 1 ,2 3, 4, 5, 6, 7, 8, 9 ,10));
WITH TempResult AS(
SELECT ID, Name
FROM Table
Where ID in ( 1 ,2 3, 4, 5, 6, 7, 8, 9 ,10)
)
SELECT *
FROM TempResult,
#MaxRows TempCount <----- this is what is slow. Removing this and the query is super fast
ORDER BY TempResult.Name
OFFSET (#PageNum-1)*#PageSize ROWS
FETCH NEXT #PageSize ROWS ONLY

I can't test this at the moment but on glancing through it struck me that specifying a multiply (cross join) as in:
FROM TempResult,
TempCount <----- this is what is slow. Removing this and the query is super
may be the issue
How does it perform when written simply as:
DECLARE
#PageSize INT = 10,
#PageNum INT = 1;
WITH TempResult AS(
SELECT ID, Name
FROM Table
Where ID in ( 1 ,2 3, 4, 5, 6, 7, 8, 9 ,10)
)
SELECT *, (SELECT COUNT(*) FROM TempResult) AS MaxRows
FROM TempResult
ORDER BY TempResult.Name
OFFSET (#PageNum-1)*#PageSize ROWS
FETCH NEXT #PageSize ROWS ONLY

Related

How to split records evenly into 4 tables from one table?

I have to design a solution that will help me out to load data into 4 tables from 1 master table.
All that the function or package is supposed to do is following :
Count total number of rows in a master table
Divide by 4
Load into table 1,2,3 and 4.
Every time we run the program, this function wipes out 4 tables and do the above process again and the name of the main table and of the destination tables will be always same.
For example, if the Master Table has 4200 records then :
Table A will get 1-1000
Table B will get 1001-2000
Table C will get 2001-3000
Table D will get 3001-4200.
Can anyone help me?

This is a very simple way to do it. There may be faster ways. Replace [TABLE] with the name of your table and [ID] with the name of a unique column in the table.
DECLARE #count int = 0;
DECLARE #numRecsPerTable int = 0;
SELECT #count = COUNT(*) FROM [TABLE]
SELECT #numRecsPerTable = #count / 4
SELECT TOP (#numRecsPerTable) *
INTO temp_1
FROM [TABLE]
SELECT TOP (#numRecsPerTable) *
INTO temp_2
FROM [TABLE]
WHERE [ID] NOT IN (SELECT TOP (#numRecsPerTable) [ID] FROM [TABLE])
SELECT TOP (#numRecsPerTable) *
INTO temp_3
FROM [TABLE]
WHERE [ID] NOT IN (SELECT TOP (#numRecsPerTable * 2) [ID] FROM [TABLE])
SELECT *
INTO temp_4
FROM [TABLE]
WHERE [ID] NOT IN (SELECT TOP (#numRecsPerTable * 3) [ID] FROM [TABLE])
Note: the remainder of recs / 4 will be in the 4th table.

The SSIS implementation is similar to Steve's answer.
Source
The first difference is that instead of division, we'll use the modulo operator, % It generates the remainder after division. In this example, I use %4 which means I will have values of 0, 1, 2 and 3. Four "buckets" of data. To give the modulus operator something to work on, I use the ROW_NUMBER function to generate an arbitrary monotonically incrementing sequence of numbers.
The query would look something like
SELECT
T.*
, (ROW_NUMBER() OVER (ORDER BY (SELECT NULL))) % 4 AS bucketNumber
FROM
sys.all_columns AS T;
Conditional Split
I route the data to a Conditional Split component. Here you define boolean expressions and correlate them to named outputs. I defined mine as Bucket0, Bucket1, Bucket2, Bucket3 and used expressions of bucketNumber==0...
Destination
I now have 4 connectors coming out of my conditional split and wire them up to tables Bucket0 to Bucket3.

Sort user preferences through their selections and rank

Overview
I'm attempting to create a sorter that allows me to get only possible preferences based on the ranks of users and their preferences
I'm not really sure where to start with this. Below you'll see a SQL Fiddle of a simplified version of what I'm looking at.
http://sqlfiddle.com/#!3/40f0c5/1/0
Initial Code
CREATE TABLE selections
(
id int,
item_id int,
preference int
);
CREATE TABLE ranks
(
id int,
rank int
);
INSERT INTO selections
(id, item_id, preference)
VALUES
(14063, 1, 1),
(14063, 2, 2),
(14063, 3, 3),
(15026, 1, 2),
(15026, 2, 1),
(15026, 3, 3),
(25014, 1, 1),
(25014, 2, 2),
(25014, 3, 3);
INSERT INTO ranks
(id, rank)
VALUES
(14063, 1),
(15026, 2),
(25014, 3);
Expected Outcome
Based on the tables below, if I run the sorter, we should see the results showing the below. Ideally, I would ONLY want to show the item the user got, based on their preference and rank.
14063(1) - item(1)
15026(2) - item(2)
25014(3) - item(3)

I was able to come up with a working solution for you, but it's far from perfect: using a WHILE loop like I'm doing here breaks one of the basic rules of SQL optimization, which is to work with set-based queries as opposed to RBAR. That said, though, I tried coming up with a way to do this with a CTE, with ROW_NUMBER(), and with some NOT EXISTS queries, and failed each time because of the dual nature of the sort. My WHILE loops are pretty unimpressive, so hopefully someone can come along and suggest some improvements for you. There are plenty of people out there whose righteous indignation could probably motivate a criticism or two - hopefully they'll also toss in some ideas or an answer of their own. :)
With that cheerfully self-critical caveat, and wishing you the best of luck on performance, here's a query that will get you the desired resultset:
DECLARE #SortingOutcome TABLE
(
UserID INT,
UserRank INT,
ItemID INT,
ItemPreference INT
)
DECLARE #Looper INT = 1
DECLARE #Ender INT
SELECT #Ender = MAX(Rank) FROM Ranks
WHILE #Looper <= #Ender
BEGIN
INSERT INTO #SortingOutcome
(
UserID,
UserRank,
ItemID,
ItemPreference
)
SELECT TOP 1
r.ID,
rank,
item_id,
preference
FROM
Ranks r
INNER JOIN
Selections s ON
r.id = s.ID
WHERE
r.rank = #Looper AND
NOT EXISTS
(
SELECT 1
FROM #SortingOutcome
WHERE ItemID = s.item_id
)
ORDER BY preference
SET #Looper = #Looper + 1
END
SELECT * FROM #SortingOutcome
SQLFiddle

This SELECT query takes 180 seconds to finish

UPDATE:
Just to mention it on a more visible place. When I changed IN for =, the query execution time went from 180 down to 0.00008 seconds. Ridiculous speed difference.
This SQL query takes 180 seconds to finish! How is that possible? is there a way to optimize it to be faster?
SELECT IdLawVersionValidFrom
FROM question_law_version
WHERE IdQuestionLawVersion IN
(
SELECT MAX(IdQuestionLawVersion)
FROM question_law_version
WHERE IdQuestionLaw IN
(
SELECT MIN(IdQuestionLaw)
FROM question_law
WHERE IdQuestion=236 AND IdQuestionLaw>63
)
)
There are only about 5000 rows in each table so it shouldn't be so slow.

(Posting my comment as an answer as apparently it did make a difference!)
Any difference if you change the IN
to =?
If anyone wants to investigate this further I've just done a test and found it very easy to reproduce.
Create Table
CREATE TABLE `filler` (
`id` int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`id`)
)
Create Procedure
CREATE PROCEDURE `prc_filler`(cnt INT)
BEGIN
DECLARE _cnt INT;
SET _cnt = 1;
WHILE _cnt <= cnt DO
INSERT
INTO filler
SELECT _cnt;
SET _cnt = _cnt + 1;
END WHILE;
END
Populate Table
call prc_filler(5000)
Query 1
SELECT id
FROM filler
WHERE id = (SELECT MAX(id) FROM filler WHERE id =
( SELECT MIN(id)
FROM filler
WHERE id between 2000 and 3000
)
)
Equals Explain Output http://img689.imageshack.us/img689/5592/equals.png
Query 2 (same problem)
SELECT id
FROM filler
WHERE id in (SELECT MAX(id) FROM filler WHERE id in
( SELECT MIN(id)
FROM filler
WHERE id between 2000 and 3000
)
)
In Explain Output http://img291.imageshack.us/img291/8129/52037513.png

Here is a good explanation why = is better than IN
Mysql has problems with inner queries - not well using indexes (if at all).
Make sure you have indexes on all the fields in the join/where/order etc.
get those Max and MIN values in a separate query (use stored procedure for this entire thing if you want to skip the multiple requests overhead Or just do a request with multiple queries.
Anyway:
SELECT
IdLawVersionValidFrom
FROM
question_law_version
JOIN
question_law
ON
question_law_version.IdQuestionLaw = question_law.IdQuestionLaw
WHERE
question_law.IdQuestion=236
AND
question_law.IdQuestionLaw>63
ORDER BY
IdQuestionLawVersion DESC,
question_law.IdQuestionLaw ASC
LIMIT 1

You can use EXPLAIN to find out how is it possible for a query to execute so slow.
MySQL does not really like nested subselects so probably what happens is that it goes and does sorts on disk to get min and max and fail to reuse results.
Rewriting as joins would probably help it.
If just looking for a quick fix try:
SET #temp1 =
(
SELECT MIN(IdQuestionLaw)
FROM question_law
WHERE IdQuestion = 236 AND IdQuestionLaw > 63
)
SET #temp2 =
(
SELECT MAX(IdQuestionLawVersion)
FROM question_law_version
WHERE IdQuestionLaw = #temp1
)
SELECT IdLawVersionValidFrom
FROM question_law_version
WHERE IdQuestionLawVersion = #temp2

SQL Server 2008 CTE And CONTAINSTABLE Statement - Why the error?

I am testing out moving our database from SQL Server 2005 to 2008. We use CTE's for paging.
When using full-text CONTAINSTABLE, the CTE will not run and generates an error.
Here's my non-working code-
WITH results AS (
SELECT ROW_NUMBER() over (ORDER BY GBU.CreateDate DESC ) as rowNum,
GBU.UserID,
NULL AS DistanceInMiles
FROM User GBU WITH (NOLOCK)
WHERE 1=1
AND GBU.CountryCode IN (SELECT [Value] FROM fn_Split('USA',','))
AND GBU.UserID IN (SELECT [KEY] FROM CONTAINSTABLE(VW_GBU_Search, *, 'COMPASS'))
)
SELECT * from results
WHERE rowNum BETWEEN 0 and 25
If I comment out the CONTAINSTABLE line, the statement executes. If I only run the SELECT statement (not the WITH), the statement executes fine.
The un-helpful error I get on this is:
Msg 0, Level 11, State 0, Line 0 A
severe error occurred on the current
command. The results, if any, should
be discarded. Msg 0, Level 20, State
0, Line 0 A severe error occurred on
the current command. The results, if
any, should be discarded.
Any suggestions?

Appears to be a bug. See http://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=426981
Sounds like the fix should be in the next MSSQL SP.

Assuming the other answers are correct, and that the underlying issue is a bug, since you aren't referencing RANK from CONTAINSTABLE, perhaps a query something like the following would be a workaround, where "ID" is the ID column in VW_GBU_Search (untested)?
;WITH results AS (
SELECT ROW_NUMBER() OVER (ORDER BY GBU.CreateDate DESC ) AS rowNum,
GBU.UserID,
NULL AS DistanceInMiles
FROM User GBU WITH (NOLOCK)
WHERE 1=1
AND GBU.CountryCode IN (SELECT [Value] FROM fn_Split('USA',','))
AND GBU.UserID IN (SELECT ID FROM VW_GBU_Search WHERE CONTAINS(*, 'COMPASS'))
)
SELECT * FROM results
WHERE rowNum BETWEEN 0 AND 25
Also, why do you have the "1=1" clause? Can you eliminate it?

I banged my head against the wall on this problem for hours; here is a workaround:
ASSUME: A table in database called
Items ( ItemId int PK, Content varchar(MAX) ),
which has a fulltext index already applied.
GO
CREATE FUNCTION udf_SearchItemsTable(#FreeText)
RETURNS #SearchHits
TABLE(
Relevance int,
ItemId int,
Content varchar(MAX)
)
AS
BEGIN
INSERT #SearchHits
SELECT Results.[Rank] AS Relevance
,Items.ItemId AS ItemId
,Items.Content AS Content
FROM SearchableItems AS Items INNER JOIN
CONTAINSTABLE(SearchableItems, *, #FreeText) AS Results
Results.[Key] = Items.Id
RETURN
END
GO
...
GO
CREATE FUNCTION udf_SearchItems( #SearchText, #StartRowNum, #MaxRows)
RETURNS #SortedItems
TABLE (
ItemId int,
Content varchar(MAX)
)
AS
BEGIN
WITH Matches AS
(
SELECT
ROW_NUMBER() OVER (ORDER BY Hits.Relevance DESC) AS RowNum
,Hits.*
FROM ( udf_SearchItemsTable(#SearchText) ) AS Hits
)
SELECT
ItemId, Content
FROM
Matches
WHERE
Matches.RowNum BETWEEN #StartRowNum
AND #StartRowNum + #MaxRows
;
RETURN
END
GO
select * from udf_SearchItems('some free text stuff', 10, 20)

SQL Server 2005 recursive query with loops in data - is it possible?

I've got a standard boss/subordinate employee table. I need to select a boss (specified by ID) and all his subordinates (and their subrodinates, etc). Unfortunately the real world data has some loops in it (for example, both company owners have each other set as their boss). The simple recursive query with a CTE chokes on this (maximum recursion level of 100 exceeded). Can the employees still be selected? I care not of the order in which they are selected, just that each of them is selected once.
Added: You want my query? Umm... OK... I though it is pretty obvious, but - here it is:
with
UserTbl as -- Selects an employee and his subordinates.
(
select a.[User_ID], a.[Manager_ID] from [User] a WHERE [User_ID] = #UserID
union all
select a.[User_ID], a.[Manager_ID] from [User] a join UserTbl b on (a.[Manager_ID]=b.[User_ID])
)
select * from UserTbl
Added 2: Oh, in case it wasn't clear - this is a production system and I have to do a little upgrade (basically add a sort of report). Thus, I'd prefer not to modify the data if it can be avoided.

I know it has been a while but thought I should share my experience as I tried every single solution and here is a summary of my findings (an maybe this post?):
Adding a column with the current path did work but had a performance hit so not an option for me.
I could not find a way to do it using CTE.
I wrote a recursive SQL function which adds employeeIds to a table. To get around the circular referencing, there is a check to make sure no duplicate IDs are added to the table. The performance was average but was not desirable.
Having done all of that, I came up with the idea of dumping the whole subset of [eligible] employees to code (C#) and filter them there using a recursive method. Then I wrote the filtered list of employees to a datatable and export it to my stored procedure as a temp table. To my disbelief, this proved to be the fastest and most flexible method for both small and relatively large tables (I tried tables of up to 35,000 rows).

this will work for the initial recursive link, but might not work for longer links
DECLARE #Table TABLE(
ID INT,
PARENTID INT
)
INSERT INTO #Table (ID,PARENTID) SELECT 1, 2
INSERT INTO #Table (ID,PARENTID) SELECT 2, 1
INSERT INTO #Table (ID,PARENTID) SELECT 3, 1
INSERT INTO #Table (ID,PARENTID) SELECT 4, 3
INSERT INTO #Table (ID,PARENTID) SELECT 5, 2
SELECT * FROM #Table
DECLARE #ID INT
SELECT #ID = 1
;WITH boss (ID,PARENTID) AS (
SELECT ID,
PARENTID
FROM #Table
WHERE PARENTID = #ID
),
bossChild (ID,PARENTID) AS (
SELECT ID,
PARENTID
FROM boss
UNION ALL
SELECT t.ID,
t.PARENTID
FROM #Table t INNER JOIN
bossChild b ON t.PARENTID = b.ID
WHERE t.ID NOT IN (SELECT PARENTID FROM boss)
)
SELECT *
FROM bossChild
OPTION (MAXRECURSION 0)
what i would recomend is to use a while loop, and only insert links into temp table if the id does not already exist, thus removing endless loops.

Not a generic solution, but might work for your case: in your select query modify this:
select a.[User_ID], a.[Manager_ID] from [User] a join UserTbl b on (a.[Manager_ID]=b.[User_ID])
to become:
select a.[User_ID], a.[Manager_ID] from [User] a join UserTbl b on (a.[Manager_ID]=b.[User_ID])
and a.[User_ID] <> #UserID

You don't have to do it recursively. It can be done in a WHILE loop. I guarantee it will be quicker: well it has been for me every time I've done timings on the two techniques. This sounds inefficient but it isn't since the number of loops is the recursion level. At each iteration you can check for looping and correct where it happens. You can also put a constraint on the temporary table to fire an error if looping occurs, though you seem to prefer something that deals with looping more elegantly. You can also trigger an error when the while loop iterates over a certain number of levels (to catch an undetected loop? - oh boy, it sometimes happens.
The trick is to insert repeatedly into a temporary table (which is primed with the root entries), including a column with the current iteration number, and doing an inner join between the most recent results in the temporary table and the child entries in the original table. Just break out of the loop when ##rowcount=0!
Simple eh?

I know you asked this question a while ago, but here is a solution that may work for detecting infinite recursive loops. I generate a path and I checked in the CTE condition if the USER ID is in the path, and if it is it wont process it again. Hope this helps.
Jose
DECLARE #Table TABLE(
USER_ID INT,
MANAGER_ID INT )
INSERT INTO #Table (USER_ID,MANAGER_ID) SELECT 1, 2
INSERT INTO #Table (USER_ID,MANAGER_ID) SELECT 2, 1
INSERT INTO #Table (USER_ID,MANAGER_ID) SELECT 3, 1
INSERT INTO #Table (USER_ID,MANAGER_ID) SELECT 4, 3
INSERT INTO #Table (USER_ID,MANAGER_ID) SELECT 5, 2
DECLARE #UserID INT
SELECT #UserID = 1
;with
UserTbl as -- Selects an employee and his subordinates.
(
select
'/'+cast( a.USER_ID as varchar(max)) as [path],
a.[User_ID],
a.[Manager_ID]
from #Table a
where [User_ID] = #UserID
union all
select
b.[path] +'/'+ cast( a.USER_ID as varchar(max)) as [path],
a.[User_ID],
a.[Manager_ID]
from #Table a
inner join UserTbl b
on (a.[Manager_ID]=b.[User_ID])
where charindex('/'+cast( a.USER_ID as varchar(max))+'/',[path]) = 0
)
select * from UserTbl

basicaly if you have loops like this in data you'll have to do the retreival logic by yourself.
you could use one cte to get only subordinates and other to get bosses.
another idea is to have a dummy row as a boss to both company owners so they wouldn't be each others bosses which is ridiculous. this is my prefferd option.

I can think of two approaches.
1) Produce more rows than you want, but include a check to make sure it does not recurse too deep. Then remove duplicate User records.
2) Use a string to hold the Users already visited. Like the not in subquery idea that didn't work.
Approach 1:
; with TooMuchHierarchy as (
select "User_ID"
, Manager_ID
, 0 as Depth
from "User"
WHERE "User_ID" = #UserID
union all
select U."User_ID"
, U.Manager_ID
, M.Depth + 1 as Depth
from TooMuchHierarchy M
inner join "User" U
on U.Manager_ID = M."user_id"
where Depth < 100) -- Warning MAGIC NUMBER!!
, AddMaxDepth as (
select "User_ID"
, Manager_id
, Depth
, max(depth) over (partition by "User_ID") as MaxDepth
from TooMuchHierarchy)
select "user_id", Manager_Id
from AddMaxDepth
where Depth = MaxDepth
The line where Depth < 100 is what keeps you from getting the max recursion error. Make this number smaller, and less records will be produced that need to be thrown away. Make it too small and employees won't be returned, so make sure it is at least as large as the depth of the org chart being stored. Bit of a maintence nightmare as the company grows. If it needs to be bigger, then add option (maxrecursion ... number ...) to whole thing to allow more recursion.
Approach 2:
; with Hierarchy as (
select "User_ID"
, Manager_ID
, '#' + cast("user_id" as varchar(max)) + '#' as user_id_list
from "User"
WHERE "User_ID" = #UserID
union all
select U."User_ID"
, U.Manager_ID
, M.user_id_list + '#' + cast(U."user_id" as varchar(max)) + '#' as user_id_list
from Hierarchy M
inner join "User" U
on U.Manager_ID = M."user_id"
where user_id_list not like '%#' + cast(U."User_id" as varchar(max)) + '#%')
select "user_id", Manager_Id
from Hierarchy

The preferrable solution is to clean up the data and to make sure you do not have any loops in the future - that can be accomplished with a trigger or a UDF wrapped in a check constraint.
However, you can use a multi statement UDF as I demonstrated here: Avoiding infinite loops. Part One
You can add a NOT IN() clause in the join to filter out the cycles.

This is the code I used on a project to chase up and down hierarchical relationship trees.
User defined function to capture subordinates:
CREATE FUNCTION fn_UserSubordinates(#User_ID INT)
RETURNS #SubordinateUsers TABLE (User_ID INT, Distance INT) AS BEGIN
IF #User_ID IS NULL
RETURN
INSERT INTO #SubordinateUsers (User_ID, Distance) VALUES ( #User_ID, 0)
DECLARE #Distance INT, #Finished BIT
SELECT #Distance = 1, #Finished = 0
WHILE #Finished = 0
BEGIN
INSERT INTO #SubordinateUsers
SELECT S.User_ID, #Distance
FROM Users AS S
JOIN #SubordinateUsers AS C
ON C.User_ID = S.Manager_ID
LEFT JOIN #SubordinateUsers AS C2
ON C2.User_ID = S.User_ID
WHERE C2.User_ID IS NULL
IF ##RowCount = 0
SET #Finished = 1
SET #Distance = #Distance + 1
END
RETURN
END
User defined function to capture managers:
CREATE FUNCTION fn_UserManagers(#User_ID INT)
RETURNS #User TABLE (User_ID INT, Distance INT) AS BEGIN
IF #User_ID IS NULL
RETURN
DECLARE #Manager_ID INT
SELECT #Manager_ID = Manager_ID
FROM UserClasses WITH (NOLOCK)
WHERE User_ID = #User_ID
INSERT INTO #UserClasses (User_ID, Distance)
SELECT User_ID, Distance + 1
FROM dbo.fn_UserManagers(#Manager_ID)
INSERT INTO #User (User_ID, Distance) VALUES (#User_ID, 0)
RETURN
END

You need a some method to prevent your recursive query from adding User ID's already in the set. However, as sub-queries and double mentions of the recursive table are not allowed (thank you van) you need another solution to remove the users already in the list.
The solution is to use EXCEPT to remove these rows. This should work according to the manual. Multiple recursive statements linked with union-type operators are allowed. Removing the users already in the list means that after a certain number of iterations the recursive result set returns empty and the recursion stops.
with UserTbl as -- Selects an employee and his subordinates.
(
select a.[User_ID], a.[Manager_ID] from [User] a WHERE [User_ID] = #UserID
union all
(
select a.[User_ID], a.[Manager_ID]
from [User] a join UserTbl b on (a.[Manager_ID]=b.[User_ID])
where a.[User_ID] not in (select [User_ID] from UserTbl)
EXCEPT
select a.[User_ID], a.[Manager_ID] from UserTbl a
)
)
select * from UserTbl;
The other option is to hardcode a level variable that will stop the query after a fixed number of iterations or use the MAXRECURSION query option hint, but I guess that is not what you want.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL Offset total row count slow with IN Clause - sql

Related

How to split records evenly into 4 tables from one table?

Sort user preferences through their selections and rank

This SELECT query takes 180 seconds to finish

SQL Server 2008 CTE And CONTAINSTABLE Statement - Why the error?

SQL Server 2005 recursive query with loops in data - is it possible?

Categories

Resources