Sort user preferences through their selections and rank - sql

Overview
I'm attempting to create a sorter that allows me to get only possible preferences based on the ranks of users and their preferences
I'm not really sure where to start with this. Below you'll see a SQL Fiddle of a simplified version of what I'm looking at.
http://sqlfiddle.com/#!3/40f0c5/1/0
Initial Code
CREATE TABLE selections
(
id int,
item_id int,
preference int
);
CREATE TABLE ranks
(
id int,
rank int
);
INSERT INTO selections
(id, item_id, preference)
VALUES
(14063, 1, 1),
(14063, 2, 2),
(14063, 3, 3),
(15026, 1, 2),
(15026, 2, 1),
(15026, 3, 3),
(25014, 1, 1),
(25014, 2, 2),
(25014, 3, 3);
INSERT INTO ranks
(id, rank)
VALUES
(14063, 1),
(15026, 2),
(25014, 3);
Expected Outcome
Based on the tables below, if I run the sorter, we should see the results showing the below. Ideally, I would ONLY want to show the item the user got, based on their preference and rank.
14063(1) - item(1)
15026(2) - item(2)
25014(3) - item(3)

I was able to come up with a working solution for you, but it's far from perfect: using a WHILE loop like I'm doing here breaks one of the basic rules of SQL optimization, which is to work with set-based queries as opposed to RBAR. That said, though, I tried coming up with a way to do this with a CTE, with ROW_NUMBER(), and with some NOT EXISTS queries, and failed each time because of the dual nature of the sort. My WHILE loops are pretty unimpressive, so hopefully someone can come along and suggest some improvements for you. There are plenty of people out there whose righteous indignation could probably motivate a criticism or two - hopefully they'll also toss in some ideas or an answer of their own. :)
With that cheerfully self-critical caveat, and wishing you the best of luck on performance, here's a query that will get you the desired resultset:
DECLARE #SortingOutcome TABLE
(
UserID INT,
UserRank INT,
ItemID INT,
ItemPreference INT
)
DECLARE #Looper INT = 1
DECLARE #Ender INT
SELECT #Ender = MAX(Rank) FROM Ranks
WHILE #Looper <= #Ender
BEGIN
INSERT INTO #SortingOutcome
(
UserID,
UserRank,
ItemID,
ItemPreference
)
SELECT TOP 1
r.ID,
rank,
item_id,
preference
FROM
Ranks r
INNER JOIN
Selections s ON
r.id = s.ID
WHERE
r.rank = #Looper AND
NOT EXISTS
(
SELECT 1
FROM #SortingOutcome
WHERE ItemID = s.item_id
)
ORDER BY preference
SET #Looper = #Looper + 1
END
SELECT * FROM #SortingOutcome
SQLFiddle

Related

SQL Offset total row count slow with IN Clause

I am using the below SQL code based on another answer. However when including the massive in clause, getting the total count takes too long. If I remove the total count, then the query takes less than 1 second. Is there a more efficient way to get the total row count? The answers I saw were based off of 2013 SQL queries.
DECLARE
#PageSize INT = 10,
#PageNum INT = 1;
WITH TempResult AS(
SELECT ID, Name
FROM Table
Where ID in ( 1 ,2 3, 4, 5, 6, 7, 8, 9 ,10)
), TempCount AS (
SELECT COUNT(*) AS MaxRows FROM TempResult
)
SELECT *
FROM TempResult,
TempCount <----- this is what is slow. Removing this and the query is super fast
ORDER BY TempResult.Name
OFFSET (#PageNum-1)*#PageSize ROWS
FETCH NEXT #PageSize ROWS ONLY
Step one for performance related questions is going to be to analyze your table/index structure, and to review the query plans. You haven't provided that information, so I'm going to make up my own, and go from there.
I'm going to assume that you have a heap, with ~10M rows (12,872,738 for me):
DECLARE #MaxRowCount bigint = 10000000,
#Offset bigint = 0;
DROP TABLE IF EXISTS #ExampleTable;
CREATE TABLE #ExampleTable
(
ID bigint NOT NULL,
Name varchar(50) COLLATE DATABASE_DEFAULT NOT NULL
);
WHILE #Offset < #MaxRowCount
BEGIN
INSERT INTO #ExampleTable
( ID, Name )
SELECT ROW_NUMBER() OVER ( ORDER BY ( SELECT NULL )),
ROW_NUMBER() OVER ( ORDER BY ( SELECT NULL ))
FROM master.dbo.spt_values SV
CROSS APPLY master.dbo.spt_values SV2;
SET #Offset = #Offset + ROWCOUNT_BIG();
END;
If I run the query provided over #ExampleTable, it takes about 4 seconds and gives me this query plan:
This isn't a great query plan by any means, but it is hardly awful. Running with live query stats shows that the cardinality estimates were at most off by one, which is fine.
Lets give a massive number of items in our IN list (5000 items from 1-5000). Compiling the plan took 4 seconds:
I can get my number up to 15000 items before the query processor stops being able to handle it, with no change in query plan (it does take a total of 6 seconds to compile). Running both queries takes about 5 seconds a pop on my machine.
This is probably fine for analytical workloads or for data warehousing, but for OLTP like queries we've definitely exceeded our ideal time limit.
Lets look at some alternatives. We can probably do some of these in combination.
We could cache off the IN list in a temp table or table variable.
We could use a window function to calculate the count
We could cache off our CTE in a temp table or table variable
If on a sufficiently high SQL Server version, use batch mode
Change the indices on your table to make this faster.
Workflow considerations
If this is for an OLTP workflow, then we need something that is fast regardless of how many users we have. As such, we want to minimize recompiles and we want index seeks wherever possible. If this is analytic or warehousing, then recompiles and scans are probably fine.
If we want OLTP, then the caching options are probably off the table. Temp tables will always force recompiles, and table variables in queries that rely on a good estimate require you to force a recompile. The alternative would be to have some other part of your application maintain a persistent table that has paginated counts or filters (or both), and then have this query join against that.
If the same user would look at many pages, then caching off part of it is probably still worth it even in OLTP, but make sure you measure the impact of many concurrent users.
Regardless of workflow, updating indices is probably okay (unless your workflows are going to really mess with your index maintenance).
Regardless of workflow, batch mode will be your friend.
Regardless of workflow, window functions (especially with either indices and/or batch mode) will probably be better.
Batch mode and the default cardinality estimator
We pretty consistently get poor cardinality estimates (and resulting plans) with the legacy cardinality estimator and row-mode executions. Forcing the default cardinality estimator helps with the first, and batch-mode helps with the second.
If you can't update your database to use the new cardinality estimator wholesale, then you'll want to enable it for your specific query. To accomplish that, you can use the following query hint: OPTION( USE HINT( 'FORCE_DEFAULT_CARDINALITY_ESTIMATION' ) ) to get the first. For the second, add a join to a CCI (doesn't need to return data): LEFT OUTER JOIN dbo.EmptyCciForRowstoreBatchmode ON 1 = 0 - this enables SQL Server to pick batch mode optimizations. These recommendations assume a sufficiently new SQL Server version.
What the CCI is doesn't matter; we like to keep an empty one around for consistency, that looks like this:
CREATE TABLE dbo.EmptyCciForRowstoreBatchmode
(
__zzDoNotUse int NULL,
INDEX CCI CLUSTERED COLUMNSTORE
);
The best plan I could get without modifying the table was to use both of them. With the same data as before, this runs in <1s.
WITH TempResult AS
(
SELECT ID,
Name,
COUNT( * ) OVER ( ) MaxRows
FROM #ExampleTable
WHERE ID IN ( <<really long LIST>> )
)
SELECT TempResult.ID,
TempResult.Name,
TempResult.MaxRows
FROM TempResult
LEFT OUTER JOIN dbo.EmptyCciForRowstoreBatchmode ON 1 = 0
ORDER BY TempResult.Name OFFSET ( #PageNum - 1 ) * #PageSize ROWS FETCH NEXT #PageSize ROWS ONLY
OPTION( USE HINT( 'FORCE_DEFAULT_CARDINALITY_ESTIMATION' ) );
As far as I know there are 3 ways to achieve this, besides using the #temp table approach already mentioned. In my test cases below, I've used a SQL Server 2016 Developer instance with 6CPU/16GB RAM, and a simple table containing ~25M rows.
Method 1: CROSS JOIN
DECLARE
#PageSize INT = 10
, #PageNum INT = 1;
WITH TempResult AS (SELECT
id
, shortDesc
FROM dbo.TestName
WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
SELECT
*, MaxRows
FROM TempResult
CROSS JOIN (SELECT COUNT(1) AS MaxRows FROM TempResult) AS TheCount
ORDER BY TempResult.shortDesc OFFSET (#PageNum - 1) * #PageSize ROWS
FETCH NEXT #PageSize ROWS ONLY;
Test result 1:
Method 2: COUNT(*) OVER()
DECLARE
#PageSize INT = 10
, #PageNum INT = 1;
WITH TempResult AS (SELECT
id
, shortDesc
FROM dbo.TestName
WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
)
SELECT
*, MaxRows = COUNT(*) OVER()
FROM TempResult
ORDER BY TempResult.shortDesc OFFSET (#PageNum - 1) * #PageSize ROWS
FETCH NEXT #PageSize ROWS ONLY;
Test result 2:
Method 3: 2nd CTE
Test result 3 (T-SQL used was the same as in the question):
Conclusion
The fastest method depends on your data structure (and total number of rows) in combination with your server sizing/load. In my case using COUNT(*) OVER() proved to be the fastest method. To find what is best for you, you have to test what is best for your scenario. And don't rule out that #table approach either just yet ;-)
You can try to count the rows while filtering the table using ROW_NUMBER():
DECLARE
#PageSize INT = 10,
#PageNum INT = 1;
;WITH
TempResult AS (
SELECT ID, Name, ROW_NUMBER() OVER (ORDER BY ID) N
FROM Table
Where ID in ( 1 ,2 3, 4, 5, 6, 7, 8, 9 ,10)
),
TempCount AS (
SELECT TOP 1 N AS MaxRows
FROM TempResult
ORDER BY ID DESC
)
SELECT *
FROM
TempResult,
TempCount
ORDER BY
TempResult.Name
OFFSET (#PageNum-1)*#PageSize ROWS
FETCH NEXT #PageSize ROWS ONLY
You could try phrasing this as:
WITH TempResult AS(
SELECT ID, Name, COUNT(*) OVER () as maxrows
FROM Table
Where ID in ( 1 ,2 3, 4, 5, 6, 7, 8, 9 ,10)
)
However, I doubt that you will see much performance improvement. The entire table needs to be scanned to get the total count. That is probably where the performance issue is.
This might be a shot in the dark but you can try using a temp table instead of a cte.
Though the performance results and preference of one over the other depends on use-case to use-case, a temp table sometimes can actually prove better since it enables you to leverage indices and dedicated statistics.
INSERT INTO #TempResult
SELECT ID, Name
FROM Table
WHERE ID in ( 1 ,2 3, 4, 5, 6, 7, 8, 9 ,10)
The IN statement is a notorious hurdle for the SQL Server query engine. When it gets "massive" (your words) it slows down even simple queries. In my experience, IN statements with more than 5000 items nearly always unacceptably slow down any query.
It nearly always works better to convert the items of a large IN statement into a temp table or table variable first and then join with this table, as below. I tested this and found it's significantly faster, even with the preparation of the temp table. I think that the IN statement, even though the inner query performs well enough with it, has a detrimental effect on the combined query.
DECLARE #ids TABLE (ID int primary key );
-- This must be done in chunks of 1000
INSERT #ids (ID) VALUES
(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),...
...
;WITH TempResult AS
(
SELECT tbl.ID, tbl.Name
FROM Table tbl
JOIN #ids ids ON ids.ID = tbl.ID
),
TempCount AS
(
SELECT COUNT(*) AS MaxRows FROM TempResult
)
SELECT *
FROM TempResult,
TempCount
ORDER BY TempResult.Name
OFFSET (#PageNum-1)*#PageSize ROWS
FETCH NEXT #PageSize ROWS ONLY
CTEs are very nice, but having many consecutive CTEs (two is not many I think but in general) caused me many times performace horror. The simplest method I think would be calculate number of rows once and assign it to variable:
DECLARE
#PageSize INT = 10,
#PageNum INT = 1,
#MaxRows bigint = (SELECT COUNT(1) FROM Table Where ID in ( 1 ,2 3, 4, 5, 6, 7, 8, 9 ,10));
WITH TempResult AS(
SELECT ID, Name
FROM Table
Where ID in ( 1 ,2 3, 4, 5, 6, 7, 8, 9 ,10)
)
SELECT *
FROM TempResult,
#MaxRows TempCount <----- this is what is slow. Removing this and the query is super fast
ORDER BY TempResult.Name
OFFSET (#PageNum-1)*#PageSize ROWS
FETCH NEXT #PageSize ROWS ONLY
I can't test this at the moment but on glancing through it struck me that specifying a multiply (cross join) as in:
FROM TempResult,
TempCount <----- this is what is slow. Removing this and the query is super
may be the issue
How does it perform when written simply as:
DECLARE
#PageSize INT = 10,
#PageNum INT = 1;
WITH TempResult AS(
SELECT ID, Name
FROM Table
Where ID in ( 1 ,2 3, 4, 5, 6, 7, 8, 9 ,10)
)
SELECT *, (SELECT COUNT(*) FROM TempResult) AS MaxRows
FROM TempResult
ORDER BY TempResult.Name
OFFSET (#PageNum-1)*#PageSize ROWS
FETCH NEXT #PageSize ROWS ONLY

sql join using recursive cte

Edit: Added another case scenario in the notes and updated the sample attachment.
I am trying to write a sql to get an output attached with this question along with sample data.
There are two table, one with distinct ID's (pk) with their current flag.
another with Active ID (fk to the pk from the first table) and Inactive ID (fk to the pk from the first table)
Final output should return two columns, first column consist of all distinct ID's from the first table and second column should contain Active ID from the 2nd table.
Below is the sql:
IF OBJECT_ID('tempdb..#main') IS NOT NULL DROP TABLE #main;
IF OBJECT_ID('tempdb..#merges') IS NOT NULL DROP TABLE #merges
IF OBJECT_ID('tempdb..#final') IS NOT NULL DROP TABLE #final
SELECT DISTINCT id,
current
INTO #main
FROM tb_ID t1
--get list of all active_id and inactive_id
SELECT DISTINCT active_id,
inactive_id,
Update_dt
INTO #merges
FROM tb_merges
-- Combine where the id from the main table matched to the inactive_id (should return all the rows from #main)
SELECT id,
active_id AS merged_to_id
INTO #final
FROM (SELECT t1.*,
t2.active_id,
Update_dt ,
Row_number()
OVER (
partition BY id, active_id
ORDER BY Update_dt DESC) AS rn
FROM #main t1
LEFT JOIN #merges t2
ON t1.id = t2.inactive_id) t3
WHERE rn = 1
SELECT *
FROM #final
This sql partially works. It doesn't work, where the id was once active then gets inactive.
Please note:
the active ID should return the last most active ID
the ID which doesn't have any active ID should either be null or the ID itself
ID where the current = 0, in those cases active ID should be the ID current in tb_ID
ID's may get interchanged. For example there are two ID's 6 and 7, when 6 is active 7 is inactive and vice versa. the only way to know the most current active state is by the update date
Attached sample might be easy to understand
Looks like I might have to use recursive cte for achieiving the results. Can someone please help?
thank you for your time!
I think you're correct that a recursive CTE looks like a good solution for this. I'm not entirely certain that I've understood exactly what you're asking for, particularly with regard to the update_dt column, just because the data is a little abstract as-is, but I've taken a stab at it, and it does seem to work with your sample data. The comments explain what's going on.
declare #tb_id table (id bigint, [current] bit);
declare #tb_merges table (active_id bigint, inactive_id bigint, update_dt datetime2);
insert #tb_id values
-- Sample data from the question.
(1, 1),
(2, 1),
(3, 1),
(4, 1),
(5, 0),
-- A few additional data to illustrate a deeper search.
(6, 1),
(7, 1),
(8, 1),
(9, 1),
(10, 1);
insert #tb_merges values
-- Sample data from the question.
(3, 1, '2017-01-11T13:09:00'),
(1, 2, '2017-01-11T13:07:00'),
(5, 4, '2013-12-31T14:37:00'),
(4, 5, '2013-01-18T15:43:00'),
-- A few additional data to illustrate a deeper search.
(6, 7, getdate()),
(7, 8, getdate()),
(8, 9, getdate()),
(9, 10, getdate());
if object_id('tempdb..#ValidMerge') is not null
drop table #ValidMerge;
-- Get the subset of merge records whose active_id identifies a "current" id and
-- rank by date so we can consider only the latest merge record for each active_id.
with ValidMergeCTE as
(
select
M.active_id,
M.inactive_id,
[Priority] = row_number() over (partition by M.active_id order by M.update_dt desc)
from
#tb_merges M
inner join #tb_id I on M.active_id = I.id
where
I.[current] = 1
)
select
active_id,
inactive_id
into
#ValidMerge
from
ValidMergeCTE
where
[Priority] = 1;
-- Here's the recursive CTE, which draws on the subset of merges identified above.
with SearchCTE as
(
-- Base case: any record whose active_id is not used as an inactive_id is an endpoint.
select
M.active_id,
M.inactive_id,
Depth = 0
from
#ValidMerge M
where
not exists (select 1 from #ValidMerge M2 where M.active_id = M2.inactive_id)
-- Recursive case: look for records whose active_id matches the inactive_id of a previously
-- identified record.
union all
select
S.active_id,
M.inactive_id,
Depth = S.Depth + 1
from
#ValidMerge M
inner join SearchCTE S on M.active_id = S.inactive_id
)
select
I.id,
S.active_id
from
#tb_id I
left join SearchCTE S on I.id = S.inactive_id;
Results:
id active_id
------------------
1 3
2 3
3 NULL
4 NULL
5 4
6 NULL
7 6
8 6
9 6
10 6

SQL select join the row with max (arithmatic(value1, value2))

I am trying to make a Trade system where people can make offer on the items they want. There are two currencies in the system, gold and silver. 100 silver = 1 gold. Note that people can make offers the same price as others, so there could be duplicate highest offer price.
Table structure looks roughly like this
Trade table
ID
TradeOffer table
ID
UserID
TradeID references Trade(ID)
GoldOffer
SilverOffer
I want to display to the user a list of trades sorted by the highest offer price whenever they do a search with constraint.
The Ideal output would be similar to this
Trade.ID TradeOffer.ID HighestGoldOffer HighestSilverOffer UserID
where HighestGoldOffer and HighestSilverOffer are the value of GoldOffer and SilverOffer column of the Offer with highest (GoldOffer * 100 + SilverOffer) and UserID is the user who made the offer
I know I can run 2 separate queries, one to retrieve all the Trades that satisfies all the constraint and extract all the ID to run another query to get the highest offer, but I am a perfectionist so I would prefer to do it with one sql instead of two.
I could just select all offers that are (GoldOffer * 100 + SilverOffer) = MAX (GoldOffer * 100 + SilverOffer) but this would possibly return duplicated Trade if there are multiple people offered the same price. Also there could be nobody offered on the Trade yet so GoldOffer and SilverOffer will be empty, I would still like to show the Trade as no offer when this happened.
Hope I made myself clear and thanks for any help
Model and test data
CREATE TABLE Trade (ID INT)
CREATE TABLE TradeOffer
(
ID INT,
UserID INT,
TradeID INT,
GoldOffer INT,
SilverOffer INT
)
INSERT Trade VALUES (1), (2), (3)
INSERT TradeOffer VALUES
(1, 1, 1, 10, 15),
(2, 2, 1, 11, 15),
(3, 1, 2, 10, 16),
(4, 2, 2, 10, 16)
Query
SELECT
[TradeID],
[TradeOfferID],
[HighestGoldOffer],
[HighestSilverOffer],
[UserID]
FROM (
SELECT
t.ID AS [TradeID],
tOffer.ID AS [TradeOfferID],
tOffer.GoldOffer AS [HighestGoldOffer],
tOffer.SilverOffer AS [HighestSilverOffer],
tOffer.[UserID],
RANK() OVER (
PARTITION BY t.ID
ORDER BY (([GoldOffer] * 100) + [SilverOffer]) DESC
) AS [Rank]
FROM Trade t
LEFT JOIN TradeOffer tOffer
ON tOffer.TradeID = t.ID
) x
WHERE [Rank] = 1
Result

What is an efficient SQL query to join to a double maximum in a sub query?

Sorry the title is crap - I could not think of a better way to phrase it. This is my data structure:
Widget WidgetTransition
------ ----------------
Id, WidgetId,
... TransitionTypeId,
Cost,
...
I want a query that will give me a list of details for each widget along with the details of the most expensive (max WidgetTransition.Cost) transition. If a widget has two or more transitions that are 'tied' for the most expensive transition, the details of the most recent transition (max WidgetTransition.WidgetId) should be used. If the widget has no transitions, it should not appear in the results. This is the best that I could come up with:
SELECT *
FROM Widget
JOIN WidgetTransition
ON WidgetTransition.WidgetId = Widget.Id
AND WidgetTransition.Cost = (
SELECT Max(MostExpensiveTransition.Cost)
FROM WidgetTransition MostExpensiveTransition
WHERE MostExpensiveTransition.WidgetId = Widget.Id
)
This almost works, but has two problems.
Doesn't deal with tied transitions properly. If a widget has two or more tied transitions, each transition will appear in the results, instead of the most recent.
With large data sets, the query is sloooow. The Sybase database that I am running it on will do two table scans (WidgetTransition.Cost is not in the index) on WidgetTransition for each widget. Presumably one is for the join and one to find the max cost.
Is there a better way to write this query that fixes the tied problem and/or runs more efficiently? I want to avoid using T-SQL or a stored procedure.
If you are using a database product that supports ranking functions and common table expressions such as SQL Server 2005 and later (or recent versions of Sybase), you can do something like:
With Data As
(
Select WidgetId, TransitionTypeId
, Row_Number() Over ( Partition By WidgetId
Order By Cost Desc, WidgetId Desc ) As Rnk
From WidgetTransition
)
Select ...
From Widget As W
Join Data As D
On D.WidgetId = W.WidgetId
Where D.Rnk = 1
Have you tried creating the 'MaxValues' like below:
SELECT *
FROM Widget
JOIN WidgetTransition
ON WidgetTransition.WidgetId = Widget.Id
JOIN (SELECT MostExpensiveTransition.WidgetId, Max(MostExpensiveTransition.Cost) Cost
FROM WidgetTransition MostExpensiveTransition
GROUP BY MostExpensiveTransition.WidgetId
) MaxValues
ON MaxValues.WidgetId = Widget.Id
AND WidgetTransition.Cost = MaxValues.Cost
I'm going from memory here as I don't have a sql database to play with at the moment, so sorry if it doesn't quite work.
If I understand you correctly this will do what you want.
Don't know about performance. You have to test that and try to figure out what indexes is helpful.
declare #Widget table (ID int)
declare #WidgetTransition table (WidgetID int, TransitionTypeID int, Cost int)
insert into #Widget values (1)
insert into #Widget values (2)
insert into #Widget values (3)
insert into #WidgetTransition values (1, 1, 1)
insert into #WidgetTransition values (1, 2, 2)
insert into #WidgetTransition values (2, 1, 2)
insert into #WidgetTransition values (2, 2, 12)
insert into #WidgetTransition values (2, 3, 12)
select *
from #Widget as Widget
inner join #WidgetTransition as WidgetTransition
on Widget.ID = WidgetTransition.WidgetID
inner join
( select Widget.ID, max(WT.TransitionTypeID) as TrasitionTypeID
from #Widget as Widget
inner join
( select WidgetID, max(Cost) as Cost
from #WidgetTransition
group by WidgetID
) as MaxCost
on Widget.ID = MaxCost.WidgetID
inner join #WidgetTransition as WT
on Widget.ID = WT.WidgetID and
MaxCost.Cost = WT.Cost
group by Widget.ID
) as MaxTransitionTypeID
on Widget.ID = MaxTransitionTypeID.ID and
WidgetTransition.TransitionTypeID = MaxTransitionTypeID.TrasitionTypeID

SQL Server 2005 recursive query with loops in data - is it possible?

I've got a standard boss/subordinate employee table. I need to select a boss (specified by ID) and all his subordinates (and their subrodinates, etc). Unfortunately the real world data has some loops in it (for example, both company owners have each other set as their boss). The simple recursive query with a CTE chokes on this (maximum recursion level of 100 exceeded). Can the employees still be selected? I care not of the order in which they are selected, just that each of them is selected once.
Added: You want my query? Umm... OK... I though it is pretty obvious, but - here it is:
with
UserTbl as -- Selects an employee and his subordinates.
(
select a.[User_ID], a.[Manager_ID] from [User] a WHERE [User_ID] = #UserID
union all
select a.[User_ID], a.[Manager_ID] from [User] a join UserTbl b on (a.[Manager_ID]=b.[User_ID])
)
select * from UserTbl
Added 2: Oh, in case it wasn't clear - this is a production system and I have to do a little upgrade (basically add a sort of report). Thus, I'd prefer not to modify the data if it can be avoided.
I know it has been a while but thought I should share my experience as I tried every single solution and here is a summary of my findings (an maybe this post?):
Adding a column with the current path did work but had a performance hit so not an option for me.
I could not find a way to do it using CTE.
I wrote a recursive SQL function which adds employeeIds to a table. To get around the circular referencing, there is a check to make sure no duplicate IDs are added to the table. The performance was average but was not desirable.
Having done all of that, I came up with the idea of dumping the whole subset of [eligible] employees to code (C#) and filter them there using a recursive method. Then I wrote the filtered list of employees to a datatable and export it to my stored procedure as a temp table. To my disbelief, this proved to be the fastest and most flexible method for both small and relatively large tables (I tried tables of up to 35,000 rows).
this will work for the initial recursive link, but might not work for longer links
DECLARE #Table TABLE(
ID INT,
PARENTID INT
)
INSERT INTO #Table (ID,PARENTID) SELECT 1, 2
INSERT INTO #Table (ID,PARENTID) SELECT 2, 1
INSERT INTO #Table (ID,PARENTID) SELECT 3, 1
INSERT INTO #Table (ID,PARENTID) SELECT 4, 3
INSERT INTO #Table (ID,PARENTID) SELECT 5, 2
SELECT * FROM #Table
DECLARE #ID INT
SELECT #ID = 1
;WITH boss (ID,PARENTID) AS (
SELECT ID,
PARENTID
FROM #Table
WHERE PARENTID = #ID
),
bossChild (ID,PARENTID) AS (
SELECT ID,
PARENTID
FROM boss
UNION ALL
SELECT t.ID,
t.PARENTID
FROM #Table t INNER JOIN
bossChild b ON t.PARENTID = b.ID
WHERE t.ID NOT IN (SELECT PARENTID FROM boss)
)
SELECT *
FROM bossChild
OPTION (MAXRECURSION 0)
what i would recomend is to use a while loop, and only insert links into temp table if the id does not already exist, thus removing endless loops.
Not a generic solution, but might work for your case: in your select query modify this:
select a.[User_ID], a.[Manager_ID] from [User] a join UserTbl b on (a.[Manager_ID]=b.[User_ID])
to become:
select a.[User_ID], a.[Manager_ID] from [User] a join UserTbl b on (a.[Manager_ID]=b.[User_ID])
and a.[User_ID] <> #UserID
You don't have to do it recursively. It can be done in a WHILE loop. I guarantee it will be quicker: well it has been for me every time I've done timings on the two techniques. This sounds inefficient but it isn't since the number of loops is the recursion level. At each iteration you can check for looping and correct where it happens. You can also put a constraint on the temporary table to fire an error if looping occurs, though you seem to prefer something that deals with looping more elegantly. You can also trigger an error when the while loop iterates over a certain number of levels (to catch an undetected loop? - oh boy, it sometimes happens.
The trick is to insert repeatedly into a temporary table (which is primed with the root entries), including a column with the current iteration number, and doing an inner join between the most recent results in the temporary table and the child entries in the original table. Just break out of the loop when ##rowcount=0!
Simple eh?
I know you asked this question a while ago, but here is a solution that may work for detecting infinite recursive loops. I generate a path and I checked in the CTE condition if the USER ID is in the path, and if it is it wont process it again. Hope this helps.
Jose
DECLARE #Table TABLE(
USER_ID INT,
MANAGER_ID INT )
INSERT INTO #Table (USER_ID,MANAGER_ID) SELECT 1, 2
INSERT INTO #Table (USER_ID,MANAGER_ID) SELECT 2, 1
INSERT INTO #Table (USER_ID,MANAGER_ID) SELECT 3, 1
INSERT INTO #Table (USER_ID,MANAGER_ID) SELECT 4, 3
INSERT INTO #Table (USER_ID,MANAGER_ID) SELECT 5, 2
DECLARE #UserID INT
SELECT #UserID = 1
;with
UserTbl as -- Selects an employee and his subordinates.
(
select
'/'+cast( a.USER_ID as varchar(max)) as [path],
a.[User_ID],
a.[Manager_ID]
from #Table a
where [User_ID] = #UserID
union all
select
b.[path] +'/'+ cast( a.USER_ID as varchar(max)) as [path],
a.[User_ID],
a.[Manager_ID]
from #Table a
inner join UserTbl b
on (a.[Manager_ID]=b.[User_ID])
where charindex('/'+cast( a.USER_ID as varchar(max))+'/',[path]) = 0
)
select * from UserTbl
basicaly if you have loops like this in data you'll have to do the retreival logic by yourself.
you could use one cte to get only subordinates and other to get bosses.
another idea is to have a dummy row as a boss to both company owners so they wouldn't be each others bosses which is ridiculous. this is my prefferd option.
I can think of two approaches.
1) Produce more rows than you want, but include a check to make sure it does not recurse too deep. Then remove duplicate User records.
2) Use a string to hold the Users already visited. Like the not in subquery idea that didn't work.
Approach 1:
; with TooMuchHierarchy as (
select "User_ID"
, Manager_ID
, 0 as Depth
from "User"
WHERE "User_ID" = #UserID
union all
select U."User_ID"
, U.Manager_ID
, M.Depth + 1 as Depth
from TooMuchHierarchy M
inner join "User" U
on U.Manager_ID = M."user_id"
where Depth < 100) -- Warning MAGIC NUMBER!!
, AddMaxDepth as (
select "User_ID"
, Manager_id
, Depth
, max(depth) over (partition by "User_ID") as MaxDepth
from TooMuchHierarchy)
select "user_id", Manager_Id
from AddMaxDepth
where Depth = MaxDepth
The line where Depth < 100 is what keeps you from getting the max recursion error. Make this number smaller, and less records will be produced that need to be thrown away. Make it too small and employees won't be returned, so make sure it is at least as large as the depth of the org chart being stored. Bit of a maintence nightmare as the company grows. If it needs to be bigger, then add option (maxrecursion ... number ...) to whole thing to allow more recursion.
Approach 2:
; with Hierarchy as (
select "User_ID"
, Manager_ID
, '#' + cast("user_id" as varchar(max)) + '#' as user_id_list
from "User"
WHERE "User_ID" = #UserID
union all
select U."User_ID"
, U.Manager_ID
, M.user_id_list + '#' + cast(U."user_id" as varchar(max)) + '#' as user_id_list
from Hierarchy M
inner join "User" U
on U.Manager_ID = M."user_id"
where user_id_list not like '%#' + cast(U."User_id" as varchar(max)) + '#%')
select "user_id", Manager_Id
from Hierarchy
The preferrable solution is to clean up the data and to make sure you do not have any loops in the future - that can be accomplished with a trigger or a UDF wrapped in a check constraint.
However, you can use a multi statement UDF as I demonstrated here: Avoiding infinite loops. Part One
You can add a NOT IN() clause in the join to filter out the cycles.
This is the code I used on a project to chase up and down hierarchical relationship trees.
User defined function to capture subordinates:
CREATE FUNCTION fn_UserSubordinates(#User_ID INT)
RETURNS #SubordinateUsers TABLE (User_ID INT, Distance INT) AS BEGIN
IF #User_ID IS NULL
RETURN
INSERT INTO #SubordinateUsers (User_ID, Distance) VALUES ( #User_ID, 0)
DECLARE #Distance INT, #Finished BIT
SELECT #Distance = 1, #Finished = 0
WHILE #Finished = 0
BEGIN
INSERT INTO #SubordinateUsers
SELECT S.User_ID, #Distance
FROM Users AS S
JOIN #SubordinateUsers AS C
ON C.User_ID = S.Manager_ID
LEFT JOIN #SubordinateUsers AS C2
ON C2.User_ID = S.User_ID
WHERE C2.User_ID IS NULL
IF ##RowCount = 0
SET #Finished = 1
SET #Distance = #Distance + 1
END
RETURN
END
User defined function to capture managers:
CREATE FUNCTION fn_UserManagers(#User_ID INT)
RETURNS #User TABLE (User_ID INT, Distance INT) AS BEGIN
IF #User_ID IS NULL
RETURN
DECLARE #Manager_ID INT
SELECT #Manager_ID = Manager_ID
FROM UserClasses WITH (NOLOCK)
WHERE User_ID = #User_ID
INSERT INTO #UserClasses (User_ID, Distance)
SELECT User_ID, Distance + 1
FROM dbo.fn_UserManagers(#Manager_ID)
INSERT INTO #User (User_ID, Distance) VALUES (#User_ID, 0)
RETURN
END
You need a some method to prevent your recursive query from adding User ID's already in the set. However, as sub-queries and double mentions of the recursive table are not allowed (thank you van) you need another solution to remove the users already in the list.
The solution is to use EXCEPT to remove these rows. This should work according to the manual. Multiple recursive statements linked with union-type operators are allowed. Removing the users already in the list means that after a certain number of iterations the recursive result set returns empty and the recursion stops.
with UserTbl as -- Selects an employee and his subordinates.
(
select a.[User_ID], a.[Manager_ID] from [User] a WHERE [User_ID] = #UserID
union all
(
select a.[User_ID], a.[Manager_ID]
from [User] a join UserTbl b on (a.[Manager_ID]=b.[User_ID])
where a.[User_ID] not in (select [User_ID] from UserTbl)
EXCEPT
select a.[User_ID], a.[Manager_ID] from UserTbl a
)
)
select * from UserTbl;
The other option is to hardcode a level variable that will stop the query after a fixed number of iterations or use the MAXRECURSION query option hint, but I guess that is not what you want.