Querying up tree for a particular value - sql

I'm a bit of a SQL novice, so I could definitely use some assistance hashing out the general design of a particular query. I'll be giving a SQL example of what I'm trying to do below. It may contain some syntax errors, and I do apologize for that- I'm just trying to get the design down before I go running and testing it!
Side note- I have 0 control over the design scheme, so redesign is not an option. My example tables may have an error due to oversight on my part, but the overall design scheme of bottom-up value searching will remain the same. I'm querying an existing database filled with tons of data already in it.
The scenario is this: There is a tree of elements. Each element has an ID and a parent ID (table layouts below). Parent ID is a recursive foreign key to itself. There is a second table that contains values. Each value has an elementID that is a foreign key to the element table. So to get the value of a particular variable for a particular element, you must join the two tables.
The variable hierarchy goes Bottom-Up by way of inheritance. If you have an element and want to get its variable value, you first look at that element. If it doesn't have a value, then check the element's parent. If that doesn't check the parent's parent- all the way to the top. Every variable is guaranteed to have a value by the time you reach the top! (if I search for variableID 21- I know that 21 will exist. If not at the bottom, then definitely at the top) The lowest element on the tree gets priority, though- if the bottom element has a value for that variable, don't go any farther up!
The tables would look roughly like this:
Element_Table
--------------
elementID (PK)
ParentID (FK to elementID)
Value_Table
--------------
valueID (PK)
variableID
value (the value that we're looking for)
elementID (FK to Element_Table.elementID)
So, what I'm looking to do is create a function that cleanly (key word here. Nice, clean and efficient code) search, bottom-up, across the tree looking for a variable value. Once I find it- return that value and move on!
Here is an example of what I'm thinking:
CREATE FUNCTION FindValueInTreeBottomUp
(#variableID int, #element varchar(50))
RETURNS varchar(50)
AS
BEGIN
DECLARE #result varchar(50)
DECLARE #ID int
DECLARE #parentID int
SET #result = NULL, #ID = #element
WHILE (#result IS NULL)
BEGIN
SELECT #result = vals.value, #parentID = eles.ParentID
FROM Value_Table vals
JOIN Element_Table eles
ON vals.elementID = eles.elementID
WHERE eles.elementID = #ID AND vals.variableID = #variableID
IF(#result IS NULL)
#ID = #parentID
CONTINUE
ELSE
BREAK
END
RETURN #result
END
Again, I apologize if there are any syntactical errors. Still a SQL novice and haven't run this yet! I'm especially a novice at functions- I can query all day, but functions/sprocs are still rather new to me.
So, SQL gurus out there- can you think of a better way to do this? The design of the tables won't be changing; I have NO control over that. All I can do is produce the query to check the already existing design.

I think you could do something like this (it's untested, have to try it in sql fiddle):
;with cte1 as (
select e.elementID, e.parentID, v.value
from Element_Table as e
left outer join Value_Table as v on e.elementID = e.elementID and v.variableID = #variableID
), cte2 as (
select v.value, v.parentID, 1 as aDepth
from cte1 as v
where v.elementID = #elementID
union all
select v.value, v.parentID, c.aDepth + 1
from cte2 as c
inner join cte1 as v on v.elementID = c.ParentID
where c.value is null
)
select top 1 value
from cte2
where value is not null
order by aDepth
test infrastructure:
declare #Elements table (ElementID int, ParentID int)
declare #Values table (VariableID int, ElementID int, Value nvarchar(128))
declare #variableID int, #elementID int
select #variableID = 1, #elementID = 2
insert into #Elements
select 1, null union all
select 2, 1
insert into #Values
select 1, 1, 'test'
;with cte1 as (
select e.elementID, e.parentID, v.value
from #Elements as e
left outer join #Values as v on e.elementID = e.elementID and v.variableID = #variableID
), cte2 as (
select v.value, v.parentID, 1 as aDepth
from cte1 as v
where v.elementID = #elementID
union all
select v.value, v.parentID, c.aDepth + 1
from cte2 as c
inner join cte1 as v on v.elementID = c.ParentID
where c.value is null
)
select top 1 value
from cte2
where value is not null
order by aDepth

Related

Short-circuiting tables

I'm upgrading several identical copies of a database which may already be upgraded partially, and for some reason bool values were stored in an nvarchar(5).
So in the below, (which exists inside an INSERT > SELECT block), I need to check if the column ShowCol exists, fill it with 0 if it does not, or fill it with the result of evaluating the string bool if it does:
CASE
WHEN COL_LENGTH('dbo.TableName', 'ShowCol') IS NULL THEN 0
ELSE IIF(LOWER(ShowCol) = 'false', 0, 1)
END
...but I'm getting an error "Invalid column name 'ShowCol'". I can't seem to short-circuit this, can you help?
Its worth noting that the column if it does exist contains a mix of "false", "False" and "FALSE", so that's the point of the LOWER(). (The True column also occasional trailing spaces to contend with, which is why I'm just dealing with False and everything else is true.)
I suspect that its because of this wrap in LOWER() which is causing the server to always evaluate the expression.
You can’t short circuit the existence of a column (and it has nothing to do with LOWER(); if you remove it, nothing will change).
You’ll need dynamic SQL, e.g.:
DECLARE #sql nvarchar(max) = N'UPDATE trg SET
trg.col1 = src.col1,
trg.col2 = src.col2';
IF COL_LENGTH('dbo.TableName', 'ShowCol') > 0
BEGIN
SET #sql += N', trg.ShowCol = IIF(LOWER(src.ShowCol) = ''false'', 0, 1)';
END
SET #sql += N' ...
FROM dbo.TableName AS trg
INNER JOIN dbo.Origin AS src
ON ...';
EXEC sys.sp_executesql #sql; -- ,N'params', #params;
When you're selecting data, you can fool the parser a little bit by introducing constants to take the place of columns, taking advantage of SQL Server's desire to find a column reference even at a different scope than the syntax would suggest. I talk about this in Make SQL Server DMV Queries Backward Compatible. I don't know of any straightforward way to make that work with writes without dynamic SQL, as the parser does more strict checking there, so it's harder to fool.
Imagine you have these tables:
CREATE TABLE dbo.SourceTable(a int, b int, c int);
INSERT dbo.SourceTable(a,b,c) VALUES(1,2,3);
CREATE TABLE dbo.DestinationWithAllColumns(a int, b int, c int);
INSERT dbo.DestinationWithAllColumns(a,b,c) VALUES(1,2,3);
CREATE TABLE dbo.DestinationWithoutAllColumns(a int, b int);
INSERT dbo.DestinationWithoutAllColumns(a,b) VALUES(1,2);
You can write a SELECT against either of them that produces an int output column called c:
;WITH optional_columns AS
(
SELECT c = CONVERT(int, NULL)
)
SELECT trg.a, trg.b, trg.c
FROM optional_columns
CROSS APPLY
(SELECT a,b,c FROM dbo.DestinationWithAllColumns) AS trg
INNER JOIN dbo.SourceTable AS src ON src.a = trg.a;
Output:
a
b
c
1
2
3
;WITH optional_columns AS
(
SELECT c = CONVERT(int, NULL)
)
SELECT trg.a, trg.b, trg.c
FROM optional_columns
CROSS APPLY
(SELECT a,b,c FROM dbo.DestinationWithoutAllColumns) AS trg
INNER JOIN dbo.SourceTable AS src ON src.a = trg.a;
Output:
a
b
c
1
2
null
So far, so good. But as soon as you try and update:
;WITH optional_columns AS
(
SELECT c = CONVERT(int, NULL)
)
UPDATE trg SET trg.b = src.b, trg.c = src.c
FROM optional_columns
CROSS APPLY
(SELECT a,b,c FROM dbo.DestinationWithoutAllColumns) AS trg
INNER JOIN dbo.SourceTable AS src ON src.a = trg.a;
Msg 4421, Level 16, State 1
Derived table 'trg' is not updatable because a column of the derived table is derived or constant.
Example db<>fiddle

compare value with 2 different columns using the IN operator

I have a situation where I need to compare the value of a column with 2 columns from my settings table.
Currently I have this query which works
declare #t int = 3
select 1
where #t = (select s.RelationGDGMID from dbo.tblSettings s )
or
#t = (select s.RelationGTTID from dbo.tblSettings s )
But I wonder if I can make this without reading tblSettings 2 times, and then I tried this
declare #t int = 3
select 1
where #t in (select s.RelationGDGMID, s.RelationGTTID from dbo.tblSettings s )
and this does not compiles, it returns
Only one expression can be specified in the select list when the
subquery is not introduced with EXISTS
So how can I do this without reading tblSettings 2 times, well one solution would be using the EXISTS like the error hints me
declare #t int = 3
select 1
where exists (select 1 from dbo.tblSettings s where s.RelationGDGMID = #t or s.RelationGTTID = #t)
and yes that works, only reads tblSettings once, so I can use this.
But I still wonder if there is a way to make it work with the IN operator
After all, when I do this
declare #t int = 3
select 1
where #t in (3, 1)
that works without problems,
so why does
where #t in (select s.RelationGDGMID, s.RelationGTTID from dbo.tblSettings s )
not works, when in fact it also returns (3, 1) ?
One way to do it would be to use UNION if the columns are of the same type.
where #t in (select s1.RelationGDGMID from dbo.tblSettings s1 UNION
select s2.RelationGTTID from dbo.tblSettings s2)
The reason this works is because it is returning one value set (1 column with values). The reason where #t in (3, 1) works is because this the same, it is returning one value set (value 3 and value 1).
That said I would prefer the EXISTS over IN as this could produce a better query plan.

Most performant way to filter on multiple values in multiple columns?

I have an application where the user can retrieve a list.
The user is allowed to add certain filters. For example:
Articles: 123, 456, 789
CustomerGroups: 1, 2, 3, 4, 5
Customers: null
ArticleGroups: null
...
When a filter is empty (or null), the query must ignore that filter.
What is the most performant way to build your query so it can handle a lot (10+) of different filters (and joins)?
My current approach is the following, but it doesn't scale very well:
CREATE PROCEDURE [dbo].[GetFilteredList]
#start datetime,
#stop datetime,
#ArticleList varchar(max), -- '123,456,789'
#ArticleGroupList varchar(max),
#CustomerList varchar(max),
#CustomerGroupList varchar(max) -- '1,2,3,4,5'
--More filters here...
AS
BEGIN
SET NOCOUNT ON
DECLARE #Articles TABLE (value VARCHAR(10));
INSERT INTO #Articles (value)
SELECT *
FROM [dko_db].[dbo].fnSplitString(#ArticleList, ',');
DECLARE #ArticleGroups TABLE (value VARCHAR(10));
INSERT INTO #ArticleGroups (value)
SELECT *
FROM [dko_db].[dbo].fnSplitString(#ArticleGroupList, ',');
DECLARE #Customers TABLE (value VARCHAR(10));
INSERT INTO #Customers (value)
SELECT *
FROM [dko_db].[dbo].fnSplitString(#CustomerList, ',');
DECLARE #CustomerGroups TABLE (value VARCHAR(10));
INSERT INTO #CustomerGroups (value)
SELECT *
FROM [dko_db].[dbo].fnSplitString(#CustomerGroupList, ',');
select * -- Some columns here
FROM [dbo].[Orders] o
LEFT OUTER JOIN [dbo].[Article] a on o.ArticleId = a.Id
LEFT OUTER JOIN [dbo].[ArticleGroup] ag on a.GroupId = ag.Id
LEFT OUTER JOIN [dbo].[Customer] c on o.CustomerId = o.Id
LEFT OUTER JOIN [dbo].[CustomerGroup] cg on c.GroupId = cg.Id
-- More joins here
WHERE o.OrderDate between #start and #stop and
(isnull(#ArticleList, '') = '' or a.ArticleCode in (select value from #Articles)) and
(isnull(#ArticleGroupList, '') = '' or ag.GroupCode in (select value from #ArticleGroups)) and
(isnull(#CustomerList, '') = '' or c.CustomerCode in (select value from #Customers)) and
(isnull(#CustomerGroupList, '') = '' or cg.GroupCode in (select value from #CustomerGroups))
ORDER BY c.Name, o.OrderDate
END
There's a lot of "low hanging fruit" performance improvements here.
First, lose ORDER BY c.Name, o.OrderDate that's just needless sorting.
Second, for your "list" variables (e.g. #ArticleList) - if you don't need VARCHAR(MAX) then change the data type(s) to VARCHAR(8000). VARCHAR(MAX) is much slower than VARCHAR(8000). I Never use MAX data types unless I am certain it's required.
Third, you can skip dumping your split values in to Table variables. That's Just needless overhead. You can lose all those declarations and inserts, then change THIS:
... a.ArticleCode in (select value from #Articles))
TO:
... a.ArticleCode in (SELECT value FROM dbo.fnSplitString(#ArticleList, ',')))
Fourth, fnSplitString is not an inline table valued function (e.g. you see BEGIN and END in the DDL) then it will be slow. An inline splitter will be much faster; consider DelimitedSplit8k or DelimitedSplit8K_LEAD.
Last I would add an OPTION (RECOMPILE) as this is a query highly unlikely to benefit from plan caching. A recompile will force the optimizer to evaluate your parameters ahead of time.
Beyond that, when joining a bunch of tables, check the execution plan, see where most of the data is coming from and use that info to index accordingly.

Reference CTE inside CTE?

I found this stored proc in our codebase:
ALTER PROCEDURE [dbo].[MoveNodes]
(
#id bigint,
#left bigint,
#right bigint,
#parentid bigint,
#offset bigint,
#caseid bigint,
#userid bigint
)
AS
BEGIN
WITH q AS
(
SELECT id, parent, lft, rgt, title, type, caseid, userid, 0 AS level,
CAST(LEFT(CAST(id AS VARCHAR) + REPLICATE('0', 10), 10) AS VARCHAR) AS bc
FROM [dbo].DM_FolderTree hc
WHERE id = #id and caseid = #caseid
UNION ALL
SELECT hc.id, hc.parent, hc.lft, hc.rgt, hc.title, hc.type, hc.caseid, hc.userid, level + 1,
CAST(bc + '.' + LEFT(CAST(hc.id AS VARCHAR) + REPLICATE('0', 10), 10) AS VARCHAR)
FROM q
JOIN [dbo].DM_FolderTree hc
ON hc.parent = q.id
)
UPDATE [dbo].DM_FolderTree
SET lft = ((-lft) + #offset), rgt = ((-rgt) + #offset), userid = #userid
WHERE id in (select id from q) AND lft <= (-(#left)) AND rgt >= (-(#right)) AND caseid = #caseid;
UPDATE [dbo].DM_FolderTree SET parent = #parentid, userid = #userid WHERE id = #id AND caseid = #caseid;
END
where you'll notice that the CTE q is being used called on the UNION. What exactly are we calling here? Everything before the UNION, the whole CTE? What exactly is happening here.
I'm assuming that this code is legal, since its been in production for quite some time (FLW, I know). But still, I have no idea what's happening here.
This is a recursive query. It calls the CTE again and again until all ID's and CaseIDs have walked the tree.
Think about nesting of folders in a directory. This query simply walks all the directors to get final the "file path" for all files in all folders.
Notice how Level starts at 0 and then gets added to. The second time though level is now 1 and becomes 2 and then 3 and so on.
To better understand:
Grab the select cte portion (with q as...) and replace the update with Select * from q and run it. Just so you can see what it does. Bit rough learning to start with but walking though an example by doing the above will help.
Specific answers to questions:
What exactly are we calling here?
Your building a baseline which denotes the all roots you wish to start with and then traversing all the levels under that root/folder. So in essense you're crawling the entire structure for hc.parent = q.id
Everything before the UNION, the whole CTE?
The whole cte. Recursion powerfully cool stuff!

Closure Table INSERT statement including the level/distance column

I'm referring Bill Karwin's presentation in order to implement a closure table which will help me manage hierarchies. Unfortunately, the presentation does not show how I could insert/update the Level column mentioned on slide 67; this would have been very useful. I've been giving it a thought but I couldn't come up with something concrete that I could test. Here's what I got so far:
create procedure USP_OrganizationUnitHierarchy_AddChild
#ParentId UNIQUEIDENTIFIER,
#NewChildId UNIQUEIDENTIFIER
AS
BEGIN
INSERT INTO [OrganizationUnitHierarchy]
(
[AncestorId],
[DescendantId],
[Level]
)
SELECT [AncestorId], #NewChildId, (here I need to get the count of ancestors that lead to the currently being selected ancestor through-out the tree)
FROM [OrganizationUnitHierarchy]
WHERE [DescendantId] = #ParentId
UNION ALL SELECT #NewChildId, #NewChildId
END
go
I am not sure how I could do that. Any ideas?
You know that for Parent = self you have Level = 0 and when you copying paths from ancestor, you're just increasing Level by 1:
create procedure USP_OrganizationUnitHierarchy_AddChild
#ParentId UNIQUEIDENTIFIER,
#NewChildId UNIQUEIDENTIFIER
AS
BEGIN
INSERT INTO [OrganizationUnitHierarchy]
(
[AncestorId],
[DescendantId],
[Level]
)
SELECT [AncestorId], #NewChildId, [Level] + 1
FROM [OrganizationUnitHierarchy]
WHERE [DescendantId] = #ParentId
UNION ALL
SELECT #NewChildId, #NewChildId, 0
END