Fuzzy grouping in SQL - sql

I need to modify a SQL table to group slightly mismatched names, and assign all elements in the group a standardized name.
For instance, if the initial table looks like this:
Name
--------
Jon Q
John Q
Jonn Q
Mary W
Marie W
Matt H
I would like to create a new table or add a field to the existing one like this:
Name | StdName
--------------------
Jon Q | Jon Q
John Q | Jon Q
Jonn Q | Jon Q
Mary W | Mary W
Marie W | Mary W
Matt H | Matt H
In this case, I've chosen the first name to assign as the "standardized name," but I don't actually care which one is chosen -- ultimately the final "standardized name" will be hashed into a unique person ID. (I'm also open to alternative solutions that go directly to a numerical ID.) I will have birthdates to match on as well, so the accuracy of the name matching doesn't actually need to be all that precise in practice. I've looked into this a bit and will probably use the Jaro-Winkler algorithm (see e.g. here).
If I knew that the names were all in pairs, this would be a relatively easy query, but there can be an arbitrary number of the same name.
I can easily conceptualize how to do this query in a procedural language, but I'm not very familiar with SQL. Unfortunately I don't have direct access to the data -- it's sensitive data and so somebody else (a bureaucrat) has to run the actual query for me. The specific implementation will be SQL Server, but I'd prefer an implementation-agnostic solution.
EDIT:
In response to a comment, I had the following procedural approach in mind. It's in Python, and I replaced the Jaro-Winkler with simply matching on the first letter of the name, for the sake of having a working code example.
nameList = ['Jon Q', 'John Q', 'Jonn Q', 'Mary W', 'Marie W', 'Larry H']
stdList = nameList[:]
# loop over all names
for i1, name1 in enumerate(stdList):
# loop over later names in list to find matches
for i2, name2 in enumerate(stdList[i1+1:]):
# If there's a match, replace latter with former.
if (name1[0] == name2[0]):
stdList[i1+1+i2] = name1
print stdList
The result is ['Jon Q', 'Jon Q', 'Jon Q', 'Mary W', 'Mary W', 'Larry H'].

Just a thought, but you might be able to use the SOUNDEX() function. This will create a value for the names that are similar.
If you started with something like this:
select name, soundex(name) snd,
row_number() over(partition by soundex(name)
order by soundex(name)) rn
from yt;
See SQL Fiddle with Demo. Which would give a result for each row that is similar along with a row_number() so you could return only the first value for each group. For example, the above query will return:
| NAME | SND | RN |
-----------------------
| Jon Q | J500 | 1 |
| John Q | J500 | 2 |
| Jonn Q | J500 | 3 |
| Matt H | M300 | 1 |
| Mary W | M600 | 1 |
| Marie W | M600 | 2 |
Then you could select all of the rows from this result where the row_number() is equal to 1 and then join back to your main table on the soundex(name) value:
select t1.name,
t2.Stdname
from yt t1
inner join
(
select name as stdName, snd, rn
from
(
select name, soundex(name) snd,
row_number() over(partition by soundex(name)
order by soundex(name)) rn
from yt
) d
where rn = 1
) t2
on soundex(t1.name) = t2.snd;
See SQL Fiddle with Demo. This gives a result:
| NAME | STDNAME |
---------------------
| Jon Q | Jon Q |
| John Q | Jon Q |
| Jonn Q | Jon Q |
| Mary W | Mary W |
| Marie W | Mary W |
| Matt H | Matt H |

Assuming you copy and paste the jaro-winkler implementation from SSC (registration required), the following code will work. I tried to build a SQLFiddle for it but it kept going belly up when I was building the schema.
This implementation has a cheat---I'm using a cursor. Generally, cursors are not conducive to performance but in this case, you need to be able to compare the set against itself. There's probably a graceful number/tally table approach to eliminate the declared cursor.
DECLARE #SRC TABLE
(
source_string varchar(50) NOT NULL
, ref_id int identity(1,1) NOT NULL
);
-- Identify matches
DECLARE #WORK TABLE
(
source_ref_id int NOT NULL
, match_ref_id int NOT NULL
);
INSERT INTO
#src
SELECT 'Jon Q'
UNION ALL SELECT 'John Q'
UNION ALL SELECT 'JOHN Q'
UNION ALL SELECT 'Jonn Q'
-- Oops on matching joan to jon
UNION ALL SELECT 'Joan Q'
UNION ALL SELECT 'june'
UNION ALL SELECT 'Mary W'
UNION ALL SELECT 'Marie W'
UNION ALL SELECT 'Matt H';
-- 2 problems to address
-- duplicates in our inbound set
-- duplicates against a reference set
--
-- Better matching will occur if names are split into ordinal entities
-- Splitting on whitespace is always questionable
--
-- Mat, Matt, Matthew
DECLARE CSR CURSOR
READ_ONLY
FOR
SELECT DISTINCT
S1.source_string
, S1.ref_id
FROM
#SRC AS S1
ORDER BY
S1.ref_id;
DECLARE #source_string varchar(50), #ref_id int
OPEN CSR
FETCH NEXT FROM CSR INTO #source_string, #ref_id
WHILE (##fetch_status <> -1)
BEGIN
IF (##fetch_status <> -2)
BEGIN
IF NOT EXISTS
(
SELECT * FROM #WORK W WHERE W.match_ref_id = #ref_id
)
BEGIN
INSERT INTO
#WORK
SELECT
#ref_id
, S.ref_id
FROM
#src S
-- If we have already matched the value, skip it
LEFT OUTER JOIN
#WORK W
ON W.match_ref_id = S.ref_id
WHERE
-- Don't match yourself
S.ref_id <> #ref_id
-- arbitrary threshold, will need to examine this for sanity
AND dbo.fn_calculateJaroWinkler(#source_string, S.source_string) > .95
END
END
FETCH NEXT FROM CSR INTO #source_string, #ref_id
END
CLOSE CSR
DEALLOCATE CSR
-- Show me the list of all the unmatched rows
-- plus the retained
;WITH MATCHES AS
(
SELECT
S1.source_string
, S1.ref_id
, S2.source_string AS match_source_string
, S2.ref_id AS match_ref_id
FROM
#SRC S1
INNER JOIN
#WORK W
ON W.source_ref_id = S1.ref_id
INNER JOIN
#SRC S2
ON S2.ref_id = W.match_ref_id
)
, UNMATCHES AS
(
SELECT
S1.source_string
, S1.ref_id
, NULL AS match_source_string
, NULL AS match_ref_id
FROM
#SRC S1
LEFT OUTER JOIN
#WORK W
ON W.source_ref_id = S1.ref_id
LEFT OUTER JOIN
#WORK S2
ON S2.match_ref_id = S1.ref_id
WHERE
W.source_ref_id IS NULL
and s2.match_ref_id IS NULL
)
SELECT
M.source_string
, M.ref_id
, M.match_source_string
, M.match_ref_id
FROM
MATCHES M
UNION ALL
SELECT
M.source_string
, M.ref_id
, M.match_source_string
, M.match_ref_id
FROM
UNMATCHES M;
-- To specifically solve your request
SELECT
S.source_string AS Name
, COALESCE(S2.source_string, S.source_string) As StdName
FROM
#SRC S
LEFT OUTER JOIN
#WORK W
ON W.match_ref_id = S.ref_id
LEFT OUTER JOIN
#SRC S2
ON S2.ref_id = W.source_ref_id
query output 1
source_string ref_id match_source_string match_ref_id
Jon Q 1 John Q 2
Jon Q 1 JOHN Q 3
Jon Q 1 Jonn Q 4
Jon Q 1 Joan Q 5
june 6 NULL NULL
Mary W 7 NULL NULL
Marie W 8 NULL NULL
Matt H 9 NULL NULL
query output 2
Name StdName
Jon Q Jon Q
John Q Jon Q
JOHN Q Jon Q
Jonn Q Jon Q
Joan Q Jon Q
june june
Mary W Mary W
Marie W Marie W
Matt H Matt H
There be dragons
Over on SuperUser, I talked about my experience matching people. In this section, I'll list some things to be aware of.
Speed
As part of your matching, hooray in that you have a birthday to augment the match process. I would actually propose you generate a match based exclusively on birthdate first. That is an exact match and one that, with a proper index, SQL Server will be able to quickly include/exclude rows. Because you're going to need it. The TSQL implementation is dog slow. I've been running the equivalent match against a dataset of 28k names (names that had been listed as conference attendees). There ought to be some good overlap there and while I did fill #src with data, it is a table variable with all that that implies but it's been running now for 15 minutes and still hasn't completed.
It's slow for a number of reasons but things that jumped out at me are all the looping and string manipulation in the functions. That is not where SQL Server shines. If you have a need to do a lot of this, it might be a good idea to convert them into CLR methods so at least you can leverage the strength of the .NET libraries for some of the manipulations.
One of the matches we used to use was the Double Metaphone and it would generate a pair of possible phonetic interpretations of the name. Instead of computing that every time, compute it once and store it alongside the name. That would help speed some of the matching. Unfortunately, it doesn't look like JW lends itself to breaking it down like that.
Look at iterating too. We'd first try the algs that we knew were fast. 'John' = 'John' so there's no need to pull out the big guns so we'd try a first pass of straight name checks. If we didn't find a match, we'd try harder. The hope was that by taking various swipes at matching we'd get the low hanging fruit as fast as possible and worry about the harder matches later.
Names
In my SU answer and in the code comments, I mention nicknames. Bill and Billy are going to match. Billy, Liam and William are definitely not going to match even though they may be the same person. You might want to look at a list like this to provide translation between nickname and full name. After running a set of matches on the supplied name, maybe we'd try looking for a match based on the possible root name.
Obviously, there are draw backs to this approach. For example, my grandfather-in-law is Max. Just Max. Not Maximilian, Maximus or any other things you might thing.
Your supplied names look like it's first and last concatenated together. Future readers, if you ever have the opportunity to capture individual portions of a name, please do so. There are products out there that will split names and try to match them up against directories to try and guess whether something is first/middle name or a surname but then you have people like "Robar Mike". If you saw that name there, you'd think Robar is a last name and you'd also pronounce it like "robber." Instead, Robar (say it with a French accent) is his first name and Mike is his last name. At any rate, I think you'll have a better matching experience if you can split first and last out into separate fields and match the individual pieces together. An exact last name match plus a partial first name match might suffice, especially in cases where legally they are "Franklin Roosevelt" and you have a candidate of "F. Roosevelt" Perhaps you have a rule that an initial letter can match. Or you don't.
Noise - as referenced in the JW post and my answer, strip out crap (punctuation, stop words, etc) for matching purposes. Also watch out for honorific tites (phd, jd, etc) and generationals (II, III, JR, SR). Our rule was a candidate with/without a generational could match one in the opposite state (Bob Jones Jr == Bob Jones) or could exactly match the generation (Bob Jones Sr = Bob Jones Sr) but you'd never want to match if both records supplied them and they were conflicting (Bob Jones Sr != Bob Jones Jr).
Case sensitivity, always check your database and tempdb to make sure you aren't making case sensitive matches. And if you are, convert everything to upper or lower for purposes of matching but don't ever throw the supplied casing away. Good luck trying to determine whether latessa should be Latessa, LaTessa or something else.
My query is coming up on a hour's worth of processing with no rows returned so I'm going to kill it and turn in. Best of luck, happy matching.

Related

SQL subquery as part of LIKE search

From this Question we learned to use a subquery to find information once-removed.
Subquery we learned :
SELECT * FROM papers WHERE writer_id IN ( SELECT id FROM writers WHERE boss_id = 4 );
Now, I need to search a table, both in column values that table, and in column values related by id on another table.
Here are the same tables, but col values contain more text for our "searching" reference...
writers :
id
name
boss_id
1
John Jonno
2
2
Bill Bosworth
2
3
Andy Seaside
4
4
Hank Little
4
5
Alex Crisp
4
The writers have papers they write...
papers :
id
title
writer_id
1
Boston
1
2
Chicago
4
3
Cisco
3
4
Seattle
2
5
North
5
I can use this to search only the names on writers...
Search only writers.name : (Not what I want to do)
SELECT * FROM writers WHERE LOWER(name) LIKE LOWER('%is%');
Output for above search : (Not what I want to do)
id
name
boss_id
5
Alex Crisp
4
I want to return cols from writers (not papers), but searching text both in writers.name and the writers.id-associated papers.title.
For example, if I searched "is", I would get both:
Alex Crisp (for 'is' in the name 'Crisp')
Andy Seaside (because Andy wrote a paper with 'is' in the title 'Cisco')
Output for "is" search :
id
title
writer_id
2
Chicago
4
4
Seattle
2
Here's what I have that doesn't work:
SELECT * FROM papers WHERE LOWER(title) LIKE LOWER('%is%') OR writer_id ( writers=writer_id WHERE LOWER(name) LIKE LOWER('%$is%') );
The best way to express this criteria is by using a correlated query with exists:
select *
from writers w
where Lower(w.name) like '%is%'
or exists (
select * from papers p
where p.writer_id = w.id and Lower(p.title) like '%is%'
);
Note you don't need to use lower on the string you are providing, and you should only use lower if your collation truly is case-sensitive as using the function makes the search predicate unsargable.
Since you want to return cols from writers (not papers) you should select them first, and use stuff from papers in the criteria
select *
from writers w
where
w.name like '%is%'
or
w.id in (select p.writer_id
paper p
where p.title like '%is%'
)
You can add your LOWER functions (my sql environment is not case-sensitive, so I didn't need them)

Strict Match Many to One on Lookup Table

This has been driving me and my team up the wall. I cannot compose a query that will strict match a single record that has a specific permutation of look ups.
We have a single lookup table
room_member_lookup:
room | member
---------------
A | Michael
A | Josh
A | Kyle
B | Kyle
B | Monica
C | Michael
I need to match a room with an exact list of members but everything else I've tried on stack overflow will still match room A even if I ask for a room with ONLY Josh and Kyle
I've tried queries like
SELECT room FROM room_member_lookup
WHERE member IN (Josh, Michael)
GROUP BY room
HAVING COUNT(1) = 2
However this will still return room A even though that has 3 members I need a exact member permutation and that matches the room even not partials.
SELECT room
FROM room_member_lookup a
WHERE member IN ('Monica', 'Kyle')
-- Make sure that the room 'a' has exactly two members
and (select count(*)
from room_member_lookup b
where a.room=b.room)=2
GROUP BY room
-- and both members are in that room
HAVING COUNT(1) = 2
Depending on the SQL dialect, one can build a dynamic table (CTE or select .. union all) to hold the member set (Monica and Kyle, for example), and then look for set equivalence using MINUS/EXCEPT sql operators.

Store multidimensional array in database: relational or multidimensional?

I have read numerous posts along the lines of multidimensional to single dimension, multidimensional database, and so on, but none of the answers helped. I did find a lot of documentation on Google but that only provided background information and didn't answer the question at hand.
I have a lot of strings that are related to one another. They are needed in a PHP script. The structure is hierarchical. Here is an example.
A:
AA:
AAA
AAC
AB
AE:
AEA
AEE:
AEEB
B:
BA:
BAA
BD:
BDC:
BDCB
BDCE
BDD:
BDDA
BE:
BED:
BEDA
C:
CC:
CCB:
CCBC
CCBE
CCC:
CCCA
CCCE
CE
Each indent supposes a new level in the multidimensional array.
The goal is to retrieve an element with PHP by name and all its descendants. If for instance I query for A, I want to receive an array of string containing array('A', 'AA', 'AAA', 'AAC', 'AB', 'AE', 'AEA', 'AEE', 'AEEB'). The 'issue' is that queries can also be made to lower-level elements. If I query AEE, I want to get array('AEE', 'AEEB').
As I understand the concept of relational databases, this means that I cannot use a relational database because there is no common 'key' between elements. The solution that I thought is possible, is assigning PARENT elements to each cell. So, in a table:
CELL | PARENT
A NULL
AA A
AAA AA
AAC AA
AB A
AE A
AEA AE
AEE AE
AEEB AEE
By doing so, I think you should be able to query the given string, and all items that share this parent, and then recursively go down this path until no more items are found. However, this seems rather slow to me because the whole search space would need to be looked through on each level - which is exactly what you don't want in a multidimensional array.
So I am a bit at loss. Note that there are actually around 100,000 strings structured in this way, so speed is important. Luckily the database is static and would not change. How can I store such a data structure in a database without having to deal with long loops and search times? And which kind of database software and data type is best suited for this? It has come to my attention that PostgreSQL is already present on our servers so I'd rather stick with that.
As I said I am new to databases but I am very eager to learn. Therefore, I am looking for an extensive answer that goes into detail and provides advantages and disadvantages of a certain approach. Performance is key. An expected answer would contain the best database type and language for this use case, and also script in that language to build such a structure.
The goal is to retrieve an element with PHP by name and all its descendants.
If that is all you need, you can use a LIKE search
SELECT *
FROM Table1
WHERE CELL LIKE 'AEE%';
With an index beginning with CELL this is a range check, which is fast.
If your data doesn't look like that, you can create a path column which looks like a directory path and contains all nodes "on the way/path" from root to the element.
| id | CELL | parent_id | path |
|====|======|===========|==========|
| 1 | A | NULL | 1/ |
| 2 | AA | 1 | 1/2/ |
| 3 | AAA | 2 | 1/2/3/ |
| 4 | AAC | 2 | 1/2/4/ |
| 5 | AB | 1 | 1/5/ |
| 6 | AE | 1 | 1/6/ |
| 7 | AEA | 6 | 1/6/7/ |
| 8 | AEE | 6 | 1/6/8/ |
| 9 | AEEB | 8 | 1/6/8/9/ |
To retrieve all descendants of 'AE' (including itself) your query would be
SELECT *
FROM tree t
WHERE path LIKE '1/6/%';
or (MySQL specific concatenation)
SELECT t.*
FROM tree t
CROSS JOIN tree r -- root
WHERE r.CELL = 'AE'
AND t.path LIKE CONCAT(r.path, '%');
Result:
| id | CELL | parent_id | path |
|====|======|===========|==========|
| 6 | AE | 1 | 1/6/ |
| 7 | AEA | 6 | 1/6/7/ |
| 8 | AEE | 6 | 1/6/8/ |
| 9 | AEEB | 8 | 1/6/8/9/ |
Demo
Performance
I have created 100K rows of fake data on MariaDB with the sequence plugin using the following script:
drop table if exists tree;
CREATE TABLE tree (
`id` int primary key,
`CELL` varchar(50),
`parent_id` int,
`path` varchar(255),
unique index (`CELL`),
unique index (`path`)
);
DROP TRIGGER IF EXISTS `tree_after_insert`;
DELIMITER //
CREATE TRIGGER `tree_after_insert` BEFORE INSERT ON `tree` FOR EACH ROW BEGIN
if new.id = 1 then
set new.path := '1/';
else
set new.path := concat((
select path from tree where id = new.parent_id
), new.id, '/');
end if;
END//
DELIMITER ;
insert into tree
select seq as id
, conv(seq, 10, 36) as CELL
, case
when seq = 1 then null
else floor(rand(1) * (seq-1)) + 1
end as parent_id
, null as path
from seq_1_to_100000
;
DROP TRIGGER IF EXISTS `tree_after_insert`;
-- runtime ~ 4 sec.
Tests
Count all elements under the root:
SELECT count(*)
FROM tree t
CROSS JOIN tree r -- root
WHERE r.CELL = '1'
AND t.path LIKE CONCAT(r.path, '%');
-- result: 100000
-- runtime: ~ 30 ms
Get subtree elements under a specific node:
SELECT t.*
FROM tree t
CROSS JOIN tree r -- root
WHERE r.CELL = '3B0'
AND t.path LIKE CONCAT(r.path, '%');
-- runtime: ~ 30 ms
Result:
| id | CELL | parent_id | path |
|=======|======|===========|=====================================|
| 4284 | 3B0 | 614 | 1/4/11/14/614/4284/ |
| 6560 | 528 | 4284 | 1/4/11/14/614/4284/6560/ |
| 8054 | 67Q | 6560 | 1/4/11/14/614/4284/6560/8054/ |
| 14358 | B2U | 6560 | 1/4/11/14/614/4284/6560/14358/ |
| 51911 | 141Z | 4284 | 1/4/11/14/614/4284/51911/ |
| 55695 | 16Z3 | 4284 | 1/4/11/14/614/4284/55695/ |
| 80172 | 1PV0 | 8054 | 1/4/11/14/614/4284/6560/8054/80172/ |
| 87101 | 1V7H | 51911 | 1/4/11/14/614/4284/51911/87101/ |
PostgreSQL
This also works for PostgreSQL. Only the string concatenation syntax has to be changed:
SELECT t.*
FROM tree t
CROSS JOIN tree r -- root
WHERE r.CELL = 'AE'
AND t.path LIKE r.path || '%';
Demo: sqlfiddle - rextester
How does the search work
If you look at the test example, you'll see that all paths in the result begin with '1/4/11/14/614/4284/'. That is the path of the subtree root with CELL='3B0'. If the path column is indexed, the engine will find them all efficiently, because the index is sorted by path. It's like you would want to find all the words that begin with 'pol' in a dictionary with 100K words. You wouldn't need to read the entire dictionary.
Performance
As others have already mentioned, performance shouldn't be an issue as long as you use a suitable indexed primary key and ensure that relations use foreign keys. In general, an RDBMS is highly optimised to efficiently perform joins on indexed columns and referential integrity can also provide the advantage of preventing orphans. 100,000 may sound a lot of rows but this isn't going to stretch an RDBMS as long as the table structure and queries are well designed.
Choice of RDBMS
One factor in answering this question lies in choosing a database with the ability to perform a recursive query via a Common Table Expression (CTE), which can be very useful to keep the queries compact or essential if there are queries that do not limit the number of descendants being traversed.
Since you've indicated that you are free to choose the RDBMS but it must run under Linux, I'm going to throw PostgreSQL out there as a suggestion since it has this feature and is freely available. (This choice is of course very subjective and there are advantages and disadvantages of each but a few other contenders I'd be tempted to rule out are MySQL since it doesn't currently support CTEs, MariaDB since it doesn't currently support *recursive* CTEs, SQL Server since it doesn't currently support Linux. Other possibilities such as Oracle may be dependent on budget / existing resources.)
SQL
Here's an example of the SQL you'd write to perform your first example of finding all the descendants of 'A':
WITH RECURSIVE rcte AS (
SELECT id, letters
FROM cell
WHERE letters = 'A'
UNION ALL
SELECT c.id, c.letters
FROM cell c
INNER JOIN rcte r
ON c.parent_cell_id = r.id
)
SELECT letters
FROM rcte
ORDER BY letters;
Explanation
The above SQL sets up a "Common Table Expression", i.e. a SELECT to run whenever its alias (in this case rcte) is referenced. The recursion happens because this is referenced within itself. The first part of the UNION picks the cell at the top of the hierarchy. Its descendants are all found by carrying on joining on children in the second part of the UNION until no further records are found.
Demo
The above query can be seen in action on the sample data here: http://rextester.com/HVY63888
You absolutely can do that (if I've read your question correctly).
Depending on your RDBMS you might have to choose a different way.
Your basic structure of having a parent is correct.
SQL Server use recursive common table expression (CTE) to anchor the start and work down
https://technet.microsoft.com/en-us/library/ms186243(v=sql.105).aspx
Edit: For Linux use the same in PostgreSQL https://www.postgresql.org/docs/current/static/queries-with.html
Oracle has a different approach, though I think you might be able to use the CTE as well.
https://oracle-base.com/articles/misc/hierarchical-queries
For 100k rows I don't imagine performance will be an issue, though I'd still index PK & FK because that's the right thing to do. If you're really concerned about speed then reading it into memory and building a hash table of linked lists might work.
Pros & cons - it pretty much comes down to readability and suitability for your RDBMS.
It's an already solved problem (again, assuming I've not missed anything) so you'll be fine.
I have two words for you... "RANGE KEYS"
You may find this technique to be incredibly powerful and flexible. You'll be able to navigate your hierarchies with ease, and support variable depth aggregation without the need for recursion.
In the demonstration below, we'll build the hierarchy via a recursive CTE. For larger hierarchies 150K+, I'm willing to share a much faster build in needed.
Since your hierarchies are slow moving (like mine), I tend to store them in a normalized structure and rebuild as necessary.
How about some actual code?
Declare #YourTable table (ID varchar(25),Pt varchar(25))
Insert into #YourTable values
('A' ,NULL),
('AA' ,'A'),
('AAA' ,'AA'),
('AAC' ,'AA'),
('AB' ,'A'),
('AE' ,'A'),
('AEA' ,'AE'),
('AEE' ,'AE'),
('AEEB','AEE')
Declare #Top varchar(25) = null --<< Sets top of Hier Try 'AEE'
Declare #Nest varchar(25) ='|-----' --<< Optional: Added for readability
IF OBJECT_ID('TestHier') IS NOT NULL
Begin
Drop Table TestHier
End
;with cteHB as (
Select Seq = cast(1000+Row_Number() over (Order by ID) as varchar(500))
,ID
,Pt
,Lvl=1
,Title = ID
From #YourTable
Where IsNull(#Top,'TOP') = case when #Top is null then isnull(Pt,'TOP') else ID end
Union All
Select cast(concat(cteHB.Seq,'.',1000+Row_Number() over (Order by cteCD.ID)) as varchar(500))
,cteCD.ID
,cteCD.Pt
,cteHB.Lvl+1
,cteCD.ID
From #YourTable cteCD
Join cteHB on cteCD.Pt = cteHB.ID)
,cteR1 as (Select Seq,ID,R1=Row_Number() over (Order By Seq) From cteHB)
,cteR2 as (Select A.Seq,A.ID,R2=Max(B.R1) From cteR1 A Join cteR1 B on (B.Seq like A.Seq+'%') Group By A.Seq,A.ID )
Select B.R1
,C.R2
,A.ID
,A.Pt
,A.Lvl
,Title = Replicate(#Nest,A.Lvl-1) + A.Title
Into dbo.TestHier
From cteHB A
Join cteR1 B on A.ID=B.ID
Join cteR2 C on A.ID=C.ID
Order By B.R1
Show The Entire Hier I added the Title and Nesting for readability
Select * from TestHier Order By R1
Just to state the obvious, the Range Keys are R1 and R2. You may also notice that R1 maintains the presentation sequence. Leaf nodes are where R1=R2 and Parents or rollups define the span of ownership.
To Show All Descendants
Declare #GetChildrenOf varchar(25) = 'AE'
Select A.*
From TestHier A
Join TestHier B on B.ID=#GetChildrenOf and A.R1 Between B.R1 and B.R2
Order By R1
To Show Path
Declare #GetParentsOf varchar(25) = 'AEEB'
Select A.*
From TestHier A
Join TestHier B on B.ID=#GetParentsOf and B.R1 Between A.R1 and A.R2
Order By R1
Clearly these are rather simple illustrations. Over time, I have created a series of helper functions, both Scalar and Table Value Functions. I should also state that you should NEVER hard code range key in your work because they will change.
In Summary
If you have a point (or even a series of points), you'll have its range and therefore you'll immediately know where it resides and what rolls into it.
This approach does not depend on the existence of a path or parent column. It is relational not recursive.
Since the table is static create a materialized view containing just the leaves to make searching faster:
create materialized view leave as
select cell
from (
select cell,
lag(cell,1,cell) over (order by cell desc) not like cell || '%' as leave
from t
) s
where leave;
table leave;
cell
------
CCCE
CCCA
CCBE
CCBC
BEDA
BDDA
BDCE
BDCB
BAA
AEEB
AEA
AB
AAC
AAA
A materialized view is computed once at creation not at each query like a plain view. Create an index to speed it up:
create index cell_index on leave(cell);
If eventually the source table is altered just refresh the view:
refresh materialized view leave;
The search function receives text and returns a text array:
create or replace function get_descendants(c text)
returns text[] as $$
select array_agg(distinct l order by l)
from (
select left(cell, generate_series(length(c), length(cell))) as l
from leave
where cell like c || '%'
) s;
$$ language sql immutable strict;
Pass the desired match to the function:
select get_descendants('A');
get_descendants
-----------------------------------
{A,AA,AAA,AAC,AB,AE,AEA,AEE,AEEB}
select get_descendants('AEE');
get_descendants
-----------------
{AEE,AEEB}
Test data:
create table t (cell text);
insert into t (cell) values
('A'),
('AA'),
('AAA'),
('AAC'),
('AB'),
('AE'),
('AEA'),
('AEE'),
('AEEB'),
('B'),
('BA'),
('BAA'),
('BD'),
('BDC'),
('BDCB'),
('BDCE'),
('BDD'),
('BDDA'),
('BE'),
('BED'),
('BEDA'),
('C'),
('CC'),
('CCB'),
('CCBC'),
('CCBE'),
('CCC'),
('CCCA'),
('CCCE'),
('CE');
For your scenario, I would suggest you to use Nested Sets Approach in PostgreSQL. It is XML tags based querying using Relational database.
Performance
If you index on lft and rgt columns, then you don't require recursive queries to get the data. Even though, the data seems huge, the retrieval will be very fast.
Sample
/*1A:
2 AA:
3 AAA
4 AAC
5 AB
6 AE:
7 AEA
8 AEE:
9 AEEB
10B:
*/
CREATE TABLE tree(id int, CELL varchar(4), lft int, rgt int);
INSERT INTO tree
("id", CELL, "lft", "rgt")
VALUES
(1, 'A', 1, 9),
(2, 'AA', 2, 4),
(3, 'AAA', 3, 3),
(4, 'AAC', 4, 4),
(5, 'AB', 5, 5),
(6, 'AE', 6, 9),
(7, 'AEA', 7, 7),
(8, 'AEE', 8, 8),
(9, 'AEEB', 9, 9)
;
SELECT hc.*
FROM tree hp
JOIN tree hc
ON hc.lft BETWEEN hp.lft AND hp.rgt
WHERE hp.id = 2
Demo
Querying using Nested Sets approach

Using a Table-Valued Function to Turn a Single Row into Many Within Select

Simple enough question I think.
I have a dataset, quite large with a bit of free-text name data. I need to to link this to our employee table.
There's a whole set of different ways people have entered the 'owner' in to this fields (John Smith, J.Smith, John Smith (JSMITH), Company:John Smith/Client: John Smith, ect.)
Most of these are fine, but the problem I have is with the ones where multiple names have been entered. For example; "John Smith / Joe Bloggs".
I have a pre-created Table-Valued function which takes in a string and a delimiter, then returns a table with the results of the split.
dbo.Split('John Smith / Joe Bloggs')
id val
1 John Smith
2 Joe Bloggs
The issue I have is that I need these results to come back for each row within an existing dataset. So for example, my query selecting the Owner, RefNumber and OSProjectCode fro my 'ProjectActions' table containing the following data:
RefNumber OSProjectCode Owner
1 1234 Bill Baggins
2 1234 John Smith / Joe Bloggs
would come out looking like this:
RefNumber OSProjectCode Owner
1 1234 Bill Baggins
2 1234 John Smith
2 1234 Joe Bloggs
What I've tried to far is attempt to join on the results of the function - but unsurprisingly it wont let me send in the column from ProjectsActions into the function like that.
SELECT a.val AS [Owner], pa.[RefNumber], pa.[OSProjectCode]
FROM dbo.ProjectsActions pa
INNER JOIN dbo.Split(pa.[Owner], '/') a
Msg 4104, Level 16, State 1, Line 1
The multi-part identifier "pa.Owner" could not be bound.
The only way I can think of doing this, which seems a little too bulky and messy, is the below:
;with base as(
SELECT
pa.RefNumber
, pa.OSProjectCode
, (SELECT val FROM dbo.Eval(pa.Owner) WHERE id = 1) AS [First]
, (SELECT val FROM dbo.Eval(pa.Owner) WHERE id = 2) AS [Second]
FROM ProjectsActions pa
)
SELECT
a.RefNumber
, a.OSProjectCode
, a.First AS [Owner]
FROM base a WHERE a.First IS NOT NULL
UNION ALL
SELECT
b.RefNumber
, b.OSProjectCode
, b.Second AS [Owner]
FROM base b WHERE a.First IS NOT NULL
Surely there's a better way? Something more similar to my first attempt - joining to the results within each row?
Any feedback or ideas would be much appreciated.
Cheers,
Scott.
EDIT:
FYI if anyone comes accross this with a similar issue, but are missing the 'split' part - I use a function found elsewhere on stackoverflow. https://stackoverflow.com/a/14600765/1700309
You need to use an APPLY as your join.
SELECT
a.val AS [Owner],
pa.[RefNumber],
pa.[OSProjectCode]
FROM dbo.ProjectsActions pa
CROSS APPLY dbo.Split(pa.[Owner], '/') a
The CROSS APPLY acts like an INNER JOIN passing the row-level value into your table-valued function. If you expect split function returns NULL if it can't split the value (NULL, empty, etc), you can use OUTER APPLY so that the NULL won't drop that row out of your result set. You can also add a COALESCE to fall back to the [owner].
SELECT
COALESCE(a.val, pa.[Owner]) AS [Owner],
pa.[RefNumber],
pa.[OSProjectCode]
FROM dbo.ProjectsActions pa
OUTER APPLY dbo.Split(pa.[Owner], '/') a

Sql Ordering Hiarchy

I am working on a SQL Statement that I can't seem to figure out. I need to order the results alphabetically, however, I need "children" to come right after their "parent" in the order. Below is a simple example of the table and data I'm working with. All non relevant columns have been removed. I'm using SQL Server 2005. Is there an easy way to do this?
tblCats
=======
idCat | fldCatName | idParent
--------------------------------------
1 | Some Category | null
2 | A Category | null
3 | Top Category | null
4 | A Sub Cat | 1
5 | Sub Cat1 | 1
6 | Another Cat | 2
7 | Last Cat | 3
8 | Sub Sub Cat | 5
Results of Sql Statement:
A Category
Another Cat
Some Category
A Sub Cat1
Sub Cat 1
Sub Sub Cat
Top Category
Last Cat
(The prefixed spaces in the result are just to add in understanding of the results, I don't want the prefixed spaces in my sql result. The result only needs to be in this order.)
You can do it with a hierarchical query, as below.
It looks a lot more complicated than it is, due to the lack of a PAD funciton in t-sql. The seed of the hierarchy are the categories without parents. The fourth column we select is their ranking alphabetically (converted to a string and padded). Then we union this with their children. At each recursion, the children will all be at the same level, so we can get their ranking alphabetically without needing to partition. We can concatenate these rankings together down the tree, and order by that.
;WITH Hierarchy AS (
SELECT
idCat, fldCatName, idParent,
CAST(RIGHT('00000'+
CAST(ROW_NUMBER() OVER (ORDER BY fldCatName) AS varchar(8))
, 5)
AS varchar(256)) AS strPath
FROM Category
WHERE idParent IS NULL
UNION ALL
SELECT
c.idCat, c.fldCatName, c.idParent,
CAST(h.strPath +
CAST(RIGHT('00000'+
CAST(ROW_NUMBER() OVER (ORDER BY c.fldCatName) AS varchar(8))
, 5) AS varchar(16))
AS varchar(256))
FROM Hierarchy h
INNER JOIN Category c ON c.idParent = h.idCat
)
SELECT idCat, fldCatName, idParent, strPath
FROM Hierarchy
ORDER BY strPath
With your data:
idCat fldCatName idParent strPath
------------------------------------------------
2 A Category NULL 00001
6 Another Category 2 0000100001
1 Some Category NULL 00002
4 A Sub Category 1 0000200001
5 Sub Cat1 1 0000200002
8 Sub Sub Category 5 000020000200001
3 Top Category NULL 00003
7 Last Category 3 0000300001
It can be done in CTE... Is this what you're after ?
With MyCats (CatName, CatId, CatLevel, SortValue)
As
( Select fldCatName CatName, idCat CatId,
0 Level, Cast(fldCatName As varChar(200)) SortValue
From tblCats
Where idParent Is Null
Union All
Select c.fldCatName CatName, c.idCat CatID,
CatLevel + 1 CatLevel,
Cast(SortValue + '\' + fldCatName as varChar(200)) SortValue
From tblCats c Join MyCats p
On p.idCat = c.idParent)
Select CatName, CatId, CatLevel, SortValue
From MyCats
Order By SortValue
EDIT: (thx to Pauls' comment below)
If 200 characters is not enough to hold the longest concatenated string "path", then change the value to as high as is needed... you can make it as high as 8000
I'm not aware of any SQL Server (or Ansi-SQL) inherent support for this.
I don't supposed you'd consider a temp table and recursive stored procedure an "easy" way ? J
Paul's answer is excellent, but I thought I would throw in another idea for you. Joe Celko has a solution for this in his SQL for Smarties book (chapter 29). It involves maintaining a separate table containing the hierarchy info. Inserts, updates, and deletes are a little complicated, but selects are very fast.
Sorry I don't have a link or any code to post, but if you have access to this book, you may find this helpful.