Find missing entries in a SQL table conditional on criteria

Find missing entries in a SQL table conditional on criteria - sql

I have modest simple SQL experience (using MS SQL server 2012 here) but this evades me. I wish to output distinct names from a table (previously successfully created from a join) which have some required entries missing, but conditional on the existence of another similar entry. For anyone who has location 90, I want to check they also have locations 10 and 20...
For example, consider this table:
Name |Number |Location
--------|-------|--------
Alice |136218 |90
Alice |136218 |10
Alice |136218 |20
Alice |136218 |40
Bob |121478 |10
Bob |121478 |90
Chris |147835 |20
Chris |147835 |90
Don |138396 |20
Don |138396 |10
Emma |136412 |10
Emma |136412 |20
Emma |136412 |90
Fred |158647 |90
Gay |154221 |90
Gay |154221 |10
Gay |154221 |30
So formally, I would like to obtain the Names (and Numbers) of those entries in the table who:
Have an entry at location 90
AND do not have all the other required location entries - in this case also 10 and 20.
So in the example above
Alice and Emma are not output by this query, they have entries for 90, 10 & 20 (all present and correct, we ignore the location 40 entry).
Don is not output by this query, he does not have an entry for location 90.
Bob and Gay are output by this query, they are both missing location 20 (we ignore Gay's location 30 entry).
Chris is output by this query, he is missing location 10.
Fred is output by this query, he is missing locations 10 & 20.
The desired query output is therefore something like:
Name |Number |Location
--------|-------|--------
Bob |121478 |20
Chris |147835 |10
Fred |158647 |10
Fred |158647 |20
Gay |154221 |20
I've tried a few approaches with left/right joins where B.Key is null, and select from ... except but so far I can't quite get the logical approach correct. In the original table there are hundreds of thousands of entries and only a few tens of valid missing matches. Unfortunately I can't use anything that counts entries as the query has to be locations specific and there are other valid table entries at other locations outside of the desired ones.
I feel that the correct way to do this is something like a left outer join but as the starting table is the output of another join does this require declaring an intermediate table and then outer joining the intermediate table with its self? Note there is no requirement to fill in any gaps or enter items into the table.
Any advice would be very much appreciated.
===Answered and used code pasted here===
--STEP 0: Create a CTE of all valid actual data in the ranges that we want
WITH ValidSplits AS
(
SELECT DISTINCT C.StartNo, S.ChipNo, S.TimingPointId
FROM Splits AS S INNER JOIN Competitors AS C
ON S.ChipNo = C.ChipNo
AND (
S.TimingPointId IN (SELECT TimingPointId FROM #TimingPointCheck)
OR
S.TimingPointId = #TimingPointMasterCheck
)
),
--STEP 1: Create a CTE of the actual data that is specific to the precondition of passing #TimingPointMasterCheck
MasterSplits AS
(
SELECT DISTINCT StartNo, ChipNo, TimingPointId
FROM ValidSplits
WHERE TimingPointId = #TimingPointMasterCheck
)
--STEP 2: Create table of the other data we wish to see, i.e. a representation of the StartNo, ChipNo and TimingPointId of the finishers at the locations in #TimingPointCheck
--The key part here is the CROSS JOIN which makes a copy of every Start/ChipNo for every TimingPointId
SELECT StartNo, ChipNo, Missing.TimingPointId
FROM MasterSplits
CROSS JOIN (SELECT * FROM #TimingPointCheck) AS Missing(TimingPointId)
EXCEPT
SELECT StartNo, ChipNo, TimingPointId FROM ValidSplits
ORDER BY StartNo

Welcome to Stack Overflow.
What you need is a bit challenging, since you want to see data that do not exist.
Thus, we first must create all possible rows, then substract the ones that exist
select ppl_with_90.Name,ppl_with_90.Number,search_if_miss.Location
from
(
select distinct Name,Number
from yourtable t
where Location=90
)ppl_with_90 -- All Name/Numbers that have the 90
cross join (values (10),(20)) as search_if_miss(Location) -- For all the previous, combine them with both 10 and 20
except -- remove the lines already existing
select *
from yourtable
where Location in (10,20)

You need to generate the sets consisting of name, number, 10_and_20 for all rows where location = 90. You can then use your favorite method (left join + null, not exists, not in) to filter the rows that do not exist:
WITH name_number_location AS (
SELECT t.Name, t.Number, v.Location
FROM #yourdata AS t
CROSS JOIN (VALUES (10), (20)) AS v(Location)
WHERE t.Location = 90
)
SELECT *
FROM name_number_location AS r
WHERE NOT EXISTS (
SELECT *
FROM #yourdata AS t
WHERE r.Name = t.Name AND r.Location = t.Location
)

Related

Aggregating or Bundle a Many to Many Relationship in SQL Developer

So I have 1 single table with 2 columns : Sales_Order called ccso, Arrangement called arrmap
The table has distinct values for this combination and both these fields have a Many to Many relationship
1 ccso can have Multiple arrmap
1 arrmap can have Multiple ccso
All such combinations should be considered as one single bundle
Objective :
Assign a final map to each of the Sales Order as the Largest Arrangement in that Bundle
Example:
ccso : 100-10015 has 3 arrangements --> Now each of those arrangements have a set of Sales Orders --> Now those sales orders will also have a list of other arrangements and so on
(Image : 1)
Therefore the answer definitely points to something recursively checking. Ive managed to write the below code / codes and they work as long as I hard code a ccso in the where clause - But I don't know how to proceed after this now. (I'm an accountant by profession but finding more passion in coding recently) I've searched the forums and web for things like
Recursive CTEs,
many to many aggregation
cartesian product etc
and I'm sure there must be a term for this which I don't know yet. I've also tried
I have to use sqldeveloper or googlesheet query and filter formulas
sqldeveloper has restrictions on on some CTEs. If recursive is the way I'd like to know how and if I can control the depth to say 4 or 5 iterations
Ideally I'd want to update a third column with the final map if possible but if not, then a select query result is just fine
Codes I've tried
Code 1: As per Screenshot
WITH a1(ccso, amap) AS
(SELECT distinct a.ccso, a.arrmap
FROM rg_consol_map2 A
WHERE a.ccso = '100-10115' -- this condition defines the ultimate ancestors in your chain, change it as appropriate
UNION ALL
SELECT m.ccso, m.arrmap
FROM rg_consol_map2 m
JOIN a1
ON M.arrmap = a1.amap -- or m.ccso=a1.ccso
) /*if*/ CYCLE amap SET nemap TO 1 /*else*/ DEFAULT 0
SELECT DISTINCT amap FROM (SELECT ccso, amap FROM a1 ORDER BY 1 DESC) WHERE ROWNUM = 1
In this the main challenge is how to remove the hardcoded ccso and do a join for each of the ccso
Code 2 : Manual CTEs for depth
Here again the join outside the CTE gives me an error and sqldeveloper does not allow WITH clause with UPDATE statement - only works for select and cannot be enclosed within brackets as subtable
SELECT distinct ccso FROM
(
WITH ar1 AS
(SELECT distinct arrmap
FROM rg_consol_map
WHERE ccso = a.ccso
)
,so1 AS
(SELECT DISTINCT ccso
FROM rg_consol_map
WHERE arrmap IN (SELECT arrmap FROM ar1)
)
,ar2 AS
(SELECT DISTINCT ccso FROM rg_consol_map
where arrmap IN (select distinct arrmap FROM rg_consol_map
WHERE ccso IN (SELECT ccso FROM so1)
))
SELECT ar1.arrmap, NULL ccso FROM ar1
union all
SELECT null, ar2.ccso FROM ar2
UNION ALL
SELECT NULL arrmap, so1.ccso FROM so1
)
Am I Missing something here or is there an easier way to do this? I read something about MERGE and PROC SQL JOIN but was unable to get them to work but if that's the way to go ahead I will try further if someone can point me in the direction
(Image : 2)
(CSV File : [3])
Edit : Fixing CSV file link
https://github.com/karan360note/karanstackoverflow.git
I suppose can be downloaded from here IC mapping many to many.csv
Oracle 11g version is being used

Apologies in advance for the wall of text.
Your problem is a complex, multi-layered Many-to-Many query; there is no "easy" solution to this, because that is not a terribly ideal design choice. The safest best does literally include multiple layers of CTE or subqueries in order to achieve all the depths you want, as the only ways I know to do so recursively rely on an anchor column (like "parentID") to direct the recursion in a linear fashion. We don't have that option here; we'd go in circles without a way to track our path.
Therefore, I went basic, and with several subqueries. Every level checks for a) All orders containing a particular ARRMAP item, and then b) All additional items on those orders. It's clear enough for you to see the logic and modify to your needs. It will generate a new table that contains the original CCSO, the linking ARRMAP, and the related CCSO. Link: https://pastebin.com/un70JnpA
This should enable you to go back and perform the desired updates you want, based on order # or order date, etc... in a much more straightforward fashion. Once you have an anchor column, a CTE in the future is much more trivial (just search for "CTE recursion tree hierarchy").
SELECT DISTINCT
CCSO, RELATEDORDER
FROM myTempTable
WHERE CCSO = '100-10115'; /* to find all orders by CCSO, query SELECT DISTINCT RELATEDORDER */
--WHERE ARRMAP = 'ARR10524'; /* to find all orders by ARRMAP, query SELECT DISTINCT CCSO */
EDIT:
To better explain what this table generates, let me simplify the problem.
If you have order
A with arrangements 1 and 2;
B with arrangement 2, 3; and
C with arrangement 3;
then, by your initial inquiry and image, order A should related to orders B and C, right? The query generates the following table when you SELECT DISTINCT ccso, relatedOrder:
+-------+--------------+
| CCSO | RelatedOrder |
+----------------------+
| A | B |
| A | C |
+----------------------+
| B | C |
| B | A |
+----------------------+
| C | A |
| C | B |
+-------+--------------+
You can see here if you query WHERE CCSO = 'A' OR RelatedOrder = 'A', you'll get the same relationships, just flipped between the two columns.
+-------+--------------+
| CCSO | RelatedOrder |
+----------------------+
| A | B |
| A | C |
+----------------------+
| B | A |
+----------------------+
| C | A |
+-------+--------------+
So query only CCSO or RelatedOrder.
As for the results of WHERE CCSO = '100-10115', see image here, which includes all the links you showed in your Image #1, as well as additional depths of relations.

Get total count and first 3 columns

I have the following SQL query:
SELECT TOP 3 accounts.username
,COUNT(accounts.username) AS count
FROM relationships
JOIN accounts ON relationships.account = accounts.id
WHERE relationships.following = 4
AND relationships.account IN (
SELECT relationships.following
FROM relationships
WHERE relationships.account = 8
);
I want to return the total count of accounts.username and the first 3 accounts.username (in no particular order). Unfortunately accounts.username and COUNT(accounts.username) cannot coexist. The query works fine removing one of the them. I don't want to send the request twice with different select bodies. The count column could span to 1000+ so I would prefer to calculate it in SQL rather in code.
The current query returns the error Column 'accounts.username' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause. which has not led me anywhere and this is different to other questions as I do not want to use the 'group by' clause. Is there a way to do this with FOR JSON AUTO?
The desired output could be:
+-------+----------+
| count | username |
+-------+----------+
| 1551 | simon1 |
| 1551 | simon2 |
| 1551 | simon3 |
+-------+----------+
or
+----------------------------------------------------------------+
| JSON_F52E2B61-18A1-11d1-B105-00805F49916B |
+----------------------------------------------------------------+
| [{"count": 1551, "usernames": ["simon1", "simon2", "simon3"]}] |
+----------------------------------------------------------------+

If you want to display the total count of rows that satisfy the filter conditions (and where username is not null) in an additional column in your resultset, then you could use window functions:
SELECT TOP 3
a.username,
COUNT(a.username) OVER() AS cnt
FROM relationships r
JOIN accounts a ON r.account = a.id
WHERE
r.following = 4
AND EXISTS (
SELECT 1 FROM relationships t1 WHERE r1.account = 8 AND r1.following = r.account
)
;
Side notes:
if username is not nullable, use COUNT(*) rather than COUNT(a.username): this is more efficient since it does not require the database to check every value for nullity
table aliases make the query easier to write, read and maintain
I usually prefer EXISTS over IN (but here this is mostly a matter of taste, as both techniques should work fine for your use case)

SQL Spatial Subquery Issue

Greetings Benevolent Gods of Stackoverflow,
I am presently struggling to get a spatially enabled query to work for a SQL assignment I am working on. The wording is as follows:
SELECT PURCHASES.TotalPrice, STORES.GeoLocation, STORES.StoreName
FROM MuffinShop
join (SELECT SUM(PURCHASES.TotalPrice) AS StoreProfit, STORES.StoreName
FROM PURCHASES INNER JOIN STORES ON PURCHASES.StoreID = STORES.StoreID
GROUP BY STORES.StoreName
HAVING (SUM(PURCHASES.TotalPrice) > 600))
What I am trying to do with this query is perform a function query (like avg, sum etc) and get the spatial information back as well. Another example of this would be:
SELECT STORES.StoreName, AVG(REVIEWS.Rating),Stores.Shape
FROM REVIEWS CROSS JOIN
STORES
GROUP BY STORES.StoreName;
This returns a Column 'STORES.Shape' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause. error message.
I know I require a sub query to perform this task, I am just having endless trouble getting it to work. Any help at all would be wildly appreciated.

There are two parts to this question, I would tackle the first problem with the following logic:
List all the store names and their respective geolocations
Get the profit for each store
With that in mind, you need to use the STORES table as your base, then bolt the profit onto it through a sub query or an apply:
SELECT s.StoreName
,s.GeoLocation
,p.StoreProfit
FROM STORES s
INNER JOIN (
SELECT pu.StoreId
,StoreProfit = SUM(pu.TotalPrice)
FROM PURCHASES pu
GROUP BY pu.StoreID
) p
ON p.StoreID = s.StoreID;
This one is a little more efficient:
SELECT s.StoreName
,s.GeoLocation
,profit.StoreProfit
FROM STORES s
CROSS APPLY (
SELECT StoreProfit = SUM(p.TotalPrice)
FROM PURCHASES p
WHERE p.StoreID = s.StoreID
GROUP BY p.StoreID
) profit;
Now for the second part, the error that you are receiving tells you that you need to GROUP BY all columns in your select statement with the exception of your aggregate function(s).
In your second example, you are asking SQL to take an average rating for each store based on an ID, but you are also trying to return another column without including that inside the grouping. I will try to show you what you are asking SQL to do and where the issue lies with the following examples:
-- Data
Id | Rating | Shape
1 | 1 | Triangle
1 | 4 | Triangle
1 | 1 | Square
2 | 1 | Triangle
2 | 5 | Triangle
2 | 3 | Square
SQL Server, please give me the average rating for each store:
SELECT Id, AVG(Rating)
FROM Store
GROUP BY StoreId;
-- Result
Id | Avg(Rating)
1 | 2
2 | 3
SQL Server, please give me the average rating for each store and show its shape in the result (but don't group by it):
SELECT Id, AVG(Rating), Shape
FROM Store
GROUP BY StoreId;
-- Result
Id | Avg(Rating) | Shape
1 | 2 | Do I show Triangle or Square ...... ERROR!!!!
2 | 3 |
It needs to be told to get the average for each store and shape:
SELECT Id, AVG(Rating), Shape
FROM Store
GROUP BY StoreId, Shape;
-- Result
Id | Avg(Rating) | Shape
1 | 2.5 | Triangle
1 | 1 | Square
2 | 3 | Triangle
2 | 3 | Square

As in any spatial query you need an idea of what your final geometry will be. It looks like you are attempting to group by individual stores but delivering an average rating from the subquery. So if I'm reading it right you are just looking to get the stores shape info associated with the average ratings?
Query the stores table for the shape field and join the query you use to get the average rating
select a.shape
b.*
from stores a inner join (your Average rating query with group by here) b
on a.StoreID = b.Storeid

The best way to keep count data in postgres

I need to create a statistic for some aggragete date splitted by days.
For example:
select
(select count(*) from bananas) as bananas_count,
(select count(*) from apples) as apples_count,
(select count(*) from bananas where color = 'yellow') as yellow_bananas_count;
obviously I will get:
bananas_count | apples_count | yellow_bananas_count
--------------+------------------+ ---------------------
123| 321 | 15
but I need to get that data grouped by day, we need to know how many banaras we had yesterday.
The first thought which I got is create aview, but in that case i will not be able split by dates ( or I don't know how to do it).
I need a performance-wise database sided implementation of this task.

Fuzzy grouping in SQL

I need to modify a SQL table to group slightly mismatched names, and assign all elements in the group a standardized name.
For instance, if the initial table looks like this:
Name
--------
Jon Q
John Q
Jonn Q
Mary W
Marie W
Matt H
I would like to create a new table or add a field to the existing one like this:
Name | StdName
--------------------
Jon Q | Jon Q
John Q | Jon Q
Jonn Q | Jon Q
Mary W | Mary W
Marie W | Mary W
Matt H | Matt H
In this case, I've chosen the first name to assign as the "standardized name," but I don't actually care which one is chosen -- ultimately the final "standardized name" will be hashed into a unique person ID. (I'm also open to alternative solutions that go directly to a numerical ID.) I will have birthdates to match on as well, so the accuracy of the name matching doesn't actually need to be all that precise in practice. I've looked into this a bit and will probably use the Jaro-Winkler algorithm (see e.g. here).
If I knew that the names were all in pairs, this would be a relatively easy query, but there can be an arbitrary number of the same name.
I can easily conceptualize how to do this query in a procedural language, but I'm not very familiar with SQL. Unfortunately I don't have direct access to the data -- it's sensitive data and so somebody else (a bureaucrat) has to run the actual query for me. The specific implementation will be SQL Server, but I'd prefer an implementation-agnostic solution.
EDIT:
In response to a comment, I had the following procedural approach in mind. It's in Python, and I replaced the Jaro-Winkler with simply matching on the first letter of the name, for the sake of having a working code example.
nameList = ['Jon Q', 'John Q', 'Jonn Q', 'Mary W', 'Marie W', 'Larry H']
stdList = nameList[:]
# loop over all names
for i1, name1 in enumerate(stdList):
# loop over later names in list to find matches
for i2, name2 in enumerate(stdList[i1+1:]):
# If there's a match, replace latter with former.
if (name1[0] == name2[0]):
stdList[i1+1+i2] = name1
print stdList
The result is ['Jon Q', 'Jon Q', 'Jon Q', 'Mary W', 'Mary W', 'Larry H'].

Just a thought, but you might be able to use the SOUNDEX() function. This will create a value for the names that are similar.
If you started with something like this:
select name, soundex(name) snd,
row_number() over(partition by soundex(name)
order by soundex(name)) rn
from yt;
See SQL Fiddle with Demo. Which would give a result for each row that is similar along with a row_number() so you could return only the first value for each group. For example, the above query will return:
| NAME | SND | RN |
-----------------------
| Jon Q | J500 | 1 |
| John Q | J500 | 2 |
| Jonn Q | J500 | 3 |
| Matt H | M300 | 1 |
| Mary W | M600 | 1 |
| Marie W | M600 | 2 |
Then you could select all of the rows from this result where the row_number() is equal to 1 and then join back to your main table on the soundex(name) value:
select t1.name,
t2.Stdname
from yt t1
inner join
(
select name as stdName, snd, rn
from
(
select name, soundex(name) snd,
row_number() over(partition by soundex(name)
order by soundex(name)) rn
from yt
) d
where rn = 1
) t2
on soundex(t1.name) = t2.snd;
See SQL Fiddle with Demo. This gives a result:
| NAME | STDNAME |
---------------------
| Jon Q | Jon Q |
| John Q | Jon Q |
| Jonn Q | Jon Q |
| Mary W | Mary W |
| Marie W | Mary W |
| Matt H | Matt H |

Assuming you copy and paste the jaro-winkler implementation from SSC (registration required), the following code will work. I tried to build a SQLFiddle for it but it kept going belly up when I was building the schema.
This implementation has a cheat---I'm using a cursor. Generally, cursors are not conducive to performance but in this case, you need to be able to compare the set against itself. There's probably a graceful number/tally table approach to eliminate the declared cursor.
DECLARE #SRC TABLE
(
source_string varchar(50) NOT NULL
, ref_id int identity(1,1) NOT NULL
);
-- Identify matches
DECLARE #WORK TABLE
(
source_ref_id int NOT NULL
, match_ref_id int NOT NULL
);
INSERT INTO
#src
SELECT 'Jon Q'
UNION ALL SELECT 'John Q'
UNION ALL SELECT 'JOHN Q'
UNION ALL SELECT 'Jonn Q'
-- Oops on matching joan to jon
UNION ALL SELECT 'Joan Q'
UNION ALL SELECT 'june'
UNION ALL SELECT 'Mary W'
UNION ALL SELECT 'Marie W'
UNION ALL SELECT 'Matt H';
-- 2 problems to address
-- duplicates in our inbound set
-- duplicates against a reference set
--
-- Better matching will occur if names are split into ordinal entities
-- Splitting on whitespace is always questionable
--
-- Mat, Matt, Matthew
DECLARE CSR CURSOR
READ_ONLY
FOR
SELECT DISTINCT
S1.source_string
, S1.ref_id
FROM
#SRC AS S1
ORDER BY
S1.ref_id;
DECLARE #source_string varchar(50), #ref_id int
OPEN CSR
FETCH NEXT FROM CSR INTO #source_string, #ref_id
WHILE (##fetch_status <> -1)
BEGIN
IF (##fetch_status <> -2)
BEGIN
IF NOT EXISTS
(
SELECT * FROM #WORK W WHERE W.match_ref_id = #ref_id
)
BEGIN
INSERT INTO
#WORK
SELECT
#ref_id
, S.ref_id
FROM
#src S
-- If we have already matched the value, skip it
LEFT OUTER JOIN
#WORK W
ON W.match_ref_id = S.ref_id
WHERE
-- Don't match yourself
S.ref_id <> #ref_id
-- arbitrary threshold, will need to examine this for sanity
AND dbo.fn_calculateJaroWinkler(#source_string, S.source_string) > .95
END
END
FETCH NEXT FROM CSR INTO #source_string, #ref_id
END
CLOSE CSR
DEALLOCATE CSR
-- Show me the list of all the unmatched rows
-- plus the retained
;WITH MATCHES AS
(
SELECT
S1.source_string
, S1.ref_id
, S2.source_string AS match_source_string
, S2.ref_id AS match_ref_id
FROM
#SRC S1
INNER JOIN
#WORK W
ON W.source_ref_id = S1.ref_id
INNER JOIN
#SRC S2
ON S2.ref_id = W.match_ref_id
)
, UNMATCHES AS
(
SELECT
S1.source_string
, S1.ref_id
, NULL AS match_source_string
, NULL AS match_ref_id
FROM
#SRC S1
LEFT OUTER JOIN
#WORK W
ON W.source_ref_id = S1.ref_id
LEFT OUTER JOIN
#WORK S2
ON S2.match_ref_id = S1.ref_id
WHERE
W.source_ref_id IS NULL
and s2.match_ref_id IS NULL
)
SELECT
M.source_string
, M.ref_id
, M.match_source_string
, M.match_ref_id
FROM
MATCHES M
UNION ALL
SELECT
M.source_string
, M.ref_id
, M.match_source_string
, M.match_ref_id
FROM
UNMATCHES M;
-- To specifically solve your request
SELECT
S.source_string AS Name
, COALESCE(S2.source_string, S.source_string) As StdName
FROM
#SRC S
LEFT OUTER JOIN
#WORK W
ON W.match_ref_id = S.ref_id
LEFT OUTER JOIN
#SRC S2
ON S2.ref_id = W.source_ref_id
query output 1
source_string ref_id match_source_string match_ref_id
Jon Q 1 John Q 2
Jon Q 1 JOHN Q 3
Jon Q 1 Jonn Q 4
Jon Q 1 Joan Q 5
june 6 NULL NULL
Mary W 7 NULL NULL
Marie W 8 NULL NULL
Matt H 9 NULL NULL
query output 2
Name StdName
Jon Q Jon Q
John Q Jon Q
JOHN Q Jon Q
Jonn Q Jon Q
Joan Q Jon Q
june june
Mary W Mary W
Marie W Marie W
Matt H Matt H
There be dragons
Over on SuperUser, I talked about my experience matching people. In this section, I'll list some things to be aware of.
Speed
As part of your matching, hooray in that you have a birthday to augment the match process. I would actually propose you generate a match based exclusively on birthdate first. That is an exact match and one that, with a proper index, SQL Server will be able to quickly include/exclude rows. Because you're going to need it. The TSQL implementation is dog slow. I've been running the equivalent match against a dataset of 28k names (names that had been listed as conference attendees). There ought to be some good overlap there and while I did fill #src with data, it is a table variable with all that that implies but it's been running now for 15 minutes and still hasn't completed.
It's slow for a number of reasons but things that jumped out at me are all the looping and string manipulation in the functions. That is not where SQL Server shines. If you have a need to do a lot of this, it might be a good idea to convert them into CLR methods so at least you can leverage the strength of the .NET libraries for some of the manipulations.
One of the matches we used to use was the Double Metaphone and it would generate a pair of possible phonetic interpretations of the name. Instead of computing that every time, compute it once and store it alongside the name. That would help speed some of the matching. Unfortunately, it doesn't look like JW lends itself to breaking it down like that.
Look at iterating too. We'd first try the algs that we knew were fast. 'John' = 'John' so there's no need to pull out the big guns so we'd try a first pass of straight name checks. If we didn't find a match, we'd try harder. The hope was that by taking various swipes at matching we'd get the low hanging fruit as fast as possible and worry about the harder matches later.
Names
In my SU answer and in the code comments, I mention nicknames. Bill and Billy are going to match. Billy, Liam and William are definitely not going to match even though they may be the same person. You might want to look at a list like this to provide translation between nickname and full name. After running a set of matches on the supplied name, maybe we'd try looking for a match based on the possible root name.
Obviously, there are draw backs to this approach. For example, my grandfather-in-law is Max. Just Max. Not Maximilian, Maximus or any other things you might thing.
Your supplied names look like it's first and last concatenated together. Future readers, if you ever have the opportunity to capture individual portions of a name, please do so. There are products out there that will split names and try to match them up against directories to try and guess whether something is first/middle name or a surname but then you have people like "Robar Mike". If you saw that name there, you'd think Robar is a last name and you'd also pronounce it like "robber." Instead, Robar (say it with a French accent) is his first name and Mike is his last name. At any rate, I think you'll have a better matching experience if you can split first and last out into separate fields and match the individual pieces together. An exact last name match plus a partial first name match might suffice, especially in cases where legally they are "Franklin Roosevelt" and you have a candidate of "F. Roosevelt" Perhaps you have a rule that an initial letter can match. Or you don't.
Noise - as referenced in the JW post and my answer, strip out crap (punctuation, stop words, etc) for matching purposes. Also watch out for honorific tites (phd, jd, etc) and generationals (II, III, JR, SR). Our rule was a candidate with/without a generational could match one in the opposite state (Bob Jones Jr == Bob Jones) or could exactly match the generation (Bob Jones Sr = Bob Jones Sr) but you'd never want to match if both records supplied them and they were conflicting (Bob Jones Sr != Bob Jones Jr).
Case sensitivity, always check your database and tempdb to make sure you aren't making case sensitive matches. And if you are, convert everything to upper or lower for purposes of matching but don't ever throw the supplied casing away. Good luck trying to determine whether latessa should be Latessa, LaTessa or something else.
My query is coming up on a hour's worth of processing with no rows returned so I'm going to kill it and turn in. Best of luck, happy matching.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Find missing entries in a SQL table conditional on criteria - sql

Related

Aggregating or Bundle a Many to Many Relationship in SQL Developer

Get total count and first 3 columns

SQL Spatial Subquery Issue

The best way to keep count data in postgres

Fuzzy grouping in SQL

Categories

Resources