Finding Potential Duplicates With hyphens or dashes in SQL server 2008 - sql

I am trying to find potential duplicates in my database.
Some people might have a duplicate since they have added a "-" into their name or last name (for which ever reason).
My query currently does not pull people who might have a duplicate of someone with a "-".
What might be the best way to do this?
This is my query so far
SELECT t1.FirstName, t1.LastName, t1.ID, t2.dupeCount
FROM Contact t1
INNER JOIN (
SELECT FirstName, REPLACE(LastName, '-', ' ') as LastName, COUNT(*) AS dupeCount
FROM Contact
GROUP BY FirstName, LastName
HAVING COUNT(*) > 1
) t2 ON ((SOUNDEX(t1.LastName) = SOUNDEX(t2.LastName)
OR SOUNDEX(REPLACE(t1.LastName, '-', ' ')) like '%' + SOUNDEX(t2.LastName) + '%'
OR SOUNDEX(REPLACE(t2.LastName, '-', ' ')) like '%' + SOUNDEX(t1.LastName) + '%' )
AND SOUNDEX(t1.FirstName) = SOUNDEX(t2.FirstName))
ORDER BY t1.LastName, t1.ID

This is a lot more involved than something you can fix in one Select statement. When I run into this, I create a stored procedure and trim off leading and trailing spaces, remove punctuation that shouldn't be there (such as in middle names that are abbreviated some of the time and not other times), and check to see if phone numbers, address/zip code combinations, and/or email addresses point to the same person. Soundex helps, but it isn't enough.

Something like a Levenshtein distance algorithm would be useful, this measures the number of edits you would need to make to a string to make it the same as another string. In Oracle there is a built in function called edit_distance under the utl_match library, but I don't know of a built in version in SQL Server.
I did a quick Google search for Levenshtein distance and Edit distance SQL Server and found the following stack overflow thread among other results that may be helpful:
Levenshtein distance in T-SQL
If you are able to create a function that you can call to get the Levenshtein distance then you can just filter the query on whether the distance < x, setting the threshold as you see fit.

Related

Full Text Search Using Multiple Partial Words

I have a sql server database that has medical descriptions in it. I've created a full text index on it, but I'm still figuring out how this works.
The easiest example to give is if there is a description of Hypertensive heart disease
Now they would like to be able to type hyp hea as a search term and have it return that.
So from what I've read it seems like my query needs to be something like
DECLARE #Term VARCHAR(100)
SET #Term = 'NEAR(''Hyper*'',''hea*'')'
SELECT * FROM Icd10Codes WHERE CONTAINS(Description, #Term)
If I take the wild card out for Hypertensive and heart, and type out the full words it works, but adding the wild card in returns nothing.
If it makes any difference I'm using Sql Server 2017
So it was a weird syntax issue that didn't cause an error, but stopped the search from working.
I changed it to
SELECT * FROM Icd10Codes where CONTAINS(description, '"hyper*" NEAR "hea*"')
The key here being I needed double quotes " and not to single quotes. I assumed it was two single quotes, the first to escape the second, but it was actually double quotes. The above query returns the results exactly as expected.
this will work:
SELECT * FROM Icd10Codes where SOUNDEX(description)=soundex('Hyp');
SELECT * FROM Icd10Codes where DIFFERENCE(description,'hyp hea')>=2;
You could try a like statement. You can find a thorough explanation here.
Like so:
SELECT * FROM Icd10Codes WHERE Icd10Codes LIKE '%hyp hea%';
And then instead of putting the String in there just use a variable.
If you need to search for separated partial words, as in an array of search terms, it gets a bit tricky, since you need to dynamically build the SQL statement.
MSSQL provides a few features for full text search. You can find those here. One of them is the CONTAINS keyword:
SELECT column FROM table WHERE CONTAINS (column , 'string1 string2 string3');
For me - this had more mileage.
create a calculated row with fields as full text search.
fullname / company / lastname all searchable.
ALTER TABLE profiles ADD COLUMN fts tsvector generated always as (to_tsvector('english', coalesce(profiles.company, '') || ' ' || coalesce(profiles.broker, '') || ' ' || coalesce(profiles.firstname, '') || ' ' || coalesce(profiles.lastname, '') || ' ' )) stored;
let { data, error } = await supabase.from('profiles')
.select()
.textSearch('fts',str)

SQL LIKE Statement with Spaces

Spaces in my phone numbers are causing issues when I am trying to search.
Select * from customers where number LIKE '%02722231%'
This will return records that're LIKE '02722231', but this will not return any records that contain a space e.g. '027 22231'
Can this be done with regular expressions? I need to search 0272542155 and get all records the same including 027 2542155
Try this:
Select * from customers where REPLACE(number, ' ', '') LIKE '%02722231%'

Find similar entries in SQL column and rank by frequency

I have a column of 10k URIs in my SQLite database. I would like to identify which of these URIs are subdomains of the same website.
For instance, for the given set...
1. daiquiri.rum.cu
2. mojito.rum.cu
3. cubalibre.rum.cu
4. americano.campari.it
5. negroni.campari.it
6. hemingway.com
... I would like to run a query that returns:
Website | Occurrences
----------------------------
rum.cu | 3
campari.it | 2
hemingway.com | 1
That is, the domain names / patterns that were matched, ranked by the number of times they were found in the database.
The heuristic I would use is: for every URI with 3+ domains, replace first domain with '%'and execute the pseudoquery: COUNT(uris from website where uris LIKE '%.remainderofmyuri').
Note that I don't care much about execution speed (in fact, not at all). The number of entries is within the range of 10k-100k.
The only problem is to find the domain. In order to find an algorithm imagine your urls with an additional dot in front (like '.negroni.campari.it' and '.hemingway.com'). You see then it's always the string that comes after the second dot from right. All we have to do is look for that occurrence and strip part of the string. Unfortunately, however, SQLite's string functions are rather poor. There is no function that gives you the second occurence of a dot, not even when counting from left. So the agorithm is great for most dbms, but it isn't for SQLite. We need another approach. (I am writing this anyhow, to show how to usually approach the problem.)
Here is the SQLite solution: The difference between a domain and the subdomains is that in the domain there is exactly one dot, whereas a subdomain has at least two. So when there is more than one dot, we must remove the first part including the first dot in order to get to the domain. Moreover we want this to work even with sub domains like abc.def.geh.ijk.com, so we must do this recursively.
with recursive cte(uri) as
(
select uri from uris
union all
select substr(uri, instr(uri, '.') + 1) as uri from cte where instr(uri, '.') > 0
)
select uri, count(*)
from cte
where length(uri) = length(replace(uri,'.','')) + 1 -- domains only
group by uri
order by count(*) desc;
Here we generate 'daiquiri.rum.cu' and 'rum.cu' and 'cu' from 'daiquiri.rum.cu' etc. So for every uri we get the domain (here 'rum.cu') and some other strings. At last we filter with LENGTH to get those strings that have exactly one dot - the domains. The rest is group by and count.
Here is the SQL fiddle: http://sqlfiddle.com/#!5/c1f35/37.
select x.site, count(*)
from mytable a
inner join
(
select 'rum.cu' as site
union all select 'campari.it'
union all select 'hemingway.com'
) x on a.url like '%' + x.site + '%'
group by x.site -- EDIT I missed out the GROUP BY on the first go - sorry!
(This is how I'd do it in SQL-Server; not sure how SQLite differs in syntax.)
'mytable' is your table whuch has a column called url containing 'mojito.rum.cu' etc. I haven't put the '%.' in the like because that would miss out hemmingway.com. However you could get around that by using this line instead:
) x on a.url like '%.' + x.site + '%' or a.url = x.site
You may not need the fimal + '%' - I put it in to catch urls like 'hemingway.com/some-page.html. If you don't have urls like that you can skip that.
EDIT for dynamic names
select x.site, count(*)
from mytable a
inner join
(
select distinct ltrim(url, instr(url, '.')) as site
from mytable
where url like '%.%.%'
union
select distinct url
from mytable
where url like '%.%' and url not like '%.%.%'
) x on a.url like '%' + x.site + '%'
group by x.site
Something like that should do it. I haven't tested that the INSTR() function is correct. You may need to add or subtract 1 from the offset it generates when you test it. It may not be the fastest query but it should work.

SQL fetch results by concatenating words in column

I have column store_name (varchar). In that column I have entries like prime sport, best buy... with a space. But when user typed concatenated string like primesport without space I need to show result prime sport. how can I achieve this? Please help me
SELECT *
FROM TABLE
WHERE replace(store_name, ' ', '') LIKE '%'+#SEARCH+'%' OR STORE_NAME LIKE '%'+#SEARCH +'%'
Well, I don't have much idea, and even I am searching for it. But may be what I know works for you, You can achieve this by performing different type of string operations:
Mike can be Myke or Myce or Mikke or so on.
Cat an be Kat or katt or catt or so on.
For this you should write a function to generate number of possible strings and then form a SQL Query using all these, and query the database.
A similar kind of search in known as Soundex Search from Oracle and Soundex Search from Microsoft. Have a look of it. this may work.
And overall make use of functions like upper and lower.
Have you tried using replace()
You can replace the white space in the query then use like
SELECT * FROM table WHERE replace(store_name, ' ', '') LIKE '%primesport%'
It will work for entries like 'prime soft' querying with 'primesoft'
Or you can use regex.

Dictionary-like search using SQL Server FULL-TEXT

I am trying to create a dictionary for my website.
Searching for 'server' using FREETEXTTable & Rank DESC returns:
name server - A program or server that maps human-readable names..
server - One who serves; a waitress or waiter.
server - A tray for dishes; a salver.
4...
'server' is obviously closer to 'server' than 'name server'. How do I fix the ranking?
I can not just reverse to ASC because there are even worse matches.
Top 3 results for 'God' are 'act of God', 'Lamb of God', 'Le God'..
Edit: Sorry for any confusion. nameserver, server, server.. are in a single column called 'word' this is the column that is queried with full-text search. The definitions are in the next column 'definition' and returned as query results.
I think you can use union to solve the problem of your result ordering problem ..
like
select * from your_table_name where col_name = 'server'
union
select * from your_table_name where col_name like '%server%' order by col1,col2..
this query should give you first row with full text search and then with partial search ..
clarification ..
please note that by col_name i meant to say about the column name what you have for your words..
say your table structure is ..
dictionary-
( c_word ,
c_definition,
c_synonyms
)
then you have to modify my query as
select * from Dictionary where c_word = 'server'
union
select * from Dictionary where c_word like '%server%' order by c_definition,c_synonyms
so that this query will show first where c_word value exactly match the word 'server' followed by the partial search..
for dynamic query-- you need to replace 'server' with the variable where you are getting requested keyword for search .
I used the PATINDEX function to do this at one point. Something like the following:
SELECT
Word,
Definition
FROM
FREETEXTTABLE(Dictionary, Word, #search, 20) AS Matches
INNER JOIN Dictionary ON
Matches.Key = Dictionary.ID
ORDER BY
CASE PATINDEX('%' + #search + '%', Word)
WHEN -1 THEN 1000
ELSE PATINDEX('%' + #search + '%', Word)
END
It doesn't perform too terribly since you're using the full text index to get a smaller result set (max 20 as well, in this case). PATINDEX finds a string within an expression. If the search string doesn't exist within the expression, it returns -1. This might occur if you're also searching definitions, if your search matches on a synonym or stemmed word (eg: you search for "took" so "take" is returned), or if your search involves multiple words. The CASE statement sorts those results to the end.