String matching using function - sql

I am looking for a way to fuzzy match strings (in my case contact names) to see where there might be possible duplicates in the database. The 'duplicates' are actually cases where the names are very similar, as each row will have unique data.
I have been looking around and think that this: JaroWinkler Function would best suit my needs, which works quite well on small sets of strings.
However, I am looking to compare about 260,000 distinct strings, and want to see if there is a way to avoid checking through all possible combinations (as this would give me around 29 billion rows of checking).
As it stands the query I am using for a small sample set:
CREATE TABLE #data
(
ROW INT IDENTITY (1,1)
,string VARCHAR(50)
)
INSERT INTO #data SELECT 'Watts' AS string
UNION ALL SELECT 'Burns'
UNION ALL SELECT 'McLaughlan'
UNION ALL SELECT 'Darry'
UNION ALL SELECT 'Storie'
UNION ALL SELECT 'Mcluangan'
UNION ALL SELECT 'Burnsysx'
SELECT
data1.string as string1
,data1.row as row1
,data2.string as string2
,data2.row as row2
,dbo.JaroWinkler(data1.string,data2.string) as correlation
from #data data1
CROSS JOIN #data data2
WHERE data1.row < data2.row
Which for this sample data returns 21 rows, but I am only interested in rows where the correlation is above 0.7, so the majority of these can be removed from the output, and if possible not even used as a comparison point.
So for the example data above, I would want to return the following rows:
string1 row1 string2 row2 correlation
McLaughlan 3 Mcluangan 6 0.8962954
Burns 2 Burnsysx 7 0.874999125
I know that using inequality triangular joins is not a good idea, so would using a cursor be a better one? I do unfortunately need to check all records against each other to make sure duplicates don't exist.
For the purposes of testing, the Difference(data1.string,data2.string) could be used, filtering only cases where the value = 4 (so that I can at least get a sense of how best to move forwards with this)!!
Thanks!

The fuzzy logic feature in SSIS might be worth a shot, if you haven't tried it yet. It might be more performant than the query you have and has more "tweakable" parameters. It is relatively easy to set up.
http://msdn.microsoft.com/en-us/magazine/cc163731.aspx

If you are trying to find duplicate names, have you considered using the built-in SOUNDEX() function to find matches?

Related

SQL Server - Generate unique ID to compare several columns altogether

In SQL Server, is it possible to generate a GUID using a specific piece of data as an input value. For example,
DECLARE #seed1 VARCHAR(10) = 'Test'
DECLARE #seed1 VARCHAR(10) = 'Testing'
SELECT NEWID(#seed1) -- will always return the same output value
SELECT NEWID(#seed2) -- will always return the same output value, and will be different to the example above
I know this completely goes against the point of GUIDs, in that the ID would not be unique. I'm looking for a way to detect duplicate records based on certain criteria (the #seed value).
I've tried generating a VARBINARY string using the HASHBYTES function, however joining between tables using VARBINARY seems extremely slow. I'm hoping to find a similar alternative that is more efficient.
Edit: for more information on why I'm looking to achieve this.
I'm looking for a fast and efficient way of detecting duplicate information that exists on two tables. For example, I have first name, last name & email. When these are concatenated, should can be used to check whether these records eexists in table A and table B.
Simply joining on these fields is possible and provides the correct result, however is quite slow. Therefore, I was hoping to find a way of transforming the data into something such as a GUID, which would make the joins much more efficient.
I think you can use CHECKSUM function for returning int type.
You should use hashbytes and not checksum like this:
SELECT hashbytes('MD5', 'JOHN' + ',' + 'SMITH' + ',' + 'JSMITH#EXAMPLE.COM')
Although it's only a small chance checksum can produce the same number with 2 completely different values, I've had it happen with datasets of around a million. As iamdave noted (thanks!), it's a good idea to throw in some kind delimiter (a comma in my example) so that you don't compare 'JOH' + 'NSMITH' and 'JOHN' + 'SMITH' as the same.
http://www.sqlservercentral.com/blogs/microsoft-business-intelligence-and-data-warehousing/2012/02/01/checksum-vs-hashbytes/

How can I create a temporary numbers table with SQL?

So I came upon a question where someone asked for a list of unused account numbers. The query I wrote for it works, but it is kind of hacky and relies on the existence of a table with more records than existing accounts:
WITH tmp
AS (SELECT Row_number()
OVER(
ORDER BY cusno) a
FROM custtable
fetch first 999999 rows only)
SELECT tmp.a
FROM tmp
WHERE a NOT IN (SELECT cusno
FROM custtable)
This works because customer numbers are reused and there are significantly more records than unique customer numbers. But, like I said, it feels hacky and I'd like to just generate a temporary table with 1 column and x records that are numbered 1 through x. I looked at some recursive solutions, but all of it looked way more involved than the solution I wound up using. Is there an easier way that doesn't rely on existing tables?
I think the simple answer is no. To be able to make a determination of absence, the platform needs to know the expected data set. You can either generate that as a temporary table or data set at runtime - using the method you've used (or a variation thereof) - or you can create a reference table once, and compare against it each time. I'd favour the latter - a table with a single column of integers won't put much of a dent in your disk space and it doesn't make sense to compute an identical result set over and over again.
Here's a really good article from Aaron Bertrand that deals with this very issue:
https://sqlperformance.com/2013/01/t-sql-queries/generate-a-set-1
(Edit: The queries in that article are TSQL specific, but they should be easily adaptable to DB2 - and the underlying analysis is relevant regardless of platform)
If you search all unused account number you can do it :
with MaxNumber as
(
select max(cusno) MaxID from custtable
),
RecurceNumber (id) as
(
values 1
union all
select id + 1 from RecurceNumber cross join MaxNumber
where id<=MaxID
)
select f1.* from RecurceNumber f1 exception join custtable f2 on f1.id=f2.cusno

How to find rows in a table with similar string values

I have a Microsoft SQL Server database table with around 7 million crowd-sourced records, primarily containing a string name value with some related details. For nearly every record it seems there are a dozen similar typo records and I am trying to do some fuzzy matching to identify record groups such as "Apple", "Aple", "Apples", "Spple", etc. These names can also contain multiple words with spaces between them.
I've come up with a solution of using an edit-distance scalar function that returns number of keystrokes required for transformation from string1 to string2 and using that function to join the table to itself. As you can imagine, this doesn't perform that well since its having to execute the function millions of times to evaluate a join.
So I put that in a cursor so at least only one string1 is being evaluated at a time, this at least gets results coming out but after letting it run for weeks it has only made it through evaluating 150,000 records. With 7 million to evaluate, I don't think I have the kind of time my method is going to take.
I put full text indexes on the string names, but couldn't really find a way to use the full text predicates when I didn't have a static value I was searching.
Any ideas how I could do something like the following in a way that wouldn't take months to run?
SELECT t1.name, t2.name
FROM names AS t1
INNER JOIN names AS t2
ON EditDistance(t1.name,t2.name) = 1
AND t1.id != t2.id
You may use the DIFFERENCE ( character_expression , character_expression ) function to evaluate the difference in the SOUNDEX code for each character expression. The SOUNDEX code is used to evaluate the difference between strings.
DIFFERENCE will return an integer of 0 (the highest possible difference) and 4 (the least possible difference). You could utilize this value to determine how closely matched the strings are (e.g. a condition similar to DIFFERENCE(column1, column2) > 3 would match records where the SOUNDEX values of column1 and column2 are off by 1).
Here is a link to the documentation of the DIFFERENCE function: https://technet.microsoft.com/en-us/library/ms188753(v=sql.105).aspx
You need to find a way to avoid comparing each record to every other record. If you are just using a single field, you can use a special data structure like a trie, for example https://github.com/mattandahalfew/Levenshtein_search

Create a function with whole columns as input and output

I have several programs written in R that now I need to translate in T-SQL to deliver them to the client. I am new to T-SQL and I'm facing some difficulties in translating all my R functions.
An example is the numerical derivative function, which for two input columns (values and time) would return another column (of same length) with the computed derivative.
My current understanding is:
I can't use SP, because I'll need to use this functions inline with
select statement, like:
SELECT Customer_ID, Date, Amount, derivative(Amount, Date) FROM Customer_Detail
I can't use UDF, because they can take, as input parameter, only scalar. I'll need vectorised function due to speed and also because for some functions I have, like the one above, running row by row wouldn't be meaningful (for each value it needs the next and the previous)
UDA take whole column but, as the name says..., they will aggregate the column like sum or avg would.
If the above is correct, which other techniques would allow me to create the type of function I need? An example of SQL built-in function similar to what I'm after is square() which (apparently) takes a column and returns itself^2. My goal is creating a library of functions which behave like square, power, etc. But internally it'll be different cause square takes and returns each scalar is read through the rows. I would like to know if is possible to have User Defied with an accumulate method (like the UDA) able to operates on all the data at the end of the import and then return a column of the same length?
NB: At the moment I'm on SQL-Server 2005 but we'll switch soon to 2012 (or possibly 2014 in few months) so answers based on any 2005+ version of SQL-Server are fine.
EDIT: added the R tag for R developers who have, hopefully, already faced such difficulties.
EDIT2: Added CLR tag: I went through CLR user defined aggregate as defined in the Pro t-sql 2005 programmers guide. I already said above that this type of function wouldn't fit my needs but it was worth looking into it. The 4 methods needed by a UDA are: Init, Accumulate, Merge and Terminate. My request would need the whole data being analysed all together by the same instance of the UDA. So options including merge methods to group together partial results from multicore processing won't be working.
I think you may consider changing your mind a bit. SQL language is very good when working with sets of data, especially modern RDBMS implementations (like SQL Server 2012), but you have to think in sets, not in rows or columns. While I stilldon't know your exact tasks, let's see - SQL Server 2012 have very nice set of window functions + ranking functions + analytic functions + common table expressions, so you can write almost any query inline. You can use chains of common table expression to turn your data any way you want, to calculate running totals, to calculate averages or other aggregates over window and so on.
Actually, I've always liked SQL and when I've learned functional language (ML and Scala) a bit, my thought was that my approach to SQL is very similar to functional language paradigm - just slicing and dicing data without saving anything into variables, untils you have resultset your need.
Just quick example, here's a question from SO - How to get average of the 'middle' values in a group?. The goal was to get the average for each group of the middle 3 values:
TEST_ID TEST_VALUE GROUP_ID
1 5 1 -+
2 10 1 +- these values for group_id = 1
3 15 1 -+
4 25 2 -+
5 35 2 +- these values for group_id = 2
6 5 2 -+
7 15 2
8 25 3
9 45 3 -+
10 55 3 +- these values for group_id = 3
11 15 3 -+
12 5 3
13 25 3
14 45 4 +- this value for group_id = 4
For me, it's not an easy task to do in R, but in SQL it could be a really simple query like this:
with cte as (
select
*,
row_number() over(partition by group_id order by test_value) as rn,
count(*) over(partition by group_id) as cnt
from test
)
select
group_id, avg(test_value)
from cte
where
cnt <= 3 or
(rn >= cnt / 2 - 1 and rn <= cnt / 2 + 1)
group by group_id
You can also easily expand this query to get 5 values around the middle.
TAke closer look to analytical functions, try to rethink your calculations in terms of window functions, may be it's not so hard to rewrite your R procedures in plain SQL.
Hope it helps.
I would solve this by passing a reference to the record(s) you want to process, and use so called "inline table-valued function" to return the record(s) after processing the initial records.
You find the table-function reference here:
http://technet.microsoft.com/en-en/library/ms186755.aspx
A Sample:
CREATE FUNCTION Sales.CustomerExtendedInfo (#CustomerID int)
RETURNS TABLE
AS
RETURN
(
SELECT FirstName + LastName AS CompleteName,
DATEDIFF(Day,CreateDate,GetDate()) AS DaysSinceCreation
FROM Customer_Detail
WHERE CustomerID = #CustomerID
);
GO
StoreID would be the Primary-Key of the Records you want to process.
Table-Function can afterwards be joined to other Query results if you want to process more than one record at once.
Here is a Sample:
SELECT * FROM Customer_Detail
CROSS APPLY Sales.CustomerExtendedInfo (CustomerID)
Using a normal Stored Procedure would do the same more or less, but it's a bit tricky to work with the results programmatically.
But keep one thing in mind: SQL-Server is not really good for "functional-programming". It's brilliant working with data and sets of data, but the more you use it as a "application server" the more you will realize it's not made for that.
I don't think this is possible in pure T-SQL without using cursors. But with cursors, stuff will usually be very slow. Cursors are processing the table row-by/row, and some people call this "slow-by-slow".
But you can create your own aggregate function (see Technet for more details). You have to implement the function using the .NET CLR (e.g. C# or R.NET).
For a nice example see here.
I think interfacing R with SQL is a very nice solution. Oracle is offering this combo as a commercial product, so why not going the same way with SQL Server.
When integrating R in the code using the own aggregate functions, you will only pay a small performance penalty. Own aggregate functions are quite fast according to the Microsoft documentation: "Managed code generally performs slightly slower than built-in SQL Server aggregate functions". And the R.NET solution seems also to be quite fast by loading the native R DLL directly in the running process. So it should be much faster than using R over ODBC.
ORIGINAL RESPONSE:
if you know already what are the functions you will need one of the approach I can think of is, creating one In-Line function for each method/operation you want to apply per table.
what I mean by that? for example you mentioned FROM Customer_Detail table when you select you might want need one method "derivative(Amount, Date)". let's say second method you might need (I am just making up for explanation) is "derivative1(Amount1, Date1)".
we create two In-Line Functions, each will do its own calculation inside function on intended columns and also returns remaining columns as it is. that way you get all columns as you get from table and also perform custom calculation as a set-based operation instead scalar operation.
later you can combine the Independent calculation of columns in same function if make sense.
you can still use this all functions and do JOIN to get all custom calculation in single set if needed as all functions will have common/unprocessed columns coming as it is.
see the example below.
IF object_id('Product','u') IS NOT NULL
DROP TABLE Product
GO
CREATE TABLE Product
(
pname sysname NOT NULL
,pid INT NOT NULL
,totalqty INT NOT NULL DEFAULT 1
,uprice NUMERIC(28,10) NOT NULL DEFAULT 0
)
GO
INSERT INTO Product( pname, pid, totalqty, uprice )
SELECT 'pen',1,100,1.2
UNION ALL SELECT 'book',2,300,10.00
UNION ALL SELECT 'lock',3,500,15.00
GO
IF object_id('ufn_Product_totalValue','IF') IS NOT NULL
DROP FUNCTION ufn_Product_totalValue
GO
CREATE FUNCTION ufn_Product_totalValue
(
#newqty int
,#newuprice numeric(28,10)
)
RETURNS TABLE AS
RETURN
(
SELECT pname,pid,totalqty,uprice,totalqty*uprice AS totalValue
FROM
(
SELECT
pname
,pid
,totalqty+#newqty AS totalqty
,uprice+#newuprice AS uprice
FROM Product
)qry
)
GO
IF object_id('ufn_Product_totalValuePct','IF') IS NOT NULL
DROP FUNCTION ufn_Product_totalValuePct
GO
CREATE FUNCTION ufn_Product_totalValuePct
(
#newqty int
,#newuprice numeric(28,10)
)
RETURNS TABLE AS
RETURN
(
SELECT pname,pid,totalqty,uprice,totalqty*uprice/100 AS totalValuePct
FROM
(
SELECT
pname
,pid
,totalqty+#newqty AS totalqty
,uprice+#newuprice AS uprice
FROM Product
)qry
)
GO
SELECT * FROM ufn_Product_totalValue(10,5)
SELECT * FROM ufn_Product_totalValuepct(10,5)
select tv.pname,tv.pid,tv.totalValue,pct.totalValuePct
from ufn_Product_totalValue(10,5) tv
join ufn_Product_totalValuePct(10,5) pct
on tv.pid=pct.pid
also check the output as shown below.
EDIT2:
three point smoothing Algorithms
IF OBJECT_ID('Test3PointSmoothingAlgo','u') IS NOT NULL
DROP TABLE Test3PointSmoothingAlgo
GO
CREATE TABLE Test3PointSmoothingAlgo
(
qty INT NOT NULL
,id INT IDENTITY NOT NULL
)
GO
INSERT Test3PointSmoothingAlgo( qty ) SELECT 10 UNION SELECT 20 UNION SELECT 30
GO
IF object_id('ufn_Test3PointSmoothingAlgo_qty','IF') IS NOT NULL
DROP FUNCTION ufn_Test3PointSmoothingAlgo_qty
GO
CREATE FUNCTION ufn_Test3PointSmoothingAlgo_qty
(
#ID INT --this is a dummy parameter
)
RETURNS TABLE AS
RETURN
(
WITH CTE_3PSA(SmoothingPoint,Coefficients)
AS --finding the ID of adjacent points
(
SELECT id,id
FROM Test3PointSmoothingAlgo
UNION
SELECT id,id-1
FROM Test3PointSmoothingAlgo
UNION
SELECT id,id+1
FROM Test3PointSmoothingAlgo
)
--Apply 3 point Smoothing algorithms formula
SELECT a.SmoothingPoint,SUM(ISNULL(b.qty,0))/3 AS Qty_Smoothed--this is a using 3 point smoothing algoritham formula
FROM CTE_3PSA a
LEFT JOIN Test3PointSmoothingAlgo b
ON a.Coefficients=b.id
GROUP BY a.SmoothingPoint
)
GO
SELECT SmoothingPoint,Qty_Smoothed FROM dbo.ufn_Test3PointSmoothingAlgo_qty(NULL)
I think you may need to break you functionalities into two parts - into UDA which can work on scopes thank to OVER (...) clause and formulas which combine the result scalars.
What you are asking for - to define objects in such a way as to make it a aggregate/scalar combo - is probably out of scope of regular SQL Server's capabilities, unless you fall back into CLR code the effectively would be equivalent to cursor in terms of performance or worse.
Your best shot is to probably defined SP (I know you don't what that) that will produce the whole result. Like create [derivative] stored procedure that will take in parameters with table and column names as parameters. You can even expand on the idea but in the end that's not what you want exactly.
Since you mention you will be upgrading to SQL Server 2012 - SQL Server 2008 introduced Table Valued Parameters
This feature will do what you want. You will have to define a User Defined Type (UDT) in your DB which is like a table definition with columns & their respective types.
You can then use that UDT as a parameter type for any other stored procedure or function in your DB.
You can combine these UDTs with CLR integration to achieve what you require.
As mentioned SQL is not good when you are comparing rows to other rows, it's much better at set based operations where every row is treated as an independent entity.
But, before looking at cursors & CLR, you should make sure it can't be done in pure TSQL which will almost always be faster & scale better as your table grows.
One method for comparing rows based on order is wrap your data in a CTE, adding a ranking function like ROW_NUMBER to set the row order, followed by a self-join of the CTE onto itself.
The join will be performed on the ordered field e.g. ROW_NUMBER=(ROW_NUMBER-1)
Look at this article for an example

Find string similarities between two dimensions in SQL

I have two tables and I want to find matches where values can be found in one of the tables and where they are in the second.
In table A I have a list over search queries by users, and in table B I have a list over a selection of search queries I want to find. To make this work I want to use a method similar to:
SELECT UTL_MATCH.JARO_WINKLER_SIMILARITY('shackleford', 'shackelford') FROM DUAL
I have used this method, but it does not work as it can be a difference between the query and the name in selection.
SELECT query FROM search_log WHERE query IN (SELECT navn FROM selection_table);
Are there any best practice methods for finding similarities through a query?
One approach might be something like:
SELECT
SEARCH_LOG.QUERY
FROM
SEARCH_LOG
WHERE
EXISTS
(
SELECT
NULL
FROM
SELECTION_TABLE
WHERE
UTL_MATCH.JARO_WINKLER_SIMILARITY(SEARCH_LOG.QUERY, SELECTION_TABLE.NAVN) >= 98
);
This will return rows in SEARCH_LOG that have a row in SELECTION_TABLE where NAVN matches QUERY with a score of at least 98 (out of 100). You could change the 98 to whatever threshold you prefer.
This is a "brute force" approach because it potentially looks at all combinations of rows. So, it might not be "best practice", but it might still be practical. If performance is important, you might consider a more sophisticated solution like Oracle Text.