Find sql records containing similar strings - sql

I have the following table with 2 columns: ID and Title containing over 500.000 records. For example:
ID Title
-- ------------------------
1 Aliens
2 Aliens (1986)
3 Aliens vs Predator
4 Aliens 2
5 The making of "Aliens"
I need to find records that are very similar, and by that I mean they are different by 3-6 letters, usually this difference is at the end of the Titles. So I have to design a query that returns the records no. 1,2 and 4. I already looked at levenstein distance but I don't know how to apply it. Also because of the number of records the query shouldn't take all night long.
Thanks for any idea or suggestion

If you really want to define similarity in the exact way that you have formulated in your question, then you would - as you say - have to implement the Levensthein Distance calculation. Either in code calculated on each row retrieved by a DataReader or as a SQL Server function.
The problem stated is actually more tricky than it may appear at first sight, because you cannot assume to know what the mutually shared elements between two strings may be.
So in addition to Levensthein Distance you probably also want to specify a minimum number of consecutive characters that actually have to match (in order for sufficient similarity to be concluded).
In sum: It sounds like an overly complicated and time consuming/slow approach.
Interestingly, in SQL Server 2008 you have the DIFFERENCE function which may be used for something like this.
It evaluates the phonetic value of two strings and calculates the difference. I'm unsure if you will get it to work properly for multi-word expressions such as movie titles since it doesn't deal well with spaces or numbers and puts too much emphasis on the beginning of the string, but it is still an interesting predicate to be aware of.
If what you are actually trying to describe is some sort of search feature, then you should look into the Full Text Search capabilities of SQL Server 2008. It provides built-in Thesaurus support, fancy SQL predicates and a ranking mechanism for "best matches"
EDIT: If you are looking to eliminate duplicates maybe you could look into SSIS Fuzzy Lookup and Fuzzy Group Transformation. I have not tried this myself, but it looks like a promising lead.
EDIT2: If you don't want to dig into SSIS and still struggle with the performance of the Levensthein Distance algorithm, you could perhaps try this algorithm which appears to be less complex.

For all the Googlers out there that run into this question, though it's already been marked as answered, I figured I'd share some code to help with this. If you're able to do CLR user-defined functions on your SQL Server, you can implement your own Levensthein Distance algorithm and then from there create a function that gives you a 'similarity score' called dbo.GetSimilarityScore(). I've based my score case-insensitivity, without much weight to jumbled word order and non-alphanumeric characters. You can adjust your scoring algorithm as needed, but this is a good start. Credit to this code project link for getting me started.
Option Explicit On
Option Strict On
Option Compare Binary
Option Infer On
Imports System
Imports System.Collections.Generic
Imports System.Data
Imports System.Data.SqlClient
Imports System.Data.SqlTypes
Imports System.Text
Imports System.Text.RegularExpressions
Imports Microsoft.SqlServer.Server
Partial Public Class UserDefinedFunctions
Private Const Xms As RegexOptions = RegexOptions.IgnorePatternWhitespace Or RegexOptions.Multiline Or RegexOptions.Singleline
Private Const Xmsi As RegexOptions = Xms Or RegexOptions.IgnoreCase
''' <summary>
''' Compute the distance between two strings.
''' </summary>
''' <param name="s1">The first of the two strings.</param>
''' <param name="s2">The second of the two strings.</param>
''' <returns>The Levenshtein cost.</returns>
<Microsoft.SqlServer.Server.SqlFunction()> _
Public Shared Function ComputeLevenstheinDistance(ByVal string1 As SqlString, ByVal string2 As SqlString) As SqlInt32
If string1.IsNull OrElse string2.IsNull Then Return SqlInt32.Null
Dim s1 As String = string1.Value
Dim s2 As String = string2.Value
Dim n As Integer = s1.Length
Dim m As Integer = s2.Length
Dim d As Integer(,) = New Integer(n, m) {}
' Step 1
If n = 0 Then Return m
If m = 0 Then Return n
' Step 2
For i As Integer = 0 To n
d(i, 0) = i
Next
For j As Integer = 0 To m
d(0, j) = j
Next
' Step 3
For i As Integer = 1 To n
'Step 4
For j As Integer = 1 To m
' Step 5
Dim cost As Integer = If((s2(j - 1) = s1(i - 1)), 0, 1)
' Step 6
d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)
Next
Next
' Step 7
Return d(n, m)
End Function
''' <summary>
''' Returns a score between 0.0-1.0 indicating how closely two strings match. 1.0 is a 100%
''' T-SQL equality match, and the score goes down from there towards 0.0 for less similar strings.
''' </summary>
<Microsoft.SqlServer.Server.SqlFunction()> _
Public Shared Function GetSimilarityScore(string1 As SqlString, string2 As SqlString) As SqlDouble
If string1.IsNull OrElse string2.IsNull Then Return SqlInt32.Null
Dim s1 As String = string1.Value.ToUpper().TrimEnd(" "c)
Dim s2 As String = string2.Value.ToUpper().TrimEnd(" "c)
If s1 = s2 Then Return 1.0F ' At this point, T-SQL would consider them the same, so I will too
Dim score1 As SqlDouble = InternalGetSimilarityScore(s1, s2)
If score1.IsNull Then Return SqlDouble.Null
Dim mod1 As String = GetSimilarityString(s1)
Dim mod2 As String = GetSimilarityString(s2)
Dim score2 As SqlDouble = InternalGetSimilarityScore(mod1, mod2)
If score2.IsNull Then Return SqlDouble.Null
If score1 = 1.0F AndAlso score2 = 1.0F Then Return 1.0F
If score1 = 0.0F AndAlso score2 = 0.0F Then Return 0.0F
' Return weighted result
Return (score1 * 0.2F) + (score2 * 0.8F)
End Function
Private Shared Function InternalGetSimilarityScore(s1 As String, s2 As String) As SqlDouble
Dim dist As SqlInt32 = ComputeLevenstheinDistance(s1, s2)
Dim maxLen As Integer = If(s1.Length > s2.Length, s1.Length, s2.Length)
If maxLen = 0 Then Return 1.0F
Return 1.0F - Convert.ToDouble(dist.Value) / Convert.ToDouble(maxLen)
End Function
''' <summary>
''' Removes all non-alpha numeric characters and then sorts
''' the words in alphabetical order.
''' </summary>
Private Shared Function GetSimilarityString(s1 As String) As String
Dim normString = Regex.Replace(If(s1, ""), "\W|_", " ", Xms)
normString = Regex.Replace(normString, "\s+", " ", Xms).Trim()
Dim words As New List(Of String)(normString.Split(" "c))
words.Sort()
Return String.Join(" ", words.ToArray())
End Function
End Class

select id, title
from my_table
where
title like 'Aliens%'
and
len(rtrim(title)) < len('Aliens') + 7

From what you've asked I imagine the differences you're looking for should not be more than a single word at the end of the original title. Is that why 1,2 and 4 are returned?
Anyway I've made a query that checks the difference at the end consists of a single word, without spaces.
declare #title varchar(20)
set #title = 'Aliens'
select id, title
from movies with (nolock)
where ltrim(title) like #title + '%'
and Charindex(' ', ltrim(right(title, len(title) - len(#title)))) = 0
and len(ltrim(right(title, len(title) - len(#title)))) < 7
hope it helps.

if you are using sql server 2008 you should be able to use the FULLTEXT functionality.
The basic steps are:
1) Create a fulltext index over the column. This will tokenise each string (stremmers, splitters, etc) and let you search for 'LIKE THIS' strings.
The disclaimer is that I've never had to use it but I think it can do what you want.
Start reading here: http://msdn.microsoft.com/en-us/library/ms142571.aspx

You can try SSIS Fuzzy Grouping and it will give you score based on string matches.

You can also use utl_match in Oracle.
enter link description here

Related

Visual Basic programming

I am new to the programing world and I am stuck on the below problem please could you assist me
Write a Visual Basic.net function to calculate the sum of all the numbers in an input field. For example, if the input string is: "ICT2611", then the numbers included in this string are: 2, 6, 1, 1 and their sum is therefore, 2+6+1+1 = 10
The below should solve your problem, it uses Regex to find any matches in the provided string to the expression (numbers 1 - 9), and then iterates through them tallying them up as it goes.
Public Function SumOfString(str As String) As Integer
Dim total As Integer = 0
For Each i As Match In Regex.Matches(str, "[1-9]")
total += i.Value
Next
Return total
End Function
Alternatively the same thing could be achieved like this, this just iterates through each character in the string and then checks to see if it's a digit. If it is a digit then it'll tally it up.
Public Function SumOfString(str As String) As Integer
Dim total As Integer = 0
For Each i As Char In str
If Char.IsDigit(i) Then total += Integer.Parse(i)
Next
Return total
End Function

Get the nth character, string, or number after delimiter in Visual Basic

In VB.net (Visual Studio 2015) how can I get the nth string (or number) in a comma-separated list?Say I have a comma-separated list of numbers like so:13,1,6,7,2,12,9,3,5,11,4,8,10How can I get, say, the 5th value in this string, in this case 12?I've looked at the Split function, but it converts a string into an array. I guess I could do that and then get the 5th element of that array, but that seems like a lot to go through just to get the 5th element. Is there a more direct way to do this, or am I pretty much limited to the Split function?
In case you are looking for an alternative method, which is more basic, you can try this:
Module Module1
Sub Main()
Dim a As String = "13,1,6,7,2,12,9,3,5,11,4,8,10"
Dim counter As Integer = 5 'the number you want (in this case, 5th one)
Dim movingcounter As Integer = 0 'how many times we have moved
Dim startofnumber, endofnumber, i As Integer
Dim numberthatIwant As String
Do Until movingcounter = counter
startofnumber = InStr(i + 1, a, ",")
i = startofnumber
movingcounter = movingcounter + 1
Loop
endofnumber = InStr(startofnumber + 1, a, ",")
numberthatIwant = (Mid(a, startofnumber + 1, endofnumber - startofnumber - 1))
Console.WriteLine("The number that I want: " + numberthatIwant)
Console.ReadLine()
End Sub
End Module
Edit: You can make this into a procedure or function if you wish to use it in a larger program, but this code run in console mode will give the output of 12.
The solution provided by Plutonix as a comment to my question is straightforward and exactly what I was looking for, to wit:result = csv.Split(","c)(5)In my case I was incrementing a variable each time my program ran and needed to get the nth character or string after the incremented value. That is, if my program had incremented the variable 5 times, then I needed the string after the 4th comma, which of course, is the 5th string. So my solution was something like this:result = WholeString.Split(","c)(IncrementedVariable)Note that this is a zero-based variable.Thanks, Plutonix.

how i can figuring the highest number in my array...visual basic

how i can figure the highest number in my array...below is the code...can someone help me to solve my problems...n i wan to show the result in the label from the other windows form....thank u... :
Public Class Frm2
Public Parties(9) As String
Public Votes(9) As String
Dim vote As Integer
Dim Party As String
Party = TParty.Text
vote = TVote.Text
For I As Integer = 0 To Parties.Length - 1
If Parties(I) = "" Then
Parties(I) = TParty.Text()
For J As Integer = 0 To Votes.Length - 1
If Votes(J) = "" Then
Votes(J) = TVote.Text()
MsgBox(TParty.Text & TVote.Text & " votes")
TParty.Clear()
TVote.Clear()
Exit Sub
End If
Next J
End If
Next
MsgBox("you can vote now")
If you want to use an algorithm to find the highest number into an array (let's say Votes), the classic is coming from the so-called Bubble Sort:
Dim max As Long 'change the type accordingly, for example if votes are 1-10 then Integer is better
max = Votes(0) 'set the first vote as the max
For j = 1 To Votes.Length - 1
If Votes(j) >= max Then max = Votes(j) 'if another element is larger, then it is the max
Next j
Now the variable max stores the highest value of the array Votes, that you can show anywhere as, for example, in MyForm.MyLabel.Text = max. More useful info here.
Please note that now you declare Public Votes(9) As String, which means they are strings so not usable as numbers. You might want to declare them with a different data type, or use the CInt() method to convert strings in integers as suggested by ja72.
I thought this would only work with a Variant array, but in quick testing it seems to work with an array of Longs as well:
Dim Votes(9) as Long
Dim Max As Long
Max=WorksheetFunction.Max(Votes)
Note that, as Matteo says, you should change Votes() to an array of numeric types. I'd use Long, as it's a native VBA type.
EDIT: As noted by Dee, the code in this question is actually VB.Net. I added that as a tag. In VBA the solution would be even simpler, as Max is an array property:
Max=Votes.Max
(I suppose it would be a good idea to change the variable name from "Max".)

Stripping out Non Numerical Characters using a Query

Light user of MS Access so not a power user by any means.
Ok, to explain what I want first of all.
I have two tables, one with a username XXX99999 ( 3 Alpha 5 Numeric ) and the other one just 99999 ( 5 numeric ).
They are one in the same, for the most part I can safely drop the first 3 letters and perform what I need to 'link' using the last 5 Numeric digits only.
I imagine doing this by a query.
My question is, how would I mask this to build my query.
All 5 Numeric are unique.
If you take the function by #paxdiablo here (VBA: How to Find Numbers from String), which is
Public Function onlyDigits(s As String) As String
' Variables needed (remember to use "option explicit"). '
Dim retval As String ' This is the return string. '
Dim i As Integer ' Counter for character position. '
' Initialise return string to empty '
retval = ""
' For every character in input string, copy digits to '
' return string. '
For i = 1 To Len(s)
If Mid(s, i, 1) >= "0" And Mid(s, i, 1) <= "9" Then
retval = retval + Mid(s, i, 1)
End If
Next
' Then return the return string. '
onlyDigits = retval
End Function
paste it into a Module and save it, you should be able to link both tables in a query like this (assuming Table1 has the only numbers field and Table2 has the alpha and numbers):
SELECT Table1.MyField1, Table2.MyField2
FROM Table1 INNER JOIN Table2
ON CStr(Table1.OnlyNumbersField) = onlyDigits(Table2.TextAndNumberField);
This will strip the alpha characters behind the scenes, make sure both datatypes are the same, then "link" them to produce the joined result in your query. I know you said you are not a power user and this may be complicated for you, but there is no "easy" way to do this. I could walk you through doing it through multiple queries, which may make more sense to you, but it is a lot more to explain.

Shortening a repeating sequence in a string

I have built a blog platform in VB.NET where the audience are very young, and for some reason like to express their commitment by repeating sequences of characters in their comments.
Examples:
Hi!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3<3
LOLOLOLOLOLOLOLOLOLOLOLOLLOLOLOLOLOLOLOLOLOLOLOLOL
..and so on.
I don't want to filter this out completely, however, I would like to shorten it down to a maximum of 5 repeating characters or sequences in a row.
I have no problem writing a function to handle a single repeating character. But what is the most effective way to filter out a repeating sequence as well?
This is what I used earlier for the single repeating characters
Private Shared Function RemoveSequence(ByVal str As String) As String
Dim sb As New System.Text.StringBuilder
sb.Capacity = str.Length
Dim c As Char
Dim prev As Char = String.Empty
Dim prevCount As Integer = 0
For i As Integer = 0 To str.Length - 1
c = str(i)
If c = prev Then
If prevCount < 10 Then
sb.Append(c)
End If
prevCount += 1
Else
sb.Append(c)
prevCount = 0
End If
prev = c
Next
Return sb.ToString
End Function
Any help would be greatly appreciated
You should be able to recursively use the 'Longest repeated substring problem' to solve this. On the first pass you will get two matching sub-strings, and will need to check if they are contiguous. Then repeat the step for one of the sub-strings. Cut off the algo, if the strings are not contiguous, or if the string size become less than a certain number of characters. Finally, you should be able to keep the last match, and discard the rest. You will need to dig around for an implementation :(
Also have a look at this previously asked question: finding long repeated substrings in a massive string