Find Text Duplicates in Excel - vba

I'm an Excel/VBA newbie and I have a question.
Is it possible to tag partial string matches between two columns in Excel?
Let's say I have two columns, A and B, that have text values in them. I want to identify rows where the A cell and B cell has a partial match.
Here are some hypothetical cases of the 'partial matches' that I'm looking for.
Case 1: exact phrase match (Fictional Company Ltd) but one column has extra text
Cell A2: 123456789 Fictional Company Ltd
Cell B2: Fictional Company Ltd
Case 2: exact phrase match (Fictional Company Ltd) but both columns have extra text
Cell A3: 123456789 Fictional Company Ltd
Cell B3: Fictional Company Ltd, 1 Main Street, City, State 12345
Case 3: partial match
Cell A4: Fictional Ltd
Cell B4: Fictional Company Ltd
Case 4: word match
Cell A5: Fictional Company Ltd
Cell B5: Fictional
I would like to identify all of those cases above. However, I don't mind running >1 set of codes to cover them all.
Thanks a lot in advance for your help!
Update: when I first created the cases, I didn't realize that I put the first word in column B as the matching word with column A. It is not the case - sometimes it is the 3rd word in column B and the 5th word in column A that matches.. the data is all over the place!
*Update 2:** also want to clarify that the cases are reversible - for example, there are some rows where it's Case 1 but cell B has more info instead of cell A.

This function returns the number of times a word in Txt1 is contained anywhere (not just as a word) in Txt2:
Function CountMatches(text1 As String, text2 As String) As Long
Dim arr, x As Long
arr = Split(text2)
For x = 0 To UBound(arr)
If text1 Like "*" & arr(x) & "*" Then CountMatches = CountMatches + 1
Next x
End Function
...and this one does the same, but also counts each occurence of Txt2 anywhere within Txt1:
Function CountMatches2(text1 As String, text2 As String) As Long
Dim arr, x As Long
arr = Split(text1)
For x = 0 To UBound(arr)
If text2 Like "*" & arr(x) & "*" Then CountMatches2 = CountMatches2 + 1
Next x
arr = Split(text2)
For x = 0 To UBound(arr)
If text1 Like "*" & arr(x) & "*" Then CountMatches2 = CountMatches2 + 1
Next x
End Function
Both are susceptible to counting the same match twice, especially (obviously) the CountMatches2.
Sample Output:
I'm curious if this suits your needs (as it's obviously not a true "fuzzy match")...
It can be easily modified to return a TRUE/FALSE (ie., TRUE = One or more matches) or to look only for entire word matches as opposed to "anywhere".
Let me know if you have any questions!

Case 1 is possible, simply by truncating the length of the longer so that it matches the length of the shorter, and then seeing if they are the same. Use the LEFT function to trim the longer word to the length of the shorter one. (Use the LEN function on the shorter word to work out how long it is).
Case 2 is tricky but possible, because you effectively need to search the longer string for every possible combination of ordered words from the shorter. It's kind of a 'slightly simpler' version of Case 3.
Case 3 is damn tricky: it's pretty much a Fuzzy Match which is computationally expensive, and requires something called tokenisation to do efficiently. Microsoft has a free Fuzzy Match addin but it's kinda sucky...it returns many false positives to the point that you need to eyeball each and every result to make sure it is a valid one. Which completely defeats the purpose. I'm working on putting together a commercial offering in that space myself that returns far fewer false positives, but can't share code. Suffice to say that this is a very difficult thing to do efficiently.
Case 4 is trivial: you just use the SEARCH formula.
Add a whole 'nother layer of trickyness if you have multiple words in each list.
The above answer is enough to point you in the right direction for a Google search. Note that you can simplify things by substituting out things like "Ltd" and "Limited" and other sundry terms using the SUBSTITUTE formula, but you've still got a heck of a challenge on your hands.

Related

Find cell value, match, cut, move, ...vba

I am a beginner in VBA.
I have components which always consist from 2 parts. (Rotor and a stator, each has its own number). When work is with them it can be damaging some of these parts, however it is necessary to keep a list of damaged parts, where the result is inventory e.g. 200 rotors, stators 150 with different numbers. Before I could scrap it, I need to complete them as proper sets. I.e. rotor "a" stator "a", "b" with "b", etc. It's crazy to work with many numbers to compare them, copy …to find the result of sets qty.
It is possible to solve it with Macro, what I try to do, but I was stuck.
What is the task: In the column "A" I have a list of all damaged parts (mix of rotors, stators different numbers). In the column "C" an information only with help of VlookUP, what should be a counterpart number.
What do I need to solve: In row 5, column. „A“ I have component number , but I know that in the same column, somewhere from row 6 to xx I have a counterpart. What I need is … according to information from column C, same row(5) where is info about the counterpart num. to find counerpart in column A, when found, took it out and put into cell B5. Thus,I get a complete set. Then the next row (6), same action. Macro reading num. in „C“,searching in „A“, when found, cut, and put to „B“ next row 7,8,9,… The result should be a certain qty of pairs + some single numbers if not second part found.
The problem I have is that cycle is working until always found relared counterpart. If the counterpart in row A is not available (no match betwen C-A), the code will stop on that row.
What I need help with is, that if code did not find the counerpart based to info from C just skip this row, make it red and continue with next row till end, it means stop on first empty cell in C. Thanks a lot to everybody who is helping me.
Dim pn As Range,
Dim a
Dim x
x = 5
Dim i As Long, Dim radek As Long
a = Cells(x, 3)
For i = 1 To 500
Range("A:A").Select
Set pn = Selection.Find(What:=a)
If Not pn Is Nothing Then
pn.Select
End If
Selection.Cut
Cells(x, 2).Select
ActiveSheet.Paste
x = x + 1
Next
End Sub

Concatenating Row Number from a match result and a Column Letter and storing the result as a Variable's Address

I'll outline the steps I'm trying to accomplish:
1) Search through a spreadsheet for an acct # via match.
2) If it exists, I'd add offset #__ cells to the right and select that cell.
3) Set the selected cell's formula to Concatenate("ColumnLetter&Match(A1:A1000"",0) + Concatenate("ColumnLetter&Match(A1:A1000"",0)
FX Debt 1,000
Fx Equity 2000
U.S Debt 4,000
U.S Loans 5,000
Recon 1 Recon 2 Diff
11111 $ Debt 0
11112 FX Debt
So, I'd search for, say account "11111" using =match(A1:1000, "11111", 0). If it exists I'd offset to the right of it and then select that cell. I'd then add a formula to the selected cell which would add Cell references.
I'm thinking it would look something alone the lines of:
If Match(A1:A1000,"11111",0)=true
Select(A&(result from match))
Offset(three to right).select
edit
So to make the next step less ambiguous I'll separate it from the rest of the code sample...First let me explain the goal with it, though. The sample data above is divided into two tables...With the first table ending, for example with the general account U.S Loans --- 5,000. The second starting with the Acct # and Recon 1. My goal is to add certain cells that contain the values (not the values themselves, I want to be able to trace back to the accounts using precedents and dependents) of the general acct's in the first table into the selected offset cell. The way I thought I'd go about this was to search for the acct name, for example "FX Debt", the same way David suggested to find the Acct #, I'd then use the similar offset method to add the cell containing 1000, so say B2, into the original offset sell to the right of the Account #.
end edit
edit 2
Dim searchRange as Range
Dim myMatch as Variant
Set searchRange = Range("A1:A1000")
myMatch = Match("11111", searchRange, 0)
If Not IsError(myMatch) Then
rng.Cells(myMatch).Offset(,3).Formula = Sum(Match("U.S Debt", searchRange, 0).Offset(,2)+(Match("U.S Debt", searchRange, 0).Offset(,2))...
End If
Does this make more sense? I'm trying to add the amounts associated with U.S Debt and U.S Loans to the master account ($ Debt).
end edit 2
1) Search through a spreadsheet for an acct # via match.
2) If it exists, I'd add offset #__ cells to the right and select that
cell.
3) Set the selected the cell's formula to
Concatenate("ColumnLetter&Match(A1:A1000"",0) +
Concatenate("ColumnLetter&Match(A1:A1000"",0)
Don't bother with Selecting the cell. It's unnecessary about 99% of the time (probably more). More detail, here:
How to avoid using Select in Excel VBA macros
Also, your Match syntax is wrong. You need to do:
=Match("11111", A1:A1000, 0)
So, putting it all together, something like:
Dim searchRange as Range
Dim myMatch as Variant
Set searchRange = Range("A1:A1000")
myMatch = Match("11111", searchRange, 0)
If Not IsError(myMatch) Then
searchRange.Cells(myMatch).Offset(,3).Formula = ...
End If
I did not attempt to interpret the formula string given below; I'm not sure I understand what it's supposed to be doing:
sum(((Column Number -->)"I" + match(A1:A1000,"",0)+("I"+match(A1:A1000,"",0))
But at the very least we can consolidate your pseudo-code using the myMatch variable:
sum(((Column Number -->)"I" + myMatch+("I"+myMatch)
(A word of caution: the + operator can be used to concatenate strings, but there are several reasons why the & operator is preferable, notably the + operator is ambiguous and defaults to a mathematical + operator when one of the arguments is a numeric type. In other words, it attempts to add a number and a string, which will invariable result in a Mismatch error)
So revise to:
sum(((Column Number -->)"I" & myMatch & ("I"& myMatch)
Even after cleaning it up, I'm still not sure what you're trying to do with the above formula, but if you can try to explain then I can probably assist.

VBA: Multiple Keyword vlookup

I have a number of narrative descriptions that I need to categorize automatically in Excel:
Description Category
I updated the o.s.
I installed the o.s.
I cleaned valve a
I cleaned valve b
I installed valve a
Today the o.s. was updated
I have another worksheet with keywords and the category the keywords are associated with:
Keyword 1 Keyword 2 Keyword 3 Category
cleaned valve a A
installed valve a B
updated os C
installed os D
My code so far can only search one keyword at a time and therefore will report incorrect answers because some keywords are used in multiple narratives:
Public Function Test21(nar As Range, ky As Range) As String
Dim sTmp As String, vWrd As Variant, vWrds As Variant
'Splits Fsr Narrative into individual words so it can be searched for keywords'
vWrds = Split(nar)
For Each vWrd In vWrds
If Not IsError(Application.VLookup(vWrd, ky, 3, False)) Then
sTmp = Application.VLookup(vWrd, ky, 3, False)
Exit For
End If
Next vWrd
Test21 = sTmp
End Function
I've seen algorithms like this but I feel that my goal could be simpler to accomplish as all narratives are relatively simple.
Thanks for reading!
You can match multiple columns with a VLOOKUP by creating a "match column" that concatenates the multiple values together, then searching that column for a match.
So if you use this formula in column A:
=B1 & "|" & C1 & "|" & D1
You can then VLOOKUP against that match column:
=VLOOKUP("blah|bleh|ugh", 'Sheet2!A1:E100', 5, FALSE)
Which will match the one row that has "blah" in column B, "bleh" in column C, and "ugh" in column D, and return the value in column E.
For your data though, I think you might also want to have a step to clean up your input before trying to match a set of keywords. The method I described above works best if the keywords are in a particular order, and where you won't have any non-keywords cluttering up things. (It also works excellently for vlookups where you want to match multiple pieces of data, ie. first name, middle name, and last name in different columns)
Otherwise you could end up needing an incredibly huge number of rows in your category table to cover every possible combination and permutation of your keywords and the other random words they might be accompanied by.
This is what I was looking for:
Public Function Test22(nar As Range, key As Range, cat As Range) As String
For r = 1 To key.Height
If InStr(nar, key(r, 1)) And InStr(nar, key(r, 2)) Then
Test22 = cat(r)
Exit For
End If
Next r
End Function

Varying Format "Part Number" sort issue

(Current Sort Sample:)
2-1203-4
2-1206-3
2CM-
3-1610-1
3-999
…
AR3021-A-7802
AR3021-A-7802-1
B43570-
B43570-3
I am working on an 8000+ record parts list. The challenge I am running into is that different manufactures of the parts are using many varying formats for their part numbers. “Part Number” is the field I wish to sort my entire worksheet on. (There are about 10 columns of data in this worksheet.)
My methodology for attacking this challenge was to count the number of characters to the left of any “-“ and count the total number of numeric characters in the field. (I also set “Part Numbers” that started with a non-numeric character to a count value of 99 for both count calculations so those would sort after the numeric values.) From this, I was able to sort on the values to the left of the “-“ using .the MIN of the two counts. (My “Part Numbers” are in Column B and I have a header row which means that my first “Part Number” is in cell B2.)
This method worked up to a point. My challenge is that I need to subsequently sort values after the “-“ character as is illustrated by the erroneous sort of “3-1610-1” being followed by “3-999”
One of the limitations I see is that sorting with  Data  Sort only gives three columns to sort on. To sort on just the characters to the left of the “-“ is costing me those three columns. So, I am unable to repeat the whole process of counting values after the “-“ character and subsequently sorting with  Data  Sort after running the primary sort.
Has the sort of many differing formats of a field such as “Part Number” been solved? Is there a macro that can be applied to this challenge? If so, I would be grateful for your input.
This data is continuously updated with new part numbers so the goal here is to be able to add those additional part numbers to the bottom of the worksheet and use a macro to correctly resort the appended list.
For the record, I am not married to my approach. After all, it didn’t solve my challenge!
Thank you,
Darrell
Place this procedure in a standard code moule:
Public Sub PartNumberSortFormat()
Dim i&, j&, f, vIn, vOut
vIn = [b2:index(b:b,match("*",b:b,-1))]
vOut = vIn
For i = 1 To UBound(vIn)
f = Split(Replace(vIn(i, 1), " ", ""), "-")
For j = 0 To UBound(f)
If IsNumeric(f(j)) Then
f(j) = Format$(f(j), "000000")
Else
f(j) = String$(6 - Len(f(j)), "0") & f(j)
End If
Next
vOut(i, 1) = Join(f, "-")
Next
Columns(1).Insert xlToRight
[a1] = "SORT COLUMN"
[a2].Resize(UBound(vOut)) = vOut
Columns(1).EntireColumn.AutoFit
End Sub
After running the procedure, you will notice that it has inserted a new column A on your worksheet and your data has been scooted over to the right by one column.
This new column A will contain a copy of your part numbers, reformatted in such a fashion to allow normal sorting.
Now select all of the data INCLUDING this new column A and sort A-Z on column A.
After the sort, you may delete the new column A.
This works by padding all characters surrounding dashes to six zeroes.
My Thoughts:
Excel 2010 onwards lets you sort using as many columns as you like. (Not sure about 2007). Don't know which version you have!
You could use the formula SUBSTITUTE to remove all "-" from the part number then sort on the number that remains, which gives you a order more like the one you are wanting.
eg
Value =SUBSTITUTE(B2,"-","")
3-15 315
3-888 3888
3-999 3999
3-1610 31610
3-2610 32610
3-1610-1 316101
3-2610-3 326103
It's not exactly what you need though!
Combine this with other formulas (or a VBA function) to manipulate you part number to be more sortable.
You could use FIND to find the position of the first "-" and extract the numbers before it into one column.
Similarly using FIND, MID and LEN you could extract the numbers between a part number two "-".
I suspect if will be best to write a VBA function to convert a part number into a "sortable value". This might splitting the part number into it's component bits (ie each bit being the text between the "-")
(VBA function split might useful for this. It creates an array.
If you know the formats of ALL the part numbers that can be delivered, you can code accordingly.
I suspect you code will take a numbers like and convert them as shown
AB123-456-78 AB12300456007800
AB12-45-7 AB12000450007000
AB12-45 AB12000450000000
ie padding with zeros each component of the part number
The key to sorting the TEXTUAL values into the order you want is understanding how textuals values get sorted! Do some experiments. Then create zero (or "9") padded numbers that sort the numbers as you required.
I hope this helps.
While not a technical answer to the Excel question, I am a logistician working with extremely large data sets of part numbers - always varying in format. The standard approach used in my field is to "ignore" (remove) special characters from the P/N and append the (clean) P/N to the 5-digit CAGE (manufacturer) code to create a "unique" CAGE + (clean) P/N code for sorting, lookup, etc. Create a column for that construct.

Find and Highlight Least Common Occurrence(s) in Variable Range

I have a code that has a variable range with many categories in each column that display data. I need to highlight the least commonly occurring values as a percentage of the total number of cells.
If there are 300 cells in the column, it needs to find the value (out of many possibly repeating values) that occurs least frequently. It is a bonus if the code can anticipate the total number, and give only 5% or 10% of the entire column as a result.
Currently my attempt is to use a function in the top cell that will find the least common occurrence, and the code will simply highlight whatever that value is down the cell as it repeats (and highlight every one of the least common ones.
The difficulty I am having is twofold.
There may be more than one least common value that is still below 10% of the total values
The ability to automate this search so that it may be performed and highlighted for all of more than 100 columns with different categories and different values in each column
If too vague, feel free to ask questions about what I am going for, and I will respond promptly.
This is what the data looks like. As you can see there are merged titles for each column with various blank spaces and sperratically placed data that matches some specific column.
This is the proposed code which is still not highlighting what I would like it to. It has two problems. 1: It will highlight ALL of the data in one range if there is no differing value in the row. 2: It will highlight the titles of the columns.
This is the highlighted data which is still insufficiently complete.
In some cases the column truely do not match the purpose of the code, for example in one column, the number 12 was highlighted down the column (67 occurances) where there are fewer occurances of other numbers. (8 occurs 29 times and is not highlighted)
I just hacked together a seemingly working example. Try this here:
Sub frequenz()
Dim col As Range, cel As Range
Dim letter As String
Dim lookFor As String
Dim frequency As Long, totalRows As Long
Dim relFrequency As Double
Dim ran As Range
ran = ActiveSheet.Range("A1:ZZ65535")
totalRows = 65535
For Each col In ran.Columns
'***get column letter***
letter = Split(ActiveSheet.Cells(1, col.Column).Address, "$")(1)
'*******
For Each cel In col.Cells
lookFor = cel.Text
frequency = Application.WorksheetFunction.CountIf(Range(letter & "2:" & letter & totalRows), lookFor)
relFrequency = frequency / totalRows
If relFrequency <= 0.001 Then
cel.Interior.Color = ColorConstants.vbYellow
End If
Next cel
Next col
End Sub
It seemed to be doing just what you are looking for.
Edit: fixed the address getting.