matching two columns in excel with slight difference in the spelling - vba

I am working on huge excel sheets from different sources about the same thing. The way the sources report it and write down information is different. So, for example, one would write the location as "Khurais" whereas the other would write it as "Khorais".
Since both of these files are contain important information, I would like to combine them in one excel sheet so that I can deal with them more easily. So if you have any suggestion or tool that you think would be beneficial, please share it here.
P.s. The words in the excel sheet are translations of Arabic words.

You could use Levenshtein distance to determine if two words are "close" to each other. Based on that you could match.

You could use FuzzyLookup, a macro that allows you to do appropriate matching. It worked really well for me in the past and is actually really well documented.
You can find it here: https://www.mrexcel.com/forum/excel-questions/195635-fuzzy-matching-new-version-plus-explanation.html including examples on how to use it.
Hope that helps!
PS obviously you can also use it stricly within VBA (not using worksheet functions)

The Double Metaphone algorithm springs to mind. It attempts to convert strings into phonetic representations. For example, "Folly" and "Pholee" should have the same phonetic code.
If you could generate these codes, you could then match your records based on them, instead of the strings.
Here's an article that explains, along with sample VBA code:
https://bytes.com/topic/access/insights/965241-fuzzy-string-matching-double-metaphone-algorithm
Hope that inspires you :)

Related

Clean unstructured place name to a structured format

I have around 300k unstructured data as below screen.I'm trying to use Google refine or OpenRefine to make this correct. However, I'm unable to find a proper way to do this. I'm new to this tool. Anyone's help would be greatly appreciated.Also, this tool is quite slow to process 300k records. If I am trying out something its taking lots of time to process and give an output.
OR Please suggest any other opensource tools and techniques do this?
As Owen said in comments, your question is probably too broad and cannot receive acceptable answer. We can just provide you with a general procedure to follow.
In Open Refine, you'll need to create a column based on the messy column and apply transformations to delete unwanted characters. You'll have to use regular expressions. But for that, it's necessary to be able to identify patterns. It's not clear to me why the "ST" of "Nat.secu ST." is important, but not the "US" in "Massy Intertech US". Not even the "36" in "Plowk 36" (Google doesn't know this word, so I'm not sure is an organisation name).
On the basis of your fifteen lines, however, we seem to distinguish some clear patterns. For example, it looks like you'll have to remove the tokens (character suites without spaces) at the end of the string that contain a #. For that, the GREL formula in Open Refine could look like this:
value.trim().replace(/\b\w+#\w+\b$/,'')
Here is a screencast if it's not clear to you.
But sometimes a company name may contain a #, in which case you will need to create more complex rules. For example, remove the token only if the string contains more than two words.
if(value.split(' ').length() > 2, value.replace(/\b\w+#\w+\b$/, ''), value)
And so on for the other patterns that you'll find (for example, any number sequence at the end that contains more than 4 numbers and one - between them)
Feel free to check out the Open Refine documentation in case of doubt.

Is it possible to categorize or split the nouns, pronoun and adjective data in VBA Excel

Here's my sample data from my dataset and if anyone can help me to fix this problem i want to get the nouns and adjectives from a sentence. If this possible is there anytutorial or link to study this kind of problem or someone can help me. I want to separate all noun, pronoun and adjective words. I already tried the excel formula to find specific word but it is not recommendable because what i want to get all of the possible noun, pronoun and adjective word not given. thankyou!
It will require a reference list, but it can be done by formula not VBA, for example.
Using the array formula as follows
=INDEX($A$1:$A$4,SMALL(IF(NOT(ISERR(SEARCH($A$2:$A$4,$C2,1))),ROW($A$2:$A$4)),COLUMN(D1)-3),1)
in D2 to F4.

Excel or Numbers, How Populate adjacent Column?

So I feel like this is a pretty simple question, but I cannot for the life of my find the answer, here or elsewhere.
I'm trying to autopopulate a column with custom text. I suppose it would be the row adjacent.
Thought vlookup was the solution, but I'm rusty.
Basically it's financial, if the Description contains, say, "Amazon" or "Subway" I'd like to populate the adjacent cell with "Amazon" or "Online Shopping" or "Subway" or Fast food.
I'm using numbers but assume that excel advice would apply for such a simple (seemingly) task.
Make sense?
Also, hope I formatted the image correctly.
Ok thanks!
Just looking at the sample data I can see a pattern that emerges from these transactions. However, My first thought would be to jump to VBA for Excel but I don't believe that is available for Mac OS.
Vlookup will only work with the Range_Lookup set to TRUE which means it will try to find the closest match. This might lead to incorrect matches returned or problems with the requirement for sorting your table array that is being queried.
The only other thing that came to mind which would work for a single query value such as "Amazon" OR "Subway" would be to use a nested formula that checks if that substring is found in the Description column for each cell. This would be something like:
=IF(FIND("Amazon",D1)>0,"Amazon","")
The problem with this is that it only checks for one value and it does not have an error handling mechanism so each string that is checked without the word "Amazon" in it will return a #Value error in Excel.

String Algorithm Comparison VB.Net

I would like to ask some suggestions cause I've been doing this for a week.
It's basically a data cleanup program.
I have this excel file which contains thousands of company name and I have this database which contains the correct names of companies.
What I want is to read the excel file which I already did and compare each of the company in the excel file with the values I have on database. For example
Data in Excel
Hewlett-Packard, Costa Rica
Hewlett-Packard (HP)
Hewlett-Packard Singapore (Private) Limited
Data in Database
Hewlett-Packard
It will auto detect that those 3 value in excel is Hewlett-Packard because the excel is in free type form. I want to correct everything that is inputted in it and find the similar value in my database. Like if the Hewlett-Packard is spelled wrong it will automatically tell that its Hewlett-Packard. Any idea?
It's like an autocomplete but with thinking. Autocomplete but decides the correct value
I'm doing it in VB.Net btw. I'm researching about fuzzy search algorithim and levenstein and stuff. But I still don't get it how can i use that
See my blog, Solving the right problem, which is somewhat similar. You're probably better off doing a simple match and outputting any failures to a text file, which you manually edit. It's drudgery, but it'll get the job done. When you start talking about Levenstein distance and fuzzy search, you're turning a simple, if dull, task into a research project.
If your database contains only "thousands" (rather than millions) of names, then one thing you can do is load all the names into a list, and sort them. Then sort the names in the Excel file. Then go through the two lists (a standard merge-type algorithm). For example, you might have in your database:
Hasbro
Hewlett Packard
Home Depot
and in your Excel file:
Grainger
Halliburton
Hewlet Packard, Costa Rica
Hewlett Packard (HP)
Humana
Using the merge algorithm, you'd be comparing "Hewlet Packard, Costa Rica" against "Hewlett Packard", and you might even output that as the suggested replacement. That would probably constitute the majority of your errors.
In any case, I strongly recommend using the computer to identify the mismatches, and then manually resolve them. That's usually the fastest way to solve this type of problem.

Testing phrases to see if they match each other

I have a large number of phrases (~ several million), each less than six or seven words and the large majority less than five, and I would like to see if they "phrase match" each other. This is a search engine marketing term - essentially, A phrase matches B if A is contained in B. Right now, they are stored in a db (postgres), and I am performing a join on regexes (see this question). It is running impossibly slowly even after trying all basic optimization tricks (indexing, etc) and trying the suggestions provided.
Is there an easier way to do this? I am not averse to a non-DB solution. Is there any reason to think that regexes are overkill and are taking way longer than a different solution?
An ideal algorithm for doing sub-string matching is AhoCorsick.
Although you will have to read the data out of the database to use it, it is tremendously fast, when compared to more naive methods.
See here for a related question on substring matching:
And here for an AhoCorsick implementation in Java:
It would be great to get a little more context as to why you need to see which phrases are subsets of others: for example, it seems strange that the DB would be built in such a way anyway: you're having to do the work now because the DB is not in an appropriate format, so it makes sense that you should 'fix' the DB or the way in which it is built, instead.
It depends massively on what you are doing with the data and why, but I have found it useful in the past to break things down into single words and pairs of words, then link resources or phrases to those singles/pairs.
For example to implement a search I have done:
Source text: Testing phrases to see
Entries:
testing
testing phrases
phrases
phrases to
to
to see
see
To see if another phrase was similar (granted, not contained within) you would break down the other phrase in the same way and count the number of phrases common between them.
It has the nice side effect of still matching if you were to use (for example) "see phases to testing": because the individual words would match.. but because the order is different the pairs wouldn't, so it's taking phrases (consecutive words) into account at the same time, the number of matches wouldn't be as high, good for use as a 'score' in matching.
As I say that -kind- of thing has worked for me, but it would be great to hear some more background/context, so we can see if we can find a better solution.
When you have the 'cleaned column' from MaasSQL's previous answer, you could, depending on the way "phrase match" works exactly (I don't know), sort this column based on the length of the containing string.
Then make sure you run the comparison query in a converging manner in a procedure instead of a flat query, by stepping through your table (with a cursor) and eliminating candidates for comparison through WHERE statements and through deleting candidates that have already been tested (completely). You may need a temporary table to do this.
What do I mean with 'WHERE' statement previously? Well, if the comparison value is in a column sorted on length, you'll never have to test whether a longer string matches inside a shorter string.
And with deleting candidates: starting with the shortest strings, once you've tested all strings of a certain length, you'll can remove them from the comparison table, as any next test you'll do will never get a match.
Of course, this requires a bit more programming than just one SQL statement. And is dependent on the way "phrase match" works exactly.
DTS or SSIS may be your friend here as well.