Clean unstructured place name to a structured format - openrefine

I have around 300k unstructured data as below screen.I'm trying to use Google refine or OpenRefine to make this correct. However, I'm unable to find a proper way to do this. I'm new to this tool. Anyone's help would be greatly appreciated.Also, this tool is quite slow to process 300k records. If I am trying out something its taking lots of time to process and give an output.
OR Please suggest any other opensource tools and techniques do this?

As Owen said in comments, your question is probably too broad and cannot receive acceptable answer. We can just provide you with a general procedure to follow.
In Open Refine, you'll need to create a column based on the messy column and apply transformations to delete unwanted characters. You'll have to use regular expressions. But for that, it's necessary to be able to identify patterns. It's not clear to me why the "ST" of "Nat.secu ST." is important, but not the "US" in "Massy Intertech US". Not even the "36" in "Plowk 36" (Google doesn't know this word, so I'm not sure is an organisation name).
On the basis of your fifteen lines, however, we seem to distinguish some clear patterns. For example, it looks like you'll have to remove the tokens (character suites without spaces) at the end of the string that contain a #. For that, the GREL formula in Open Refine could look like this:
value.trim().replace(/\b\w+#\w+\b$/,'')
Here is a screencast if it's not clear to you.
But sometimes a company name may contain a #, in which case you will need to create more complex rules. For example, remove the token only if the string contains more than two words.
if(value.split(' ').length() > 2, value.replace(/\b\w+#\w+\b$/, ''), value)
And so on for the other patterns that you'll find (for example, any number sequence at the end that contains more than 4 numbers and one - between them)
Feel free to check out the Open Refine documentation in case of doubt.

Related

Is using comma separated field good or not

I have a table named buildings
each building has zero - n images
I have two solutions
the first one (the classic solution) using two tables:
buildings(id, name, address)
building_images(id, building_id, image_url)
and the second solution using olny one table
buildings(id, name, address, image_urls_csv)
Given I won't need to search by image URL obviously,
I think the second solution (using image_urls_csv column) is easier to use, and no need to create another table just to keep the images, also I will avoid the hassle of multiple queries or joining.
the question is, if I don't really want to filter, search or group by the filed value, can I just make it CSV?
On the one hand, by simply having a column of image_urls_list avoids joins or multiple queries, yes. A single round-trip to the db is always a plus.
On the other hand, you then have a string of urls that you need to parse. What happens when a URL has a comma in it? Oh, I know, you quote it. But now you need a parser that is beyond a simple naive split on commas. And then, three months from now, someone will ask you which buildings share a given image, and you'll go through contortions to handle quotes, not-quotes, and entries that are at the beginning or end of the string (and thus don't have commas on either side). You'll start writing some SQL to handle all this and then say to heck with it all and push it up to your higher-level language to parse each entry and tell if a given image is in there, and find that this is slow, although you'll realise that you can at least look for %<url>% to limit it, ... and now you've spent more time trying to hack around your performance improvement of putting everything into a single entry than you saved by avoiding joins.
A year later, someone will give you a building with so many URLs that it overflows the text limit you put in for that field, breaking the whole thing. Or add some extra fields to each for extra metadata ("last updated", "expires", ...).
So, yes, you absolutely can put in a list of URLs here. And if this is postgres or any other db that has arrays as a first-class field type, that may be okay. But do yourself a favour, and keep them separate. It's a moderate amount of up-front pain, and the long-term gain is probably going to make you very happy you did.
Not
"Given I won't need to search by image URL obviously" is an assumption that you cannot make about a database. Even if you never do end up searching by url, you might add other attributes of building images, such as titles, alt tags, width, height, etc, so you would end up having to serialize all this data in that one column, and then you would not be able to index any of it. Plus, if you serialize it with one language, then you or whoever comes after you using a different language will either have to install some 3rd party library to deserialize your stuff or write their own deserialization function.
The only case that I can think of where you should keep serialized data in a database is when you inherit old software that you don't have time to fix yet.

matching two columns in excel with slight difference in the spelling

I am working on huge excel sheets from different sources about the same thing. The way the sources report it and write down information is different. So, for example, one would write the location as "Khurais" whereas the other would write it as "Khorais".
Since both of these files are contain important information, I would like to combine them in one excel sheet so that I can deal with them more easily. So if you have any suggestion or tool that you think would be beneficial, please share it here.
P.s. The words in the excel sheet are translations of Arabic words.
You could use Levenshtein distance to determine if two words are "close" to each other. Based on that you could match.
You could use FuzzyLookup, a macro that allows you to do appropriate matching. It worked really well for me in the past and is actually really well documented.
You can find it here: https://www.mrexcel.com/forum/excel-questions/195635-fuzzy-matching-new-version-plus-explanation.html including examples on how to use it.
Hope that helps!
PS obviously you can also use it stricly within VBA (not using worksheet functions)
The Double Metaphone algorithm springs to mind. It attempts to convert strings into phonetic representations. For example, "Folly" and "Pholee" should have the same phonetic code.
If you could generate these codes, you could then match your records based on them, instead of the strings.
Here's an article that explains, along with sample VBA code:
https://bytes.com/topic/access/insights/965241-fuzzy-string-matching-double-metaphone-algorithm
Hope that inspires you :)

SQL like '%term%' except without letters

I'm searching against a table of news articles. The 2 relevant columns are ArticleTitle and ArticleText. When I want to search an article for a particular term, i started out with
column LIKE '%term%'.
However that gave me a lot of articles with the term inside anchor links, for example <a href="example.com/*term*> which would potentially return an irrelevant article.
So then I switched to
column LIKE '% term %'.
The problem with this query is it didn't find articles who's title or text began/ended with the term. Also it didn't match against things like term- or term's, which I do want.
It seems like the query i want should be able to do something like this
'%[^a-z]term[^a-z]%
This should exclude terms within anchor links, but everything else. I think this query still excludes strings that begin/end with the term. Is there a better solution? Does SQL-Server's FULL TEXT INDEXING solve this problem?
Additionally, would it be a good idea to store ArticleTitle and ArticleText as HTML-free columns? Then i could use '%term%' without getting anchor links. These would be 2 extra columns though, because eventually i will need the original HTML for formatting purposes.
Thanks.
SQL Server's LIKE allows you to define Regex-like patterns like you described.
A better option is to use fulltext search:
WHERE CONTAINS(ArticleTitle, 'term')
exploits the index properly (the LIKE '%term%' query is slow), and provides other benefit in the search algorithm.
Additionally, you might benefit from storing a plaintext version of the article alongside the HTML version, and run your search queries on it.
SQL is not designed to interpret HTML strings. As such, you'd only be able to postpone the problem till a more difficult issue arrives (for example, a comment node that contains your search terms as part of a plain sentence).
You can still utilize FULL TEXT as a prefilter and then run an HTML analysis on the application layer to further filter your result set.

Testing phrases to see if they match each other

I have a large number of phrases (~ several million), each less than six or seven words and the large majority less than five, and I would like to see if they "phrase match" each other. This is a search engine marketing term - essentially, A phrase matches B if A is contained in B. Right now, they are stored in a db (postgres), and I am performing a join on regexes (see this question). It is running impossibly slowly even after trying all basic optimization tricks (indexing, etc) and trying the suggestions provided.
Is there an easier way to do this? I am not averse to a non-DB solution. Is there any reason to think that regexes are overkill and are taking way longer than a different solution?
An ideal algorithm for doing sub-string matching is AhoCorsick.
Although you will have to read the data out of the database to use it, it is tremendously fast, when compared to more naive methods.
See here for a related question on substring matching:
And here for an AhoCorsick implementation in Java:
It would be great to get a little more context as to why you need to see which phrases are subsets of others: for example, it seems strange that the DB would be built in such a way anyway: you're having to do the work now because the DB is not in an appropriate format, so it makes sense that you should 'fix' the DB or the way in which it is built, instead.
It depends massively on what you are doing with the data and why, but I have found it useful in the past to break things down into single words and pairs of words, then link resources or phrases to those singles/pairs.
For example to implement a search I have done:
Source text: Testing phrases to see
Entries:
testing
testing phrases
phrases
phrases to
to
to see
see
To see if another phrase was similar (granted, not contained within) you would break down the other phrase in the same way and count the number of phrases common between them.
It has the nice side effect of still matching if you were to use (for example) "see phases to testing": because the individual words would match.. but because the order is different the pairs wouldn't, so it's taking phrases (consecutive words) into account at the same time, the number of matches wouldn't be as high, good for use as a 'score' in matching.
As I say that -kind- of thing has worked for me, but it would be great to hear some more background/context, so we can see if we can find a better solution.
When you have the 'cleaned column' from MaasSQL's previous answer, you could, depending on the way "phrase match" works exactly (I don't know), sort this column based on the length of the containing string.
Then make sure you run the comparison query in a converging manner in a procedure instead of a flat query, by stepping through your table (with a cursor) and eliminating candidates for comparison through WHERE statements and through deleting candidates that have already been tested (completely). You may need a temporary table to do this.
What do I mean with 'WHERE' statement previously? Well, if the comparison value is in a column sorted on length, you'll never have to test whether a longer string matches inside a shorter string.
And with deleting candidates: starting with the shortest strings, once you've tested all strings of a certain length, you'll can remove them from the comparison table, as any next test you'll do will never get a match.
Of course, this requires a bit more programming than just one SQL statement. And is dependent on the way "phrase match" works exactly.
DTS or SSIS may be your friend here as well.

First Name Variations in a Database

I am trying to determine what the best way is to find variations of a first name in a database. For example, I search for Bill Smith. I would like it return "Bill Smith", obviously, but I would also like it to return "William Smith", or "Billy Smith", or even "Willy Smith". My initial thought was to build a first name hierarchy, but I do not know where I could obtain such data, if it even exists.
Since users can search the directory, I thought this would be a key feature. For example, people I went to school with called me Joe, but I always go by Joseph now. So, I was looking at doing a phonetic search on the last name, either with NYSIIS or Double Metaphone and then searching on the first name using this name heirarchy. Is there a better way to do this - maybe some sort of graded relevance using a full text search on the full name instead of a two part search on the first and last name? Part of me thinks that if I stored a name as a single value instead of multiple values, it might facilitate more search options at the expense of being able to address a user by the first name.
As far as platform, I am using SQL Server 2005 - however, I don't have a problem shifting some of the matching into the code; for example, pre-seeding the phonetic keys for a user, since they wouldn't change.
Any thoughts or guidance would be appreciated. Countless searches have pretty much turned up empty. Thanks!
Edit: It seems that there are two very distinct camps on the functionality and I am definitely sitting in the middle right now. I could see the argument of a full-text search - most likely done with a lack of data normalization, and a multi-part approach that uses different criteria for different parts of the name.
The problem ultimately comes down to user intent. The Bill / William example is a good one, because it shows the mutation of a first name based upon the formality of the usage. I think that building a name hierarchy is the more accurate (and extensible) solution, but is going to be far more complex. The fuzzy search approach is easier to implement at the expense of accuracy. Is this a fair comparison?
Resolution: Upon doing some tests, I have determined to go with an approach where the initial registration will take a full name and I will split it out into multiple fields (forename, surname, middle, suffix, etc.). Since I am sure that it won't be perfect, I will allow the user to edit the "parts", including adding a maiden or alternate name. As far as searching goes, with either solution I am going to need to maintain what variations exists, either in a database table, or as a thesaurus. Neither have an advantage over the other in this case. I think it is going to come down to performance, and I will have to actually run some benchmarks to determine which is best. Thank you, everyone, for your input!
In my opinion you should either do a feature right and make it complete, or you should leave it off to avoid building a half-assed intelligence into a computer program that still gets it wrong most of the time ("Looks like you're writing a letter", anyone?).
In case of human names, a computer will get it wrong most of the time, doing it right and complete is impossible, IMHO. Maybe you can hack something that does the most common English names. But actually, the intelligence to look for both "Bill" and "William" is built into almost any English speaking person - I would leave it to them to connect the dots.
The term you are looking for is Hypocorism:
http://en.wikipedia.org/wiki/Hypocorism
And Wikipedia lists many of them. You could bang out some Python or Perl to scrape that page and put it in a db.
I would go with a structure like this:
create table given_names (
id int primary key,
name text not null unique
);
create table hypocorisms (
id int references given_names(id),
name text not null,
primary key (id, name)
);
insert into given_names values (1, 'William');
insert into hypocorisms values (1, 'Bill');
insert into hypocorisms values (1, 'Billy');
Then you could write a function/sproc to normalize a name:
normalize_given_name('Bill'); --returns William
One issue you will face is that different names can have the same hypocorism (Albert -> Al, Alan -> Al)
I think your basic approach is solid. I don't think fulltext is going to help you. For seeding, behindthename.com seems to have large amount of the data you want.
Are you using SQl Server 2005 Express with Advanced Services as to me it sounds you would benefit from the Full Text indexing and more specifically Contains and Containstable which you can use with specific instructions here is a link for the uses of Containstable:
http://msdn.microsoft.com/en-us/library/ms189760.aspx
and here is the download link for SQL Server 2005 With Advanced Services:
http://www.microsoft.com/downloads/details.aspx?familyid=4C6BA9FD-319A-4887-BC75-3B02B5E48A40&displaylang=en
Hope this helps,
Andrew
You can use the SQL Server Full Text Search and do an inflectional search.
Basically like:
SELECT ProductId, ProductName
FROM ProductModel
WHERE CONTAINS(CatalogDescription, ' FORMSOF(THESAURUS, metal) ')
Check out:
http://en.wikipedia.org/wiki/SQL_Server_Full_Text_Search#Inflectional_Searches
http://msdn.microsoft.com/en-us/library/ms345119.aspx
http://www.mssqltips.com/tip.asp?tip=1491
Not sure what your application is, but if your users know at the time of sign up that people from their past might be searching the database for them, you could offer them the chance in the user profile to define other names they might be known as (including last names, women change these all the time and makes finding them much harder!) and that they want people to be able to search on. Store these in a separate related table. Then search on that. Just make the structure such that you can define one name as the main name (the one you use for everything except the search.)
You'll find that you're dabbling in an area known as "Natural Language Processing" and you'll need to do several things, most of which can be found under the topic of stemming.
Simplistic stemming simply breaks the word apart, but more advanced algorithms associate words that mean the same thing - for instance Google might use stemming to convert "cat" and "kitten" to "feline" and search for all three, weighing the actual word provided by the user as slightly heavier so exact matches return before stemmed matches.
It's a known problem, and there are open source stemmers available.
-Adam
No, Full Text searches will not help to solve your problem.
I think you might want to take a look at some of the following links: (Funny, no one mentioned SoundEx till now)
SoundEx - MSDN
SoundEx - Google results
InformIT - Tolerant Search algorithms
Basically SoundEx allows you to evaluate the level of similarity in similar sounding words. The function is also available on SQL 2005.
As a side issue, instead of returning similar results, it might prove more intuitive to the user to use a AJAX based script to deliver similar sounding names before the user initiates his/her search. That way you can show the user "similar names" or "did you mean..." kind of data.
Here's an idea for automatically finding "name synonyms" like Bill/William. That problem has been studied in the broader context of synonyms in general: inducing them from statistics of which words commonly appear in the same contexts in a large text corpus like the Web. You could try combining that approach with a list of names like Moby Names; I don't know if it's been done before.
Here are some pointers.