Is it possible to conduct 'Context Analysis' for precise entity extraction with OpenNLP? - text-mining

Im wondering if opennlp can be used to extract very specific context when using the namefinder api.
For example, if i have two sentences:
Jane Smith, 26, was taken into custody for stealing biscuits at her local Sainsburies.
Jane Smith, 26, was awarded a medal of honour for bravery.
In this situation, i would like opennlp to detect not just the sentence structure (finding Jane Smith in both sentences), but also conclude that when the words 'custody', 'stealing' is used in the same sentence, then this gives a different context to the second sentence. Therefore if i train the first sentence to be '[START:offender] Jane Smith [END]' and second '[START:hero] Jane Smith [END]', there will be some decision at some point based on the words within the sentence i train.
I know Opennlp uses feature extraction (from what i've read it looks at sentence structure - i could be wrong here?), but i wonder if there is also some dictionary analysis as well, if i train enough of these sentences, will i eventually get good a good context split?
If there isnt, can you suggest any way forward (which is scalable)? I want to try and keep with Opennlp because of the license.

Related

How to build DBpedia locally and retrieve abstracts

I want to search for a certain word in DBpedia and get an abstract (or the full text of an article) about that word.
For example,
query: Tokyo
result: Tokyo (/ˈtoʊkioʊ/;[7] Japanese: Tokyo, Tōkyō, [toːkʲoː] (listen)), officially the Tokyo Metropolis (Tokyo, Tōkyō) (listen)), officially the Tokyo Metropolis (Tokyo-to, Tōkyō-to), is the capital and largest city of Japan.[8] Formerly known as Edo ...
(cited from https://en.wikipedia.org/wiki/Tokyo)
I plan to use the obtained sentences in a program written in python.
However, since I intend to send a large number of queries, I need to build DBpedia locally.
(I may be wrong as I am a beginner, but can this be accomplished by downloading a dump of DBpedia and doing searches in SQL, etc.?)
I would like to know the best way to achieve this.
It would be more helpful if your answer is specific.

Lucene fuzzy phrase search approach with scoring

My requirement is to generate match score on fuzzy phrase search.
Example
1) Input Data - Hello Sam, how are you doing? Thanks, Smith.
Indexed Document - Sam Smith (documents are always person/organization names and input data would be free-text data)
In above case, both Sam and Smith found in my input data but contextually both are different persons. If my input data would be "Hello Sam Smith" than I should get relevant hit with higher score (Also I am expecting OK score for "Hello Sam John Smith" and so on).
I am using Lucene here for primary filtering and later will post-process matched documents with input data and define match score (using levenshtein), and it should also work for fuzzy.
Exact approach,
1) Index documents as Tri-Grams
2) Search input free text data with Tri-Gram indexes
3) Gather all matching documents (this will have lot of noisy data yet)
4) Post process every matched document and define position of every matched tri-gram token in input free text data and calculate levenshtein score between possible position token(s) and entire document.
e.g - Hello Sam, how are you doing? Thanks, Smith.
In here my document match will be "Sam Smith", I want to look at each tri gram of index and its position match in input free text data, like
1) token "sam" matched with 2nd position word "Sam" in input data
2) token "smi" matched with 8th position word "Smith" in input data
Later I will write logic to calculate levenshtein score of token 2, 8 with actual matched document (it would be very less score considering proximity between position 2 and 8), but if position of tokens were 2,3 (or 2,4) I would have given good scoring.
Would like to get feedback from experts on this approach, or some better suggestions, Thanks.
I'm doing a similar sort of fuzzy phrase matching in Lucene using tokenized sequences. Token distances are computed with Levenshtein or JaroWinkler and then Smith-Waterman is used to find the best sequence alignment. If I were to adapt that approach to your case the trouble would be that alignment scoring doesn't have a way to (directly) favor token swaps (displaced token substitution). The only thing I could do would be to have a lower cost for insertions for tokens that appear in the source vs those that do not.
So I like the n-gram approach to get scoring that is less sensitive to non-local reordering. I suggest checking out BLEU, METEOR, and ROUGE which are standard n-gram metrics for sentence similarity with various approaches to dealing with order sensitivity. They can be used with either character-level n-grams as in your proposal or with token-level n-grams such as I'm doing.

how did WordNet come in being

I wonder how the hierarchical relationship in WordNet between the words are retrieved.
Is that manually done or via computer techniques.
If based on computer techniques, what are they?
From the FAQ:
q.1.2 Where do you get the definitions for WordNet? (short answer) Our
lexicographers write them.
Where do you get the definitions for WordNet? (long answer) From the
foreword to WordNet: An Electronic Lexical Database, pp. xviii-xix:
People sometimes ask, "Where did you get your words?" We began in 1985
with the words in Kučera and Francis's Standard Corpus of Present-Day
Edited English (familiarly known as the Brown Corpus), principally
because they provided frequencies for the different parts of speech.
We were well launched into that list when Henry Kučera warned us that,
although he and Francis owned the Brown Corpus, the syntactic tagging
data had been sold to Houghton Mifflin. We therefore dropped our plan
to use their frequency counts (in 1988 Richard Beckwith developed a
polysemy index that we use instead). We also incorporated all the
adjectives pairs that Charles Osgood had used to develop the semantic
differential. And since synonyms were critically important to us, we
looked words up in various thesauruses: for example, Laurence Urdang's
little "Basic Book of Synonyms and Antonyms" (1978), Urdang's revision
of Rodale's "The Synonym Finder" (1978), and Robert Chapman's 4th
edition of "Roget's International Thesaurus" (1977) -- in such works,
one word quickly leads on to others. Late in 1986 we received a list
of words compiled by Fred Chang at the Naval Personnel Research and
Development Center, which we compared with our own list; we were
dismayed to find only 15% overlap.
So Chang's list became input. And in 1993 we obtained the list of
39,143 words that Ralph Grishman and his colleagues at New York
University included in their common lexicon, COMLEX; this time we were
dismayed that WordNet contained only 74% of the COMLEX words. But that
list, too, became input. In short, a variety of sources have
contributed; we were not well disciplined in building our vocabulary.
The fact is that the English lexicon is very large, and we were lucky
that our sponsors were patient with us as we slowly crawled up the
mountain.

Algorithm for almost similar values search

I have Persons table in SQL Server 2008.
My goal is to find Persons who have almost similar addresses.
The address is described with columns state, town, street, house, apartment, postcode and phone.
Due to some specific differences in some states (not US) and human factor (mistakes in addresses etc.), address is not filled in the same pattern.
Most common mistakes in addresses
Case sensitivity
Someone wrote "apt.", another one "apartment" or "ap." (although addresses aren't written in English)
Spaces, dots, commas
Differences in writing street names, like 'Dr. Jones str." or "Doctor Jones street" or "D. Jon. st." or "Dr Jones st" etc.
The main problem is that data isn't in the same pattern, so it's really difficult to find similar addresses.
Is there any algorithm for this kind of issue?
Thanks in advance.
UPDATE
As I mentioned address is separated into different columns. Should I generate a string concatenating columns or do your steps for each column?
I assume I shouldn't concatenate columns, but if I'll compare columns separately how should I organize it? Should I find similarities for each column an union them or intersect or anything else?
Should I have some statistics collecting or some kind of educating algorithm?
Suggest approaching it thus:
Create word-level n-grams (a trigram/4-gram might do it) from the various entries
Do a many x many comparison for string comparison and cluster them by string distance. Someone suggested Levenshtein; there are better ones for this kind of task, Jaro-Winkler Distance and Smith-Waterman work better. A libraryt such as SimMetrics would make life a lot easier
Once you have clusters of n-grams, you can resolve the whole string using the constituent subgrams i.e. D.Jones St => Davy Jones St. => DJones St.
Should not be too hard, this is an all-too-common problem.
Update: Based on your update above, here are the suggested steps
Catenate your columns into a single string, perhaps create a db "view" . For example,
create view vwAddress
as
select top 10000
state town, street, house, apartment, postcode,
state+ town+ street+ house+ apartment+ postcode as Address
from ...
Write a separate application (say in Java or C#/VB.NET) and Use an algorithm like JaroWinkler to estimate the string distance for the combined address, to create a many x many comparison. and write into a separate table
address1 | address n | similarity
You can use Simmetrics to get the similarity thus:
JaroWinnkler objJw = new JaroWinkler()
double sim = objJw.GetSimilarity (address1, addres n);
You could also trigram it so that an address such as "1 Jones Street, Sometown, SomeCountry" becomes "1 Jones Street", "Jones Street Sometown", and so on....
and compare the trigrams. (or even 4-grams) for higher accuracy.
Finally you can order by similarity to get a cluster of most similar addresses and decide an approprite threshold. Not sure why you are stuck
I would try to do the following:
split up the address in multiple words, get rid of punctuation at the same time
check all the words for patterns that are typically written differently and replace them with a common name (e.g. replace apartment, ap., ... by apt, replace Doctor by Dr., ...)
put all the words back in one string alphabetically sorted
compare all the addresses using a fuzzy string comparison algorithm, e.g. Levenshtein
tweak the parameters of the Levenshtein algorithm (e.g. you want to allow more differences on longer strings)
finally do a manual check of the strings
Of course, the solution to keep your data 'in shape' is to have explicit fields for each of your characteristics in your database. Otherwise, you will end up doing this exercise every few months.
The main problem I see here is to exactly define equality.
Even if someone writes Jon. and another Jone. - you will never be able to say if they are the same. (Jon-Jonethan,Joneson,Jonedoe whatever ;)
I work in a firm where we have to handle exact this problem - I'm afraid I have to tell you this kind of checking the adress lists for navigation systems is done "by hand" most of the time. Abbrevations are sometimes context dependend, and there are other things that make this difficult. Ofc replacing string etc is done with python - but telling you the MEANING of such an abbr. can only done by script in a few cases. ("St." -> Can be "Saint" and "Street". How to decide? impossible...this is human work.).
Another big problem is as you said "Is there a street "DJones" or a person? Or both? Which one is ment here? Is this DJones the same as Dr Jones or the same as Don Jones? Its impossible to decide!
You can do some work with lists as presented by another answer here - but it will give you enough "false positives" or so.
You have a postcode field!!!
So, why don't you just buy a postcode table for your country
and use that to clean up your street/town/region/province information?
I did a project like this in the last centuary. Basicly it was a consolidation of two customer files after a merger, and, involved names and addresses from three different sources.
Firstly as many posters have suggested, convert all the common words and abbreveations and spelling mistakes to a common form "Apt." "Apatment" etc. to "Apt".
Then look through the name and identifiy the first letter of the first name, plus the first surname. (Not that easy consider "Dr. Med. Sir Henry de Baskerville Smythe") but dont worry where there are amiguities just take both! So if you lucky you get HBASKERVILLE and HSMYTHE. Now get rid of all the vowels as thats where most spelling variations occur so now you have HBSKRVLL HSMTH.
You would also get these strings from "H. Baskerville","Sir Henry Baskerville Smith" and unfortunately "Harold Smith" but we are talking fuzzy matching here!
Perform a similar exercise on the street, and apartment and postcode fields. But do not throw away the original data!
You now come to the interesting bit first you compare each of the original strings and give say 50 points for each string that matches exactly. Then go through you "normalised" strings and give say 20 points for each one that matches exactly. Then go through all the strings and give say 5 points for each four character or more substring they have in common. For each pair compared you will end up with some with scores > 150 which you can consider as a certain match, some with scores less than 50 which you can consider not matched and some inbetween which have some probability of matching.
You need some more tweaking to improve this by adding various rules like "subtract 20 points for a surname of 'smith'". You really have to keep running and tweaking until you get happy with the resulting matches, but, once you look at the results you get a pretty good feel which score to consider a "match" and which are the false positives you need to get rid of.
I think the amount of data could affect what approach works best for you.
I had a similar problem when indexing music from compilation albums with various artists. Sometimes the artist came first, sometimes the song name, with various separator styles.
What I did was to count the number of occurrences on other entries with the same value to make an educated guess wether it was the song name or an artist.
Perhaps you can use soundex or similar algorithm to find stuff that are similar.
EDIT: (maybe I should clarify that I assumed that artist names were more likely to be more frequently reoccurring than song names.)
One important thing that you mention in the comments is that you are going to do this interactively.
This allows to parse user input and also at the same time validate guesses on any abbreviations and to correct a lot of mistakes (the way for example phone number entry works some contact management systems - the system does the best effort to parse and correct the country code, area code and the number, but ultimately the user is presented with the guess and has the chance to correct the input)
If you want to do it really good then keeping database/dictionaries of postcodes, towns, streets, abbreviations and their variations can improve data validation and pre-processing.
So, at least you would have fully qualified address. If you can do this for all the input you will have all the data categorized and matches can then be strict on certain field and less strict on others, with matching score calculated according weights you assign.
After you have consistently pre-processed the input then n-grams should be able to find similar addresses.
Have you looked at SQL Server Integration Services for this? The Fuzzy Lookup component allows you to find 'Near matches': http://msdn.microsoft.com/en-us/library/ms137786.aspx
For new input, you could call the package from .Net code, passing the value row to be checked as a set of parameters, you'd probably need to persist the token index for this to be fast enough for user interaction though.
There's an example of address matching here: http://msdn.microsoft.com/en-us/magazine/cc163731.aspx
I'm assuming that response time is not critical and that the problem is finding an existing address in a database, not merging duplicates. I'm also assuming the database contains a large number of addresses (say 3 million), rather than a number that could be cleaned up economically by hand or by Amazon's Mechanical Turk.
Pre-computation - Identify address fragments with high information content.
Identify all the unique words used in each database field and count their occurrences.
Eliminate very common words and abbreviations. (Street, st., appt, apt, etc.)
When presented with an input address,
Identify the most unique word and search (Street LIKE '%Jones%') for existing addresses containing those words.
Use the pre-computed statistics to estimate how many addresses will be in the results set
If the estimated results set is too large, select the second-most unique word and combine it in the search (Street LIKE '%Jones%' AND Town LIKE '%Anytown%')
If the estimated results set is too small, select the second-most unique word and combine it in the search (Street LIKE '%Aardvark%' OR Town LIKE '%Anytown')
if the actual results set is too large/small, repeat the query adding further terms as before.
The idea is to find enough fragments with high information content in the address which can be searched for to give a reasonable number of alternatives, rather than to find the most optimal match. For more tolerance to misspelling, trigrams, tetra-grams or soundex codes could be used instead of words.
Obviously if you have lists of actual states / towns / streets then some data clean-up could take place both in the database and in the search address. (I'm very surprised the Armenian postal service does not make such a list available, but I know that some postal services charge excessive amounts for this information. )
As a practical matter, most systems I see in use try to look up people's accounts by their phone number if possible: obviously whether that is a practical solution depends upon the nature of the data and its accuracy.
(Also consider the lateral-thinking approach: could you find a mail-order mail-list broker company which will clean up your database for you? They might even be willing to pay you for use of the addresses.)
I've found a great article.
Adding some dlls as sql user-defined functions we can use string comparison algorithms using SimMetrics library.
Check it
http://anastasiosyal.com/archive/2009/01/11/18.aspx
the possibilities of such variations are countless and even if such an algorithm exists, it can never be fool-proof. u can't have a spell checker for nouns after all.
what you can do is provide a drop-down list of previously entered field values, so that they can select one, if a particular name already exists.
its better to have separate fields for each value like apartments and so on.
You could throw all addresses at a web service like Google Maps (I don't know whether this one is suitable, though) and see whether they come up with identical GPS coordinates.
One method could be to apply the Levenshtein distance algorithm to the address fields. This will allow you to compare the strings for similarity.
Edit
After looking at the kinds of address differences you are dealing with, this may not be helpful after all.
Another idea is to use learning. For example you could learn, for each abbreviation and its place in the sentence, what the abbreviation means.
3 Jane Dr. -> Dr (in 3rd position (or last)) means Drive
Dr. Jones St -> Dr (in 1st position) means Doctor
You could, for example, use decision trees and have a user train the system. Probably few examples of each use would be enough. You wouldn't classify single-letter abbreviations like D.Jones that could be David Jones, or Dr. Jones as likely. But after a first level of translation you could look up a street index of the town and see if you can expand the D. into a street name.
Again, you would run each address through the decision tree before storing it.
It feels like there should be some commercial products doing this out there.
A possibility is to have a dictionary table in the database that maps all the variants to the 'proper' version of the word:
*Value* | *Meaning*
Apt. | Apartment
Ap. | Apartment
St. | Street
Then you run each word through the dictionary before you compare.
Edit: this alone is too naive to be practical (see comment).

Finding exact match using Lucene search API

I'm working on a company search API using Lucene.
My Lucene company index has got 2 companies:
1.Abigail Adams National Bancorp, Inc.
2.National Bancorp
If the user types in National Bancorp, then only company # 2(ie. National Bancorp) should be returned and not #1.....ie. only exact matches should be returned.
How do I achieve this functionality?
Thanks for reading.
You can use KeywordAnalyzer to index and search on this field. Keyword Analyzer will generate only one token for the entire string.
I googled a lot with no help for the same problem. After scratching my head for a while I found the solution. Search the string within double quotes, that will solve your problem.
National Bancorp will return both #1 and #2 but "National Bancorp" will return only #2.
This is something that may warrant the use of the shingle filter. This filter groups multiple words together. For example, Abigail Adams National Bancorp with a ShingleFilter of 3 tokens would produce (assuming a simple WhitespaceAnalyzer) [Abigail], [Abigail Adams], [Abigail Adams National], [Adams National Bancorp], [Adams National], [Adams], [National], [National Bancorp] and [Bancorp].
If a user the queries for National Bancorp, you will get an exact match on National Bancorp itself, and a lower scored exact match on Abigail Adams National Bancorp (lower scored because this one has much more tokens in the field, thus lowering the idf). I think it makes sense to return both documents on such a query.
You may want to apply the shingle filter at query time as well, depending on the use case.
You may want to reconsider your requirements, depending on whether or not I correctly understood your question. Please bear with me if I did misunderstand you.
Just a little food for thought:
If you only want exact matches returned, then why are you searching in the first place?
Are you sure that the user expects exact matches? I typically search assuming that the search engine will accommodate missing words.
Suppose the user searched for National Bank but National Bank was no longer in your index. Would you still want Abigail Adams National Bancorp, Inc to be excluded from the results simply because it was not an exact match?
In light of this, I would suggest you continue to present all possible matches (exact or not) to the user and let them decide for themselves which is most appropriate for them. I say this simply because you may not be thinking the same way as all of your users. Lucene will take care of making sure the closest matches rank highest in the results, helping them make quicker choices.
I have the same requirements of exact matching. I have used queryBuilder of org.hibernate.search.query.dsl and the query is:
query = queryBuilder.phrase().withSlop(0).onField(field)
.sentence(searchTerm).createQuery();
Its working for me.