Entity Extraction/Recognition with free tools while feeding Lucene Index - lucene

I'm currently investigating the options to extract person names, locations, tech words and categories from text (a lot articles from the web) which will then feeded into a Lucene/ElasticSearch index. The additional information is then added as metadata and should increase precision of the search.
E.g. when someone queries 'wicket' he should be able to decide whether he means the cricket sport or the Apache project. I tried to implement this on my own with minor success so far. Now I found a lot tools, but I'm not sure if they are suited for this task and which of them integrates good with Lucene or if precision of entity extraction is high enough.
Dbpedia Spotlight, the demo looks very promising
OpenNLP requires training. Which training data to use?
OpenNLP tools
Stanbol
NLTK
balie
UIMA
GATE -> example code
Apache Mahout
Stanford CRF-NER
maui-indexer
Mallet
Illinois Named Entity Tagger Not open source but free
wikipedianer data
My questions:
Does anyone have experience with some of the listed tools above and its precision/recall? Or if there is training data required + available.
Are there articles or tutorials where I can get started with entity extraction(NER) for each and every tool?
How can they be integrated with Lucene?
Here are some questions related to that subject:
Does an algorithm exist to help detect the "primary topic" of an English sentence?
Named Entity Recognition Libraries for Java
Named entity recognition with Java

The problem you are facing in the 'wicket' example is called entity disambiguation, not entity extraction/recognition (NER). NER can be useful but only when the categories are specific enough. Most NER systems doesn't have enough granularity to distinguish between a sport and a software project (both types would fall outside the typically recognized types: person, org, location).
For disambiguation, you need a knowledge base against which entities are being disambiguated. DBpedia is a typical choice due to its broad coverage. See my answer for How to use DBPedia to extract Tags/Keywords from content? where I provide more explanation, and mentions several tools for disambiguation including:
Zemanta
Maui-indexer
Dbpedia Spotlight
Extractiv (my company)
These tools often use a language-independent API like REST, and I do not know that they directly provide Lucene support, but I hope my answer has been beneficial for the problem you are trying to solve.

You can use OpenNLP to extract names of people, places, organisations without training. You just use pre-exisiting models which can be downloaded from here: http://opennlp.sourceforge.net/models-1.5/
For an example on how to use one of these model see: http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind

Rosoka is a commercial product that provides a computation of "Salience" which measures the importance of the term or entity to the document. Salience is based on the linguistic usage and not the frequency. Using the salience values you can determine the primary topic of the document as a whole.
The output is in your choice of XML or JSON which makes it very easy to use with Lucene.
It is written in java.
There is an Amazon Cloud version available at https://aws.amazon.com/marketplace/pp/B00E6FGJZ0. The cost to try it out is $0.99/hour. The Rosoka Cloud version does not have all of the Java API features available to it that the full Rosoka does.
Yes both versions perform entity and term disambiguation based on the linguistic usage.
The disambiguation, whether human or software requires that there is enough contextual information to be able to determine the difference. The context may be contained within the document, within a corpus constraint, or within the context of the users. The former being more specific, and the later having the greater potential ambiguity. I.e. typing in the key word "wicket" into a Google search, could refer to either cricket, Apache software or the Star Wars Ewok character (i.e. an Entity). The general The sentence "The wicket is guarded by the batsman" has contextual clues within the sentence to interpret it as an object. "Wicket Wystri Warrick was a male Ewok scout" should enterpret "Wicket" as the given name of the person entity "Wicket Wystri Warrick". "Welcome to Apache Wicket" has the contextual clues that "Wicket" is part of a place name, etc.

Lately I have been fiddling with stanford crf ner. They have released quite a few versions http://nlp.stanford.edu/software/CRF-NER.shtml
The good thing is you can train your own classifier. You should follow the link which has the guidelines on how to train your own NER. http://nlp.stanford.edu/software/crf-faq.shtml#a
Unfortunately, in my case, the named entities are not efficiently extracted from the document. Most of the entities go undetected.
Just in case you find it useful.

Related

Define a graph in an ontology

I have a knowledge base which includes multiple graphs. I want a way to formally define these graphs in my metadata layer. But I can't seem to find a standard way to do it. More specifically If I want to say A is a class I can use rdfs:class. But what if I want to have a collection that contains names of all of my graphs and I want to say these names are named graphs. I was thinking about it and all I could think of was to define a graph as a rdf:bag of rdf:statement. But I don't think this is a good one. Is there any existing vocabulary that I can use for this?
It seems to me the DCAT Initiative might be worth having a look.
Dataset Catalogue Vocabulary (DCAT) is a W3C-authored RDF vocabulary designed to facilitate interoperability between data catalogs published on the Web.
I am learning about this subject matter as well. Apparently an Australian regulating body (for georesources) seems to have bundled their Graphs (concerning Geology the science, and Mining the Engineering discipline) with DCAT: CGIVocPrez. You might look here for an application example. But I cannot tell how "good" it actually is.

What exactly are the UMLS and SNOMED-CT vocabularies used by cTAKES?

Very new to cTAKES and looking through the docs, curious about what exactly the UMLS and SNOMEDCT "vocabularies" are. The user installation docs don't really seem to tell and simply applying for the UMLS license and the language around the UMLS Metathesaurus does not really divulge much more about the structure of the data being accessed. Eg. is it some online API service? Is it some files that come with the cTAKES download that can only be unlocked with a valid UMLS password that is checked against an online DB?
Info on what the UMLS Metathesaurus and SNOMEDCT are can be found here (https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/index.html) and here (https://www.ncbi.nlm.nih.gov/books/NBK9676/, specifically https://www.ncbi.nlm.nih.gov/books/NBK9684/):
The Metathesaurus is a very large, multi-purpose, and multi-lingual [relational?] vocabulary database that contains information about biomedical and health related concepts, their various names, and the relationships among them. Designed for use by system developers...
...The Metathesaurus contains concepts, concept names, and other attributes from more than 100 terminologies, classifications, and thesauri, some in multiple editions.
While I'm not sure how exactly cTAKES implements its use of the UMLS Metathesaurus (anyone who knows could please enlighten), I assume that it is accessing some API for a relational database based on the UMLS credentials you need to add to the example scripts that come with the cTAKES download (see https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+4.0+User+Install+Guide#cTAKES4.0UserInstallGuide-(Recommended)AddUMLSaccessrights).
...You may select from two relational formats: the Rich Release Format (RRF), introduced in 2004, and the Original Release Format (ORF).
(I think) this is what is used to power the UIMA analysis engines used to process text in cTAKES
UIMA is an architecture in which basic building blocks called Analysis Engines (AEs) are composed in order to analyze a document [...] How Annotators represent and share their results is an important part of the UIMA architecture. To enable composition and reuse, UIMA defines a Common Analysis Structure (CAS) precisely for these purposes. The CAS is an object-based container that manages and stores typed objects having properties and values, https://www.ibm.com/developerworks/data/downloads/uima/#How-does-it-work

Knowing what RDFA vocabulary to use

How do we know which vocabulary/namespace to use to describe data with RDFa?
I have seen a lot of examples that use xmlns:dcterms="http://purl.org/dc/terms/" or xmlns:sioc="http://rdfs.org/sioc/ns#" then there is this video that uses FOAF vocabulary.
This is all pretty confusing and I am not sure what these vocabularies mean or what is best to use for the data I am describing. Is there some trick I am missing?
There are many vocabularies. And you could create your own, too, of course (but you probably shouldn’t before you checked possible alternatives).
You’d have to look for vocabularies for your specific needs, for example
by browsing and searching on http://lov.okfn.org/dataset/lov/ (they collect and index open vocabularies),
on W3C’s RDFa Core Initial Context (it lists vocabularies that have pre-defined prefixes for use with RDFa), or
by browsing through http://prefix.cc/ (it’s a lookup for typically used namespaces, but you might get an overview by that).
After some time you get to know the big/broad ones: Schema.org, Dublin Core, FOAF, RSS, SKOS, SIOC, vCard, DOAP, Open Graph, Ontology for Media Resources, GoodRelations, DBpedia Ontology, ….
The simplest thing is to check if schema.org covers your needs. Schema.org is backed by Google and the other major search engines and generally pretty awesome.
If it doesn't suit your needs, then enter a few of the terms you need into a vocabulary search engine. My recommendation is LOV.
Another option is to just ask the community about the best vocabularies for the specific domain you need to represent. A good place is answers.semanticweb.com, which is like StackOverflow but with more RDF experts hanging out.
Things have changed quite a bit since that video was posted. First, like Richard said, you should check if schema.org fits your needs. Personally when I need to describe something that's not covered on schema.org, I check LOV as well. If, and only if I can't find anything in LOV, I will then consider creating a new type or property. A quick way to do this is to use http://open.vocab.org/
A newer version of RDFa was published since that video was released: RDFa 1.1 and RDFa Lite. If you want to use schema.org only, I'd recommend to check http://www.w3.org/TR/rdfa-lite/
Vocabularies are usually domain specific. The xmlns line is deprecated. The RDFa profile at http://www.w3.org/profile/rdfa-1.1 lists the vocabularies available as part of initial context. Sometimes vocabularies may overlap in the context of your data. Analogous to solving math prb by either Algebraic or Geometric or other technique, mixing up vocabularies is fine. Equal terms can be found using http://sameas.org/ For addressing your consumer base's favoritism amongst vocab recognition, skos:closeMatch and skos:exactMatch may be used, eg. "gr:Brand skos:closeMatch owl:Thing" with any terms you please. Prefix attribute can be used with vocabularies besides those covered by initial context like: prefix="fb: http://ogp.me/ns/fb# vocab2: path2 ..." For cross-cutting concern across different domain vocabularies such as customizing presentation in search results microdata using schema.org guidelines should be beneficial. However, as this has nothing to do with specialization in any peculiar domain, prefixes are unavailable in this syntax. RDFa vocab have been helpful in such specific domain contexts that content seems to appeal further to participative audience while microdata targets those who've lost their way. For tasks that are too simple to merit full-fledged vocab, but have semantic implications, try http://microformats.org/ Interchanging usage of REST profile URIs for vocabs amongst the 3 syntaxes is valid, but useless owing to lack of affordable manpower to implement alternative support for the vocabs on the Web scale. How & why schema.org vocab merited separate microdata syntax of its own is discussed by Google employee Ian Hickson a. k. a. Hixie- the editor of WHATWG HTML5 draft at http://logbot.glob.com.au/?c=freenode%23whatwg&s=28+Nov+2012&e=28+Nov+2012#c747855 or http://krijnhoetmer.nl/irc-logs/whatwg/20121128#l-1122 If only Google had smart enough employees to implement parser for 1 syntax whose WG included its own employee also, then RDFa Lite inside RDFa would have been another course like Core Java within Java, & no need of separate microdata named mocking rip-off, but alas- our's is an imperfect world!

How I can start building wordnet for Turkish language to use in sentiment analysis

Although I hold EE background, I didn't get chance to attend Natural Language processing classes.
I would like to build sentiment analysis tool for Turkish language. I think it is best to create a Turkish wordnet database rather than translating the text to English and analyze it with buggy translated text with provided tools. (is it?)
So what do you guys recommend me to do ? First of all taking NLP classes from an open class website? I really don't know where to start. Could you help me and maybe provide me step by step guide? I know this is an academic project but I am interested to build skills as a hobby in that area.
Thanks in advance.
Here is the process I have used before (making Japanese, Chinese, German and Arabic semantic networks):
Gather at least two English/Turkish dictionaries. They must be independent, not derived from each other. You can use Wikipedia to auto-generate one of your dictionaries. If you need to publish your network, then you may need open source dictionaries, or license fees, or a lawyer.
Use those dictionaries to translate English Wordnet, producing a confidence rating for each synset.
Keep those with strong confidence, manually approving or fixing through those with medium or low confidence.
Finish it off manually
I expanded on this in the "Automatic Translation Of WordNet" section of my 2008 paper: http://dcook.org/mlsn/about/papers/nlp2008.MLSN_A_Multilingual_Semantic_Network.pdf
(For your stated goal of a Turkish sentiment dictionary, there are other approaches, not involving a semantic network. E.g. "Semantic Analysis and Opinion Mining", by Bing Liu, is a good round-up of research. But a semantic network approach will, IMHO, always give better results in the long run, and has so many other uses.)

Context Specific Spelling Engine

I'm sure more than a few of you will have seen the Google Wave demonstration. I was wondering about the spell checking technology specificially. How revolutionary is a spell checker which works by figuring out where a word appears contextually within a sentence to make these suggestions ?
I haven't seen this technique before, but are there examples of this elsewhere?
and if so are there code examples and literature into its workings ?
My 2 cents. Given the fact that translate.google.com is a statistical machine translation engine and "The Unreasonable Effectiveness of Data" from A Halevy, P Norvig (Director of Research at Google) & F Pereira: I make the assumption (bet) that this is a statistically driven spell checker.
How it could work: you collect a very large corpus of the language you want to spell check. You store this corpus as phrase-tables in adapted datastructures (suffix arrays for example if you have to count the n-grams subsets) that keep track of the count (an so an estimated probability of) the number of n-grams.
For example, if your corpus is only constitued of:
I had bean soup last diner.
From this entry, you will generate the following bi-grams (sets of 2 words):
I had, had bean, bean soup, soup last, last diner
and the tri-grams (sets of 3 words):
I had bean, had bean soup, bean soup last, soup last diner
But they will be pruned by tests of statistical relevance, for example: we can assume that the tri-gram
I had bean
will disappear of the phrase-table.
Now, spell checking is only going to look is this big phrase-tables and check the "probabilities". (You need a good infrastructure to store this phrase-tables in an efficient data structure and in RAM, Google has it for translate.google.com, why not for that ? It's easier than statistical machine translation.)
Ex: you type
I had been soup
and in the phrase-table there is a
had bean soup
tri-gram with a much higher probability than what you just typed! Indeed, you only need to change one word (this is a "not so distant" tri-gram) to have a tri-gram with a much higher probability. There should be an evaluating function dealing with the trade-off distance/probability. This distance could even be calculated in terms of characters: we are doing spell checking, not machine translation.
This is only my hypothetical opinion. ;)
You should also watch an official video by Casey Whitelaw of the Google Wave team that describes the techniques used: http://www.youtube.com/watch?v=Sx3Fpw0XCXk
You can learn all about topics like this by diving into natural language processing. You can even go as in-depth as making a statistical guess as to which word will come next after a string of given words.
If you are interested in such a topic, I highly suggest using the NLTK (natural language toolkit) written entirely in python. it is a very expansive work, having many tools and pretty good documentation.
There are a lot of papers on this subject. Here are some good resources
This doesn't use context sensitivity, but it's a good base to build from
http://norvig.com/spell-correct.html
This is probably a good and easy to understand view of a more powerful spell checker
http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Cucerzan.pdf
From here you can dive deep on the particulars. I'd recommend using google scholar and looking up the references in the paper above, and searching for 'spelling correction'