How to do relation extraction for entity centric search engine? - lucene

I'm building a Entity centric search engine. Here is what i have done so far.
Identified all entities in a document like person, email id etc using Stanford's Named Entity Recognizer.
Built a entity based index table using Lucene indexer like "Barack Obama" with field name "PERSON" and also keyword based index.
Now, I need to establish relationships between those entities. For example, if the query is like "Wife of Obama", i need to resolve that to Michelle Obama. I want two entities "Barack Obama" and "Michelle Obama" linked by relation "spouse". I referred several papers for relation extraction but in vain.I don't want to extract data from already existing one like from "Freebase", I want to extract on my own using some algorithms or api.
Please suggest an idea or a way to build relation table.
Thanks :)

Relationship extraction well known problem in NLP field and can be handled with kernal mathed.
This problem can be easily transformed into a classification problem and you can train a model for every relation ship type.
What you have to do is first extract entities from the Wikipedia page.
and perform conference resolution for every entity (pronoun replacement).
Then you have to extract feature for the relationship type for which you want to train your model
lets try it with a toy example.
American Airlines, a unit of AMR, immediately matched the move,
spokesman Tim Wagner said.
here you have two entity in the artical person and orgnization
relation is person - spokesman - org.
for this relation you can extract this fetaures
1- entity based features
entity1 type - ORG (can be find out by Gazetteer or named entity extarction tool)
entity1 head - airline
entity2 type - PERSON (can be find out by Gazetteer or named entity extarction tool)
entity2 head - Wagner
concatenated types - ORGPER
2- word based features :
bagofwords between entity , word before entity1 , word after entity2 , bigram before entity1, bigram after entity1, bigram before entity2, bigram after entity2
3- syntatic features:
poss tag of word before entity1,poss tag of word before entity2,poss tag of word after entity1,poss tag of word before entity2
constituent path , base syntatic chunk path , typed depandency path (thease feature you can extarct by using stanford parser and depandency parser)
you can also try some more feaure.
ONES you extract the features Now use any (multiclass) classifier you like :
SVM (support vector machine)
MaxEnt (aka multiclass logistic regression)
Naive Bayes
ones you have trained your model you can use it for relation extraction.

You can use the Rosoka available through Amazon AWS at Rosoka it provides entity and relationship extraction.

Related

Model an overview object in an object oriented way

I have an api that displays, let's say, cities:
/cities
/city/{id}
The cities endpoint returns an overview of the city (id, city name, city area), whereas the city endpoint returns the same plus some more (population, image, thumbnail...). Now, when modelling this in the client I see different alternatives:
Have a CityOverview class which has a City subclass that adds the extra attributes.
Have a City class that has all the attributes with a CityOverview subclass that hides all the extra attributes (say, in Java, by throwing an UnsupportedOperationException on all the getters for attributes it doesn't have).
Have the above classes with no inheritance relationship.
Have a single City class that allows all the extra attributes to be nulled.
What are the pros and cons of the above approaces and/or any other you can think of?
I would go with option 3 - have 2 classes ,i.e. not an inheritance relationship. Here are the reasons for this decisions -
You will need to de-serialize\serialize JSON using an api such as jackson. For this you will need a field mapped POJO. Since, you have 2 separate classes now, you can map different POJOs to the 2 different api calls. It makes the code cleaner.
Inheritance is not an option because of 2 reasons - 1).At one time any of your API calls will either bring data for parent or child. I.e. the fields of either will always be empty depending on which API call you make. This is an unnecessary waste.
2). From an OOP design point-of-view the data object being returned in cities call is not a parent of the data being returned in the city{id} call. So, we should not have this in the class design as well. They together make up the "City" entity.
I think that if there is not a big difference between the OverviewCity and the City You should only keep a single City class in the BE.
In your /cities api your can pass the full list of your cities.
Displaying a list of cities with a few details on each city (OverviewCity) can be created easily from a City object by the client side.
In the Backend I dont think there is any need to support 2 classes.

Linking related topics IR

How to link terms(keywords entities) which have some relation among them through text documents . Example is of google when you search for a person it shows recommendations of other people related to that person .
In this picture it figured out spouse , presidential candidate , and equal designation
I am using frequency count technique . The more two terms occur in same document the more chance of them to have some relation. But this also links unrelated terms like pagemarks , verbs and page refences in a text document .
How should I improve it and is there any other easy but reliable technique ?
You should look a few techniques
1.) Stop word filtering: it is common in text mining two filter words which are typically not very important as they are two frequent. Like the, a, is and so on. There are predefined dictionaries.
2.) TF/IDF: TF/IDF re-weights words on how much they separate documents.
3.) Named Entity Recognition: For your task at hand it might be sufficient to just focus on the names. Named entity recognition can extract names from documents
4.) Linear Dirichlet Allocation: LDA finds concept in documents. A concept is a set of words which frequently appear together.

Vocabulary term to model number of individuals of a class in RDF

I want to model a dataset in RDF with class name as subject and number of individuals present in the class as object. I am thinking of which predicate will be good to model this information
I searched different vocabularies like RDFS,SKOS etc and also in http://lov.okfn.org/dataset/lov/ but couldn't find the apt one.
Any suggestions regarding which vocabulary term will be good to model this information

What is the use of lucene index files in DBPedia-Spotlight..?

I am trying to find named entities in a given text. For that, I have tried using DBPedia spotlight service.
I am able to get a response out of that. However, the DBPedia dataset is limited, so I tried replacing their spotter.dict file with my own dictionary. My dictionary contains entities per line:
Sachin Tendulkar###PERSON
Barack Obama ###PERSON
.... etc
Then I parse this file and build an ExactDictionaryChunker object.
Now I am able to get the entities and their types (after modification of dbpedia code).
My Question is: DBPedia spotlight is using Lucene Index files. I really don't understand for what purpose they are using these files?
Can we do it without using Index files? Whats the significance of the index files?
Lucene was used in the earlier implementation of DBpedia Spotlight to store a model of each entity in our KB. This model is used to give us a relatedness measure between the context (extracted from your input text) and the entity. More concretely, each entity is represented by a vector {t1: score1, t2: score2, ... }. At runtime we model your input text as a vector in the same dimensions and measure the cosine between input vector and entity vectors. In your case, you would have to add a vector for Sachin Tendulkar to the space (add a document to the Lucene index) in case it is not already there. The latest implementation, though, has moved away from Lucene to an in-house in-memory context store. https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Internationalization-(DB-backed-core)

Designing a solution to retrieve and classify content based on given attributes

This is a design problem I am facing. Let's say I have a cars website. Cars have the following attributes with different possible values.
Color: red, green, blue
Size: small, big
Based on those attributes I want to classify between cars for young people, cars for middle aged people and cars for elder people, with the following criteria:
Cars_young: red or green
Cars_middle_age: blue and big
Cars_elder: blue and small
I'll call this criteria target
I have a table cars with columns: id, color and size.
I need to be able to:
a) when retrieving a car by id, tell its target (if it's young, middle age or elder people)
b) be able to query the database to know how many views had cars belonging to each target
Also, as a developer, I must implement it in a way that those criteria are easily changed.
Which is the best way to implement it? Is there a design pattern for it? I can explain two possible solutions I thought about but I don't really like:
1) create a new column in the database table called target, so it's easy to make both a) and b).
Drawbacks: Each time crieteria changes I have to update the column target for all cars, and also, I have to change the insertNewCar() function.
2) Implement it in the 'Cars' class.
Drawback: Each time criteria changes I have to change query in b) as well as code in 'getCarById' in a).
3) Use TRIGGERS in SQL, but I would like to avoid this solution if possible
I would like to be able have this criteria definition somewhere in the code which can be changed easily, and would also hopefully be used by 'Cars' class. I'm thinking about some singleton or global objects for 'target' which can be injected in some Cars methods.
Anyone can explain a nice solution or send documentation about some post that faces this problem, or a pattern design that solves it?
On first sight specification pattern might meet your expectations. Wikipedia gives a nice explanation how it works, small teaser bellow:
OverDueSpecification OverDue = new OverDueSpecification();
NoticeSentSpecification NoticeSent = new NoticeSentSpecification();
InCollectionSpecification InCollection = new InCollectionSpecification();
ISpecification SendToCollection = OverDue.And(NoticeSent).And(InCollection.Not());
InvoiceCollection = Service.GetInvoices();
foreach (Invoice currentInvoice in InvoiceCollection) {
if (SendToCollection.IsSatisfiedBy(currentInvoice)) {
currentInvoice.SendToCollection();
}
}
You can consider combine specification pattern with observers.
Also there are few other ideas:
extention of specification pattern on SQL generation, WHERE clauses in particular
storing criteria configuration in database
criteria versioning: storing information about version of rules used to assign to category comined with category itself