Vocabulary term to model number of individuals of a class in RDF - semantics

I want to model a dataset in RDF with class name as subject and number of individuals present in the class as object. I am thinking of which predicate will be good to model this information
I searched different vocabularies like RDFS,SKOS etc and also in http://lov.okfn.org/dataset/lov/ but couldn't find the apt one.
Any suggestions regarding which vocabulary term will be good to model this information

Related

How to assign a suitable science category to keywords, using DBPedia and SPARQL?

I have some keywords like emotion perception ability, students’ motivation, self-efficacy. The goal is to map these keywords to a corresponding category(-ies) of psychology. In this case I know apriori that the answer is Educational psychology, however I want to get the same answer using DBPedia ontologies.
Using the following query I am able to extract different branches of psychology and corresponding abstracts:
SELECT DISTINCT ?subject ?abstract
WHERE {
?concept rdfs:label "Branches of psychology"#en .
?concept ^dct:subject ?subject .
?subject dbo:abstract ?abstract .
}
LIMIT 100
Now I want to add some OPTIONAL clause that would compare my keywords (using OR) with terms from the abstract (dbo:abstract). Is it possible to do this using SPARQL? Or should I use SPARQL just to obtain the abstracts and then to make all further text processing using e.g. Java or Python?
Also, the ideas of some other approaches that might be useful to reach the goal are highly appreciated.
You could retrieve data as text using sparql, but deciding if a text match a query should be done with text data analysis technics, or text mining
This is a whole science, but fortunately lots of libraries for many languages (including Java and Python) exist in order to implement related algorithms. Here is a list of software on wikipedia. NLTK is well-known for the job, and have a Python binding.
In your case, i think of many ways, but i'm far from an expert, so my idea can be wrong:
Create a corpus of abstracts of each desired categories (educational psychology,…), and, for a given abstract A, compare A to each abstract of each abstract of each category C. The result of the comparison will give for each category a score/likelihood that A belongs to C. (cf fuzzy sets)
The comparison could be implemented with vector space model, that works on vocabulary similarities.
Named Entities Recognition could help to detect names of authors, technics or tools related to particular category.
Main idea is the following: once you defined what is the particular trait of each category, by using its vocabulary, authors, references or whatever, you can decide, for any abstract, a score of membership to all categories.
So, the real question to ask is which scoring function should i use ?.
The answer depends heavily of the data, and the results you want. When you say an abstract is about educational psychology, you have to know why. And then implement it as a scoring function.
As a side node, i add that neural networks, by training on the corpus, could, maybe, bypass the scoring, by automatic learning. I don't know enough in that field to say more.

Linking related topics IR

How to link terms(keywords entities) which have some relation among them through text documents . Example is of google when you search for a person it shows recommendations of other people related to that person .
In this picture it figured out spouse , presidential candidate , and equal designation
I am using frequency count technique . The more two terms occur in same document the more chance of them to have some relation. But this also links unrelated terms like pagemarks , verbs and page refences in a text document .
How should I improve it and is there any other easy but reliable technique ?
You should look a few techniques
1.) Stop word filtering: it is common in text mining two filter words which are typically not very important as they are two frequent. Like the, a, is and so on. There are predefined dictionaries.
2.) TF/IDF: TF/IDF re-weights words on how much they separate documents.
3.) Named Entity Recognition: For your task at hand it might be sufficient to just focus on the names. Named entity recognition can extract names from documents
4.) Linear Dirichlet Allocation: LDA finds concept in documents. A concept is a set of words which frequently appear together.

Yago ontology for entity disambiguation

I am using the propriety rdfs:type equal to dbpedia-owl:Organisation for selecting (obviously) organizations on my SPARQL query:
SELECT ?s
WHERE {
?s a dbpedia-owl:Organisation .
} LIMIT 10
I would like to consider the YAGO ontology for increasing my performance on getting real organizations. For example, the FBI (http://dbpedia.org/resource/Federal_Bureau_of_Investigation) is not considered as a dbpedia-owl:Organisation but is tagged as yago:Organization108008335 .
Note the "random" (at least for me) number in the end of the class name. Does anyone know what this number stands for? How do I suppose to know it a priori?
Moreover, when I look for more classes with this format (using the query below), I can find two more classes: http://dbpedia.org/class/yago/Organization108008335, http://dbpedia.org/class/yago/Organization101008378, http://dbpedia.org/class/yago/Organization101136519
SELECT DISTINCT ?t WHERE {
?s a ?t
FILTER(regex(str(?t), "http://dbpedia.org/class/yago/Organization\\d+"))
}
Are they different? Why aren't they all "yago:Organization". Should I expect "new" organization classes as new versions of YAGO ontologies are made available? Is there any other class I should consider when selecting Organizations?
I have been digging into this lately, so I'll try to answer all your questions one by one:
Note the "random" (at least for me) number in the end of the class name. Does anyone know what this number stands for? How do I suppose to know it a priori?
That number corresponds to the synset id of the word in Wordnet. For example, if you look up wordnet_organization_101136519 in wordnet (the URI in the dbpedia is not resolvable at this moment, maybe they have changed something in the last releases), you will see that it has a synsetID "101136519". I don't think you can know it a priori without looking into wordnet.
Are they different? Why aren't they all "yago:Organization".
They are different because they have a different definition in wordnet. For example:
wordnet_organization_101136519: "the activity or result of distributing or disposing persons or things properly or methodically 'his organization of the work force was very efficient'". Example of an instance: Bogo-Indian_Defence. See more details here
wordnet_organization_101008378: "the act of organizing a business or an activity related to a business 'he was brought in to supervise the organization of a new department'". Example of an instance: Adam_Smith_Foundation. See more details here
If you follow the links I provided you can see more differences and common similarities.
Should I expect "new" organization classes as new versions of YAGO ontologies are made available?
When they generated Yago they associated every word in wordnet to a URI. If more words about organizations are added, then I guess that you'll have more definitions. However it is impossible to know beforehand.
Is there any other class I should consider when selecting Organizations?
You can look for all the classes with the label "organization" in wordnet and then add optionals to your query (or issue one query per class retrieving the different organizations you are interested in). These are the classes with the "organization" label in Wordnet.
I hope it helps.

What is the use of lucene index files in DBPedia-Spotlight..?

I am trying to find named entities in a given text. For that, I have tried using DBPedia spotlight service.
I am able to get a response out of that. However, the DBPedia dataset is limited, so I tried replacing their spotter.dict file with my own dictionary. My dictionary contains entities per line:
Sachin Tendulkar###PERSON
Barack Obama ###PERSON
.... etc
Then I parse this file and build an ExactDictionaryChunker object.
Now I am able to get the entities and their types (after modification of dbpedia code).
My Question is: DBPedia spotlight is using Lucene Index files. I really don't understand for what purpose they are using these files?
Can we do it without using Index files? Whats the significance of the index files?
Lucene was used in the earlier implementation of DBpedia Spotlight to store a model of each entity in our KB. This model is used to give us a relatedness measure between the context (extracted from your input text) and the entity. More concretely, each entity is represented by a vector {t1: score1, t2: score2, ... }. At runtime we model your input text as a vector in the same dimensions and measure the cosine between input vector and entity vectors. In your case, you would have to add a vector for Sachin Tendulkar to the space (add a document to the Lucene index) in case it is not already there. The latest implementation, though, has moved away from Lucene to an in-house in-memory context store. https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Internationalization-(DB-backed-core)

How to do relation extraction for entity centric search engine?

I'm building a Entity centric search engine. Here is what i have done so far.
Identified all entities in a document like person, email id etc using Stanford's Named Entity Recognizer.
Built a entity based index table using Lucene indexer like "Barack Obama" with field name "PERSON" and also keyword based index.
Now, I need to establish relationships between those entities. For example, if the query is like "Wife of Obama", i need to resolve that to Michelle Obama. I want two entities "Barack Obama" and "Michelle Obama" linked by relation "spouse". I referred several papers for relation extraction but in vain.I don't want to extract data from already existing one like from "Freebase", I want to extract on my own using some algorithms or api.
Please suggest an idea or a way to build relation table.
Thanks :)
Relationship extraction well known problem in NLP field and can be handled with kernal mathed.
This problem can be easily transformed into a classification problem and you can train a model for every relation ship type.
What you have to do is first extract entities from the Wikipedia page.
and perform conference resolution for every entity (pronoun replacement).
Then you have to extract feature for the relationship type for which you want to train your model
lets try it with a toy example.
American Airlines, a unit of AMR, immediately matched the move,
spokesman Tim Wagner said.
here you have two entity in the artical person and orgnization
relation is person - spokesman - org.
for this relation you can extract this fetaures
1- entity based features
entity1 type - ORG (can be find out by Gazetteer or named entity extarction tool)
entity1 head - airline
entity2 type - PERSON (can be find out by Gazetteer or named entity extarction tool)
entity2 head - Wagner
concatenated types - ORGPER
2- word based features :
bagofwords between entity , word before entity1 , word after entity2 , bigram before entity1, bigram after entity1, bigram before entity2, bigram after entity2
3- syntatic features:
poss tag of word before entity1,poss tag of word before entity2,poss tag of word after entity1,poss tag of word before entity2
constituent path , base syntatic chunk path , typed depandency path (thease feature you can extarct by using stanford parser and depandency parser)
you can also try some more feaure.
ONES you extract the features Now use any (multiclass) classifier you like :
SVM (support vector machine)
MaxEnt (aka multiclass logistic regression)
Naive Bayes
ones you have trained your model you can use it for relation extraction.
You can use the Rosoka available through Amazon AWS at Rosoka it provides entity and relationship extraction.