Yago ontology for entity disambiguation - semantics

I am using the propriety rdfs:type equal to dbpedia-owl:Organisation for selecting (obviously) organizations on my SPARQL query:
SELECT ?s
WHERE {
?s a dbpedia-owl:Organisation .
} LIMIT 10
I would like to consider the YAGO ontology for increasing my performance on getting real organizations. For example, the FBI (http://dbpedia.org/resource/Federal_Bureau_of_Investigation) is not considered as a dbpedia-owl:Organisation but is tagged as yago:Organization108008335 .
Note the "random" (at least for me) number in the end of the class name. Does anyone know what this number stands for? How do I suppose to know it a priori?
Moreover, when I look for more classes with this format (using the query below), I can find two more classes: http://dbpedia.org/class/yago/Organization108008335, http://dbpedia.org/class/yago/Organization101008378, http://dbpedia.org/class/yago/Organization101136519
SELECT DISTINCT ?t WHERE {
?s a ?t
FILTER(regex(str(?t), "http://dbpedia.org/class/yago/Organization\\d+"))
}
Are they different? Why aren't they all "yago:Organization". Should I expect "new" organization classes as new versions of YAGO ontologies are made available? Is there any other class I should consider when selecting Organizations?

I have been digging into this lately, so I'll try to answer all your questions one by one:
Note the "random" (at least for me) number in the end of the class name. Does anyone know what this number stands for? How do I suppose to know it a priori?
That number corresponds to the synset id of the word in Wordnet. For example, if you look up wordnet_organization_101136519 in wordnet (the URI in the dbpedia is not resolvable at this moment, maybe they have changed something in the last releases), you will see that it has a synsetID "101136519". I don't think you can know it a priori without looking into wordnet.
Are they different? Why aren't they all "yago:Organization".
They are different because they have a different definition in wordnet. For example:
wordnet_organization_101136519: "the activity or result of distributing or disposing persons or things properly or methodically 'his organization of the work force was very efficient'". Example of an instance: Bogo-Indian_Defence. See more details here
wordnet_organization_101008378: "the act of organizing a business or an activity related to a business 'he was brought in to supervise the organization of a new department'". Example of an instance: Adam_Smith_Foundation. See more details here
If you follow the links I provided you can see more differences and common similarities.
Should I expect "new" organization classes as new versions of YAGO ontologies are made available?
When they generated Yago they associated every word in wordnet to a URI. If more words about organizations are added, then I guess that you'll have more definitions. However it is impossible to know beforehand.
Is there any other class I should consider when selecting Organizations?
You can look for all the classes with the label "organization" in wordnet and then add optionals to your query (or issue one query per class retrieving the different organizations you are interested in). These are the classes with the "organization" label in Wordnet.
I hope it helps.

Related

The most specific rdf type of a DBpedia resource

I was wondering how DPpedia decides which rdf:type to put in the header of each resource page:
I tried computing the most "specific" category by selecting the resource category that contains the fewest members total as suggested here. But it is not a reliable heuristic, at least not for this purpose. Any suggestions how to query this entity type appreciated.

WikiData, locating entities with a specific type or subtype, located in a specific city

My Specific problem is that I have a place called "Beacon Theater". What I want to find is the best match for this in Wikidata.
A Wikidata Search will give me three results:
Live at the Beacon Theater (Q6656601)
Beacon Theatre (Q264186): performing arts venue
Beacon Theaters (Q19110809)
The first one is a movie, the second it the correct result, and the third is a Supreme Court decision.
Using this API call, I can find the id's for all three:
https://www.wikidata.org/w/api.php?action=query&format=json&list=search&srsearch=Beacon Theater
Next step is getting the details for each of these. I use this call to get the information for all three entities
"https://www.wikidata.org/w/api.php?action=wbgetentities&props=descriptions|labels|claims&ids=Q6656601|Q264186|Q19110809&languages=en&format=json"
At this point, I want to iterate over them and find the one that is a building. I may also later want to add a way to find the one located in New York.
My problem is that the correct answer is not a building (Q41176). The P31 value is Q3469910, which is a Performance Arts Venue, so I can't really sort on that (Imagine in the future I use this code to search for a museum. A museum is also a building, but not a Performing Arts Venue. The search for Beacon Theater is just an example.
So question: How can I find the correct entry, which for the purposes of this question I define as:
Being a building (or perhaps being derived from a Building)
Optional Answer: Being located in New York (In case of multiple hits, this would further limit the results)
I think I need to do a SPARQL query as the second query to do this, but from the examples I could not quite figure out how, or if that would be the correct/easiest way. Maybe even a SPARQL query that would do all of the above in one query?
In your case an exact label match might already do
SELECT DISTINCT ?loc ?locLabel ?locDescription
WHERE
{
values ?locLabel {"Beacon Theater"#en }
?loc rdfs:label ?locLabel .
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en"}
}
try it!
I have a project where I meet the same kind of problems but for books, which can also be comic books, or manga etc. The easiest solution I found was to keep a list of "alias entities", that is, entities that could be considered at match when looking for a book. It's not as dynamic as a SPARQL query and requires periodical updates - adding newly found matching entities, removing problematic ones - but that's so much faster and satisfies most of my needs.

How to assign a suitable science category to keywords, using DBPedia and SPARQL?

I have some keywords like emotion perception ability, students’ motivation, self-efficacy. The goal is to map these keywords to a corresponding category(-ies) of psychology. In this case I know apriori that the answer is Educational psychology, however I want to get the same answer using DBPedia ontologies.
Using the following query I am able to extract different branches of psychology and corresponding abstracts:
SELECT DISTINCT ?subject ?abstract
WHERE {
?concept rdfs:label "Branches of psychology"#en .
?concept ^dct:subject ?subject .
?subject dbo:abstract ?abstract .
}
LIMIT 100
Now I want to add some OPTIONAL clause that would compare my keywords (using OR) with terms from the abstract (dbo:abstract). Is it possible to do this using SPARQL? Or should I use SPARQL just to obtain the abstracts and then to make all further text processing using e.g. Java or Python?
Also, the ideas of some other approaches that might be useful to reach the goal are highly appreciated.
You could retrieve data as text using sparql, but deciding if a text match a query should be done with text data analysis technics, or text mining
This is a whole science, but fortunately lots of libraries for many languages (including Java and Python) exist in order to implement related algorithms. Here is a list of software on wikipedia. NLTK is well-known for the job, and have a Python binding.
In your case, i think of many ways, but i'm far from an expert, so my idea can be wrong:
Create a corpus of abstracts of each desired categories (educational psychology,…), and, for a given abstract A, compare A to each abstract of each abstract of each category C. The result of the comparison will give for each category a score/likelihood that A belongs to C. (cf fuzzy sets)
The comparison could be implemented with vector space model, that works on vocabulary similarities.
Named Entities Recognition could help to detect names of authors, technics or tools related to particular category.
Main idea is the following: once you defined what is the particular trait of each category, by using its vocabulary, authors, references or whatever, you can decide, for any abstract, a score of membership to all categories.
So, the real question to ask is which scoring function should i use ?.
The answer depends heavily of the data, and the results you want. When you say an abstract is about educational psychology, you have to know why. And then implement it as a scoring function.
As a side node, i add that neural networks, by training on the corpus, could, maybe, bypass the scoring, by automatic learning. I don't know enough in that field to say more.

How do triple stores use linked data?

Lets say I have the following scenario:
I have some different ontology files hosted somewhere on the web on different domains like _http://foo1.com/ontolgy1.owl#, _http://foo2.com/ontology2.owl# etc.
I also have a triple store in which I want to insert instances based on the ontology files mentioned like this:
INSERT DATA
{
<http://foo1.com/instance1> a <http://foo1.com/ontolgy1.owl#class1>.
<http://foo2.com/instance2> a <http://foo2.com/ontolgy2.owl#class2>.
<http://foo2.com/instance2x> a <http://foo2.com/ontolgy2.owl#class2x>.
}
Lets say that _http://foo2.com/ontolgy2.owl#class2x is a subclass of _http://foo2.com/ontolgy2.owl#class2 defined within the same ontology.
And after the insert if I run a SPARQL query like this:
select ?a
where
{
?a rdf:type ?type.
?type rdfs:subClassOf* <http://foo2.com/ontolgy2.owl#class2> .
}
the result would be:
<http://foo2.com/instance2>
and not:
<http://foo2.com/instance2>
<http://foo2.com/instance2x>
as it should be. This is happening because the ontology file _http://foo2.com/ontolgy2.owl# is not imported into the triple store.
My question is:
Can we talk in this example about "linked" data? Because it seems to me that it is not linked at all. It has to be imported locally into a triple store, and after that you can start querying.
Lets say if you want to run a query on some complex data that is described by 20 ontology files, then all 20 ontology files would need to be imported.
Isn't this a bit disappointing?
Do I misunderstood triple stores and linked data and how they work together?
as it should be.
I'm not certain that should is the right term here. The semantics of the SPARQL query is to query the data stored in a particular graph stored at the endpoint. IRIs are more or less opaque identifiers; just because they might also be URLs from which additional data can be retrieved doesn't obligate any particular system to actually do that kind of retrieval. Doing that would easily make query behavior unpredictable: "this query worked yesterday, why doesn't it work today? oh, a remote website is no longer available…".
Lets say that _http://foo2.com/ontolgy2.owl#class2x is a subclass of _http://foo2.com/ontolgy2.owl#class2 defined within the same ontology.
Remember, since IRIs are opaque, anyone can define a term in any ontology. It's always possible for someone else to come along and say something else about a resource. You have no way of tracking all that information. For instance, if I go and write an ontology, I can declare http://foo2.com/ontolgy2.owl#class2x as a class and assert that it's equivalent to http://dbpedia.org/ontology/Person. Should the system have some way to know about what I did someplace else, and even if it did, should it be required to go and retrieve information from it? What if I made an ontology that's 2GB in size? Surely your endpoint can't be expected to go and retrieve that just to answer a quick query?
Can we talk in this example about "linked" data? Because it seems to
me that it is not linked at all. It has to be imported locally into a
triple store, and after that you can start querying.
Lets say if wan to run a query on some complex data that is describe
by 20 ontology files, in this case I have to import all 20 ontology
files.
This is usually the case, and the point about linked data is that you have a way to get more information if you choose to, and that you don't have to do as much work in negotiating how to identify resources in that data. However, you can use the service keyword in SPARQL to reference other endpoints, and that can provide a type of linking. For instance, knowing that DBpedia has a SPARQL endpoint, I can run a local query that incorporates DBpedia with something like this:
select ?person ?localValue ?publicName {
?person :hasLocalValueOfInterest ?localValue
service <http://dbpedia.org/sparql> {
?person foaf:name ?publicName
}
}
You can use multiple service blocks to aggregate data from multiple endpoints; you're not limited to just one. That seems pretty "linked" to me.

Why 2 different vocabularies for the same attribute in DBpedia?

Why does DBpedia use multiple vocabularies for the same attributes?
I have to get data of all possible movies.
For each movie I have observed that it has a dbpedia-owl and a dbpprop vocabulary for producers, directors and so on.. I retrieve the attribute with the following query:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?star_name
WHERE {
<http://dbpedia.org/resource/Goal_III:_Taking_on_the_World> dbpedia-owl:starring ?star.
?star foaf:name ?star_name
}
I'll have the page id of each movie and then I'll retrieve stars and producers. For some I think
dbpedia-owl works and for some dbpprop works.
I am puzzled about it. I have to write code in Python to run this query for each movie. Hence every time I'll have to check that the result is null and then run the code for the other vocabulary.
DBPedia's data is extracted using a mapping based language from the info boxes you see on the corresponding wikipedia pages. Different mappings are used for different info boxes so two different types of resource may be mapped completely differently which is perfectly logical if you think about it.
Now the problem you are talking about is that two resources of the same type having the same data mapped differently. I suspect (though can't confirm because you didn't give examples of two movies which map properties differently) that the problem here is the data in Wikipedia. It may be that there is more than one way to express the information you are interested in within an info box and that the mapping for the info box maps differently for the different ways. This isn't ideal but Wikipedia does not have lovely clean data so you shouldn't expect DBPedia to have clean data either.
You may consider asking a question on the DBPedia mailing list at dbpedia-discussion#lists.sf.net about this to try and find out why this happens as they will be better placed to help you.