Why 2 different vocabularies for the same attribute in DBpedia? - sparql

Why does DBpedia use multiple vocabularies for the same attributes?
I have to get data of all possible movies.
For each movie I have observed that it has a dbpedia-owl and a dbpprop vocabulary for producers, directors and so on.. I retrieve the attribute with the following query:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?star_name
WHERE {
<http://dbpedia.org/resource/Goal_III:_Taking_on_the_World> dbpedia-owl:starring ?star.
?star foaf:name ?star_name
}
I'll have the page id of each movie and then I'll retrieve stars and producers. For some I think
dbpedia-owl works and for some dbpprop works.
I am puzzled about it. I have to write code in Python to run this query for each movie. Hence every time I'll have to check that the result is null and then run the code for the other vocabulary.

DBPedia's data is extracted using a mapping based language from the info boxes you see on the corresponding wikipedia pages. Different mappings are used for different info boxes so two different types of resource may be mapped completely differently which is perfectly logical if you think about it.
Now the problem you are talking about is that two resources of the same type having the same data mapped differently. I suspect (though can't confirm because you didn't give examples of two movies which map properties differently) that the problem here is the data in Wikipedia. It may be that there is more than one way to express the information you are interested in within an info box and that the mapping for the info box maps differently for the different ways. This isn't ideal but Wikipedia does not have lovely clean data so you shouldn't expect DBPedia to have clean data either.
You may consider asking a question on the DBPedia mailing list at dbpedia-discussion#lists.sf.net about this to try and find out why this happens as they will be better placed to help you.

Related

The most specific rdf type of a DBpedia resource

I was wondering how DPpedia decides which rdf:type to put in the header of each resource page:
I tried computing the most "specific" category by selecting the resource category that contains the fewest members total as suggested here. But it is not a reliable heuristic, at least not for this purpose. Any suggestions how to query this entity type appreciated.

How to assign a suitable science category to keywords, using DBPedia and SPARQL?

I have some keywords like emotion perception ability, students’ motivation, self-efficacy. The goal is to map these keywords to a corresponding category(-ies) of psychology. In this case I know apriori that the answer is Educational psychology, however I want to get the same answer using DBPedia ontologies.
Using the following query I am able to extract different branches of psychology and corresponding abstracts:
SELECT DISTINCT ?subject ?abstract
WHERE {
?concept rdfs:label "Branches of psychology"#en .
?concept ^dct:subject ?subject .
?subject dbo:abstract ?abstract .
}
LIMIT 100
Now I want to add some OPTIONAL clause that would compare my keywords (using OR) with terms from the abstract (dbo:abstract). Is it possible to do this using SPARQL? Or should I use SPARQL just to obtain the abstracts and then to make all further text processing using e.g. Java or Python?
Also, the ideas of some other approaches that might be useful to reach the goal are highly appreciated.
You could retrieve data as text using sparql, but deciding if a text match a query should be done with text data analysis technics, or text mining
This is a whole science, but fortunately lots of libraries for many languages (including Java and Python) exist in order to implement related algorithms. Here is a list of software on wikipedia. NLTK is well-known for the job, and have a Python binding.
In your case, i think of many ways, but i'm far from an expert, so my idea can be wrong:
Create a corpus of abstracts of each desired categories (educational psychology,…), and, for a given abstract A, compare A to each abstract of each abstract of each category C. The result of the comparison will give for each category a score/likelihood that A belongs to C. (cf fuzzy sets)
The comparison could be implemented with vector space model, that works on vocabulary similarities.
Named Entities Recognition could help to detect names of authors, technics or tools related to particular category.
Main idea is the following: once you defined what is the particular trait of each category, by using its vocabulary, authors, references or whatever, you can decide, for any abstract, a score of membership to all categories.
So, the real question to ask is which scoring function should i use ?.
The answer depends heavily of the data, and the results you want. When you say an abstract is about educational psychology, you have to know why. And then implement it as a scoring function.
As a side node, i add that neural networks, by training on the corpus, could, maybe, bypass the scoring, by automatic learning. I don't know enough in that field to say more.

How do triple stores use linked data?

Lets say I have the following scenario:
I have some different ontology files hosted somewhere on the web on different domains like _http://foo1.com/ontolgy1.owl#, _http://foo2.com/ontology2.owl# etc.
I also have a triple store in which I want to insert instances based on the ontology files mentioned like this:
INSERT DATA
{
<http://foo1.com/instance1> a <http://foo1.com/ontolgy1.owl#class1>.
<http://foo2.com/instance2> a <http://foo2.com/ontolgy2.owl#class2>.
<http://foo2.com/instance2x> a <http://foo2.com/ontolgy2.owl#class2x>.
}
Lets say that _http://foo2.com/ontolgy2.owl#class2x is a subclass of _http://foo2.com/ontolgy2.owl#class2 defined within the same ontology.
And after the insert if I run a SPARQL query like this:
select ?a
where
{
?a rdf:type ?type.
?type rdfs:subClassOf* <http://foo2.com/ontolgy2.owl#class2> .
}
the result would be:
<http://foo2.com/instance2>
and not:
<http://foo2.com/instance2>
<http://foo2.com/instance2x>
as it should be. This is happening because the ontology file _http://foo2.com/ontolgy2.owl# is not imported into the triple store.
My question is:
Can we talk in this example about "linked" data? Because it seems to me that it is not linked at all. It has to be imported locally into a triple store, and after that you can start querying.
Lets say if you want to run a query on some complex data that is described by 20 ontology files, then all 20 ontology files would need to be imported.
Isn't this a bit disappointing?
Do I misunderstood triple stores and linked data and how they work together?
as it should be.
I'm not certain that should is the right term here. The semantics of the SPARQL query is to query the data stored in a particular graph stored at the endpoint. IRIs are more or less opaque identifiers; just because they might also be URLs from which additional data can be retrieved doesn't obligate any particular system to actually do that kind of retrieval. Doing that would easily make query behavior unpredictable: "this query worked yesterday, why doesn't it work today? oh, a remote website is no longer available…".
Lets say that _http://foo2.com/ontolgy2.owl#class2x is a subclass of _http://foo2.com/ontolgy2.owl#class2 defined within the same ontology.
Remember, since IRIs are opaque, anyone can define a term in any ontology. It's always possible for someone else to come along and say something else about a resource. You have no way of tracking all that information. For instance, if I go and write an ontology, I can declare http://foo2.com/ontolgy2.owl#class2x as a class and assert that it's equivalent to http://dbpedia.org/ontology/Person. Should the system have some way to know about what I did someplace else, and even if it did, should it be required to go and retrieve information from it? What if I made an ontology that's 2GB in size? Surely your endpoint can't be expected to go and retrieve that just to answer a quick query?
Can we talk in this example about "linked" data? Because it seems to
me that it is not linked at all. It has to be imported locally into a
triple store, and after that you can start querying.
Lets say if wan to run a query on some complex data that is describe
by 20 ontology files, in this case I have to import all 20 ontology
files.
This is usually the case, and the point about linked data is that you have a way to get more information if you choose to, and that you don't have to do as much work in negotiating how to identify resources in that data. However, you can use the service keyword in SPARQL to reference other endpoints, and that can provide a type of linking. For instance, knowing that DBpedia has a SPARQL endpoint, I can run a local query that incorporates DBpedia with something like this:
select ?person ?localValue ?publicName {
?person :hasLocalValueOfInterest ?localValue
service <http://dbpedia.org/sparql> {
?person foaf:name ?publicName
}
}
You can use multiple service blocks to aggregate data from multiple endpoints; you're not limited to just one. That seems pretty "linked" to me.

Yago ontology for entity disambiguation

I am using the propriety rdfs:type equal to dbpedia-owl:Organisation for selecting (obviously) organizations on my SPARQL query:
SELECT ?s
WHERE {
?s a dbpedia-owl:Organisation .
} LIMIT 10
I would like to consider the YAGO ontology for increasing my performance on getting real organizations. For example, the FBI (http://dbpedia.org/resource/Federal_Bureau_of_Investigation) is not considered as a dbpedia-owl:Organisation but is tagged as yago:Organization108008335 .
Note the "random" (at least for me) number in the end of the class name. Does anyone know what this number stands for? How do I suppose to know it a priori?
Moreover, when I look for more classes with this format (using the query below), I can find two more classes: http://dbpedia.org/class/yago/Organization108008335, http://dbpedia.org/class/yago/Organization101008378, http://dbpedia.org/class/yago/Organization101136519
SELECT DISTINCT ?t WHERE {
?s a ?t
FILTER(regex(str(?t), "http://dbpedia.org/class/yago/Organization\\d+"))
}
Are they different? Why aren't they all "yago:Organization". Should I expect "new" organization classes as new versions of YAGO ontologies are made available? Is there any other class I should consider when selecting Organizations?
I have been digging into this lately, so I'll try to answer all your questions one by one:
Note the "random" (at least for me) number in the end of the class name. Does anyone know what this number stands for? How do I suppose to know it a priori?
That number corresponds to the synset id of the word in Wordnet. For example, if you look up wordnet_organization_101136519 in wordnet (the URI in the dbpedia is not resolvable at this moment, maybe they have changed something in the last releases), you will see that it has a synsetID "101136519". I don't think you can know it a priori without looking into wordnet.
Are they different? Why aren't they all "yago:Organization".
They are different because they have a different definition in wordnet. For example:
wordnet_organization_101136519: "the activity or result of distributing or disposing persons or things properly or methodically 'his organization of the work force was very efficient'". Example of an instance: Bogo-Indian_Defence. See more details here
wordnet_organization_101008378: "the act of organizing a business or an activity related to a business 'he was brought in to supervise the organization of a new department'". Example of an instance: Adam_Smith_Foundation. See more details here
If you follow the links I provided you can see more differences and common similarities.
Should I expect "new" organization classes as new versions of YAGO ontologies are made available?
When they generated Yago they associated every word in wordnet to a URI. If more words about organizations are added, then I guess that you'll have more definitions. However it is impossible to know beforehand.
Is there any other class I should consider when selecting Organizations?
You can look for all the classes with the label "organization" in wordnet and then add optionals to your query (or issue one query per class retrieving the different organizations you are interested in). These are the classes with the "organization" label in Wordnet.
I hope it helps.

DBPedia queries missing certain chemical compounds

I am running this query to get list of all compounds from the DBPedia public SPARQL endpoint.
SELECT * WHERE {
?y rdf:type dbpedia-owl:Drug.
?y rdfs:label ?Name .
OPTIONAL {?y dbpedia-owl:iupacName ?iupacname} .
OPTIONAL {?y dcterms:subject ?y1}
FILTER (langMatches(lang(?Name),"en"))
}
LIMIT 50000
I am downloading in batches of 50000 (2 files) using offset parameter.
Somehow Isopropyl_alcohol is not getting covered in this even where page exists at
http://live.dbpedia.org/page/Isopropyl_alcohol
and it has the properties that I am searching for?
There are two issues here. The first is that DBpedia Live and DBpedia do not have exactly the same content. According to the DBpedia live webpage
Wikipedia users constantly revise Wikipedia articles with updates
happening almost each second. Hence, data stored in the official
DBpedia endpoint can quickly become outdated, and Wikipedia articles
need to be re-extracted. DBpedia Live enables such a continuous
synchronization between DBpedia and Wikipedia.
That page also lists two SPARQL endpoints for DBpedia Live:
http://live.dbpedia.org/sparql
http://dbpedia-live.openlinksw.com/sparql
However, you'll run into issues on both. Isopropyl_alcohol is in DBpedia, and its URI is
http://dbpedia.org/resource/Isopropyl_alcohol (which, when using a web browser, redirects to
http://dbpedia.org/page/Isopropyl_alcohol)
Looking there, we see that Isopropyl alcohol doesn't have rdf:type dbpedia-owl:Drug, but only
owl:Thing
dbpedia-owl:ChemicalCompound
dbpedia-owl:ChemicalSubstance
yago:Alcohols
http://umbel.org/umbel/rc/ChemicalSubstanceType
yago:AlcoholSolvents
so you won't be able to find it with your query on DBpedia, because it doesn't have the type `dbpedia-owl:Drug. Now, Isopropyl_alcohol also exists in DBpedia live, and its URL is
http://live.dbpedia.org/resource/Isopropyl_alcohol, which redirects to
http://live.dbpedia.org/page/Isopropyl_alcohol
but it only has the folllowing rdf:types:
yago:Alcohols
yago:AlcoholSolvents
http://umbel.org/umbel/rc/ChemicalSubstanceType
so it won't be found by your query on DBpedia Live, for the same reason.
The second issue is the one that AndyS pointed out. Even if the query would select Isopropyl_alcohol in DBpedia or DBpedia Live, unless you provide an ordering constraint, the limit/offset combination won't be guaranteed to return it, since without an ordering constraint, the server could legitimately return the same set of 50000 results to you every time.
Maybe it is not finding it in the LIMIT/OFFSET combination you are using. The server is not obliged to answer queries in the same order everytime unless you use ORDER BY so maybe the slices you have are not in fact all results.
Maybe the SPARQL site and live.dbpedia are not in step.
Try asking directly for Isopropyl_alcohol.