SPARQL: Extracting Unique Entities from DBpedia - sparql

Consider the following script:
PREFIX category: <http://dbpedia.org/resource/Category:>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX dbpedia: <http://dbpedia.org/ontology/>
SELECT DISTINCT *
WHERE {
?s dcterms:subject category:Living_people .
?s foaf:name ?name
}
LIMIT 10000
When running it, I get something like this in result:
Sir Alexander Chapman Ferguson
Sir Alex Ferguson
Though they are different entries, they are definitely the same entities. So I would like to reduce the output when addressing the SPARQL endpoint, i.e. I would like to avoid editing output data because it may be challenging in this case. Could you help me with that? What should be fixed in my query?

As you see when you run your query, both the rows that you mention refer to the same resource: <http://dbpedia.org/resource/Alex_Ferguson>. The fact that you get multiple rows in your query result is simply because there are multiple names for this person.
So if you just need to ensure that you don't get duplicates in your application, simply make sure that your application treats each unique value for "s" in your query result as a separate person.
On the other hand, if your problem is the fact that you get multiple names for a person, you could perhaps use some other properties. For example, dbpedia:fullname only has a single entry, likewise the properties dbpedia:surname and dbpedia:givenName.

Related

Sparql- Number Of Entities and Instances?

I'm doing small task on Sparql Query. I want to get the number of entities and number of instances. I have basic knowledge of Sparql and rdf. So I wrote sparql query to get the number of entities but i'm not 100% sure it's right. The endpoint i'm using is Dbpedia. Here's the query.
#Number of Entities
SELECT (count(?entity) AS ?Entities)
WHERE{ ?entity rdf:type ?type.
}
-----------
Output:
113715893
The output above me gives me big number. I'm just wondering is that the right query to get the number of entities?
Also I have to get the number of Instances. I'm not sure what 'instances' means. I assume that is the subclass or something.
Can anyone help me out with the task?
Hey the problems with the terms entity and instance is they are used often in different meanings. I assume Entity means every uri that can be an subject. While instance means every entity which is an instance of an owl:Class.
For the entities the query would be:
SELECT (count(distinct ?entity) AS ?Entities)
WHERE{ ?entity ?p ?o}
For instances i would write the following query:
select distinct count(distinct ?instance) where {?instance a ?class . ?class a owl:Class}
Maybe you mention the distinct before the variable i want to count? This is very important for you. Because to stick with your try an entity can have multiple types. For each of this types you will get an binding for the combination of entity & type variable. This at least leads to the fact that you will count the entity for each type you found in your query. So an entity with two types is counted twice. But I assume you want to count the entity only once - so you need to use the distinct keyword for the variable you want to count. This ensures that you only count different entities that are bound to this variable.

Mapping DBpedia types to Wikipedia Categories

I am trying to map DBPedia types to Wikipedia Categories, a simple example would be the following SPARQL query
select distinct ?cat where {
?s a dbpedia-owl:LacrossePlayer; dcterms:subject ?cat . filter(regex(?cat,'players','i') )
} limit 100
SPARQL Result
But this is highly inefficient as it has to first map the DBpedia types to DBpedia Named Entities(resources) and then extract their corresponding Wikipedia categories. I am trying to do this mapping for a lot of other DBpedia types.
Is there a direct or more efficient way to do this?
Improving the filter may help…
As an initial note, you may get some speedup if you remove or improve your filter. You can, of course, just remove it, but you could also make it more efficienct, since you're not really using any special regular expressions. Just do
filter contains(lcase(str(?cat)),'players')
to check whether the URI for ?cat contains the string players. It might even be better (I'm not sure) to grab the English rdfs:label of ?cat and check that, since you wouldn't have to do the case or string conversions.
… but there are lots of results.
But this is highly inefficient as it has to first map the DBpedia
types to DBpedia Named Entities(resources) and then extract their
corresponding Wikipedia categories. I am trying to do this mapping for
a lot of other DBpedia types. Is there a direct or more efficient way
to do this?
I'm not sure exactly what's inefficient in this. The only way that DBpedia types and categories are associated is that resources have types (via rdf:type) and have categories (via dcterms:subject). If you want to find the connections, then you'll need to find the instances of the type and the categories to which they belong. There may be some possibility that you can look into whether any particular infoboxes provide categories to articles and are used in the infobox mapping to provide DBpedia types. That's the only way to get category/DBpedia-types directly, without going through instances that I can think of, and I don't know whether the current dataset has that kind of information.
In general, since Wikipedia categories are not a type hierarchy, there will be lots of categories with which instances of any particular type are associated. For instance, we can count the number of categories associated with the types Fish and LacrossePlayer with a query like this:
select ?type (count(distinct ?category) as ?nCategories) where {
values ?type { dbpedia-owl:Fish dbpedia-owl:LacrossePlayer }
?type ^a/dcterms:subject ?category
}
group by ?type
SPARQL results
type nCategories
http://dbpedia.org/ontology/LacrossePlayer 346
http://dbpedia.org/ontology/Fish 2375
That query responds pretty quickly, and you can even get those categories pretty easily, too:
select distinct ?type ?category where {
values ?type { dbpedia-owl:Fish dbpedia-owl:LacrossePlayer }
?type ^a/dcterms:subject ?category
}
order by ?type
limit 4000
SPARQL results
When you start using types that have many more instances, though, these counts get big, and the queries take a while to return. E.g., a very common type like Place:
select ?type (count(distinct ?category) as ?nCategories) where {
values ?type { dbpedia-owl:Place }
?type ^a/dcterms:subject ?category
}
group by ?type
type nCategories
http://dbpedia.org/ontology/Place 191172
I wouldn't suggest trying to pull all that data down from the remote server. If you want to extract it, you should load the data locally.

Query dbpedia with SPARQL for films

I want to get data (movie title, director name, actor name and the wikipedia link) of all movies present on dbpedia.
I tried this query on http://dbpedia.org/snorql/.
SELECT ?film_title ?star_name ?nameDirector ?link WHERE {
{
SELECT DISTINCT ?movies ?film_title
WHERE {
?movies rdf:type <http://dbpedia.org/ontology/Film>;
rdfs:label ?film_title.
}
}.
?movies dbpedia-owl:starring ?star;
foaf:isPrimaryTopicOf ?link;
dbpedia-owl:director ?director.
?director foaf:name ?nameDirector.
?star foaf:name ?star_name.
FILTER LANGMATCHES( LANG(?film_title), 'en')
} LIMIT 100
Responses seems correct, but the response time are slow, so I'm wondering if I can improve my query for get a faster response.
There are a couple of things you could change in your query that might make it faster.
Firstly what is the point of your SELECT DISTINCT subquery? Is that merely trying to eliminate duplicate film titles? Removing this may make things faster if you can live with a few duplicates.
Secondly the FILTER clauses requires the database to scan over all the possible matches and evaluate the expression on each possible match to determine whether to keep it or throw it away. Again if you can live with getting some duplicate data and don't mind non-English language tags removing the FILTER may make the query run faster.

Using DBpedia to find which Wikipedia categories are common to several given pages?

I've forgotten all I once new about DBpedia and SPARQL and find all the examples too complex and hard to understand when I Google for them.
What I wish to do is pass in two or three Wikipedia pages and get back the set of Wikipedia categories that all of the pages are members of.
This seems that it should be utterly simple in SPARQL so I would appreciate a very minimal example to get me started.
This is actually a variation of your earlier question about getting all pages belonging to two categories. The only difference is that this time, you want two/three subjects rather than objects, so you cannot use a comma-separated enumeration of values, but instead have to write out the triple pattern that you want to match.
For example, to get back all categories that both Spain and Portugal belong to, you could simply do a query like this:
SELECT ?cat
WHERE {
<http://dbpedia.org/resource/Spain> dcterms:subject ?cat .
<http://dbpedia.org/resource/Portugal> dcterms:subject ?cat .
}
what this query does is select all triple patterns that have the same value of ?cat for the dcterms:subject relation for the subjects 'Spain' and 'Portugal'. In other words, it retrieves precisely those categories that both resources are a member of.
The trick is to think in terms of a graph, or triples with connected subjects and objects. It's a bit of a mental shift but once you've got that, query writing becomes a lot easier.
The mapping between wikipedia and dbpedia URI's is as follows:
For
http://en.wikipedia.org/wiki/Spain
DBPedia uri is:
http://dbpedia.org/resource/Spain
So to find out the categories for the above
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?categoryUri ?categoryName
WHERE {
<http://dbpedia.org/resource/Spain> dcterms:subject ?categoryUri.
?categoryUri rdfs:label ?categoryName.
FILTER (lang(?categoryName) = "en")
}

DBpedia query returns some musicals more than once despite filters

I'm trying to use a SPARQL query against DBpedia to retrieve a list of musicals and some associated properties. However, despite using the appropriate filters (as far as I can tell), the results include many of the musicals more than once. Here is my query:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbpprop: <http://dbpedia.org/property/>
SELECT ?label ?abstract ?book ?music ?lyrics
WHERE {
?play <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:Broadway_musicals> ;
rdfs:label ?label ;
dbo:abstract ?abstract ;
dbpprop:book ?book ;
dbpprop:lyrics ?lyrics ;
dbpprop:music ?music .
FILTER (LANG(?label) = 'en')
FILTER (LANG(?abstract) = 'en')
FILTER (LANG(?book) = 'en')
FILTER (LANG(?lyrics) = 'en')
FILTER (LANG(?music) = 'en')
}
The resulting list has many duplicate entries. Pasting the query here:
DBpedia SPARQL Explorer, you'll see that starting with 'Mama Mia!' there are a lot of duplicates in the list.
Any idea what I'm missing to get unique results with no duplicates? Thanks!
[Edited by glenn mcdonald to clarify that it's musicals which are "duplicated" here, not triples.]
SPARQL returns variable-bindings. Your "duplicates" are cartesian products of multiples in your projected properties. Mamma Mia has multiple music writers and multiple lyricists, so you get every possible combination of them that could produce a row in your table.
What a pain, huh? The "solution" is to use CONSTRUCT instead of SELECT, and deal with getting back a graph instead of a table. Maybe like this:
http://dbpedia.org/snorql/?query=PREFIX+rdfs%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0D%0A++++PREFIX+dbo%3A+%3Chttp%3A%2F%2Fdbpedia.org%2Fontology%2F%3E%0D%0A++++PREFIX+dbpprop%3A+%3Chttp%3A%2F%2Fdbpedia.org%2Fproperty%2F%3E%0D%0A++++CONSTRUCT+%7B%0D%0A++++++++%3Fplay+rdfs%3Alabel+%3Flabel+%3B%0D%0A++++++++++++dbo%3Aabstract+%3Fabstract+%3B%0D%0A++++++++++++dbpprop%3Abook+%3Fbook+%3B%0D%0A++++++++++++dbpprop%3Alyrics+%3Flyrics+%3B%0D%0A++++++++++++dbpprop%3Amusic+%3Fmusic+.%0D%0A++++%7D%0D%0A++++WHERE+%7B+%0D%0A++++++++%3Fplay+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2Fsubject%3E+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FCategory%3ABroadway_musicals%3E+%3B%0D%0A++++++++++++rdfs%3Alabel+%3Flabel+%3B%0D%0A++++++++++++dbo%3Aabstract+%3Fabstract+%3B%0D%0A++++++++++++dbpprop%3Abook+%3Fbook+%3B%0D%0A++++++++++++dbpprop%3Alyrics+%3Flyrics+%3B%0D%0A++++++++++++dbpprop%3Amusic+%3Fmusic+.%0D%0A++++++++FILTER+%28LANG%28%3Flabel%29+%3D+%27en%27%29++++%0D%0A++++++++FILTER+%28LANG%28%3Fabstract%29+%3D+%27en%27%29%0D%0A++++++++FILTER+%28LANG%28%3Fbook%29+%3D+%27en%27%29%0D%0A++++++++FILTER+%28LANG%28%3Flyrics%29+%3D+%27en%27%29%0D%0A++++++++FILTER+%28LANG%28%3Fmusic%29+%3D+%27en%27%29%0D%0A++++%7D
Are the duplicates exact duplicates? i.e. every value for every variable of each duplicate result is identical
If so then add the DISTINCT keyword after SELECT to force the SPARQL engine to discard duplicates solutions.
If not then Glenn is entirely correct that because there are multiple values given for the various properties so you will get multiple results. There are complex workarounds you can do with subqueries, GROUP BY etc. but they would tend to lead to less efficient queries. Sometimes you just have to deal with the duplicates on the client side.