How can I avoid timeout on a SPARQL query on Wikidata? - sparql

I am trying to extract all items of a category on Wikidata, with their respective page title in English. It works ok as long as the category does not contain many items, like this:
SELECT ?work ?workLabel
WHERE
{
?work wdt:P31/wdt:P279* wd:Q734454.
?work rdfs:label ?workLabel .
FILTER ( LANGMATCHES ( LANG ( ?workLabel ), "en" ) )
}
ORDER BY ?work
but times out (Query timeout limit reached )as soon as I use a category with more items, such as Q2188189. See This example
I have tried using LIMIT or OFFSET clauses but this does not change the result.
I also have tried to insert a filter like this FILTER (regex(?work, '.*Q1.*')) . to slice the query in subsets, also without success (No matching records found).
For now I have only extracted the ids - and then run queries to get the page title for each one of them, but that seems silly.
Is there a way to work around the timeout?

Standard method
If you want the page title of all the music works which have an article on en.wikipedia.org, you must use the following query:
SELECT ?work ?workTitle
WHERE
{
?work wdt:P31/wdt:P279* wd:Q2188189.
?workLink schema:about ?work ;
schema:isPartOf <https://en.wikipedia.org/> ;
schema:name ?workTitle .
}
I tried it three times and two of them it haven't exceed timeout.
Alternative method
If you don't manage to make it work, the only workaround I can imagine is to retrieve all the possible types (i.e. subclasses) of music work, and adapt the above query to the single-class case.
So, the first step is:
SELECT ?workType WHERE { ?workType wdt:P279* wd:Q2188189. }
You'll get more than a thousand results. For each of them (take for example the result Q2743), you'll then have to run the following query:
SELECT ?work ?workTitle
WHERE
{
?work wdt:P31 wd:Q2743.
?workLink schema:about ?work ;
schema:isPartOf <https://en.wikipedia.org/> ;
schema:name ?workTitle .
}
This will return all the items that are directly instances of Q2743, without caring about subclasses.
This method is a bit cumbersome, abut you can use it if you don't care of doing many queries. The idea is to divide the complexity among many queries, so that you will exceed the timeout less likely for each of them.

Related

Get movie(s) based on book(s) from DBpedia

I am new to SPARQL and trying to fetch a movie adapted from specific book from dbpedia. This is what I have so far:
PREFIX onto: <http://dbpedia.org/ontology/>
SELECT *
WHERE
{
<http://dbpedia.org/page/2001:_A_Space_Odyssey> a ?type.
?type onto:basedOn ?book .
?book a onto:Book
}
I can't get any results. How can I do that?
When using any web resource, and in your case the property :basedOn, you need to make sure that you have declared the right prefix. If you are querying from the DBpedia SPARQL endpoint, then you can directly use dbo:basedOneven without declaring it, as it is among predefined. Alternatively, if you want to use your own, or if you are using another SPARQL client, make sure that whatever short name you choose for this property, you declare the prefix for http://dbpedia.org/ontology/.
Then, first, to get more result you may not restrict the type of the subject of this triple pattern, as there could be movies that actually not type as such. So, a query like this
select distinct *
{
?movie dbo:basedOn ?book .
?book a dbo:Book .
}
will give you lots of good results but not all. For example, the resource from your example will be missing. You can easily check test the available properties between these two resource with a query like this:
select ?p
{
{<http://dbpedia.org/resource/2001:_A_Space_Odyssey_(film)> ?p <http://dbpedia.org/resource/2001:_A_Space_Odyssey> }
UNION
{ <http://dbpedia.org/resource/2001:_A_Space_Odyssey> ?p <http://dbpedia.org/resource/2001:_A_Space_Odyssey_(film)>}
}
You'll get only one result:
http://www.w3.org/2000/01/rdf-schema#seeAlso
(note that the URI is with 'resource', not with 'page')
Then you may search for any path between the two resource, using the method described here, or find a combination of other patterns that would increase the number of results.

Get string without the language tag

A SPARQL query like:
SELECT distinct * where {
?x dc:title ?title .
}
is always returning ?title with a language tag. How to obtain an rdf language string without a language tag, e.g. return "English"#en as "English" only
I guess you are willing to show the results from one language only. If that is the case, you can take off the label with:
SELECT distinct ?stripped_title where {
?x dc:title ?title .
BIND (STR(?title) AS ?stripped_title)
}
but it would be meaningful only after you filter your results for the desired language, e.g.
FILTER ( LANG(?title) = "en" )
Alternatively, there might be some confusion in reading the results, for example you might get seemingly duplicating answers, while it just happened that the label is the same in two different languages.

SPARQL query for all people for an institution on dbpedia

I'm trying to extract alumni lists for universities using SPARQL.
I've identified the ontologies I need:
http://mappings.dbpedia.org/server/ontology/classes/University
http://mappings.dbpedia.org/server/ontology/classes/Person
I tried this query, which you can examine here:
SELECT * WHERE {
?University dbpedia2:alumni ?Person .
}
Which seemed to make sense, except this returns counts instead of people, as the ontology says the property contains.
I found this query somewhere which seemed to do a better job finding universities, but was very slow.
SELECT * WHERE {
{ <http://dbpedia.org/ontology/University> ?property ?hasValue }
UNION
{ ?isValueOf ?property <http://dbpedia.org/ontology/University> }
}
I also tried going the other way, start with all people and look for their almae matres, in this form:
SELECT * WHERE {
?person dbpedia2:almaMater ?University
}
But this is much slower, possibly because searching through the people space is too laborious. This does actually work, but it returns a different set of results in application---namely, all people with a listed alma mater, rather than all people listed by universities as alumni. I'd prefer a syntax that gets me the alumni.
How can I phrase this to return all alumni listed for universities?
The performance of DBpedia's SPARQL endpoint can be a bit unreliable at times. After all, it's apublic service, and isn't intended for huge queries. Nonetheless, I think you can get what you're looking for here without too much trouble. First, you can check how many results there are with a query like this at the public SPARQL endpoint:
select (count(*) as ?nResults) where {
?person dbpedia-owl:almaMater ?almaMater
}
SPARQL results (64928)
Now, if you just want the big list, you'd get it like this. The order by helps organize the results for easy consumption, but isn't technically necessary:
select ?almaMater ?person where {
?person dbpedia-owl:almaMater ?almaMater
}
order by ?almaMater ?person
SPARQL results
If you need to place some additional restrictions on ?almaMater, e.g., to ensure that it's a university, then you can add them to the query. For instance:
select ?almaMater ?person where {
?person dbpedia-owl:almaMater ?almaMater .
?almaMater a dbpedia-owl:University .
}
order by ?almaMater ?person
SPARQL results
In your last query, you are almost there. However, you are currently asking for any resource that can take the place of the ?University variable. As you only want universities to take that place, you can use another triple to further restrict that variable:
SELECT * WHERE {
?University a dbpedia-owl:University.
?person dbpedia2:almaMater ?University.
}
This means that ?University can only be an individual of class dbpedia-owl:University (where dbpedia-owl is mapped to http://dbpedia.org/ontology/).
Your first query:
SELECT * WHERE {
?University dbpedia2:alumni ?Person .
}
isn't just returning counts; it's returning both counts and individual alumni. Apparently dbpedia's data here is poor quality and there are a number of triples misusing the dbpedia2:alumni relation.
You can filter out the counts by adding a second condition requiring that an entity satisfying Person be a member of the appropriate class:
SELECT * WHERE {
?university dbpedia2:alumni ?person .
?person rdf:type <http://dbpedia.org/ontology/Person>
}
What you see running this is that there are very few individuals tagged as alumni; the data is surprisingly scant, unfortunately.

Find number of some entity type

What is sparql query that finds count of some entity? For examles, on Linked movie database, if I want find count of actors or films, how can I get it?
I tried this
SELECT (count ( ?Film)){?entity rdf:type ?Film}
but got wrong number.
There's a whole lot missing from this question (e.g., where you ran the query, what you expected as a result, etc.) but I think we can pinpoint the problem even without those details. First, let's rewrite your query using proper syntax (the formatting is optional; the important thing is count(?Film) as ?count):
select (count(?Film) as ?count) {
?entity rdf:type ?Film
}
?Film here is a variable, so you're asking "find me things and their types, and then count how many types were found." If you were trying to count the number of things of some particular film type, though, you probably wanted a query like:
select (count(?entity) as ?numberOfFilms) {
?entity rdf:type :Film .
}
Where :Film is some particular IRI, not a variable. Also note that you can abbreviate rdf:type with a, so you can make this even shorter and fit it nicely on one line again, if you want:
select (count(?entity) as ?numberOfFilms) { ?entity a :Film }

Limit a SPARQL query to one dataset

I'm working with the following SPARQL query, which is an example on the web-based end of my institution's SPARQL endpoint;
SELECT ?building_number ?name ?occupants WHERE {
?site a org:Site ;
rdfs:label "Highfield Campus" .
?building spacerel:within ?site ;
skos:notation ?building_number ;
rdfs:label ?name .
OPTIONAL {
?building soton:buildingOccupants ?occ .
?occ rdfs:label ?occupants .
} .
} ORDER BY ?name
The problem is that as well as getting data from 'Buildings and Places', the Dataset I'm interested in, and would expect the example to use, it also gets data from the 'Facilities and Equipment' dataset, which isn't relevant. You should see this if you follow the link.
I suspect the example may pre-date the addition of the Facilities and Equipment dataset, but even with the research I've done into SPARQL, I can't see a clear way to define which datasets to include.
Can anyone recommend a starting point to limit it to just show 'Buildings', or, more specifically, results from the 'Buildings and Places' dataset.
Thanks
First things first, you really need to use SELECT DISTINCT, as otherwise you'll get repeated results.
To answer your question, you can use GRAPH { ... } to filter certain parts of a SPARQL query to only match data from a specific dataset. This only works if the SPARQL endpoint is divided up into GRAPHs (this one is). The solution you asked for isn't the best choice, as it assumes that things within sites in the 'places' dataset will always be resticted to buildings... That's risky -- as it might end up containing trees and signposts at some time in the future.
Step one is to just find out what graphs are in play:
SELECT DISTINCT ?g1 ?building_number ?name ?occupants WHERE {
?site a org:Site ;
rdfs:label "Highfield Campus" .
GRAPH ?g1 { ?building spacerel:within ?site ;
skos:notation ?building_number ;
rdfs:label ?name .
}
OPTIONAL {
?building soton:buildingOccupants ?occ .
?occ rdfs:label ?occupants .
} .
} ORDER BY ?name
Try it here: http://is.gd/WdRAGX
From this you can see that http://id.southampton.ac.uk/dataset/places/latest and http://id.southampton.ac.uk/dataset/places/facilities are the two relevant ones.
To only look for things 'within' a site according to the "places" graph, use:
SELECT DISTINCT ?building_number ?name ?occupants WHERE {
?site a org:Site ;
rdfs:label "Highfield Campus" .
GRAPH <http://id.southampton.ac.uk/dataset/places/latest> {
?building spacerel:within ?site ;
skos:notation ?building_number ;
rdfs:label ?name .
}
OPTIONAL {
?building soton:buildingOccupants ?occ .
?occ rdfs:label ?occupants .
} .
} ORDER BY ?name
Alternate solutions:
Using rdf:type
Above I've answered your question, but it's not the answer to your problem. This solution is more semantic as it actually says 'only give me buildings within the campus' which is what you really mean.
Instead of filtering by graph, which is not very 'semantic' you could also restrict ?building to be of class 'building' which research facilities are not. They are still sometimes listed as 'within' a site. Usually when the uni has only published what campus they are on but not which building.
?building a rooms:Building
Using FILTER
In extreme cases you may not have data in different GRAPHS and there may not be an elegant relationship to use to filter your results. In this case you can use a FILTER and turn the building URI into a string and use a regular expression to match acceptable ones:
FILTER regex(str(?building), "^http://id.southampton.ac.uk/building/")
This is bar far the worst option and don't use it if you have to.
Belt and Braces
You can use any of these restictions together and a combination of restricting the GRAPH plus ensuring that all ?buildings really are buildings would be my recommended solution.