Ways to reduce query time of SPARQL query on Wikidata? - sparql

I want to create a histogram of births and deaths for people on English Wikipedia, but I am running into query time limits on Wikidata.
I formed the following query:
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX schema: <http://schema.org/>
SELECT ?item ?article ?_date_of_birth ?_date_of_death WHERE {
?item wdt:P31 wd:Q5.
?article schema:about ?item.
?article schema:isPartOf <https://en.wikipedia.org/>.
OPTIONAL { ?item wdt:P569 ?_date_of_birth. }
OPTIONAL { ?item wdt:P570 ?_date_of_death. }
}
LIMIT 10000
Try it here
This works fine in and of itself, but as I'm trying to get the whole list, when I start adding offsets, I run into query time limits around OFFSET 500000. According to the Wikidata manual, I should try to optimize my query, but is there a way to optimize this? There are definitely more than 500000 people on wikipedia, as just finding transclusions of the 'birth date' template yields over 600000.
I have also tried dbpedia, but some of it is out of date, for example Muhammad Ali has no death date on dbpedia.
I've also tried not filtering out the english articles, i.e. asking for all of them and filtering on my end, but similar scaling issues still exist, albeit at a much higher offset.

Related

Wikidata COUNT(*) query times out

I have a straightforward query that counts how many humans have an English Wikipedia page.
prefix schema: <http://schema.org/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?item ?article
WHERE
{
?item wdt:P31 wd:Q5 . # Must be of a human
?article schema:about ?item ; # Must have a Wikipedia article
schema:inLanguage "en" ; # Article must be in English
schema:isPartOf <https://en.wikipedia.org/> . # Wikipedia article must be regular article
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } # Helps get the label in your language, if not, then en language
}
I get expected output as follows:
wd:Q11124 <https://en.wikipedia.org/wiki/Stephen_Breyer>
wd:Q10727 <https://en.wikipedia.org/wiki/Steve_Leo_Beleck>
wd:Q10065 <https://en.wikipedia.org/wiki/Taichang_Emperor>
wd:Q9605 <https://en.wikipedia.org/wiki/Sarah_Allen_(software_developer)>
However, if I change the SELECT statement from
SELECT ?item ?article
to
SELECT (count(?item) as ?count)
I get timeout error. Please note that the count statement works if I only specify "human" condition and exclude English Wiki article condition. So, clearly, some kind of background join is causing the query to timeout.
However, this is a fairly trivial join, so the query timeout is surprising.
Please let me know what may I be missing here.
Thanks!

SPARQL wikidata query: Obtain number of languages that associated wikipedia article is available in

Is it possible to use SPARQL on wikidata to extract the number of languages of a wikipedia article associated with a wikidata item?
I am new to SPARQL and wikidata. I have tried to find examples online, but no luck so far. I am able to extract the url for the wikipedia article. But I wonder if it is possible to count the number of languages for which the url exists. Here is my code so far:
prefix schema: <http://schema.org/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?museum ?museumLabel ?article WHERE {
?museum wdt:P17 wd:Q39; # countries
wdt:P31 ?type.
?type (wdt:P279*) wd:Q207694.
# If available, get the "en" entry, use native language as fallback:
SERVICE wikibase:label { bd:serviceParam wikibase:language "en,de". }
# get wikipedia article in english
OPTIONAL {
?article schema:about ?museum .
?article schema:inLanguage "en" .
FILTER (SUBSTR(str(?article), 1, 25) = "https://en.wikipedia.org/")
}
}
order by ?museum
I would like to add another column to my output that gives me the number of languages a wikipedia article exists in.
PREFIX schema: <http://schema.org/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?museum ?museumLabel (COUNT(DISTINCT ?article) as ?wiki_article_count) WHERE {
?museum wdt:P17 wd:Q39; # countries
wdt:P31 ?type.
?type (wdt:P279*) wd:Q207694.
SERVICE wikibase:label { bd:serviceParam wikibase:language "en,de". }
OPTIONAL {
?article schema:about ?museum .
}
}
GROUP BY ?museum ?museumLabel
ORDER BY DESC( ?wiki_article_count )
Further developing the query given in the question, one can obtain the number of linked wiki-pages (I am absolutely not sure about the nomenclature).
Please try the Wikidata Query Service here: https://w.wiki/334g
For example, one can see for the wikidata/wiki entry: https://www.wikidata.org/wiki/Q194626 (Kunstmuseum Basel) (nomenclature remains unclear to me for now. Is it a wiki-entity?) 22 linked wiki-pages. That is 21 wikipedia-languages and one multilingual-site. See screenshot here:
I would be thankful for any comments on this answer and could improve this answer accordingly.

Wikidata sparql query returns 0 result

I’m new to query languages and linked data so thanks a lot for the help. I also have a similar question about sparql on dbpedia
dbpedia sparql query returns 0 result
I would like to look up all the art movements in wikidata with the associated artists (founder/inventor/creator, known for), date start, date end, country. Here is my query:
PREFIX wdno: <http://www.wikidata.org/prop/novalue/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?art ?artLabel ?start ?end ?countryLabel ?influencebyLabel WHERE {
?art wdt:P31 wd:Q968159 ;
wdt:P571 ?start ;
wdt:P576 ?end;
wdt:P17 ?country ;
wdt:P737 ?influenceby .
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
When I run the query only for the artLabel, it shows 300s result but as only a few has start date, when I include the start date, my dataset shrank, and when I include the rest of the search terms, there’s few record that has all information. My question is how can I generate result where the empty cells also get recorded instead of discarded?
Also, what’s the difference between this wikidata result and the dbpedia result?
Thanks
Answered in the comments...
Ettore Rizza wrote:
You need to add optional clauses for the properties that are not always filled.

How to search for rdfs:labels in dbpedia which are partial matches to a given term using SPARQL?

I am using this query to search for all labels that contains the word "Medi"
select distinct ?label where
{
?concept rdfs:label ?label
filter contains(?label,"Medi")
filter(langMatches(lang(?label),"en"))
}
However, as soon as I change the term from "Medi" to "Medicare" it doesn't work and times out. How do I get it to work with longer words like Medicare i.e. extract all labels which has the word Medicare in it.
Your query has to iterate over all labels in DBpedia - which is quite a large number - and then apply String containment check. This is indeed expensive.
Even a count query leads to an "estimated timeout error":
select count(?label) where
{
?concept rdfs:label ?label
filter(regex(str(?label),"Medi"))
filter(langMatches(lang(?label),"en"))
}
Two options:
Virtuoso has some fulltext search support:
SELECT DISTINCT ?label WHERE {
?concept rdfs:label ?label .
?label bif:contains "Medicare"
FILTER(langMatches(lang(?label),"en"))
}
Since the public DBpedia endpoint is a shared endpoint, the solution is to load the DBpedia dataset into your own triple store, e.g. Virtuoso. There you can adjust the max. estimated execution timeout parameter.

How to query for publication date of books using SPARQL in DBPEDIA

I am trying to retrieve the publication date and the no.of pages for books in DBpedia. I tried the following query and it gives empty results. I see that these are properties under book(http://mappings.dbpedia.org/server/ontology/classes/Book) but could not retrieve it.
I would like to know if there is an error in the code or if dbpedia does not store these dates related to books.
SELECT ?book ?genre ?date ?numberOfPages
WHERE {
?book rdf:type dbpedia-owl:Book .
?book dbp:genre ?genre .
?book dbp:firstPublicationDate ?date .
OPTIONAL {?book dbp:numberOfPages ?numberOfPages .}
}
The dbp:firstPublicationDate does not work for two reasons:
First, as pointed in the first answer, you used the wrong prefix.
But even if you correct it, you'll see that you would still have no results. Then the best thing to do is to test with the minimum number of patters, in you case you should as for books with first publication date, two triple pattern only. If you still don't get results, you should test how <http://dbpedia.org/ontology/firstPublicationDate> is actually used with a query like this:
SELECT ?class (COUNT (DISTINCT ?s) AS ?instances)
WHERE {
?s <http://dbpedia.org/ontology/firstPublicationDate> ?date ;
a ?class
}
GROUP BY ?class
ORDER BY DESC(?instances)
LIMIT 1000
Mapping based properties are using the namespace http://dbpedia.org/ontology/, thus, the prefix must be dbo instead of dbp, which stands for http://dbpedia.org/property/.
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
SELECT ?book ?genre ?date ?numberOfPages
WHERE {
?book a dbo:Book ;
dbp:genre ?genre ;
dbo:firstPublicationDate ?date .
OPTIONAL {?book dbp:numberOfPages ?numberOfPages .}
}
Some additional comments:
put the prefixes to the SPARQL query such that others here can run it without any exceptions (also in the future) - the current SPARQL query uses dbpedia-owl but this one is not pre-defined on the official DBpedia anymore - it's called dbo instead
which brings me to the second point -> if you're using a public SPARQL endpoint, show its URL
you can start debugging your own SPARQL query by simply starting with only parts of it and adding more triple patterns then, e.g. in your case you could check if there is any triple with the property with
PREFIX dbp: <http://dbpedia.org/property/>
SELECT * WHERE {?book dbp:firstPublicationDate ?date } LIMIT 10
Update
As Ivo Velitchkov noticed in his answer below, the property dbo:firstPublicationDate is only used for mangas, etc., i.e. written work that was published periodically. Thus, the result will be empty.