Optimizing aggregation query against wikidata - sparql

I am running an aggregation query against wiki data. The query tries to calculate the average duration of films, grouped by their genre and the year of publication
The multiple grouping/subqueries in the query are intended to retain an n-1 relationship from film to the grouping criteria (year and genre) and a 1-1 relationship between a film and its duration. Reason for this is having approximately correct aggregations (n-1 relationships are familiar for OLAP and data warehousing practitioners).
More explanation is embedded in the query. Hence I cannot drop down the groupings done in the subqueries and the if statements or the group concatenation. This query times out on Wikidata SPARQL endpoint.
QUESTION
I need some suggestion for performance enhancement... Any optimization hints? In case that's not possible, anyone aware of some authenticated way (so that they know I am not playing around) to query Wikidata so that timeout can be increased, or a way to increase timeout generally?
# Average duration of films, grouped by their genre and the year of publication
SELECT
?genre1 # film genre
?year1 # film year of publication
(AVG(?duration1) AS ?avg) # film average duration
WHERE
{
# Calculating the average duration for each single film.
# As there are films with multiple duration, these durations are
# averagred by grouping aggregating durations by film.
# Hence, a single duration for each film is projected out from the subquery.
{
select ?film (avg(?duration) as ?duration1)
where{
?film <http://www.wikidata.org/prop/direct/P2047> ?duration .
}group by ?film
}
# Here the grouping criteria (genre and year) are calculated.
# The criteria is grouped by film, so that in case multiple
# genre/multiple year exist for a single film, all of them are
# group concated into a single value.
# Also in case of a lack of a value of year or genre for some
# specific film, a dummy value "OtherYear"/"OtherGenre" is generated.
{
select ?film (
IF
(
group_concat(distinct ?year ; separator="-- ") != "",
# In case multiple year exist for a single film, all of them are group concated into a single value.
group_concat(distinct ?year ; separator="-- "),
# In case of a lack of a value of year for some specific film, a dummy value "OtherYear" is generated.
"OtherYear"
)
as ?year1
)
(
IF
(
group_concat(distinct ?genre ; separator="-- ") != "",
# In case multiple genre exist for a single film, all of them are group concated into a single value.
group_concat(distinct ?genre ; separator="-- "),
# In case of a lack of a value of genre for some specific film, a dummy value "OtherGenre" is generated.
"OtherGenre"
)
as ?genre1
)
where
{
?film <http://www.wikidata.org/prop/direct/P31> <http://www.wikidata.org/entity/Q11424> .
optional {
?film <http://www.wikidata.org/prop/direct/P577> ?date .
BIND(year(?date) AS ?year)
}
optional {
?film <http://www.wikidata.org/prop/direct/P136> ?genre .
}
} group by ?film
}
} GROUP BY ?year1 ?genre1

The query seems to work after replacing the two IF expressions with a simple sample (which picks an arbitrary value from the group):
(sample(?year) as ?year1)
(sample(?genre) as ?genre1)
So it appears that the expense of group_concat is the main problem. I don't find that very intuitive and have no explanation.
Maybe the version with sample is good enough, or at least it may give you a baseline point for further improvements.

Related

Wikidata: an effective way to count items that share two properties

I would like to count the number of Wikidata items that have two properties at the same time. For example, a Viaf ID and a BNF ID, or a LoC Id and a SUDOC id. The first way that comes to my mind would be a query like this:
SELECT (COUNT(DISTINCT ?item) AS ?count) WHERE {
?item wdt:P214 ?viaf.
?item wdt:P268 ?bnf.
}
Try it.
But this query is inefficient (23 seconds) and, to apply it to 10 properties, would require 90 comparisons two by two. Is there a more efficient way to perform these calculations?

SPARQL Query to find results which do not meet a certain criteria

I am trying to write a SPARQL query which will return a set of patient identifier codes (?Crid) which have associated with them a specific diagnosis code (?ICD9) and DO NOT have associated with them a specific medication AND which have an order date (?OrderDate) prior to their recruitment date (?RecruitDate). I have incorporated the OBIB ontology into my graph.
Here is what I have so far (a bit simplified and with a few steps through the graph omitted for readability/sensitivity):
SELECT DISTINCT ?Crid WHERE
{?Crid a obib:CRID .
#-- Return CRIDs with a diagnosis
?Crid obib:hasPart ?ICD9 .
?ICD9 a obib:diagnosis .
#-- Return CRIDs with a medical prescription record
?Crid obib:hasPart ?medRecord .
?medRecord a obib:medicalRecord .
#-- Return CRIDs with an order date
?medRecord obib:hasPart ?OrderDate .
?OrderDate a obib:dateOfDataEntry .
#-- Return CRIDs with a recruitment date
?Crid obib:hasPart ?FormFilling .
?FormFilling a obib:formFilling .
?RecruitDate obib:isAbout ?FormFilling .
?RecruitDate a obib:dateOfDataEntry .
#-- Filter results for specific ICD9 codes
FILTER (?ICD9 = '1')
#-- Subtract Results with Certain Medication and Order Date Prior to Recruitment
#-- This is the part that I think is giving me a problem
MINUS {
FILTER (regex (?medRecord, "medication_1", "i"))
FILTER (?RecruitDate-?OrderDate < "P0D"^^xsd:dayTimeDuration)
}
}
My gut feeling is that I am not using MINUS correctly. This query returns mostly the right results: I am expecting 10 results and it is returning 12. The extraneous 2 results did take "medication_1" and have order dates before their recruitment dates, so I do not want them to be included in the set.
In case it matters, I am using a Stardog endpoint to run this query and to store my graph data.
Instead of
#-- Subtract Results with Certain Medication and Order Date Prior to Recruitment
#-- This is the part that I think is giving me a problem
MINUS {
FILTER (regex (?medRecord, "medication_1", "i"))
FILTER (?RecruitDate-?OrderDate < "P0D"^^xsd:dayTimeDuration)
}
}
I'd probably just write this without MINUS as:
FILTER (!regex(?medRecord, "medication_1", "i"))
FILTER (?RecruitDate-?OrderDate >= "P0D"^^xsd:dayTimeDuration)
I'd also probably consider whether REGEX is the right tool here (would a simple string comparison work?), but that's a different issue.

Stardog not returning expected results upon query involving division

this query is supposed to return the proportion of players in a specific competition that are of a certain team.
However when I run it on my stardog db, nothing is returned.
Stardog doesn't even indicate there were 0 results or fill in column headers.
I pasted the query into yasgui.org's interface, and the query appears to be well formed (no syntax error).
Does anyone have any idea why it doesn't return the expected results?
select ?competition ?team (COUNT(distinct ?team_1_player)/COUNT(distinct ?player) as ?proportion)
where {
?team_1_player prop:competesIn ?competition.
?team_1_player prop:memberOf ?team.
?player prop:competesIn ?competition.
}
group by ?competition ?team
order by desc(?proportion)
The following similar query does return expected results. It does exactly the same, except that it returns the sum of team players and of all players in the competition, instead of the proportion of players of a certain team.
select distinct ?competition ?team (COUNT(distinct ?team_1_player) as ?num_team_players) (COUNT(distinct ?player) as ?num_players)
where {
?team_1_player prop:competesIn ?competition.
?team_1_player prop:memberOf ?team.
?player prop:competesIn ?competition.
}
group by ?competition ?team
order by desc(?num_team_players)

How to get all the entities that do not have a given attribute?

I need to formulate a SPARQL query that returns me all entities that have a given number of values for a given attribute. For example, I want to have all the countries that border with exactly two other countries.
I also might want to find all countries that do not border with any other country (so, the number of values of the attribute "hasBorderWith" is equal to zero. In this context, it is not clear to me if there is a difference between the following two cases:
An entity has zero values for the given attribute.
An entity does not have the given entity.
For example, I can imagine that a country that does not have borders with other country does not have "hasBorderWith" attribute. Will it cause a problem?
There are a couple of questions embedded here. To find countries bordered by exactly two countries, you'd need to group by the country match and get the count. Then use HAVING, which is executed after the aggregate has been calculated to filter by the count criteria:
SELECT ?country (count(?bordered) AS ?borderCount)
WHERE {
?country a :Country .
?country :hasBorderWith ?bordered
} GROUP BY ?country
HAVING (?borderCount = 2)
For the second question, I don't see a difference between 0 and no property, and this can be computed with a negation query:
SELECT ?country
WHERE {
?country a :Country .
FILTER NOT EXISTS {
?country :hasBorderWith ?x
}
}
EDIT: to find a count of 0
Per the questions and #ASKW's suggestion, the following would get a count of 0 if there are no hasBorderWith properties:
SELECT ?country (count(?bordered) AS ?borderCount)
WHERE {
?country a :Country .
OPTIONAL {
?country :hasBorderWith ?bordered
}
} GROUP BY ?country
HAVING (?borderCount = 0)
The OPTIONAL clause allows the match to occur, but will not contribute to the count(?bordered) aggregate if ?bordered is not bound, hence members of :Country without a :hasBorderWith property will get a count of 0.

How to form SPARQL queries that refers to multiple resources

My question is a followup with my first question about SPARQL here.
My SPARQL query results for Mountain objects are here.
From those results I picked a certain object resource.
Now I want to get values of "is dbpedia-owl:highestPlace of" records for this chosen Mountain object.
That is, names of mountain ranges for which this mountain is highest place of.
This is, as I figure, complex. Not only because I do not know the required syntax, but also I get two objects here.
One of them is Mont Blank Massif which is of type "place".
Another one is Western Alps which is of type "mountain range" - my desired record.
I need record # 2 above but not 1. I know 1 is also relevant but sometimes it doesn't follow same pattern. Sometimes the records appear to be of YAGO type, which can be totally misleading. To be safe, I simply want to discard those records whenever there is type mismatch.
How can I form my SPARQL query to get these "is dbpedia-owl:highestPlace of" records and also have the type filtering?
you can use this query, note however that Mont_Blanc_massif in your example is both a dbpedia-owl:Place and a dbpedia-owl:MountainRange
select * where {
?place dbpedia-owl:highestPlace :Mont_Blanc.
?place rdf:type dbpedia-owl:MountainRange.
}
run query
edit after comment: filter
It is not really clear what you want to filter (yago?), technically you can filter for example like this:
select * where {
?place dbpedia-owl:highestPlace :Mont_Blanc.
?place rdf:type dbpedia-owl:MountainRange.
FILTER NOT EXISTS {
?place ?pred ?obj
Filter (regex(?obj, "yago"))
}
}
this filters out results that have any object with 'yago' in its URL.
Extending the result from the previous answer, the appropriate query would be
select * where {
?mountain a dbpedia-owl:Mountain ;
dbpedia-owl:abstract ?abstract ;
foaf:depiction ?depiction .
?range a dbpedia-owl:MountainRange ;
dbpedia-owl:highestPlace ?mountain .
FILTER(langMatches(lang(?abstract),"EN"))
}
LIMIT 10
SPARQL Results
This selects mountains with English abstracts that have at least one depiction (or else the pattern wouldn't match) and for which there is some mountain range of which the mountain is the highest place. Without the parts from the earlier question, if you just want to retrieve mountains that are the highest place of a range, you can use a query like this:
select * where {
?mountain a dbpedia-owl:Mountain .
?range a dbpedia-owl:MountainRange ;
dbpedia-owl:highestPlace ?mountain .
}
LIMIT 10
SPARQL results