Speeding up SPARQL query on GraphDB - sparql

I'm trying to speed up and to optimize this query
select distinct ?root where {
?root a :Root ;
:hasnode* ?node ;
:hasnode* ?node2 .
?node a :Node ;
:hasAnnotation ?ann .
?ann :hasReference ?ref .
?ref a :ReferenceType1 .
?node2 a :Node ;
:hasAnnotation ?ann2 .
?ann2 :hasReference ?ref2 .
?ref2 a :ReferenceType2 .
}
Basically, I'm analyzing some trees and I want to get all trees (i.e., trees' roots) which do have at least a couple of underlying nodes with a pattern like this one:
?node_x a :Node ;
:hasAnnotation ?ann_x .
?ann_x :hasReference ?ref_x .
?ref_x a :ReferenceTypex .
one with x = 1 and the other with x = 2.
Since in my graph one node may have at most one :hasAnnotation predicate, I do not have to specify that those nodes must be different.
The problem
The aforementioned query describes what I need but does have a very bad performance. After minutes and minutes of execution, it is still running.
My (ugly) solution: breaking it in half
I noticed that if a look for a node pattern at a time, I get my result in some seconds(!).
Sadly enough, my current approach consists in running the following query type twice:
select distinct ?root where {
?root a :Root ;
:hasnode* ?node .
?node a :Node ;
:hasAnnotation ?ann_x .
?ann_x :hasReference ?ref_x .
?ref_x a :ReferenceTypex .
}
one with x = 1 and the other with x = 2.
Saving partial results (i.e., ?roots) in 2 sets, let's say R1 and R2 and finally calculating intersection between those resultsets.
Is there a way to speed up my initial approach to get results just by leveraging SPARQL?
PS: I'm working with GraphDB.

Without knowing the specific dataset I can give you only some general directions how to optimize the query:
Avoid using DISTINCT for large datasets
The GraphDB query optimiser will not rewrite automatically the query to use EXISTS for all patterns not participating in the projection. The query semantics is to find that there is at least one such pattern, but not give me all bindings and then eliminate the duplicated results.
Materialize the property paths
GraphDB has a very efficient forward chaining reasoner and relatively not so optimised property path expansion. If you are not concerned for the write/data update performance, I suggest you to declare :hasNode as a transitive property (see owl:TransitiveProperty in query), which will eliminate the property path wildcard. This will boost many times the query speed.
Your final query should look like:
select ?root where {
?root a :Root ;
:hasnode ?node ;
:hasnode ?node2 .
FILTER (?node != ?node2)
FILTER EXISTS {
?node a :Node ;
:hasAnnotation ?ann .
?ann :hasReference ?ref .
?ref a :ReferenceType1 .
}
FILTER EXISTS {
?node2 a :Node ;
:hasAnnotation ?ann2 .
?ann2 :hasReference ?ref2 .
?ref2 a :ReferenceType2 .
}
}

Well, putting together auto-hint :) and Stanislav's suggestion I came up with a solution.
Solution 1 nested query
Nesting the query in the following way, I get the result in 15s.
select distinct ?root where {
?root a :Root ;
:hasnode* ?node .
?node a :Node ;
:hasAnnotation ?ann .
?ann :hasReference ?ref .
?ref a :ReferenceType1 .
{
select distinct ?root where {
?root a :Root ;
:hasnode* ?node2 .
?node2 a :Node ;
:hasAnnotation ?ann2 .
?ann2 :hasReference ?ref2 .
?ref2 a :ReferenceType2 .
}
}
}
Solution 2: groups into {}
Grouping parts into {}, as suggested by Stanislav's, took 60s.
select distinct ?root where {
{
?root a :Root ;
:hasnode* ?node .
?node a :Node ;
:hasAnnotation ?ann .
?ann :hasReference ?ref .
?ref a :ReferenceType1 .
}
{
?root a :Root ;
:hasnode* ?node2 .
?node2 a :Node ;
:hasAnnotation ?ann2 .
?ann2 :hasReference ?ref2 .
?ref2 a :ReferenceType2 .
}
}
Probably GraphDB's optimizer builds a query plan more effective for my data in the first case (explanations are welcome).
I've ever thought about SPARQL in a 'declarative' way, but it seems like there is a massive variability in performance respect to the way you write your SPARQL. Coming from SQL, it seems to me that such a performance variability is much greater than what it happens in the relational world.
However, reading this post, it seems I'm not sufficiently aware of SPARQL optimizer dynamics. :)

Related

Questions Re: AnzoGraph Tutorial SPARQL Queries

I'm going through the AnzoGraph tutorial. I have 2 questions: 1) They often use desc, e.g. desc(?sum) in the Order By or other parts of the query. I'm not familiar with this and didn't find much help in the SPARQL documentation. What does desc do and why is it needed? 2) Sometimes, when I would expect to see a number in the output I see an alphanumeric string. E.g., in the query:
SELECT ?location ?kind ?name ?list_date (((?selldate - ?list_date)) as ?sale_age)
FROM <tickit>
WHERE {
?sale <saletime> ?selldate .
?sale <eventid> ?event .
?listing <eventid> ?event .
?listing <listtime> ?list_date .
?event <eventname> ?name .
?event <venueid> ?venue .
?event <catid> ?cat .
?cat <catname> ?kind .
?venue <venuename> ?location .
}
ORDER BY desc(?selldate) desc(?list_date) ?location ?kind ?name
I expect to see ?sale_age as an an integer but it is displayed as alphanumerics such as: P1DT1H50M.

SPARQL property paths based on a new property defined in a CONSTRUCT subquery

Given the following schema, "driver-passenger" lineages can be easily seen:
tp:trip a owl:Class ;
rdfs:label "trip"#en ;
rdfs:comment "an 'asymmetric encounter' where someone is driving another person."#en .
tp:driver a owl:ObjectProperty ;
rdfs:label "driver"#en ;
rdfs:comment "has keys."#en ;
rdfs:domain tp:trip ;
rdfs:range tp:person .
tp:passenger a owl:ObjectProperty ;
rdfs:label "passenger"#en ;
rdfs:comment "has drinks."#en ;
rdfs:domain tp:trip ;
rdfs:range tp:person .
Consider the following data:
<alice> a tp:person .
<grace> a tp:person .
<tim> a tp:person .
<ruth> a tp:person .
<trip1> a tp:trip ;
tp:participants <alice> , <grace> ;
tp:driver <alice> ;
tp:passenger <grace> .
<trip2> a tp:trip ;
tp:participants <alice> , <tim> ;
tp:driver <alice> ;
tp:passenger <tim> .
<trip3> a tp:trip ;
tp:participants <tim> , <grace> ;
tp:driver <tim> ;
tp:passenger <grace> .
<trip4> a tp:trip ;
tp:participants <grace> , <ruth> ;
tp:driver <grace> ;
tp:passenger <ruth> .
<trip5> a tp:trip ;
tp:participants <grace> , <tim> ;
tp:driver <grace> ;
tp:passenger <tim> .
Now let a "driver-passenger descendent" be any tp:passenger at the end of a trip sequence where the tp:passenger of one trip is the tp:driver of the next trip
Ex. <ruth> is a descendent of <alice> according to the following sequence of trips:
<trip2> -> <trip3> -> <trip4>.
Question:
How to get the (ancestor,descendent) pairs of all driver-passenger lineages?
Attempt 1:
I initially tried the following CONSTRUCT subquery to define an object property: tp:drove, which can be easily used in a property path. However, this did not work on my actual data:
SELECT ?originalDriver ?passengerDescendent
WHERE {
?originalDriver tp:drove+ ?passengerDescendent .
{
CONSTRUCT { ?d tp:drove ?p . }
WHERE { ?t a tp:trip .
?t tp:driver ?d .
?t tp:passenger ?p .}
}
}
Attempt 2:
I tried to create property path which expresses an ancestor as the driver of a passenger, but I don't think I've properly understood how this is supposed to work:
(tp:driver/^tp:passenger)+
Regarding MWE: Is there some kind of RDF sandbox that would allow me to create an MWE by defining a simple ontology like tp above, along with some sample data? The following "playgrounds" are available but none of them seem to support defining a toy ontology: SPARQL Playground, SPARQL Explorer.
Notes on related content:
This question is directly related to a previous question, but no longer requires saving the paths themselves, a feature not directly supported by SPARQL 1.1.
This answer by Joshua Taylor seems relevant, but doesn't address the identification of specific types of paths, such as the lineages defined above.
This one seems to do the trick:
select ?driver ?passenger where {
?driver (^tp:driver/tp:passenger)+ ?passenger .
filter( ?driver != ?passenger)
}
The filter condition can be removed if you want to also see relationships that lead back to the same person.

Length of path between nodes, allowing for multiple relation types (SPARQL)

I have to write a SPARQL query that returns the length of the path between two nodes (:persA and :persD) that are connected by these relationships:
#prefix : <http://www.example.org/> .
:persA :knows :knowRelation1 .
:knowRelation1 :hasPerson :persB .
:persB :knows :knowRelation2 .
:knowRelation2 :hasPerson :persC .
:persC :knows :knowRelation3 .
:knowRelation3 :hasPerson :persD .
I tried with this query:
PREFIX : <http://www.example.org/>
SELECT (COUNT(?mid) AS ?length)
WHERE
{
:persA (:knows | :hasPerson)* ?mid .
?mid (:knows | :hasPerson)+ :persD .
}
the result seems to be a infinite loop.
Any advice/examples of how this can be done?
After fixing some syntax errors in your initial post, the provided triples and query work for me in GraphDB Free 8.2 and BlazeGraph 2.1.1. I have since applied these edits to your post itself.
added a trailing / to your definition of the empty prefix
added a trailing . to your prefix definition line (required if you want to start with a #)
fixed the spelling of length (OK, that's just a cosmetic fix)
.
#prefix : <http://www.example.org/> .
:persA :knows :knowRelation1 .
:knowRelation1 :hasPerson :persB .
:persB :knows :knowRelation2 .
:knowRelation2 :hasPerson :persC .
:persC :knows :knowRelation3 .
:knowRelation3 :hasPerson :persD .
.
PREFIX : <http://www.example.org/>
SELECT (COUNT(?mid) AS ?length)
WHERE
{ :persA (:knows|:hasPerson)* ?mid .
?mid (:knows|:hasPerson)+ :persD
}
result:
length
"6"^^xsd:integer

Query evaluation took too long

I am using GraphDB Free 7.1 and I have created a repository with the default settings. I have uploaded a ttl file with 2.7 million triplets. I am trying to issue a query (not very complex, but quite complex) that should return 200k answers and the Workbench displays just 1k answers and the GraphDB log displays an exception
10:52:19.580 [repositories/PaaSport] INFO c.o.f.sesame.RepositoryController - POST query -1325396809
10:52:29.594 [repositories/PaaSport] ERROR o.o.h.s.r.TupleQueryResultView - Query interrupted
org.openrdf.query.QueryInterruptedException: Query evaluation took too long
...
10:52:29.594 [repositories/PaaSport] INFO o.o.h.s.r.TupleQueryResultView - Request for query -1325396809 is finished
The query I'm using is:
SELECT DISTINCT ?offering ?Value
WHERE {
?offering a paasport:Offering ;
DUL:satisfies ?groundDescription .
?groundDescription paasport:offers ?characteristic .
?characteristic a paasport:Storage ;
DUL:hasParameter ?par .
?par a paasport:StorageCapacity ;
DUL:hasParameterDataValue ?Value ;
DUL:parametrizes ?qualityValue .
?qualityValue uomvocab:measuredIn ?Units .
?Units a ?AppParMeasureUnitType .
ucum:GB a ?AppParMeasureUnitType .
?Units a uomvocab:SimpleDerivedUnit .
ucum:GB a uomvocab:SimpleDerivedUnit .
ucum:GB uomvocab:derivesFrom ?BasicUnit .
?Units uomvocab:derivesFrom ?BasicUnit .
ucum:GB uomvocab:modifierPrefix ?prefix1 .
?Units uomvocab:modifierPrefix ?prefix2 .
?prefix1 uomvocab:factor ?Factor1 .
?prefix2 uomvocab:factor ?Factor2 .
FILTER( xsd:double(?Factor2)*?Value = xsd:double(?Factor1)*4)
}
Since the query timeout is set to 0, I am not sure what causes the query interruption exception; most probably memory problems?
Very simple queries (e.g. return all instances of a certain class) work OK.
Are there any hints? Any help would be appreciated.
I can provide more details if needed.
Best,
Nick
Actually I have managed to cut down the query to the minimum, in order to be answered. The problem was mainly due to the following triple patterns:
ucum:GB rdf:type ?AppParMeasureUnitType .
ucum:GB rdf:type uomvocab:SimpleDerivedUnit .
ucum:GB uomvocab:derivesFrom ?BasicUnit .
If these are omitted and the corresponding variables in the original query are replaced by constant resources, then the query is answered.
Here is the resulting query:
SELECT DISTINCT ?offering ?Value
WHERE {
?offering rdf:type paasport:Offering .
?offering DUL:satisfies ?groundDescription .
?groundDescription paasport:offers ?characteristic .
?characteristic rdf:type paasport:Storage .
?characteristic DUL:hasParameter ?par .
?par rdf:type paasport:StorageCapacity .
?par DUL:hasParameterDataValue ?Value .
?par DUL:parametrizes ?qualityValue .
?qualityValue uomvocab:measuredIn ?Units .
?Units rdf:type ucum:UnitOf-infotech .
?Units rdf:type uomvocab:SimpleDerivedUnit .
?Units uomvocab:derivesFrom <http://purl.oclc.org/NET/muo/ucum/unit/amount-of-information/byte> .
ucum:GB uomvocab:modifierPrefix ?prefix1 .
?Units uomvocab:modifierPrefix ?prefix2 .
?prefix1 uomvocab:factor ?Factor1 .
?prefix2 uomvocab:factor ?Factor2 .
FILTER( xsd:double(?Factor2)*?Value >= xsd:double(?Factor1)*2)
}
Do you really need DISTINCT? That always slows down things, since it has to fetch all results in memory, sort them and uniq them before starting to serve.
Do you need = or >= in the FILTER? If = then replace the filter with a BIND(...as ?Factor2), and put these things before searching for ?prefix2
Have you profiled the query? http://graphdb.ontotext.com/documentation/standard/explain-plan.html

How do i fit that sparql calculation?

this is my actual problem:
?var0 is a group variable and ?var1 is not. But whenever I try to validate the syntax, there comes the following error message:
Non-group key variable in SELECT: ?var1 in expression ( sum(?var0) / ?var1 )
The complete Query:
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX cz: <http://www.vs.cs.hs-rm.de/ontostor/SVC#Cluster>
PREFIX n: <http://www.vs.cs.hs-rm.de/ontostor/SVC#Node>
SELECT ( (SUM(?var0) / ?var1) AS ?result)
WHERE{
?chain0 rdf:type rdfs:Property .
?chain0 rdfs:domain <http://www.vs.cs.hs-rm.de/ontostor/SVC#Cluster> .
?chain0 rdfs:range <http://www.vs.cs.hs-rm.de/ontostor/SVC#Node> .
?this ?chain0 ?arg0 .
?arg0 n:node_realtime_cpu ?var0 .
?this cz:node_count ?var1 .
}
My question is how to correct that calculation to fit the SPARQL syntax?
The immediate problem is that ?var1 is not grouped on, so a fix would be to simply append
GROUP BY ?var1
at the end of your query.
However, whether that gives you the calculation you actually want is another matter.
It's not quite clear what you're trying to calculate, but it looks as if you're attempting to determine the average node_realtime_cpu for a cluster. If that is the case, you can probably do your calculation by just using SPARQL's AVG function instead:
SELECT ( AVG(?var0) AS ?result)
WHERE{
?chain0 rdf:type rdfs:Property .
?chain0 rdfs:domain <http://www.vs.cs.hs-rm.de/ontostor/SVC#Cluster> .
?chain0 rdfs:range <http://www.vs.cs.hs-rm.de/ontostor/SVC#Node> .
?this ?chain0 ?arg0 .
?arg0 n:node_realtime_cpu ?var0 .
}
GROUP BY ?this // grouping on the cluster identifier so we get an average _per cluster_
Yet another alternative would be to keep your query as-is, but group on two variables:
GROUP BY ?this ?var1
Which is best depends on what your data looks like and what, exactly, you're trying to calculate.