I have a rather complex SPARQL query, which is executed thousands of times in parallel threads (400 threads). The query is here somewhat simplified (namespaces, properties, and variables have been reduced) for readability, but the complexity is left untouched (unions, number of graphs, etc.). The query is run against 4 graphs, the biggest of which contains 5,561,181 triples.
PREFIX graphA: <GraphABaseURI:>
ASK
FROM NAMED <GraphBURI>
FROM NAMED <GraphCURI>
FROM NAMED <GraphABaseURI>
FROM NAMED <GraphDBaseURI>
WHERE{
{
GRAPH <GraphABaseURI>{
?variableA a graphA:ClassA .
?variableA graphA:propertyA ?variableB .
?variableB dcterms:title ?variableC .
?variableA graphA:propertyB ?variableD .
?variableL<GraphABaseURI:propertyB> ?variableD .
?variableD <propertyBURI> ?variableE
}
.
GRAPH <GraphBURI>{
?variableF <propertyCURI>/<propertyDURI> ?variableG .
?variableF <propertyEURI> ?variableH
}
.
GRAPH <GraphCURI>{
?variableI <http://www.w3.org/2004/02/skos/core#notation> ?variableJ .
?variableI <http://www.w3.org/2004/02/skos/core#prefLabel> ?variableK .
FILTER (isLiteral(?variableK) && REGEX(?variableK, "literalA", "i"))
}
.
FILTER (isLiteral(?variableJ) && ?variableG = ?variableJ) .
FILTER (?variableE = ?variableH)
}
UNION
{
GRAPH <GraphABaseURI>{
?variableA a graphA:ClassA .
?variableA graphA:propertyA ?variableB .
?variableB dcterms:title ?variableC .
?variableA graphA:propertyB ?variableD .
?variableL<propertyBURI> ?variableE .
?variableL <propertyFURI> ?variableD .
}
.
GRAPH <GraphDBaseURI>{
?variableM <propertyGURI> ?variableN .
?variableM <propertyHURI> ?variableO .
FILTER (isLiteral(?variableO) && REGEX(?variableO, "literalA", "i"))
}
.
FILTER (?variableE = ?variableN) .
}
UNION
{
GRAPH <GraphABaseURI>{
?variableA a graphA:ClassA .
?variableA graphA:propertyA ?variableB .
?variableB dcterms:title ?variableC .
?variableA graphA:propertyB ?variableD .
?variableL<propertyBURI> ?variableE .
?variableL <propertyIURI> ?variableD .
}
.
GRAPH <GraphDBaseURI>{
?variableM <propertyGURI> ?variableN .
?variableM <propertyHURI> ?variableO .
FILTER (isLiteral(?variableO) && REGEX(?variableO, "literalA", "i"))
}
.
FILTER (?variableE = ?variableN) .
}
. FILTER (isLiteral(?variableC) && REGEX(?variableC, "literalB", "i")) .
}
I would not expect someone to transform the above query (of course...). I am only posting the query to demonstrate the complexity and all the SPARQL structures used.
My questions:
Would I gain regarding performance if I had all my triples in one graph? This way I would avoid unions and simplify my query, however, would this also benefit in terms of performance?
Are there any kind of indexes that I could built and they could be of any help with the above query? I am not really confident on data indexing, however reading in the RDF Index Scheme section of RDF Performance Tuning, I wonder if Virtuoso 7's default indexing scheme is suitable for queries like the above. While the predicates are defined in the above query's SPARQL triple patterns, there are many triple patterns that have not defined subject or predicate. Could this be a major problem regarding performance?
Perhaps there is a SPARQL syntax structure that I am not aware of and could be of great help in the above query. Could you suggest something? For example, I have already improved performance by removing STR() casts and using the isLiteral() function. Could you suggest anything else?
Perhaps you could suggest overusing a complex SPARQL syntax structure?
Please note that I use Virtuoso Open source edition, built on Ubuntu, Version: 07.20.3214, Build: Oct 14 2015.
Regards,
Pantelis Natsiavas
First thing -- your Virtuoso build is long outdated; updating to 7.20.3217 as of April 2016 (or later) is strongly recommended.
Optimization suggestions are naturally limited when looking at a simplified query., but here are several thoughts, in no particular order...
Index Scheme Selection, the RDF Performance Tuning doc section following RDF Index Scheme, offers a couple of alternative and/or additional indexes which may make sense for your queries and data. As you say that some of your patterns will have defined graph and object, and undefined subject and predicate, some other indexes may also make sense (e.g., GOPS, GOSP), depending on some other factors.
Depending on how much your data has changed since original load, it may be worth rebuilding the free-text indexes, with this SQL command (which may be issued through any SQL interface -- iSQL, ODBC, JDBC, etc.) —
VT_INC_INDEX_DB_DBA_RDF_OBJ ()
Using the bif:contains predicate can result in substantially better performance than regex() filters, for instance replacing —
FILTER (isLiteral(?variableO) && REGEX(?variableO, "literalA", "i")) .
— with —
?variableO bif:contains "'literalA'" .
FILTER ( isLiteral(?variableO) ) .
Explain() and profile() can be helpful in query optimization efforts. Much of this output is meant for analysis by Development, so it may not mean much to you, but providing it to other Virtuoso users can still yield helpful suggestions.
For a number of reasons, the rdf:type predicate (often expressed as a, thanks to SPARQL/Turtle semantic sugar) can be a performance killer. Removing those predicates from your graph pattern is likely to boost performance substantially. If needed, there are other ways to limit the solution set (such as by testing for attributes only possessed by entities your desired rdf:type) which do not have such negative performance impacts.
(ObDisclaimer: OpenLink Software produces Virtuoso, and employs me.)
Related
I am looking to run a SPARQL query over any dataset. We dont know the names of the named graphs in the datasets.
These are lots of documentation and examples of selection from named graphs when you know the name of the named graph/s. There are examples showing listing named graphs.
We are running the Jena from Java so it would be possible to run 2 queries, the first gets the named graphs and we inject these into the 2nd.
But surely you can write a single query that reads from all named graphs when you dont know their names?
Note: we are looking to stay away from using default graph/s as their behaviour seems implementation dependent.
Example:
{
?s foaf:name ?name ;
vCard:nickname ?nickName .
}
If you want the pattern to match within one graph and wish to try each graph, use the GRAPH ?g form.
GRAPH ?g
{ ?s foaf:name ?name ;
vc:nickname ?nickName .
}
If you want to make a query where the pattern matches across named graphs, -- e.g. foaf:name in one graph and vCard:nickname in another, same subject --
then set union default graph tdb2:unionDefaultGraph true then the default graph as seen by the query is the union (actually, RDF merge - no duplicates) of all the named graphs. Use the pattern as originally given.
Fuseki configuration file extract:
:dataset_tdb2 rdf:type tdb2:DatasetTDB2 ;
tdb2:location "DB2" ;
## Optional - with union default for query and update WHERE matching.
tdb2:unionDefaultGraph true ;
.
In code, not Fuseki, the application can use Dataset.getUnionModel().
I have an RDF ontology in a triplestore that users can make changes to. I want to increment/alter the ontology version with each change. One way to do it would be to use a hash of the ontology graph as the version. For that I need to calculate a hash of the RDF graph using SPARQL, as that is the only available API.
I realized this can be addressed in 2 steps:
generating a stable serialization of the graph (e.g. N-Triples with a stable ordering) using SPARQL
applying one of the SPARQL hash functions to get the actual hash value
Following suggestions on Twitter, I came up with this query that seems to generate valid N-Triples:
SELECT (GROUP_CONCAT(?tripleStr ; separator=' \n') AS ?nTriples)
WHERE
{ ?s ?p ?o
BIND(if(isURI(?s), concat("<", str(?s), ">"), concat("_:", str(?s))) AS ?sStr)
BIND(concat("<", str(?p), ">") AS ?pStr)
BIND(if(isURI(?o), concat("<", str(?o), ">"), if(isBlank(?o), concat("_:", str(?o)), concat("\"", str(?o), "\"", if(( lang(?o) != "" ), concat("#", str(lang(?o))), concat("^^<", str(datatype(?o)), ">"))))) AS ?oStr)
BIND(concat(?sStr, " ", ?pStr, " ", ?oStr, " .") AS ?tripleStr)
}
ORDER BY ?s ?p ?o datatype(?o) lcase(lang(?o))
Please do give it a try and report bugs, if any. The approach could be easily extended to N-Quads. As was pointed out on Twitter, it will not scale to large datasets.
Now, getting the actual hash of the graph is as easy as changing the SELECT criteria to:
SELECT (SHA1(GROUP_CONCAT(?tripleStr; separator=" \n")) AS ?graphHash)
Here we used the SHA1 hash function.
I recently built a Virtuoso database (version 07.10.3207) using dbpedia data. I'm trying to build some queries for it, and encountering very strange results. For one example:
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select ?s, ?p, ?o where {
?s ?p ?o .
?s rdfs:label "Almond"#en .
?o bif:contains "mythical"
}
This yields a hit. One might expect it to mean that the comment field (the field that matches "mythical") for Almond contains the word "mythical". However, it does not. It is, in fact:
"The almond (/??m?nd/) (Prunus dulcis, syn. Prunus amygdalus, Amygdalus communis, Amygdalus dulcis) (or badam in Indian English, from Persian: ??????) is a species of tree native to the Middle East and South Asia. "Almond" is also the name of the edible and widely cultivated seed of this tree."#en
Many other queries yield similarly strange results.
Trying the same queries on the public dbpedia endpoint does not yield these bizarre results, so I know it's somehow a problem with my database. I guessed that it might have to do with some corruption of the full text indices.
I tried the following, without a super-clear understanding of what exactly they might do, based on other notes I was able to find:
DB.DBA.RDF_OBJ_FT_RULE_ADD(null, null, 'All');
DB.DBA.VT_INC_INDEX_DB_DBA_RDF_OBJ();
DB.DBA.RDF_OBJ_FT_RECOVER();
DB.DBA.VT_INDEX_DB_DBA_RDF_OBJ();
Thus far, no dice. I'm sort of wondering if it might have to do with the mangled characters in the comment field - the online dbpedia endpoint renders them properly, while my Virtuoso installation just gives question marks, as seen above. No idea even how to begin approaching this though.
I did include SQL_UTF8_EXECS = 1 in virtuoso.ini (and subsequently restart the server), which still left me with question marks in the results.
Actually, it does not appear to have anything to do with those question marks; I ran the following query:
select ?s, ?p, ?o where {
?s ?p ?o .
?o bif:contains "mythical" .
FILTER (!regex(?o, "mythical", "i"))
}
A pseudorandom selection of hits, none of which contain "mythical" or "?":
"Asgrrr"
"403 BC"
"Potential infinity"
"Beauty and the Beast (talk show)"
"Alberta highway highway 22"
The same query, run at http://dbpedia.org/sparql, returns nothing (as it should).
Any ideas?
Rebuilding the database did not correct the problem. I was, however, able to get a working version by taking the following steps. Some of them may have been unnecessary, but given how long it takes, I haven't done controlled experiments to narrow things down to the minimum.
First, delete the database and associated files, to start with a blank slate.
Edit virtuoso.ini to uncomment/include:
SQL_UTF8_EXECS = 1
Start up Virtuoso, and then from isql issue the following commands:
DB.DBA.RDF_OBJ_FT_RULE_ADD (null, null, 'All');
DB.DBA.VT_BATCH_UPDATE ('DB.DBA.RDF_OBJ', 'OFF', null);
DB.DBA.VT_INC_INDEX_DB_DBA_RDF_OBJ ();
DB.DBA.RDF_OBJ_FT_RECOVER ();
COMMIT WORK;
CHECKPOINT;
CHECKPOINT_INTERVAL(60000);
Then, load the data.
Then, call:
COMMIT WORK;
CHECKPOINT;
DB.DBA.VT_INC_INDEX_DB_DBA_RDF_OBJ();
CHECKPOINT;
COMMIT WORK;
CHECKPOINT;
CHECKPOINT_INTERVAL(60);
COMMIT WORK;
Enjoy your fully-text-searchable database!
I've been testing Sesame 2.7.2 and I got a big surprise when faced to the fact that DESCRIBE queries do not include blank nodes closure [EDIT: the right term for this is CBD for concise bounded description]
If I correctly understand, the SPARQL spec is quite loose on that and says that what is returned is actually up to the provider, but I'm still surprised at the choice, since bnodes (in the results of the describe query) cannot be used in subsequent SPARQL queries.
So the question is: how can I get a closed description of a resource <uri1> without doing:
query DESCRIBE <uri1>
iterate over the result to determine which objects are blank nodes
then DESCRIBE ?b WHERE { <uri1> pred_relating_to_bnode_ ?b }
do it recursively and chaining over as long as bnodes are found
If I'm not mistaken, depth-2 bnodes would have to be described with
DESCRIBE ?b2 WHERE {<uri1> <p1&> ?b . ?b <p2> ?b2 }
unless there is a simpler way to do this?
Finally, would it not be better and simpler to let DESCRIBE return a closed description of a resource where you can still obtain the currently returned result with something like the following?
CONSTRUCT {<uri1> ?p ?o} WHERE {<uri1> ?p ?o}
EDIT: here is an example of a closed result I want to get back from Sesame
<urn:sites#1> a my:WebSite .
<urn:sites#1> my:domainName _:autos1 .
<urn:sites#1> my:online "true"^^xsd:boolean .
_:autos1 a rdf:Alt .
_:autos1 rdf:_1 _:autos2
_:autos2 my:url "192.168.2.111:15001"#fr
_:autos2 my:url "192.168.2.111:15002"#en
Currently: DESCRIBE <urn:sites#1> returns me the same result as the query CONSTRUCT WHERE {<urn:sites#1> ?p ?o}, so I get only that
<urn:sites#1> a my:WebSite .
<urn:sites#1> my:domainName _:autos1 .
<urn:sites#1> my:online "true"^^xsd:boolean .
Partial solutions using SPARQL
Based on your comments, this isn't an exact solution yet, but note that you can describe multiple things in a given describe query. For instance, given the data:
#prefix : <http://example.org/> .
:Alice :named "Alice" ;
:likes :Bill, [ :named "Carl" ;
:likes [ :named "Daphne" ]].
:Bill :likes :Elaine ;
:named "Bill" .
you can run the query:
PREFIX : <http://example.org/>
describe :Alice ?object where {
:Alice :likes* ?object .
FILTER( isBlank( ?object ) )
}
and get the results:
#prefix : <http://example.org/> .
:Alice
:likes :Bill ;
:likes [ :likes [ :named "Daphne"
] ;
:named "Carl"
] ;
:named "Alice" .
That's not a complete description of course, because it's only following :likes out from :Alice, not arbitrary predicates. But it does get the blank nodes named "Carl" and "Daphne", which is a start.
The larger issue in Sesame
It looks like you're going to have to do something like what's described above, and possibly with multiple searches, or you're going to have to modify Sesame. The alternative to writing some creative SPARQL is to change the way that Sesame implements describe queries. Some endpoints make this relatively easy, but Sesame doesn't seem to be one of them. There's a mailing list thread from 2011, Custom SPARQL DESCRIBE Implementation, that seems addressed at this same problem.
Roberto García asks:
I'm trying to customise the behaviour of SPARQL DESCRIBE queries.
I'm willing to get something similar to CBD (i.e. all properties and
values for the described resource plus all properties and values for
the blank nodes connected to it).
I have tried to reproduce a similar behaviour using a CONSTRUCT query
but the performance is not good and the query gets quite complex if I
try to consider long chains of properties pointing to blank nodes
starting from the described resource.
Jeen Broekstra replies:
The implementation of DESCRIBE in Sesame is hardcoded in the query
parser. It can only be changed by adapting the parser itself, and even
then it will be tricky, as the query model has no easy way to express it
either: it needs an extension of the algebra.
> If this is not possible, any advice about how to implement it using CONSTRUCT
queries?
I'm not sure it's technically possible to do this in a single query.
CBDs are recursive in nature, and while SPARQL does have some support
for recursivity (property chains), the problem is that you have to do an
intermediate check in every step of the property chain to see if the
bound value is a blank node or not. This is not something that SPARQL
supports out of the box: property chains are defined to have only length
of the path as the stop condition.
Perhaps something is possible using a convoluted combination of
subqueries, unions and optionals, but I doubt it.
I think the best workaround is instead to use the standard DESCRIBE
format that Sesame supports, and for each blank node value in that
result do a separate consecutive query. In other words: you solve it by
hand.
The only other option is to log a feature request for support of CBDs in
Sesame. I can't give any guarantees about if/when that will be followed
up on though.
I have following sparql query(from the book, semantic web primer):
select ?n
where
{
?x rdf:type uni:Course;
uni:isTaughtBy :949352
?c uni:name ?n .
FILTER(?c=?x) .
}
In this case, I guess this code is same as the the following:
Select ?n
Where
{
?x rdf:type uni:course;
uni:isTaughtBy :949352 .
?x uni:name ?n .
}
Does this query lead to coding error?
No, I don't see why it should give you an error or produce wrong results. Just make sure to always use the right case (uni:Course vs. uni:course), as SPARQL is case sensitive.
To be honest, the first version seems rather obscure as it uses a FILTER without a real need for it. That said, you may further slim down your query if you wish:
SELECT ?n
WHERE
{
?x rdf:type uni:Course;
uni:isTaughtBy :949352;
uni:name ?n .
}
However, keep in mind that saving characters does not always lead to improved readability.
For your example yes the queries are identical and there would be no value in using a FILTER over a join.
However the reason why you might use the FILTER form is the difference in semantics between joins and the = operator
Joins require that the values of the variables be exactly the same RDF term, whereas = does value equality - do the values of RDF terms that represent the same value? This is primarily a concern when one/both of the values may have literal values
It's easier to see if you take a specific example, assume ?x=4 and ?c = 4.0 (which is a bad example for your query but illustrates the point)
?x = ?c would give true while a join would give no results because they are not the exact same term