Does Blazegraph support range query optimization?

Does Blazegraph support range query optimization? - sparql

I am playing with Blazegraph. I insert some triples representing 'events', each of 'event' contains 3 triples and looks like this:
<%event-iri%> <http://predicates/timestamp> '2020-01-02T03:04:05.000Z'^^xsd:dateTime .
<%event-iri%> <http://predicates/a> %RANDOM_UUID% .
<%event-iri%> <http://predicates/b> %RANDOM_UUID% .
Timestamps represent consecutive moments of time, each next event is 1 minute later than the previous one.
I made two sets of tests: once having 1 million events (so 3 million triples), and once having 3 million events (9 million triples).
I run queries like the following:
select ?event ?a ?v
where {
?event <http://predicates/timestamp> ?timestamp .
filter (?timestamp >= '2020-01-02T03:04:05.000Z'^^xsd:dateTime && ?timestamp < '2020-01-02T03:03:05.000Z'^^xsd:dateTime)
?event ?a ?v .
}
I started with queries returning 1000 events (3000 triples) and then went down to queries that only match 1 event (and return 3 triples) to make sure that result data set size does not influence the range query performance itself too much.
I also tried adding a hint found here https://sourceforge.net/p/bigdata/discussion/676946/thread/2cf9a1e8/?limit=25 to tell Blazegraph that it should use range query optimization by adding the following
hint:Prior hint:rangeSafe "true" .
Right after the filter clause.
Also, it was mentioned that for some types range queries do not work while working for others (for ints they worked for johpfe), so I also tried to do another set of tests where timestamps are represented as ints (Unix timestamps):
<%event-iri%> <http://predicates/timestamp> 1606528746 .
<%event-iri%> <http://predicates/a> %RANDOM_UUID% .
<%event-iri%> <http://predicates/b> %RANDOM_UUID% .
The final query I tried was
select ?event ?a ?v
where {
?event <http://predicates/timestamp> ?timestamp .
filter (?timestamp >= 1606528746 && ?timestamp < 1606528806)
hint:Prior hint:rangeSafe "true" .
?event ?a ?v .
}
Whatever I try, I get the following results: for the smaller dataset (1 million timestamps/ints) queries take 1 second, sometimes more, but not less; for the bigger dataset (3 million timestamps/ints) queries take at least 3 seconds.
The difference is 3x, which perfectly correlates with 3x change of data volume. So it looks like the range optimization is not working.
I also compared against MongoDB. Having an index on 'timestamp' field, it always executes an analogous query in 30-50ms, no matter on what data size.
What do I do wrong? Is there a way to make Blazegraph apply the optimization here?
PS. I also tried putting the hint right after a triple pattern, not filter statement, as per https://github.com/blazegraph/database/wiki/QueryHints which says the following about rangeSafe hint:
Declare that the data touched by the query for a specific triple pattern is strongly typed, thus allowing a range filter to be pushed down onto an index.
So the query became
select ?event ?a ?v
where {
?event <http://predicates/timestamp> ?timestamp .
hint:Prior hint:rangeSafe "true" .
filter (?timestamp >= 1606528746 && ?timestamp < 1606528806)
?event ?a ?v .
}
But this query finds nothing, so the hint just breaks it.

Here are the queries where the optimization does work.
This one is for the case of integers:
select ?event ?a ?v
where {
?event <http://predicates/timestamp> ?timestamp .
hint:Prior hint:rangeSafe "true" .
filter (?timestamp >= "1606528746"^^xsd:int && ?timestamp < "1606528806"^^xsd:int)
?event ?a ?v .
}
And this one is for the case when timestamps are date-times:
select ?event ?a ?v
where {
?event <http://predicates/timestamp> ?timestamp .
hint:Prior hint:rangeSafe true .
filter (?timestamp >= '2021-11-07T22:08:24.022+04:00'^^xsd:dateTime && ?timestamp < '2021-11-07T22:10:24.022+04:00'^^xsd:dateTime)
?event ?a ?v .
}
What prevented the optimization from kicking in was:
For the case of integers, I did not specify literal type explicitly, so it was like ?timestamp >= 1606528746 instead of ?timestamp >= "1606528746"^^xsd:int. Strangely enough, it breaks the optimization.
Also, the hint must be specified right after the triple pattern and NOT after the filter statement.
Also, it turned out that it is not important whether the hint contains true or "true": both options work successfully.
Many thanks to #StanislavKralin for giving a working example using which I was able to transform my queries to a working form.

Related

Sparql two-nested-opt dawg test query results hard to understand

The DAWG test query two-nested-opt.rq is like this:
PREFIX : <http://example/>
SELECT *
{
:x1 :p ?v .
OPTIONAL
{
:x3 :q ?w .
OPTIONAL { :x2 :p ?v }
}
}
The test data is:
#prefix : <http://example/> .
#prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
:x1 :p "1"^^xsd:integer .
:x2 :p "2"^^xsd:integer .
:x3 :q "3"^^xsd:integer .
:x3 :q "4"^^xsd:integer .
If I run the query, the expected results is just one record with ?v=1 (for short). Don't really understand it since commenting the second optional, the result is two records:
v=1, w=3
v=1, w=4
Some possible explanations I found mention the second optional is a no-op since the ?v bindings don't match between second optional and the main bgp. But don't get how that explains it. Shouldn't the results of first optional be always included in the solution whatever the result of second optional is?

This test case is about evaluation being functional (also called bottom-up) and how early (inner-most) setting of ?v affects the outcome. In the test, the inner most setting of ?v then blocks the first OPTIONAL and the ?w = 3 and ?w = 4 are not results.
It is the case of the general situation of having an outer ?v (left side of the OPTIONAL), then an OPTIONAL part that does not mention ?v which itself has an OPTIONAL using ?v.
If the query is thought of top-down, the answers would be different.
Evaluation is:
:x3 :q ?w leftjoin :x2 :p ?v
Two rows:
?w = 3; ?v = 2
?w = 4; ?v = 2
:x1 :p ?v
?v = 1
Now left join "?v = 1" with "?w = 3; ?v = 2" -- The OPTIONAL does not join so the result is one row, "?v = 1" and no ?w binding.
If the ":x2 :p ?v" is omitted, the first
:x3 :q ?w
Two rows:
?w = 3
?w = 4
both of which join with ?v = 1, giving two rows of ?v and ?w.

how to search for strings with different language code

I need to search items in two graphs with the same string, but different language codes ("xx"#en and "xx"#eng - from wordnet). obviously
"xx"#en is not equal to "xx"#eng.
it can be done with (prefix nlp suitably defined):
select * where {
?a nlp:lemma ?as .
?b rdfs:label ?bs .
filter (str(?as)=str(?bs)) .
# more code using ?a and ?b
}
However, this query takes too long and is wasteful. It should be possible to do something like:
?a nlp:lemma ?s .
?b rdfs:label ?s .
but i cannot see how - short of manually change all #eng in the wordnet triples to #en - which i would rather not do.
any solution?
thank you!

You could cut you search space down by filtering only for en and eng, but the only way to compare the string portion of a language-labeled string is to convert them to a string.
I.e. the following could be more efficient if there are language-labeled strings other than en and eng:
select * where {
?a nlp:lemma ?as .
?b rdfs:label ?bs .
filter (lang(?as) = "en" || lang(?as) = "eng")
filter (str(?as)=str(?bs)) .
# more code using ?a and ?b
}

Sparql query to construct a new graph

I have two graphs with similar data but a slight difference.
my goal is to merge them using SPARQL and perform the integration. I want a final output of two RDF graphs that has a slight difference in a single RDF graph using SPARQL.
example one graph is :
ns0:BaseRoleClassLib
a ns0:ExternalReference ;
ns0:externalReferenceAlias "BaseRoleClassLib" .
ns0:maxTransportationWeight
a ns0:Attribute ;
schema:name "maxTransportationWeight" ;
ns0:hasValue "35" .
second graph is :
ns0:BaseRoleClassLib
a ns0:ExternalReference ;
ns0:maxTransportationWeight
a ns0:Attribute ;
schema:name "maxTransportationWeight" ;
ns0:hasValue "35.0" .
The only difference is one has the transport value in integer and other in the float.
So I write a query to generalise them :
select distinct ?integer
from <graph1>
from <graph2>
where {
?s ns0:hasValue ?y
Bind(xsd:integer(xsd:decimal(?y)) as ?integer)
}
}
This converts the difference in to generalised form of integer.
Now my next goal is I want to integrate these files into a single RDF by using the above result .
I want an RDF file which has the union of these and the solved generalization of float to integer.
S1 , S2 -> generalization -> integration -> s3 RDF
How can I achieve this using SPARQL constructor / insert ?
Thanks so much

This can be done pretty straightforwardly by CONSTRUCT. SPARQL update doesn't seem to support FROM, so you'd need to use a UNION of GRAPH statements. The following should get the merge you are looking for - basically filter out the old ns0:hasValue value and insert the new one:
CONSTRUCT {
?s ?p ?o .
?s ns0:hasValue ?intValue .
}
FROM <graph1>
FROM <graph2>
WHERE {
?s ?p ?o .
OPTIONAL{?s ns0:hasValue ?origValue .}
BIND(IF(datatype(?origValue) = xsd:integer, ?origValue, xsd:integer(STRBEFORE(str(?origValue), ".")) )as ?intValue)
FILTER (?p != ns0:hasValue)
}
Note that conversion of float to integer isn't straightforward. You's have to live with rounding down or have logic to round by decimal values.

how to remove duplicates in sparql query

I wrote this query and return list of couples and particular condition. ( in http://live.dbpedia.org/sparql)
SELECT DISTINCT ?actor ?person2 ?cnt
WHERE
{
{
select DISTINCT ?actor ?person2 (count (?film) as ?cnt)
where {
?film dbo:starring ?actor .
?actor dbo:spouse ?person2.
?film dbo:starring ?person2.
}
order by ?actor
}
FILTER (?cnt >9)
}
Problem is that some rows is duplicate.
example:
http://dbpedia.org/resource/George_Burns http://dbpedia.org/resource/Gracie_Allen 12
http://dbpedia.org/resource/Gracie_Allen http://dbpedia.org/resource/George_Burns 12
how to remove these duplications?
I added gender to ?actor but it damage current result.

Natan Cox's answer shows the typical way to exclude these kind of pseudo-duplicates. The results aren't actually duplicates, because in one, e.g., George Burns is the ?actor, and in the other he is the ?person2. In many cases, you can add a filter to require that the two things are ordered, and that will remove the duplicate cases. E.g., when you have data like:
:a :likes :b .
:a :likes :c .
and you search for
select ?x ?y where {
:a :likes ?x, ?y .
}
you can add filter(?x < ?y) to enforce an ordering between the between ?x and ?y which will remove these pseudo-duplicates. However, in this case, it's a bit trickier, since ?actor and ?person2 aren't found using the same critera. If DBpedia contains
:PersonB dbo:spouse :PersonA
but not
:PersonA dbo:spouse :PersonB
then the simple filter won't work, because you'll never find the triple where the subject PersonA is less than the object PersonB. So in this case, you also need to modify your query a bit to make the criteria symmetric:
select distinct ?actor ?spouse (count(?film) as ?count) {
?film dbo:starring ?actor, ?spouse .
?actor dbo:spouse|^dbo:spouse ?spouse .
filter(?actor < ?spouse)
}
group by ?actor ?spouse
having (count(?film) > 9)
order by ?actor
(This query also shows that you don't need a subquery here, you can use having to "filter" on aggregate values.) But the important part is using the property path dbo:spouse|^dbo:spouse to find a value for ?spouse such that either ?actor dbo:spouse ?spouse or ?spouse dbo:spouse ?actor. This makes the relationship symmetric, so that you're guaranteed to get all the pairs, even if the relationship is only declared in one direction.

It is not actual duplicates of course since you can look at it from both ways. The way to fix it if you want to is to add a filter. It is a bit of a dirty hack but it only takes on of the 2 rows that are the "same".
SELECT DISTINCT ?actor ?person2 ?cnt
WHERE
{
{
select DISTINCT ?actor ?person2 (count (?film) as ?cnt)
where {
?film dbo:starring ?actor .
?actor dbo:spouse ?person2.
?film dbo:starring ?person2.
FILTER (?actor < ?person2)
}
order by ?actor
}
FILTER (?cnt >9)
}

Graph restrictions to reduce dataset in SPARQL

I am writing a SPARQL query for a dataset containing thousands of graphs, and I want to compute on the fly which graphs are to be included my search. As an example, I might only want to include graphs written by me this year. This can be done easily enough by a restriction on the graph URI when the query is straightforward:
SELECT ?name WHERE {
GRAPH ?g { [ foaf:name ?name ] }
?g dc:creator ex:me ; dc:created ?date
FILTER( xsd:dateTime(?date) >= xsd:dateTime("2015-01-01") )
}
But suppose my query is more complex and I want a list of pairs of acquaintances. The naïve implementation is something like this:
SELECT ?name WHERE {
GRAPH ?g { ?name1 ^foaf:name/foaf:knows/foaf:name ?name2 }
?g dc:creator ex:me ; dc:created ?date
FILTER( xsd:dateTime(?date) >= xsd:dateTime("2015-01-01") )
}
This works fine if all three FOAF triples are in the same graph. But if any is in a different graph, it fails because ?g binds to a single graph in each result. I can explicitly write each of the three FOAF triples in their own GRAPH block, but then I have to associate each with their own graph URI variable, and repeat the graph restriction for each:
SELECT ?name WHERE {
GRAPH ?g1 { ?p1 foaf:name ?name1 }
?g1 dc:creator ex:me ; dc:created ?date1
FILTER( xsd:dateTime(?date1) >= xsd:dateTime("2015-01-01") )
GRAPH ?g2 { ?p1 foaf:knows ?p2 }
?g2 dc:creator ex:me ; dc:created ?date2
FILTER( xsd:dateTime(?date2) >= xsd:dateTime("2015-01-01") )
GRAPH ?g3 { ?p2 foaf:name ?name2 }
?g3 dc:creator ex:me ; dc:created ?date3
FILTER( xsd:dateTime(?date3) >= xsd:dateTime("2015-01-01") )
}
That code now does the right thing, but it rapidly becomes untenable as the query becomes more complex. If the main query has m triples and the graph restriction has n, the complete query ends up with m×n triples.
Is there a better solution within standard SPARQL 1.1? I'm aware that some SPARQL engines will fetch a graph from its URI, and then you can make that URI the URL of a GET request to a SPARQL endpoint, but that's not standard. I had hoped the federated query mechanism might help, but it doesn't seem to.

As I don't have your data at hand, I can't test my query reliably. However, you seem to need subqueries.
Subqueries are a way to embed SPARQL queries within other queries, normally to achieve results which cannot otherwise be achieved, such as limiting the number of results from some sub-expression within the query.
In your case, the objective is:
Getting the list of graphs for which you are the dc:creator
For each of those graphs, find some possibly useful triples
What you want as a result of your query isn't very clear, since you mention only ?name after SELECT... although it's nowhere in the rest of the query. Here is how you could try using subqueries and maybe find inspiration to solve your problem:
SELECT ?name1 ?name2 WHERE {
{GRAPH ?graph { ?p1 foaf:name ?name1 }}
UNION {GRAPH ?graph { ?p1 foaf:knows ?p2 }}
UNION {GRAPH ?graph { ?p2 foaf:name ?name2 }}
{ select ?graph where { #here we get the list of graphs you created
?graph dc:creator ex:me ; dc:created ?date
FILTER( xsd:dateTime(?date) >= xsd:dateTime("2015-01-01") )
}
}
}

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Does Blazegraph support range query optimization? - sparql

Related

Sparql two-nested-opt dawg test query results hard to understand

how to search for strings with different language code

Sparql query to construct a new graph

how to remove duplicates in sparql query

Graph restrictions to reduce dataset in SPARQL

Categories

Resources