How to determine whether two SPARQL queries are identical using Python? - sparql

When using SPARQL to query RDF dataset, the same query can be written in many different ways. For example, sparql queries are always permutation-invariant with respect to some clauses inside it. Also, we can rename the variables inside a sparql query. But how can we identify those identical SPARQL queries? Ideally, there should be a python package that can parse a sparql query (i.e., a string object) into a query object, and different strings share the same underlying query are parsed into the same object, then we can simply compare the parsed query objects to determine whether two sparql queries are identical. Is there any tool like this (seems prepareQuery() in rdflib doesn't work in this way)? If not, then what should I do?
Semantically identical queries example:
SELECT ?x WHERE { ?x foaf:haha ?k .\n ?person foaf:knows ?x .}
SELECT ?s WHERE { ?person foaf:knows ?s .\n ?s foaf:haha ?k .}

The paper "Generating SPARQL Query Containment Benchmarks
using the SQCFramework" by Muhammad Seleem et al., mentions "SPARQL query containment solvers" where
Query containment is the problem of deciding if the result set of a query Q1 is included
in the result set of another query Q2
If you use such a solver to test whether the result set of Q1 is a subset of Q2 and vice versa, you have established that they are semantically identical.
As for your "off-the-shelf tool": the former paper mentions that those are tested in another paper "Evaluating and benchmarking sparql query containment solvers." by M.W. Chekol et al..
As for the complexity and computability, the latter paper mentions:
The query containment problem for full SPARQL is undecidable [15, 1].
Hence, it is necessary to reduce SPARQL in order to consider it. A
double exponential upper bound has been proven for the containment and
equivalence problems of SPARQL queries without OPTIONAL , FILTER and
under set semantics [7].
However, query containment in both directions is only one way to determine identity of queries. I am unaware whether there is a proof of a better complexity/computability for query identity than for query containment (or a proof on the contrary).

Related

MarkLogic seems not to follow the RDF specification

Write the following RDF data into MarkLogic:
#prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<http://John> <http://have> 10 .
<http://Peter> <http://have> "10"^^xsd:double .
Then execute the following query:
SELECT *
WHERE { ?s ?p 10. }
The query result is:
?s
?p
<http://John>
<http://have>
<http://Peter>
<http://have>
In RDF document, RDF term is equal if and only if the value and data type are both the same.
Literal term equality: Two literals are term-equal (the same RDF literal) if and only if the two lexical forms, the two datatype IRIs, and the two language tags (if any) compare equal, character by character. Thus, two literals can have the same value without being the same RDF term.
https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal
It seems that MarkLogic does not follow the RDF specification?
Expected query result:
The expected query result is:
?s
?p
<http://John>
<http://have>
Alternatives
Executing the SparQL query in Query Console and with Java Client API will both get the unexpected query result.
This query can return the expected query result in Apache Jena and RDF4j.
Can someone give me an answer or a hint about it?
A triple store that does that is not automatically wrong.
Try:
{ ?s ?p ?x . FILTER (sameTerm(?x, 10)}
{ ?s ?p ?x . FILTER (?x = 10) }
and also
Also try:
with 010 and +10, "010^^xsd:integer
and explore the lexical forms:
FILTER(str(?x) = "10")
The use cases for both value-based matching and term-based matching exist.
SPARQL query does not say whether the data loaded is exactly the data queried (by design). Several store do variations of value-processing as default behaviour. Jena does for TDB2 but keeps datatypes apart (TDB1 conflating datatypes was less popular).
It depends on what use case they are targetting. If you want a particular effect, then use a FILTER form of the query assuming data was not pre-processed on loading.
See the answer on Unexpected data writing result in MarkLogic
See https://www.w3.org/TR/sparql11-entailment/#CanonicalLit for the discussion that D-entailment gives many, many answers even with a restriction to canonical lexical forms.

Is there a way to sort SPARQL query results by relevance score in MarkLogic 8?

We're running SPARQL queries on some clinical ontology data in our MarkLogic server. Our queries look like the following:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX cts: <http://marklogic.com/cts#>
SELECT *
FROM <http://example/ontologies/snomedct>
WHERE {
?s rdfs:label ?o .
FILTER cts:contains(?o, cts:word-query("Smoke*", "wildcarded"))
}
LIMIT 10
We expected to get sorted results based off of relevance score, but instead they seemed to be in some random order. Tried many ways with the query but nothing worked. After some research we found this statement in the MarkLogic docs:
When understanding the order an expression returns in, there are two
main rules to consider:
cts:search expressions always return in relevance order (the most relevant to the least relevant).
XPath expressions always return in document order.
Does this mean that cts:contains is a XPath expression that always return in document order? If that's the case, how can we construct a SPARQL query that returns in relevance order?
Thanks,
Kevin
In the example you have, the language you are using is SPARQL - with a fragment filter of the cts:contains.
IN this case, the cts:contains is only useful in isolating fragment IDs that match - thus filtering the candidate documents used in the SPARQL query. Therefore, I do not believe that the the cts relevance is taken into account.
However, you could possibly get results you are looking for in a different way: Do an actual cts:search on the documents in question - then filter them using a cts:triple-range-query.
https://docs.marklogic.com/cts:triple-range-query

Jena TDB physical query plan with multiple FROM clauses

I am trying to figure out how Jena TDB handles SPARQL queries with multiple FROM clauses on the physical query plan level.
I would like to know how Jena TDB handles executing a query over different graphs.
I have made some small experiments and looked at the query algebra, however, it is not clear to me how the FROM clauses affect the algebra.
It looks like that the FROM clauses are discarded in the algebra. I expect that the algebra is evaluated over the union of the graphs, but I would like to be sure.
I have the following quads:
<http://example.com/book2/> <http://example.com/price> "5"^^<http://www.w3.org/2001/XMLSchema#integer> <http://example.com/A> .
<http://example.com/book2/> <http://example.com/title> "Lord of the Rings" <http://example.com/B> .
and the following query:
SELECT (AVG(?price) as ?total)
FROM <http://example.com/A>
FROM <http://example.com/B>
WHERE {
?book <http://example.com/price> ?price .
?book <http://example.com/title> ?title .
}
./tdbquery --loc test --query test.sparql --explain
The query algebra looks as follows:
INFO exec :: ALGEBRA
(project (?total)
(extend ((?total ?.0))
(group () ((?.0 (avg ?price)))
(bgp (triple ?book <http://example.com/price> ?price)))))
When I execute the query over the data I receive the expected result.
FROM (and FROM NAMED) aren't really part of the query, but indications of what the dataset to be queried ought to be. These clauses don't alter what the query will do, only what it operates on, so you don't see them in the algebra.
What a particular processor does with that information varies:
some processors will build the requested dataset (even downloading data)
but it is also common to provide a dataset explicitly in APIs (e.g. query(query_string, dataset)) in which case the processor will ignore it since a dataset has been provided.
a dataset might also be supplied with in a SPARQL protocol request, in which case, as with the API call, the processor will ignore the NAMED clause.
Now a TDB database is a dataset, but TDB has a special feature called 'dynamic datasets' which used FROM and FROM NAMED to form a sub-dataset in effect, limiting the graphs queried to those mentioned in the FROM clauses.

Retrieve smallest subgraph for given subjects from RDF graph via SPARQL

As you might see from the formulation of my question title, I am new to SPARQL and RDF in general, so please excuse me for not being able to formulate the question more precisely.
I am looking for a SPARQL query which returns a list of triples that describe a connected subgraph (assuming that the queried graph is completly connected). I found a solution at SPARQL query to construct sub-graph with select paths (paths have different lengths) but it only works if the passed nodes are directly connected. If for example I have a scenario where I pass nodes that are not connected directly I don't get a result. I would like to get a subgraph (as triples) that is allowed to include additional nodes but has to be connected.
Example: Given Graph G=(V,E), with V={A,B,C,D,E,F,G,H} and E as described in the following picture (predicates are not displayed as they are not important). I would like to find a query which when given {C,E,H} (see picture 1) returns the triples that describe the connected subgraph given in picture 2, that is {(C,E),(E,G),(G,H)}.
Obviously there might be multiple ways in which this can be achieved in an arbitrary graph, so in this case there are two possibilities: 1. A structure should be returned that is the minimum spanning tree of those vertices ({C,E,G,H}) or 2. If this is not possible, a structure should be returned from which a minimum spanning tree can be calculated afterwards
Thanks for any help.
The primary misunderstanding, wrt SPARQL, seems to be in the second paragraph. The predicates ARE important. They are the driving force behind RDF and SPARQL - i.e. naming the "relationships", which are the names of edges. As you'll see below, it's not possible to answer the question without knowing something about the predicate names.
But let me try anyway. Assuming all predicates are named :p1, I think what you may want is something like the following:
CONSTRUCT {
:C :p1 :E .
:E :p1 ?G .
?G :p1 :H .
}
WHERE {
:C :p1 :E .
:E :p1 ?G .
?G :p1 :H .
}
The WHERE clause will find the graph pattern, in your specific example finding a solution for ?G. The CONSTRUCT statement then creates your minimal spanning tree.
Another tip for going from graph theory to RDF and SPARQL is the use of URI's to uniquely name nodes and edges. The above takes a specific XML-based shortcut that assumes a default prefix has been defined. For example, if the default prefix were <http://example.com/graphTheory/>, then replace the : in the above with the prefix, e.g. <http://example.com/graphTheory/p1>.

SPARQL query returns no results when a property path is present

The following query returns some results which have skos:broader set as category:History
select ?subject
where
{
?subject skos:broader category:History .
}
However replacing skos:broader with skos:broader+ or skos:broader* returns no results. Why is this? I would expect ethier to fetch at least the results returned in the first query.
I'm using the SPARQL front end here: http://dbpedia.org/sparql
Virtuoso (the endpoint that DBpedia uses) has some idiosyncrasies, supports some non-standard syntax (which often leads people to wonder why a query that worked on DBpedia doesn't work with other libraries), and (I think) doesn't support all of SPARQL 1.1. This may be a case where you've run into some internal limitations. You can approximate the results that you want with a query like the following, though:
select ?category { ?category skos:broader{,7} category:History }
This only follows paths of length seven or less. The {m,n} notation for property paths isn't part of SPARQL 1.1, but was in early drafts, and Virtuoso supports it. It is convenient for limiting the resources used in answer a query, and this is a good use case for it.