MarkLogic seems not to follow the RDF specification - sparql

Write the following RDF data into MarkLogic:
#prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<http://John> <http://have> 10 .
<http://Peter> <http://have> "10"^^xsd:double .
Then execute the following query:
SELECT *
WHERE { ?s ?p 10. }
The query result is:
?s
?p
<http://John>
<http://have>
<http://Peter>
<http://have>
In RDF document, RDF term is equal if and only if the value and data type are both the same.
Literal term equality: Two literals are term-equal (the same RDF literal) if and only if the two lexical forms, the two datatype IRIs, and the two language tags (if any) compare equal, character by character. Thus, two literals can have the same value without being the same RDF term.
https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal
It seems that MarkLogic does not follow the RDF specification?
Expected query result:
The expected query result is:
?s
?p
<http://John>
<http://have>
Alternatives
Executing the SparQL query in Query Console and with Java Client API will both get the unexpected query result.
This query can return the expected query result in Apache Jena and RDF4j.
Can someone give me an answer or a hint about it?

A triple store that does that is not automatically wrong.
Try:
{ ?s ?p ?x . FILTER (sameTerm(?x, 10)}
{ ?s ?p ?x . FILTER (?x = 10) }
and also
Also try:
with 010 and +10, "010^^xsd:integer
and explore the lexical forms:
FILTER(str(?x) = "10")
The use cases for both value-based matching and term-based matching exist.
SPARQL query does not say whether the data loaded is exactly the data queried (by design). Several store do variations of value-processing as default behaviour. Jena does for TDB2 but keeps datatypes apart (TDB1 conflating datatypes was less popular).
It depends on what use case they are targetting. If you want a particular effect, then use a FILTER form of the query assuming data was not pre-processed on loading.
See the answer on Unexpected data writing result in MarkLogic
See https://www.w3.org/TR/sparql11-entailment/#CanonicalLit for the discussion that D-entailment gives many, many answers even with a restriction to canonical lexical forms.

Related

How to determine whether two SPARQL queries are identical using Python?

When using SPARQL to query RDF dataset, the same query can be written in many different ways. For example, sparql queries are always permutation-invariant with respect to some clauses inside it. Also, we can rename the variables inside a sparql query. But how can we identify those identical SPARQL queries? Ideally, there should be a python package that can parse a sparql query (i.e., a string object) into a query object, and different strings share the same underlying query are parsed into the same object, then we can simply compare the parsed query objects to determine whether two sparql queries are identical. Is there any tool like this (seems prepareQuery() in rdflib doesn't work in this way)? If not, then what should I do?
Semantically identical queries example:
SELECT ?x WHERE { ?x foaf:haha ?k .\n ?person foaf:knows ?x .}
SELECT ?s WHERE { ?person foaf:knows ?s .\n ?s foaf:haha ?k .}
The paper "Generating SPARQL Query Containment Benchmarks
using the SQCFramework" by Muhammad Seleem et al., mentions "SPARQL query containment solvers" where
Query containment is the problem of deciding if the result set of a query Q1 is included
in the result set of another query Q2
If you use such a solver to test whether the result set of Q1 is a subset of Q2 and vice versa, you have established that they are semantically identical.
As for your "off-the-shelf tool": the former paper mentions that those are tested in another paper "Evaluating and benchmarking sparql query containment solvers." by M.W. Chekol et al..
As for the complexity and computability, the latter paper mentions:
The query containment problem for full SPARQL is undecidable [15, 1].
Hence, it is necessary to reduce SPARQL in order to consider it. A
double exponential upper bound has been proven for the containment and
equivalence problems of SPARQL queries without OPTIONAL , FILTER and
under set semantics [7].
However, query containment in both directions is only one way to determine identity of queries. I am unaware whether there is a proof of a better complexity/computability for query identity than for query containment (or a proof on the contrary).

Is there a better way to do case insensitive queries? [duplicate]

I am using Jena ARQ to write a SPARQL query against a large ontology being read from Jena TDB in order to find the types associated with concepts based on rdfs label:
SELECT DISTINCT ?type WHERE {
?x <http://www.w3.org/2000/01/rdf-schema#label> "aspirin" .
?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type .
}
This works pretty well and is actually quite speedy (<1 second). Unfortunately, for some terms, I need to perform this query in a case-insensitive way. For instance, because the label "Tylenol" is in the ontology, but not "tylenol", the following query comes up empty:
SELECT DISTINCT ?type WHERE {
?x <http://www.w3.org/2000/01/rdf-schema#label> "tylenol" .
?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type .
}
I can write a case-insensitive version of this query using FILTER syntax like so:
SELECT DISTINCT ?type WHERE {
?x <http://www.w3.org/2000/01/rdf-schema#label> ?term .
?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type .
FILTER ( regex (str(?term), "tylenol", "i") )
}
But now the query takes over a minute to complete! Is there any way to write the case-insensitive query in a more efficient manner?
From all the the possible string operators that you can use in SPARQL, regex is probably the most expensive one. Your query might run faster if you avoid regex and you use UCASE or LCASE on both sides of the test instead. Something like:
SELECT DISTINCT ?type WHERE {
?x <http://www.w3.org/2000/01/rdf-schema#label> ?term .
?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type .
FILTER (lcase(str(?term)) = "tylenol")
}
This might be faster but in general do not expect great performance for text search with any triple store. Triple stores are very good at graph matching and not so good at string matching.
The reason the query with the FILTER query runs slower is because ?term is unbound it requires scanning the PSO or POS index to find all statements with the rdfs:label predicate and filter them against the regex. When it was bound to a concrete resource (in your first example), it could use a OPS or POS index to scan over only statements with the rdfs:label predicate and the specified object resource, which would have a much lower cardinality.
The common solution to this type of text searching problem is to use an external text index. In this case, Jena provides a free text index called LARQ, which uses Lucene to perform the search and joins the results with the rest of the query.

How to list all all graphs in Virtuoso?

When I go to http://localhost:8890/sparql/, there are two fields: Default Data Set Name (Graph IRI) and query. How can I list what all graphs (that go in the former field) are available in my DB? The field is not mandatory and I can just run a query against all namespaces. But I would like to know how to list the graphs available.
The only non-empty graph I was able to run was http://localhost:8890/sparql
For example, in a relational database environment, I believe this kind of info could be retrieved from system tables.
As noted in comments, this query will get you a list of all Named Graphs (which, as also noted, are not the same as "namespaces") in the targeted store --
SELECT DISTINCT ?g
WHERE { GRAPH ?g {?s ?p ?o} }
ORDER BY ?g
You can see live results (limited here to 100 graph names) on the DBpedia endpoint (a very short list, as you would expect) and on URIBurner (a much longer and more varied list).
I know this is an old question, but I was facing the same problem and thought someone else might benefit from the solution I found.
I was trying to execute this query to list all graphs and it was taking about 5 minutes to return a result:
SELECT DISTINCT ?g
WHERE {
GRAPH ?g {?s ?p ?o}
}
Instead, you should try the following query suggested in this documentation from Virtuoso:
SELECT DISTINCT ?g
WHERE {
GRAPH ?g {?s a ?o}
}
This query finished in less than 1 second and returned the list of graphs in the store. Of course, this query would only return graphs which have at least one triple with the predicate rdf:type, but it's still a big improvement on the one suggested by TallTed.

SPARQL query returns no results when a property path is present

The following query returns some results which have skos:broader set as category:History
select ?subject
where
{
?subject skos:broader category:History .
}
However replacing skos:broader with skos:broader+ or skos:broader* returns no results. Why is this? I would expect ethier to fetch at least the results returned in the first query.
I'm using the SPARQL front end here: http://dbpedia.org/sparql
Virtuoso (the endpoint that DBpedia uses) has some idiosyncrasies, supports some non-standard syntax (which often leads people to wonder why a query that worked on DBpedia doesn't work with other libraries), and (I think) doesn't support all of SPARQL 1.1. This may be a case where you've run into some internal limitations. You can approximate the results that you want with a query like the following, though:
select ?category { ?category skos:broader{,7} category:History }
This only follows paths of length seven or less. The {m,n} notation for property paths isn't part of SPARQL 1.1, but was in early drafts, and Virtuoso supports it. It is convenient for limiting the resources used in answer a query, and this is a good use case for it.

DBpedia query returns some musicals more than once despite filters

I'm trying to use a SPARQL query against DBpedia to retrieve a list of musicals and some associated properties. However, despite using the appropriate filters (as far as I can tell), the results include many of the musicals more than once. Here is my query:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbpprop: <http://dbpedia.org/property/>
SELECT ?label ?abstract ?book ?music ?lyrics
WHERE {
?play <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:Broadway_musicals> ;
rdfs:label ?label ;
dbo:abstract ?abstract ;
dbpprop:book ?book ;
dbpprop:lyrics ?lyrics ;
dbpprop:music ?music .
FILTER (LANG(?label) = 'en')
FILTER (LANG(?abstract) = 'en')
FILTER (LANG(?book) = 'en')
FILTER (LANG(?lyrics) = 'en')
FILTER (LANG(?music) = 'en')
}
The resulting list has many duplicate entries. Pasting the query here:
DBpedia SPARQL Explorer, you'll see that starting with 'Mama Mia!' there are a lot of duplicates in the list.
Any idea what I'm missing to get unique results with no duplicates? Thanks!
[Edited by glenn mcdonald to clarify that it's musicals which are "duplicated" here, not triples.]
SPARQL returns variable-bindings. Your "duplicates" are cartesian products of multiples in your projected properties. Mamma Mia has multiple music writers and multiple lyricists, so you get every possible combination of them that could produce a row in your table.
What a pain, huh? The "solution" is to use CONSTRUCT instead of SELECT, and deal with getting back a graph instead of a table. Maybe like this:
http://dbpedia.org/snorql/?query=PREFIX+rdfs%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0D%0A++++PREFIX+dbo%3A+%3Chttp%3A%2F%2Fdbpedia.org%2Fontology%2F%3E%0D%0A++++PREFIX+dbpprop%3A+%3Chttp%3A%2F%2Fdbpedia.org%2Fproperty%2F%3E%0D%0A++++CONSTRUCT+%7B%0D%0A++++++++%3Fplay+rdfs%3Alabel+%3Flabel+%3B%0D%0A++++++++++++dbo%3Aabstract+%3Fabstract+%3B%0D%0A++++++++++++dbpprop%3Abook+%3Fbook+%3B%0D%0A++++++++++++dbpprop%3Alyrics+%3Flyrics+%3B%0D%0A++++++++++++dbpprop%3Amusic+%3Fmusic+.%0D%0A++++%7D%0D%0A++++WHERE+%7B+%0D%0A++++++++%3Fplay+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2Fsubject%3E+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FCategory%3ABroadway_musicals%3E+%3B%0D%0A++++++++++++rdfs%3Alabel+%3Flabel+%3B%0D%0A++++++++++++dbo%3Aabstract+%3Fabstract+%3B%0D%0A++++++++++++dbpprop%3Abook+%3Fbook+%3B%0D%0A++++++++++++dbpprop%3Alyrics+%3Flyrics+%3B%0D%0A++++++++++++dbpprop%3Amusic+%3Fmusic+.%0D%0A++++++++FILTER+%28LANG%28%3Flabel%29+%3D+%27en%27%29++++%0D%0A++++++++FILTER+%28LANG%28%3Fabstract%29+%3D+%27en%27%29%0D%0A++++++++FILTER+%28LANG%28%3Fbook%29+%3D+%27en%27%29%0D%0A++++++++FILTER+%28LANG%28%3Flyrics%29+%3D+%27en%27%29%0D%0A++++++++FILTER+%28LANG%28%3Fmusic%29+%3D+%27en%27%29%0D%0A++++%7D
Are the duplicates exact duplicates? i.e. every value for every variable of each duplicate result is identical
If so then add the DISTINCT keyword after SELECT to force the SPARQL engine to discard duplicates solutions.
If not then Glenn is entirely correct that because there are multiple values given for the various properties so you will get multiple results. There are complex workarounds you can do with subqueries, GROUP BY etc. but they would tend to lead to less efficient queries. Sometimes you just have to deal with the duplicates on the client side.