Sparql query with Blank node can be complex - sparql

I read this blog article, Problems of the RDF model: Blank Nodes, and there's mentioned that using blank nodes can complicate the handling of data.
Can you give me an example why using blank nodes is difficult to perform a SPARQL query?
I do not understand the complexity of blank nodes.
Can you explain me the meaning and semantics of an existential variable?
I do not understand clearly this explanation given in the RDF Semantics Recommendation, 1.5. Blank Nodes as Existential Variables.

Existential Variables
In the (first-order) predicate calculus, there is existential quantification which lets us make assertions about things that exist, without saying (or, possibly, knowing) which specific individuals in the domain we're actually talking about. For instance, a sentence like
hasUserId(JoshuaTaylor,1281433)
entails the sentence
∃x.hasUserId(x,1281433)
Of course, there are lots of scenarios in which the second sentence could be true without the first one being true. In that sense, the second sentence gives us less information than the first. It's also important to note that the variable x in the second sentence doesn't provide any way to find out which element in the domain of discourse actually has the given userId. It also also doesn't make any claim that there's only one such thing that has the given user id. To make that clearer, we might use an example:
∃y.hasAge(y,29)
This is presumably true, since someone or something out there is age 29. Note that we can't talk about y as the individual that is age 29, though, because there could be lots of them. All this sentence tells us is that there is at least one.
Even though we used different variables in the two sentences, there's nothing to say that the individuals with the specified properties might not be the same. This is particularly important in nested quantification, e.g.,
∃x.∃y.likes(x, y)
This sentence could be true because there is one individual in the domain that likes itself. just because x and y have different names in the sentence doesn't mean that they might not refer to the same individual.
Blank Nodes as Existential Variables
There is a defined RDF entailment model defined in RDF Semantics. This has been described more in another Stack Overflow question, RDF Graph Entailment. The idea is that an RDF graph is treated a big existential quantification over the blank nodes mentioned in the graph. E.g., if the triples in the graph are t1, …, tn, and the blank nodes that appear in those triples are b1, …, bm, then the graph is a formula:
∃b1, …, bm.(t1 ∧ … ∧ tn)
Based on the discussion of the existential variables above, note that this means that blank nodes in the data might refer to same element of the domain, or different elements, and that it's not required that exactly one element could take the place of a blank node. This means that a graph with blank nodes, when interpreted in this manner, provides much less information than you might expect.
Blank Nodes in Real Data
Now, the discussion above is useful if people are using blank nodes as existential variables. In many cases, authors think of them more as anonymous, but definite and distinct objects. E.g., if we casually write
#prefix : <https://stackoverflow.com/q/20629437/1281433/> .
:Carol :hasAddress [ :hasNumber 4222 ;
:hasStreet :Clinton_Way ] .
we may well be trying to say that there is a single address out there with the specified properties, but according to the RDF entailment model, that's not what we're doing.
In practice, this isn't so much of a problem, because we're usually not using RDF entailment. What is a problem though is that since the scope of blank variables is local to a graph, we can't run a SPARQL query against an endpoint asking for Carol's address and get back an IRI that we can reuse. If we run a query like this:
prefix : <https://stackoverflow.com/q/20629437/1281433/>
construct {
:Mike :hasAddress ?address
}
where {
:Carol :hasAddress ?address
}
then we get back the following (unhelpful) graph as a result:
#prefix : <https://stackoverflow.com/q/20629437/1281433/> .
:Mike :hasAddress [] .
We won't have a way to get more information about the address because all we have now is a blank node. If we had used IRIs, e.g.,
#prefix : <https://stackoverflow.com/q/20629437/1281433/> .
:Carol :hasAddress :address1267389 .
:address1267389 :hasNumber 4222 ;
:hasStreet :Clinton_Way .
then the query would have produced something more helpful:
#prefix : <https://stackoverflow.com/q/20629437/1281433/> .
:Mike :hasAddress :address1267389 .
Why is this more useful? The first case is like having the data
∃ x.(hasAddress(Carol,x) &wedge; hasNumber(x,4222) &wedge; hasStreet(x,ClintonWay))
and getting back a result
∃ y.hasAddress(Mike,y)
Sure, it's possible that Mike and Carol have the same address, but from these sentences there's no way to know for sure. It's much more helpful to have data like
hasAddress(Carol,address1267389)
hasNumber(address1267389,4222)
hasStreet(address1267389,ClintonWay))
and getting back a result
hasAddress(Mike,address1267389)
From this, you know that they have the same address, and you can ask things about it.
Conclusion
How much this will affect your data and its consumers depends on what the typical use cases are. For automatically constructed graphs, it may be hard to know in advance just what kind of data you'll need to be able to refer to later, so it's a good idea to generate IRIs for as many of your resources as you can. Since IRIs are free-form, it's usually not too hard to do this. For instance, if you've got some sensible “base” IRI, e.g.,
http://example.org/myData/
then you can easily append suffixes to identify your resources. E.g.,
http://example.org/myData/addresses/addr1
http://example.org/myData/addresses/addr2
http://example.org/myData/addresses/addr3
http://example.org/myData/individuals/ind34
http://example.org/myData/individuals/ind35

Related

Functions in SPARQL to Manipulate IRIs?

I want to write some reusable SPARQL queries to do things like take an IRI, get the name part (typically after the # sign), modify it (e.g., replace underscores with blank spaces) and put it in the rdfs:label property. This would be useful for Protege which doesn't fill in the rdfs:label if you use user defined IRIs. Or take an IRI with a user defined name, do the same and then replace the user defined IRI with a UUID. I looked in the SPARQL spec for functions to manipulate IRIs and either they don't exist or I'm missing something obvious. I'm posting this to make sure it isn't the latter. I know it is easy to do the equivalent with things like SUBSTR but I'm surprised that there aren't predefined operators to do things like getting the name part of an IRI and getting the base and want to double check before I roll my own.
In case anyone else wants to do this, I figured it out. There are some answers on this site but they are all for SQL or other languages than SPARQL. The following is for classes and it should be obvious how to adapt it for other entities. Note: this works in the Snap SPARQL Plugin for Protege (that's why I use CONSTRUCT rather than INSERT), however, there is a bug in their implementation of SUBSTR so that it uses 0 based indexing rather than 1 based as the spec says. So if you use this in Snap SPARQL change the 1 to a 2.
CONSTRUCT {?c rdfs:label ?lblname.}
WHERE {?c rdfs:subClassOf owl:Thing.
BIND(STRAFTER(STR(?c), '#') as ?name)
BIND(REPLACE(?name,"([A-Z])", " $1" ) as ?namewbs)
BIND (IF (STRSTARTS(?namewbs," "),SUBSTR(?namewbs,1),?namewbs) AS ?lblname)
FILTER(?c != owl:Thing || ?c != owl:Nothing)}

Is it possible to use variables as integers in SPARQL property paths?

I am currently trying to create pointers to datatype values as they cannot be linked directly. However, I would like to be able to evaluate the pointers from within the SPARQL environment, which raised specifically in the case that the desired value is part of an ordered rdf:List some questions for me. My approach is to use property paths within a SPARQL query in which I can use the defined individual, property and index of the ordered list that the pointer has attached to it.
Given the following example data with the shortened syntax for ordered lists by ttl:
ex:myObject ex:somePropery ("1" "2" "3") .
ex:myPointer ex:lookAtIndividual ex:myObject;
ex:lookAtProperty ex:someProperty ;
ex:lookAtIndex "3"^^xsd:integer .
Now I would like to create a SPARQL query that -- based on the pointer -- returns the value at the given index. To my understanding the query could/should look something like this:
SELECT ?value
WHERE {
ex:myPointer ex:lookAtIndividual ?individual ;
ex:lookAtProperty ?prop ;
ex:lookAtIndex ?index .
?individual ?prop/rdf:rest{?index-1}/rdf:first ?value .
}
But if I try to execute this query with TopBraid, it shows an error message that ?index has been found when <INTEGER> was expected. I also tried binding the index in the SPARQL query via BIND(?index-1 AS ?i), again without success. If the pointed value is not stored in a list, the query without property path works fine.
Is it in general possible to use a value that is connected via datatype property within a SPARQL query as path length for property paths?
This syntax: rdf:rest{<number>} is not standard SPARQL. So the short answer is, regrettably: no, you can't use variables as integers in SPARQL property paths, for the simple reason that you can't use integers in SPARQL property paths at all.
In an earlier draft of the SPARQL standard, there was a proposal to use this kind of syntax to allow specifying the min and max length of a property path, e.g. rdf:rest{1, 3} would match any paths using rdf:rest properties between length 1 and 3. But this was never fully standardized and most SPARQL engines don't implement it.
If you happen to use a SPARQL engine that does implement it, you will have to get in touch with the developers directly to ask if they can extend the mechanism to allow use of variables in this position (the error message suggests to me that it's currently just not possible).
As an aside: there's a SPARQL 1.2 community initiative going on. It only just got started but one of the proposals on the table is re-introducing this particular piece of functionality to the standard.

What is the benefit of defining datatypes for literals in an RDF graph?

I am using rdflib in Python to build my first rdf graph. However, I do not understand the explicit purpose of defining Literal datatypes. I have scraped over the documentation and did my due diligence with google and the stackoverflow search, but I cannot seem to find an actual explanation for this. Why not just leave everything as a plain old Literal?
From what I have experimented with, is this so that you can search for explicit terms in your Sparql query with BIND? Does this also help with FILTERing? i.e. FILTER (?var1 > ?var2), where var1 and var2 should represent integers/floats/etc? Does it help with querying speed? Or am I just way off altogether?
Specifically, why add the following triple to mygraph
mygraph.add((amazingrdf, ns['hasValue'], Literal('42.0', datatype=XSD.float)))
instead of just this?
mygraph.add((amazingrdf, ns['hasValue'], Literal("42.0")))
I suspect that there must be some purpose I am overlooking. I appreciate your help and explanations - I want to learn this right the first time! Thanks!
Comparing two xsd:integer values in SPARQL:
ASK { FILTER (9 < 15) }
Result: true. Now with xsd:string:
ASK { FILTER ("9" < "15") }
Result: false, because when sorting strings, 9 comes after 1.
Some equality checks with xsd:decimal:
ASK { FILTER (+1.000 = 01.0) }
Result is true, it’s the same number. Now with xsd:string:
ASK { FILTER ("+1.000" = "01.0") }
False, because they are clearly different strings.
Doing some maths with xsd:integer:
SELECT (1+1 AS ?result) {}
It returns 2 (as an xsd:integer). Now for strings:
SELECT ("1"+"1" AS ?result) {}
It returns "11" as an xsd:string, because adding strings is interpreted as string concatenation (at least in Jena where I tried this; in other SPARQL engines, adding two strings might be an error, returning nothing).
As you can see, using the right datatype is important to communicate your intent to code that works with the data. The SPARQL examples make this very clear, but when working directly with an RDF API, the same kind of issues crop up around object identity, ordering, and so on.
As shown in the examples above, SPARQL offers convenient syntax for xsd:string, xsd:integer and xsd:decimal (and, not shown, for xsd:boolean and for language-tagged strings). That elevates those datatypes above the rest.

Fast lookup of tree with placeholders?

For an application I'm considering, there would be a large (100,000+) 'database' of trees (think expressions in a programming language, or S-expressions), and I would need to query that database for expressions that match a specific given expression.
Before giving the details of what I'd like to have, note that I'd appreciate any information related to indexing a large set of trees for optimizing lookup by a subtree.
In my specific situation (which would be for a backend to be used by Metamath proof assistants), expressions have the following structure (in Haskell-like notation):
data Expression = Placeholder Id | VarName Id | ConstName Id [Expression]
or as a BNF for an S-expression form:
Expression = '?' Id | Id | '(' Id Expression* ')'
where Id is some kind of identifier.
For example, I could have a database with expressions like
(equiv ?ph ?ps)
(not (in (appl (sqrt) (2)) (Q)))
(equiv (eq ?A ?B) (forall ?x (equiv (in ?x ?A) (in ?x ?B))))
In this context, two expressions match if they can be made equal by substitution of expressions for placeholders. So looking up (equiv (eq A (emptyset)) ?ph) in the above mini-database would result in the first and last expressions.
So again: how would I implement fast lookups in a large set of (expression) trees with placeholders? What kind of index data structure could I use?
I would implement the lookup with a trie. Each key would consist of one of the following:
ConstName Identifier
Variable w/ context info
ConstValue
Placeholder
These should be ordered in some fashion- possibly Placeholder, then all ConstNames (alphabetical), then variables (scope ordering, then argument order), then ConstValues (numerical order). As long as there's a concrete ordering for usage in the trie, you're fine.
Traverse the expression's tree, injecting the appropriate keys into the trie as they are encountered. Do this for all the expressions you want to insert into your data structure. When it comes time to query it, you can traverse the trie in a similar fashion, but with a few new rules.
Everything matches a placeholder node. If it matches some other key as well, then you'll need to explore both branches (easily done via a recursive DFS-like approach).
A placeholder matches everything. This is not equivalent to the previous point- we are talking about placeholders in the query here, the previous bullet is regarding placeholders as trie keys.
Now, this does mean that the search space can somewhat "explode" as you encounter placeholders, but there is one thing you can do to try to mitigate this in practice. Traverse the expression's tree in a breadth-first fashion (both in construction of the trie, and querying). This means if one of the arguments is a placeholder, you won't have to full-depth search every single subtree that matches that expression so far- instead you jump ahead to the next argument- which may not be a placeholder, and will thus greatly prune the search space (compared to matching "everything").
For completeness sake, lets take one of your examples
(not (in (appl (sqrt) (2)) (Q)))
and make a trie entry from that-
not -> in -> apply -> "Q" -> sqrt -> 2
adding (not (in ?ph E)) to this would result in-
not -> in -> apply -> "Q" -> sqrt -> 2
\-> ?ph -> "E"
Continue in this fashion injecting expressions into the trie. Also traverse in this fashion for querying until you reach the ends of your searches into the trie, and return those that matched.
Note- the uniqueness of these entries is based on the assumption you do not have to support variadic functions. If you do, attach to each key some context info (read the next paragraphs for info on how to do this) to distinguish which arguments go to which functions
There is one detail I glossed over- variables. If you only want it to match if they are the exact same variable name, then no work is necessary. But this likely isn't what you want; you probably want it to match generic variables as long as they are "consistent" with each other. The way to do this is to assign each variable an identifier that represents the scope of which it was first defined.
The easiest way to do this is just compose an identifier from the concatenation of the argument ordering of its ancestors. That is, if a variable is first defined as the second argument to a function which is the fifth argument to the root function, then we might label it as (5, 2) or (2, 5), whichever makes more sense intuitively. Either way, this will ensure the variable is given a consistent identifier regardless of other variables / functions elsewhere. Then proceed as normal with this new variable name.

What's the meaning of hash sign (#) in SPARQL?

In SPARQL, I often see usage of # at the end of prefix definitions, like this:
#prefix dt: <http://example.org/datatype#>
What's the purpose? I couldn't find this in the SPARQL documentation.
Your example seems to be in Turtle, as in SPARQL the syntax would be:
PREFIX dt: <http://example.org/datatype#>
But it’s the same idea: Instead of having to use full IRIs in your query, you can use prefixed names:
In your example, the prefix label is dt. It’s mapped to the IRI http://example.org/datatype#.
In your query, it might get used as dt:foobar, where foobar is called the local part.
The mapped IRI from the prefix label and the local part get concatenated to form the "actual" IRI:
http://example.org/datatype# + foobar =
http://example.org/datatype#foobar
(Instead of using dt:foobar, you could also use <http://example.org/datatype#foobar>.)
So the # just happens to be part of the IRI design. It’s a popular way to structure vocabulary IRIs in the Semantic Web. The other popular way is using a /. See HashVsSlash.