SPARQL 1.1 entailment regimes and query with FROM clause - sparql

I'm currently documenting/testing about SPARQL 1.1 entailment regimes and the recommendation repeatedly states that
The scoping graph is graph-equivalent to the active graph
but it does not specifies what is the active graph referring to : is it the data-set used in the query ? a union of all graphs in the store ?
As a test to determine this , I got this graph URIed <http://www.example.org/> in a Sesame Memory store with RDF Schema and direct type inferencing store (v2.7.14)
#prefix ex:<http://www.example.org/> .
ex:book1 rdf:type ex:Publication .
ex:book2 rdf:type ex:Article .
ex:Article rdfs:subClassOf ex:Publication .
ex:publishes rdfs:range ex:Publication .
ex:MITPress ex:publishes ex:book3 .
I've been trying the following query (which means using the default graph thus the inference engine)
SELECT ?s WHERE { ?s a ex:Publication . }
As expected, it returns me all three instances
<http://www.example.org/book1>
<http://www.example.org/book2>
<http://www.example.org/book3>
while the query :
SELECT ?s FROM ex: WHERE { ?s a ex:Publication . }
returns only
<http://www.example.org/book1>
Under the said circumstances, shouldn't both results be the same?
What should happen (according to the recommendation) if data and schema are split between two graphs in the store (like <urn:rdfs-schema> and <urn:data>, or even scattered across more graphs) and the query uses both graphs (or a subset of schema-related graphs) in the FROM clause instead of the default graph ?
Meaning should the inferencing be global throughout the store or does it depend on the query dataset ?
Or maybe is the recommendation loose enough to make this an implementation dependent issue ?
Thanks for your lights,
Max.
EDIT this question is being redirected to SPARQL 1.1 entailment regimes and query with FROM clause (follow-up)

Your second query only returns only book1 because in Sesame's RDFS inferencer, entailed statements are inserted in the default graph, not in the named graph(s) from which the premises for the entailment come. So the entailed results are simply not present in the graph you are querying.
The reason for this design choice is at least partly historic, as the Sesame RDFS inference engine predates the W3C notion of entailment regimes. The rationale at the time was that in the case of inferencing over several named graphs (where e.g. one premise comes from graph A and another from B), insertion in the default graph (rather than in either A, or B, or both) was simplest with the least amount for confusion.
Sesame currently does not explicitly support the W3C entailment regimes specification. However, if you feel that a simple improvement would be possible to make it more compatible, by all means log a feature request.
(disclosure: Sesame developer)

What exactly is in the default graph isn't specified by the SPARQL 1.1 standard. In particular, see 13.1 Examples of RDF Datasets which mentions that:
The definition of RDF Dataset does not restrict the relationships of
named and default graphs. Information can be repeated in different
graphs; relationships between graphs can be exposed. Two useful
arrangements are:
to have information in the default graph that includes provenance information about the named graphs
to include the information in the named graphs in the default graph as well.
However, by using FROM clauses to specify which graph should be the default graph, or by using multiple FROM clauses to specify which graphs should be merged to be the default graph.
That's all concerning the default graph. The active graph is another term you'll see in the SPARQL 1.1 spec:
The graph that is used for matching a basic graph pattern is the
active graph. In the previous sections, all queries have been shown
executed against a single graph, the default graph of an RDF dataset
as the active graph. The GRAPH keyword is used to make the active
graph one of all of the named graphs in the dataset for part of the
query.
So, you can use from (possible multiple times) to control the default graph, and thereby the initial active graph, and then graph { … } within a query to change the active graph.

Related

Realizing and Retrieving information from a producer-consumer OWL model

I have the following scenario modelled in OWL:
Producer ----producesResource---> Resource <------consumesResource ---- Consumer
Producer, Resource and Consumer are OWL Classes, while producesResource and consumesResource are object properties. The scenario is quite intuitive in that that each producer produces one or more resources that is consumed by one or more consumers. Conversely, each consumer can consume one or more resources. The ontology is populated with instances / individuals accordingly.
I would like to check if there exists a resource that is consumed by a consumer that is not produced by a Producer. What is an elegant way to get this information via a:
Query in SPARQL
SHACL Shapes graph (if possible).
Negations are possible in SPARQL using the NOT BOUND filter or more easily in SPARQL 1.1 using MINUS:
SELECT ?resource WHERE
{
?resource a :Resource.
?consumer a :Consumer;
?consumer :consumesResource ?resource.
MINUS {?producer a :Producer; :producesResource ?resource.}
}
You can also use ASK to get a boolean result but SELECT allows easier debugging to verify if your query is working correctly.
As SHACL allows integrating SPARQL queries, this answers your second question too but in that case it is easier to just use the SPARQL query on its own.

How to make a selection of a giant ontology, built from several aligned reference ontologies?

My organization has an information requirement spanning several information domains. In order to capture this, we are building a large organization ontology in which we align several domain specific reference ontologies / vocabularies (think of dublin core, geosparql, industry specific information models etc) and where necessary, we add concepts in an ` extension' ontology (which is then also aligned with the reference ontologies).
The totality of this aligned ontology (>3000 classes and >10000 ObjectProperties) contains both unused concepts and semantic doubles, and for the newcomer is impossible to navigate. Further more the organization wishes to standardize the use of specific concepts, so doubles are extremely undesirable. We are therefore looking for a way to construct the SuperAwesomeOntology that contains all concepts (and their owl related predicates like subClassOf, domain/range etc) that have been labeled (maybe by something like dcterms:isRequiredBy "SuperAwesomeOntology"). The result should be a correct OWL-ontology that can be stored in a single file.
One constraint: it has to be done programmatically,(the copy/move/delete axioms interface of protege wont do), because if one of the reference ontologies gets an update, we want to be able to render the SuperAwesomeOntology again from its most up-to-date reference ontologies and find out if there are any conflicts.
How would we go about this? could SPARQL do this, how? Alternative suggestions to the isRequiredBy labeling are also welcome.
If I understand you correctly, you want to programmatically remove unused concepts from a large ontology or a collection of ontologies/graphs and you also want to remove concepts/classes that you identified as duplicates via interlinking.
Identified duplicates are easy to remove:
Define what a duplicate is for you. For example, nodes at either end of a owl:sameAs or skos:closeMatch link that are outside of your core graph (so you don't remove the "original").
Construct the new graph using a SPARQL query:
construct {?s ?p ?o.}
{
?s ?p ?o.
filter not exists {graph ?g {?s owl:sameAs ?x.} filter(?g!=<http://my.core.graph>)}
filter not exists {graph ?g {?o owl:sameAs ?x.} filter(?g!=<http://my.core.graph>)}
}
I tested this query for syntax and performance but not for correctness.
Unused concepts are more difficult to remove:
First, again, you need to define, what "unused" means for you. This criterion will however certainly involve reachability, or "connectedness" in the combined graph, where you want to only select the graph component that contains your core ontology. The problem is that, if you treat the triples as undirected edges, you will probably get a connected graph (that is only a single component and no nodes to remove) because the type hierarchy often connects everything. You could take the direction of edges into account, that is include only resources Y where there is a directed path from any resource X in your core ontology to Y.
This would ensure that you can go up the subclass hierarchy of the target ontology until e.g. owl:Thing but not down again. The problem is that you don't know what other type of edges are in the target ontology and in which direction they go but you could use only rdfs:subClassOf edges for now.
If you have sufficiently defined your "unused concept" or want to try it with some experimental definition, you can either use a graph library or graph analysis application and import your code there.
Here is an example of how to import a SPARQL endpoint into the Cytoscape.js JavaScript graph visualization library, it can be used in node as well. You need to heavily adapt the code however.
Or you do it again in SPARQL using SPARQL 1.1 property paths.
The problem is that those can have a large performance impact (or even a complexity that is way too large to ever complete) especially when applied to a large number of resources and an unrestricted path length. So it is possible that a query such as that times out but feel free to try and adapt it:
construct {?s ?p ?o.}
{
{?s ?p ?o.}
graph <http://my.core.graph> {?x rdfs:subClassOf ?X.}
{?x (<>|!<>)* ?s.}
}
The ?x rdfs:subClassOf ?X statement is just an identifier for which resources of your core ontology you want to use a source points, I couldn't get a valid query without that. When I apply a graph statement to the path expression, I get a syntax error from Virtuoso.

Grouping by blank nodes

I have the following data:
#prefix f: <http://example.org#> .
_:a f:trait "Rude"#en .
_:a f:name "John" .
_:a f:surname "Roy" .
_:b f:trait "Crude"#en .
_:b f:name "Mary" .
_:b f:surname "Lestern" .
However, if I execute the following query in Blazegraph:
PREFIX f: <http://example.org#>
SELECT ?s ?o
WHERE
{
?s f:trait ?o .
}
I get six results:
s o
t32 Crude
t37 Crude
t39 Crude
t31 Rude
t36 Rude
t38 Rude
If blank nodes _:a and _:b are distinct nodes, how should I write a SPARQL query to return only two distinct results? I have tried SELECT DISTINCT, but it still returns six results. I have tried grouping by ?o, but Blazegraph returns an error, saying it's a bad aggregate. Why does this kind of output of repeating tuples happen? And how to avoid it?
The problem is that you have inserted data containing blank nodes several times. This operation is not idempotent.
Useful quotes
From RDF 1.1 Concepts and Abstract Syntax:
Blank node identifiers are local identifiers that are used in some concrete RDF syntaxes or RDF store implementations. They are always locally scoped to the file or RDF store, and are not persistent or portable identifiers for blank nodes.
From RDF 1.1 Semantics:
RDF graphs can be viewed as conjunctions of simple atomic sentences in first-order logic, where blank nodes are free variables which are understood to be existential. Taking the union of two graphs is then analogous to syntactic conjunction in this syntax. RDF syntax has no explicit variable-binding quantifiers, so the truth conditions for any RDF graph treat the free variables in that graph as existentially quantified in that graph. Taking the union of graphs which share a blank node changes the implied quantifier scopes.
From SPARQL 1.1 Query Language:
Blank node labels are scoped to a result set.
There need not be any relation between a label _:a in the result set and a blank node in the data graph with the same label.
An application writer should not expect blank node labels in a query to refer to a particular blank node in the data.
From SPARQL 1.1 Update:
Blank nodes... are assumed to be disjoint from the blank nodes in the Graph Store, i.e., will be inserted with "fresh" blank nodes.
Some discussion
Problems of the RDF Model: Blank Nodes
Blank nodes considered harmful
Different triplestores provides solutions for the "problems" described. E.g., Jena allows to use pseudo-URIs like <_:b1> etc.

Virtuoso SPARQL not retrieving values

I am uploading a RDF file to a Virtuoso Repository vía de graphic interface (ODS-Briefcase). The file is uploaded successfully. However, every time I make a SPARQL query, an empty result is returned.
I have tried with many other files, and I did not have this problem.
This file size is bigger than the previous ones (14MB), so I guess this could be the cause, but I am not sure about it.
Any help in this matter will be appreciated :)
UPDATE
I have tried uploading a smaller file (2KB) and the SPARQL returns results as expected.
However, I uploaded again the file (14 MB) and it seems it is not correctly uploaded.
When I try to read it from the ODS-Briefcase of Virtuoso, this happens:
To solve problems like this you have to fundamentally understand the task you are performing and how its interpreted using Virtuoso.
Task at hand:
Loading an RDF document into Virtuoso's WebDAV repository (for which ODS-Briefcase provides a front-end), in a manner that results in the content of said RDF document being loaded into the Quad Store (where RDF data is indexed and made available to SPARQL Queries etc.).
How you achieve your goal:
Use the ODS-Briefcase UI to create a DET Folder (a folder then provides an automatic conduit between WebDAV storage and the Virtuoso Quad Store) of type: Linked Data Import. One of the attributes (characteristics) of this kind of folder is a Named Graph IRI and Named Graph IRI Base:
With your Linked Data Import DET folder in place, you simply upload RDF documents to the newly created folder.
To verify existence of RDF Language statements imported from an RDF Document placed in this folder simply execute one of the following
SELECT COUNT (*)
FROM {targe-named-graph-iri}
WHERE {?s ?p ?o}
OR
SELECT DISTINCT *
FROM {targe-named-graph-iri}
WHERE {?s ?o ?o}
You can also leverage Virtuoso's in-built RDF data import middleware (a/k/a Sponger) within a SPARQL query using the pattern:
DEFINE get:soft "replace"
SELECT DISTINCT *
FROM {rdf-document-uri}
WHERE {?s ?o ?o}
I hope this brings clarity to options available for importing RDF Document content into the Virtuoso Quad Store (Engine that manages Data represented and RDF Property/Predicate Graphs).
It sounds like you have loaded your files into the Virtuoso WebDAV (file) repository, but you may have not loaded the RDF therein into the Virtuoso (RDF) Quad Store.
See this guide to the bulk loader, and this page of RDF loading methods.
(ObDisclaimer: I work for OpenLink Software, producer of Virtuoso.)

How do triple stores use linked data?

Lets say I have the following scenario:
I have some different ontology files hosted somewhere on the web on different domains like _http://foo1.com/ontolgy1.owl#, _http://foo2.com/ontology2.owl# etc.
I also have a triple store in which I want to insert instances based on the ontology files mentioned like this:
INSERT DATA
{
<http://foo1.com/instance1> a <http://foo1.com/ontolgy1.owl#class1>.
<http://foo2.com/instance2> a <http://foo2.com/ontolgy2.owl#class2>.
<http://foo2.com/instance2x> a <http://foo2.com/ontolgy2.owl#class2x>.
}
Lets say that _http://foo2.com/ontolgy2.owl#class2x is a subclass of _http://foo2.com/ontolgy2.owl#class2 defined within the same ontology.
And after the insert if I run a SPARQL query like this:
select ?a
where
{
?a rdf:type ?type.
?type rdfs:subClassOf* <http://foo2.com/ontolgy2.owl#class2> .
}
the result would be:
<http://foo2.com/instance2>
and not:
<http://foo2.com/instance2>
<http://foo2.com/instance2x>
as it should be. This is happening because the ontology file _http://foo2.com/ontolgy2.owl# is not imported into the triple store.
My question is:
Can we talk in this example about "linked" data? Because it seems to me that it is not linked at all. It has to be imported locally into a triple store, and after that you can start querying.
Lets say if you want to run a query on some complex data that is described by 20 ontology files, then all 20 ontology files would need to be imported.
Isn't this a bit disappointing?
Do I misunderstood triple stores and linked data and how they work together?
as it should be.
I'm not certain that should is the right term here. The semantics of the SPARQL query is to query the data stored in a particular graph stored at the endpoint. IRIs are more or less opaque identifiers; just because they might also be URLs from which additional data can be retrieved doesn't obligate any particular system to actually do that kind of retrieval. Doing that would easily make query behavior unpredictable: "this query worked yesterday, why doesn't it work today? oh, a remote website is no longer available…".
Lets say that _http://foo2.com/ontolgy2.owl#class2x is a subclass of _http://foo2.com/ontolgy2.owl#class2 defined within the same ontology.
Remember, since IRIs are opaque, anyone can define a term in any ontology. It's always possible for someone else to come along and say something else about a resource. You have no way of tracking all that information. For instance, if I go and write an ontology, I can declare http://foo2.com/ontolgy2.owl#class2x as a class and assert that it's equivalent to http://dbpedia.org/ontology/Person. Should the system have some way to know about what I did someplace else, and even if it did, should it be required to go and retrieve information from it? What if I made an ontology that's 2GB in size? Surely your endpoint can't be expected to go and retrieve that just to answer a quick query?
Can we talk in this example about "linked" data? Because it seems to
me that it is not linked at all. It has to be imported locally into a
triple store, and after that you can start querying.
Lets say if wan to run a query on some complex data that is describe
by 20 ontology files, in this case I have to import all 20 ontology
files.
This is usually the case, and the point about linked data is that you have a way to get more information if you choose to, and that you don't have to do as much work in negotiating how to identify resources in that data. However, you can use the service keyword in SPARQL to reference other endpoints, and that can provide a type of linking. For instance, knowing that DBpedia has a SPARQL endpoint, I can run a local query that incorporates DBpedia with something like this:
select ?person ?localValue ?publicName {
?person :hasLocalValueOfInterest ?localValue
service <http://dbpedia.org/sparql> {
?person foaf:name ?publicName
}
}
You can use multiple service blocks to aggregate data from multiple endpoints; you're not limited to just one. That seems pretty "linked" to me.