How to make a selection of a giant ontology, built from several aligned reference ontologies? - sparql

My organization has an information requirement spanning several information domains. In order to capture this, we are building a large organization ontology in which we align several domain specific reference ontologies / vocabularies (think of dublin core, geosparql, industry specific information models etc) and where necessary, we add concepts in an ` extension' ontology (which is then also aligned with the reference ontologies).
The totality of this aligned ontology (>3000 classes and >10000 ObjectProperties) contains both unused concepts and semantic doubles, and for the newcomer is impossible to navigate. Further more the organization wishes to standardize the use of specific concepts, so doubles are extremely undesirable. We are therefore looking for a way to construct the SuperAwesomeOntology that contains all concepts (and their owl related predicates like subClassOf, domain/range etc) that have been labeled (maybe by something like dcterms:isRequiredBy "SuperAwesomeOntology"). The result should be a correct OWL-ontology that can be stored in a single file.
One constraint: it has to be done programmatically,(the copy/move/delete axioms interface of protege wont do), because if one of the reference ontologies gets an update, we want to be able to render the SuperAwesomeOntology again from its most up-to-date reference ontologies and find out if there are any conflicts.
How would we go about this? could SPARQL do this, how? Alternative suggestions to the isRequiredBy labeling are also welcome.

If I understand you correctly, you want to programmatically remove unused concepts from a large ontology or a collection of ontologies/graphs and you also want to remove concepts/classes that you identified as duplicates via interlinking.
Identified duplicates are easy to remove:
Define what a duplicate is for you. For example, nodes at either end of a owl:sameAs or skos:closeMatch link that are outside of your core graph (so you don't remove the "original").
Construct the new graph using a SPARQL query:
construct {?s ?p ?o.}
{
?s ?p ?o.
filter not exists {graph ?g {?s owl:sameAs ?x.} filter(?g!=<http://my.core.graph>)}
filter not exists {graph ?g {?o owl:sameAs ?x.} filter(?g!=<http://my.core.graph>)}
}
I tested this query for syntax and performance but not for correctness.
Unused concepts are more difficult to remove:
First, again, you need to define, what "unused" means for you. This criterion will however certainly involve reachability, or "connectedness" in the combined graph, where you want to only select the graph component that contains your core ontology. The problem is that, if you treat the triples as undirected edges, you will probably get a connected graph (that is only a single component and no nodes to remove) because the type hierarchy often connects everything. You could take the direction of edges into account, that is include only resources Y where there is a directed path from any resource X in your core ontology to Y.
This would ensure that you can go up the subclass hierarchy of the target ontology until e.g. owl:Thing but not down again. The problem is that you don't know what other type of edges are in the target ontology and in which direction they go but you could use only rdfs:subClassOf edges for now.
If you have sufficiently defined your "unused concept" or want to try it with some experimental definition, you can either use a graph library or graph analysis application and import your code there.
Here is an example of how to import a SPARQL endpoint into the Cytoscape.js JavaScript graph visualization library, it can be used in node as well. You need to heavily adapt the code however.
Or you do it again in SPARQL using SPARQL 1.1 property paths.
The problem is that those can have a large performance impact (or even a complexity that is way too large to ever complete) especially when applied to a large number of resources and an unrestricted path length. So it is possible that a query such as that times out but feel free to try and adapt it:
construct {?s ?p ?o.}
{
{?s ?p ?o.}
graph <http://my.core.graph> {?x rdfs:subClassOf ?X.}
{?x (<>|!<>)* ?s.}
}
The ?x rdfs:subClassOf ?X statement is just an identifier for which resources of your core ontology you want to use a source points, I couldn't get a valid query without that. When I apply a graph statement to the path expression, I get a syntax error from Virtuoso.

Related

Grouping by blank nodes

I have the following data:
#prefix f: <http://example.org#> .
_:a f:trait "Rude"#en .
_:a f:name "John" .
_:a f:surname "Roy" .
_:b f:trait "Crude"#en .
_:b f:name "Mary" .
_:b f:surname "Lestern" .
However, if I execute the following query in Blazegraph:
PREFIX f: <http://example.org#>
SELECT ?s ?o
WHERE
{
?s f:trait ?o .
}
I get six results:
s o
t32 Crude
t37 Crude
t39 Crude
t31 Rude
t36 Rude
t38 Rude
If blank nodes _:a and _:b are distinct nodes, how should I write a SPARQL query to return only two distinct results? I have tried SELECT DISTINCT, but it still returns six results. I have tried grouping by ?o, but Blazegraph returns an error, saying it's a bad aggregate. Why does this kind of output of repeating tuples happen? And how to avoid it?
The problem is that you have inserted data containing blank nodes several times. This operation is not idempotent.
Useful quotes
From RDF 1.1 Concepts and Abstract Syntax:
Blank node identifiers are local identifiers that are used in some concrete RDF syntaxes or RDF store implementations. They are always locally scoped to the file or RDF store, and are not persistent or portable identifiers for blank nodes.
From RDF 1.1 Semantics:
RDF graphs can be viewed as conjunctions of simple atomic sentences in first-order logic, where blank nodes are free variables which are understood to be existential. Taking the union of two graphs is then analogous to syntactic conjunction in this syntax. RDF syntax has no explicit variable-binding quantifiers, so the truth conditions for any RDF graph treat the free variables in that graph as existentially quantified in that graph. Taking the union of graphs which share a blank node changes the implied quantifier scopes.
From SPARQL 1.1 Query Language:
Blank node labels are scoped to a result set.
There need not be any relation between a label _:a in the result set and a blank node in the data graph with the same label.
An application writer should not expect blank node labels in a query to refer to a particular blank node in the data.
From SPARQL 1.1 Update:
Blank nodes... are assumed to be disjoint from the blank nodes in the Graph Store, i.e., will be inserted with "fresh" blank nodes.
Some discussion
Problems of the RDF Model: Blank Nodes
Blank nodes considered harmful
Different triplestores provides solutions for the "problems" described. E.g., Jena allows to use pseudo-URIs like <_:b1> etc.

The total number of classes and properties in DBpedia

Okay this seems like a really basic question, but for some reason I am not able to figure this out. I have the DBpedia 2014 owl file from here. Now when I load this in Protégé and look at the Ontology metrics tab, I see that the class count is 814, Object property count is 1310, Data property count is 1725. Is this the right number? Out of curiosity I tried to check the numbers on the Virtuoso endpoint and for the query
select ?p (count(?p) as ?totalCount) where {?s ?p ?o } group by ?p order by DESC(?totalCount)
i.e. trying to find the properties and the total number of times they appear in the graph, I find that the total is 10,000. Now I am not sure if this is the right way to check the properties and the number of times they appear in the graph.
For classes when I issue this query :
SELECT ?class
WHERE {
?class rdf:type rdfs:Class.
}
I don't get any results at all. Now using the default query in Virtuoso i.e.
Select count(distinct ?Concept) where {[] a ?Concept}
I get the value as 369857. So I am a bit confused. Is this large number because of the fact that the graph has concepts from yago,umbel,schema.org and purl or am I looking at something wrongly? Are the Concepts completely different from the classes? (interpreted differently, which I haven't thought of).
Now honestly I got waylaid by these numbers because I needed them to calculate the selectivity as defined in this paper
Here the say that for a triple pattern, the selectivity of a subject is 1/R, where R is the number of resource, so do the resources mean the Class count or the Concept count? or the count of ?s in the ?s ?p ?o . triple pattern?
The DBpedia ontology contains just the axioms for classes and properties with the namespace http://dbpedia.org/ontology.
The DBpedia SPARQL endpoint contains much more data:
At first, it contains triples with properties that have the namespace http://dbpedia.org/property . Those properties are untyped (i.e. of typerdf:Property, which in fact means that the value can be both a resource or a literal. In OWL we have typed properties, i.e. object and data properties.
Other information that are loaded into the SPARQL endpoint are, among others, links to external datasets like YAGO or the upper-level ontology UMBEL. You can find more details here [1], [2].
By the way, you can see that easily from your first query. There are much more properties with different namespaces.
According to your first query: It's the correct query if you want the number of triples for each property. It returns only 10000 because that's the default result set limit of the Virtuoso triple store in which DBpedia is loaded. For more results you have to use pagination. The total number of properties used in triples can be found with
SELECT (COUNT(DISTINCT ?p) AS ?cnt)
WHERE
{ ?s ?p ?o}
Your second query with all classes of type rdf:Class returns nothing because no class in DBpedia is of that type. It's more usual to query for classes of type owl:Class for OWL ontologies. The third query in fact returns all resources that ever occur in rdf:type triples in object position, which is slightly different as it works on instance data. That means it return all classes that are really used in the data.
For your last question. I haven't read the paper, but a common metric in many research papers is often to use the distinct subjects that use a given property.

Downloaded DBpedia data does not include all instance-types triples

I downloaded some data from http://downloads.dbpedia.org/2015-04/core/,
including: instance-type_en.nt, mappingbased-properties_en.nt, and some others.
I have successfully loaded them into OpenLink Virtuoso DB, but when I run some sample SPARQL query, e.g., a query to see all the triples about a subject Xiamen_University, the problem appears.
select ?s ?p ?o
where
{
?s rdfs:label "Xiamen University"#en .
?s ?p ?o .
}
From the DBPedia SPARQL endpoint, there are heaps of triples about iamen_University; while in my db, there are only 4 or 5 of them.
Especially, there is no triple in db indicating Xiamen_University is a type of University, or any instance-type triples at all. I found similar cases on some other subjects as well.
I think the instance-types_en.nt file does not include all the instance-types triple from Wikipedia, same problem with mappingbased-properties. Is that right? If so, where could I find the right source file?
There's a whole list of the datasets available at on the downloads page. I don't see lots of documentation about exactly what's in each one of them, but the names are fairly descriptive, and the question mark links next to each one show a preview of what kind of information is in each one of them. Hovering over each title will provide a short description. E.g.:
It looks like to get most of the interesting properties, you'd probably want the mappingbased datasets, along with the labels dataset (since the query that you wrote identifies objects by label).

How do triple stores use linked data?

Lets say I have the following scenario:
I have some different ontology files hosted somewhere on the web on different domains like _http://foo1.com/ontolgy1.owl#, _http://foo2.com/ontology2.owl# etc.
I also have a triple store in which I want to insert instances based on the ontology files mentioned like this:
INSERT DATA
{
<http://foo1.com/instance1> a <http://foo1.com/ontolgy1.owl#class1>.
<http://foo2.com/instance2> a <http://foo2.com/ontolgy2.owl#class2>.
<http://foo2.com/instance2x> a <http://foo2.com/ontolgy2.owl#class2x>.
}
Lets say that _http://foo2.com/ontolgy2.owl#class2x is a subclass of _http://foo2.com/ontolgy2.owl#class2 defined within the same ontology.
And after the insert if I run a SPARQL query like this:
select ?a
where
{
?a rdf:type ?type.
?type rdfs:subClassOf* <http://foo2.com/ontolgy2.owl#class2> .
}
the result would be:
<http://foo2.com/instance2>
and not:
<http://foo2.com/instance2>
<http://foo2.com/instance2x>
as it should be. This is happening because the ontology file _http://foo2.com/ontolgy2.owl# is not imported into the triple store.
My question is:
Can we talk in this example about "linked" data? Because it seems to me that it is not linked at all. It has to be imported locally into a triple store, and after that you can start querying.
Lets say if you want to run a query on some complex data that is described by 20 ontology files, then all 20 ontology files would need to be imported.
Isn't this a bit disappointing?
Do I misunderstood triple stores and linked data and how they work together?
as it should be.
I'm not certain that should is the right term here. The semantics of the SPARQL query is to query the data stored in a particular graph stored at the endpoint. IRIs are more or less opaque identifiers; just because they might also be URLs from which additional data can be retrieved doesn't obligate any particular system to actually do that kind of retrieval. Doing that would easily make query behavior unpredictable: "this query worked yesterday, why doesn't it work today? oh, a remote website is no longer available…".
Lets say that _http://foo2.com/ontolgy2.owl#class2x is a subclass of _http://foo2.com/ontolgy2.owl#class2 defined within the same ontology.
Remember, since IRIs are opaque, anyone can define a term in any ontology. It's always possible for someone else to come along and say something else about a resource. You have no way of tracking all that information. For instance, if I go and write an ontology, I can declare http://foo2.com/ontolgy2.owl#class2x as a class and assert that it's equivalent to http://dbpedia.org/ontology/Person. Should the system have some way to know about what I did someplace else, and even if it did, should it be required to go and retrieve information from it? What if I made an ontology that's 2GB in size? Surely your endpoint can't be expected to go and retrieve that just to answer a quick query?
Can we talk in this example about "linked" data? Because it seems to
me that it is not linked at all. It has to be imported locally into a
triple store, and after that you can start querying.
Lets say if wan to run a query on some complex data that is describe
by 20 ontology files, in this case I have to import all 20 ontology
files.
This is usually the case, and the point about linked data is that you have a way to get more information if you choose to, and that you don't have to do as much work in negotiating how to identify resources in that data. However, you can use the service keyword in SPARQL to reference other endpoints, and that can provide a type of linking. For instance, knowing that DBpedia has a SPARQL endpoint, I can run a local query that incorporates DBpedia with something like this:
select ?person ?localValue ?publicName {
?person :hasLocalValueOfInterest ?localValue
service <http://dbpedia.org/sparql> {
?person foaf:name ?publicName
}
}
You can use multiple service blocks to aggregate data from multiple endpoints; you're not limited to just one. That seems pretty "linked" to me.

SPARQL 1.1 entailment regimes and query with FROM clause

I'm currently documenting/testing about SPARQL 1.1 entailment regimes and the recommendation repeatedly states that
The scoping graph is graph-equivalent to the active graph
but it does not specifies what is the active graph referring to : is it the data-set used in the query ? a union of all graphs in the store ?
As a test to determine this , I got this graph URIed <http://www.example.org/> in a Sesame Memory store with RDF Schema and direct type inferencing store (v2.7.14)
#prefix ex:<http://www.example.org/> .
ex:book1 rdf:type ex:Publication .
ex:book2 rdf:type ex:Article .
ex:Article rdfs:subClassOf ex:Publication .
ex:publishes rdfs:range ex:Publication .
ex:MITPress ex:publishes ex:book3 .
I've been trying the following query (which means using the default graph thus the inference engine)
SELECT ?s WHERE { ?s a ex:Publication . }
As expected, it returns me all three instances
<http://www.example.org/book1>
<http://www.example.org/book2>
<http://www.example.org/book3>
while the query :
SELECT ?s FROM ex: WHERE { ?s a ex:Publication . }
returns only
<http://www.example.org/book1>
Under the said circumstances, shouldn't both results be the same?
What should happen (according to the recommendation) if data and schema are split between two graphs in the store (like <urn:rdfs-schema> and <urn:data>, or even scattered across more graphs) and the query uses both graphs (or a subset of schema-related graphs) in the FROM clause instead of the default graph ?
Meaning should the inferencing be global throughout the store or does it depend on the query dataset ?
Or maybe is the recommendation loose enough to make this an implementation dependent issue ?
Thanks for your lights,
Max.
EDIT this question is being redirected to SPARQL 1.1 entailment regimes and query with FROM clause (follow-up)
Your second query only returns only book1 because in Sesame's RDFS inferencer, entailed statements are inserted in the default graph, not in the named graph(s) from which the premises for the entailment come. So the entailed results are simply not present in the graph you are querying.
The reason for this design choice is at least partly historic, as the Sesame RDFS inference engine predates the W3C notion of entailment regimes. The rationale at the time was that in the case of inferencing over several named graphs (where e.g. one premise comes from graph A and another from B), insertion in the default graph (rather than in either A, or B, or both) was simplest with the least amount for confusion.
Sesame currently does not explicitly support the W3C entailment regimes specification. However, if you feel that a simple improvement would be possible to make it more compatible, by all means log a feature request.
(disclosure: Sesame developer)
What exactly is in the default graph isn't specified by the SPARQL 1.1 standard. In particular, see 13.1 Examples of RDF Datasets which mentions that:
The definition of RDF Dataset does not restrict the relationships of
named and default graphs. Information can be repeated in different
graphs; relationships between graphs can be exposed. Two useful
arrangements are:
to have information in the default graph that includes provenance information about the named graphs
to include the information in the named graphs in the default graph as well.
However, by using FROM clauses to specify which graph should be the default graph, or by using multiple FROM clauses to specify which graphs should be merged to be the default graph.
That's all concerning the default graph. The active graph is another term you'll see in the SPARQL 1.1 spec:
The graph that is used for matching a basic graph pattern is the
active graph. In the previous sections, all queries have been shown
executed against a single graph, the default graph of an RDF dataset
as the active graph. The GRAPH keyword is used to make the active
graph one of all of the named graphs in the dataset for part of the
query.
So, you can use from (possible multiple times) to control the default graph, and thereby the initial active graph, and then graph { … } within a query to change the active graph.