Grouping by blank nodes

Grouping by blank nodes - sparql

I have the following data:
#prefix f: <http://example.org#> .
_:a f:trait "Rude"#en .
_:a f:name "John" .
_:a f:surname "Roy" .
_:b f:trait "Crude"#en .
_:b f:name "Mary" .
_:b f:surname "Lestern" .
However, if I execute the following query in Blazegraph:
PREFIX f: <http://example.org#>
SELECT ?s ?o
WHERE
{
?s f:trait ?o .
}
I get six results:
s o
t32 Crude
t37 Crude
t39 Crude
t31 Rude
t36 Rude
t38 Rude
If blank nodes _:a and _:b are distinct nodes, how should I write a SPARQL query to return only two distinct results? I have tried SELECT DISTINCT, but it still returns six results. I have tried grouping by ?o, but Blazegraph returns an error, saying it's a bad aggregate. Why does this kind of output of repeating tuples happen? And how to avoid it?

The problem is that you have inserted data containing blank nodes several times. This operation is not idempotent.
Useful quotes
From RDF 1.1 Concepts and Abstract Syntax:
Blank node identifiers are local identifiers that are used in some concrete RDF syntaxes or RDF store implementations. They are always locally scoped to the file or RDF store, and are not persistent or portable identifiers for blank nodes.
From RDF 1.1 Semantics:
RDF graphs can be viewed as conjunctions of simple atomic sentences in first-order logic, where blank nodes are free variables which are understood to be existential. Taking the union of two graphs is then analogous to syntactic conjunction in this syntax. RDF syntax has no explicit variable-binding quantifiers, so the truth conditions for any RDF graph treat the free variables in that graph as existentially quantified in that graph. Taking the union of graphs which share a blank node changes the implied quantifier scopes.
From SPARQL 1.1 Query Language:
Blank node labels are scoped to a result set.
There need not be any relation between a label _:a in the result set and a blank node in the data graph with the same label.
An application writer should not expect blank node labels in a query to refer to a particular blank node in the data.
From SPARQL 1.1 Update:
Blank nodes... are assumed to be disjoint from the blank nodes in the Graph Store, i.e., will be inserted with "fresh" blank nodes.
Some discussion
Problems of the RDF Model: Blank Nodes
Blank nodes considered harmful
Different triplestores provides solutions for the "problems" described. E.g., Jena allows to use pseudo-URIs like <_:b1> etc.

Related

How to make a selection of a giant ontology, built from several aligned reference ontologies?

My organization has an information requirement spanning several information domains. In order to capture this, we are building a large organization ontology in which we align several domain specific reference ontologies / vocabularies (think of dublin core, geosparql, industry specific information models etc) and where necessary, we add concepts in an ` extension' ontology (which is then also aligned with the reference ontologies).
The totality of this aligned ontology (>3000 classes and >10000 ObjectProperties) contains both unused concepts and semantic doubles, and for the newcomer is impossible to navigate. Further more the organization wishes to standardize the use of specific concepts, so doubles are extremely undesirable. We are therefore looking for a way to construct the SuperAwesomeOntology that contains all concepts (and their owl related predicates like subClassOf, domain/range etc) that have been labeled (maybe by something like dcterms:isRequiredBy "SuperAwesomeOntology"). The result should be a correct OWL-ontology that can be stored in a single file.
One constraint: it has to be done programmatically,(the copy/move/delete axioms interface of protege wont do), because if one of the reference ontologies gets an update, we want to be able to render the SuperAwesomeOntology again from its most up-to-date reference ontologies and find out if there are any conflicts.
How would we go about this? could SPARQL do this, how? Alternative suggestions to the isRequiredBy labeling are also welcome.

If I understand you correctly, you want to programmatically remove unused concepts from a large ontology or a collection of ontologies/graphs and you also want to remove concepts/classes that you identified as duplicates via interlinking.
Identified duplicates are easy to remove:
Define what a duplicate is for you. For example, nodes at either end of a owl:sameAs or skos:closeMatch link that are outside of your core graph (so you don't remove the "original").
Construct the new graph using a SPARQL query:
construct {?s ?p ?o.}
{
?s ?p ?o.
filter not exists {graph ?g {?s owl:sameAs ?x.} filter(?g!=<http://my.core.graph>)}
filter not exists {graph ?g {?o owl:sameAs ?x.} filter(?g!=<http://my.core.graph>)}
}
I tested this query for syntax and performance but not for correctness.
Unused concepts are more difficult to remove:
First, again, you need to define, what "unused" means for you. This criterion will however certainly involve reachability, or "connectedness" in the combined graph, where you want to only select the graph component that contains your core ontology. The problem is that, if you treat the triples as undirected edges, you will probably get a connected graph (that is only a single component and no nodes to remove) because the type hierarchy often connects everything. You could take the direction of edges into account, that is include only resources Y where there is a directed path from any resource X in your core ontology to Y.
This would ensure that you can go up the subclass hierarchy of the target ontology until e.g. owl:Thing but not down again. The problem is that you don't know what other type of edges are in the target ontology and in which direction they go but you could use only rdfs:subClassOf edges for now.
If you have sufficiently defined your "unused concept" or want to try it with some experimental definition, you can either use a graph library or graph analysis application and import your code there.
Here is an example of how to import a SPARQL endpoint into the Cytoscape.js JavaScript graph visualization library, it can be used in node as well. You need to heavily adapt the code however.
Or you do it again in SPARQL using SPARQL 1.1 property paths.
The problem is that those can have a large performance impact (or even a complexity that is way too large to ever complete) especially when applied to a large number of resources and an unrestricted path length. So it is possible that a query such as that times out but feel free to try and adapt it:
construct {?s ?p ?o.}
{
{?s ?p ?o.}
graph <http://my.core.graph> {?x rdfs:subClassOf ?X.}
{?x (<>|!<>)* ?s.}
}
The ?x rdfs:subClassOf ?X statement is just an identifier for which resources of your core ontology you want to use a source points, I couldn't get a valid query without that. When I apply a graph statement to the path expression, I get a syntax error from Virtuoso.

The total number of classes and properties in DBpedia

Okay this seems like a really basic question, but for some reason I am not able to figure this out. I have the DBpedia 2014 owl file from here. Now when I load this in Protégé and look at the Ontology metrics tab, I see that the class count is 814, Object property count is 1310, Data property count is 1725. Is this the right number? Out of curiosity I tried to check the numbers on the Virtuoso endpoint and for the query
select ?p (count(?p) as ?totalCount) where {?s ?p ?o } group by ?p order by DESC(?totalCount)
i.e. trying to find the properties and the total number of times they appear in the graph, I find that the total is 10,000. Now I am not sure if this is the right way to check the properties and the number of times they appear in the graph.
For classes when I issue this query :
SELECT ?class
WHERE {
?class rdf:type rdfs:Class.
}
I don't get any results at all. Now using the default query in Virtuoso i.e.
Select count(distinct ?Concept) where {[] a ?Concept}
I get the value as 369857. So I am a bit confused. Is this large number because of the fact that the graph has concepts from yago,umbel,schema.org and purl or am I looking at something wrongly? Are the Concepts completely different from the classes? (interpreted differently, which I haven't thought of).
Now honestly I got waylaid by these numbers because I needed them to calculate the selectivity as defined in this paper
Here the say that for a triple pattern, the selectivity of a subject is 1/R, where R is the number of resource, so do the resources mean the Class count or the Concept count? or the count of ?s in the ?s ?p ?o . triple pattern?

The DBpedia ontology contains just the axioms for classes and properties with the namespace http://dbpedia.org/ontology.
The DBpedia SPARQL endpoint contains much more data:
At first, it contains triples with properties that have the namespace http://dbpedia.org/property . Those properties are untyped (i.e. of typerdf:Property, which in fact means that the value can be both a resource or a literal. In OWL we have typed properties, i.e. object and data properties.
Other information that are loaded into the SPARQL endpoint are, among others, links to external datasets like YAGO or the upper-level ontology UMBEL. You can find more details here [1], [2].
By the way, you can see that easily from your first query. There are much more properties with different namespaces.
According to your first query: It's the correct query if you want the number of triples for each property. It returns only 10000 because that's the default result set limit of the Virtuoso triple store in which DBpedia is loaded. For more results you have to use pagination. The total number of properties used in triples can be found with
SELECT (COUNT(DISTINCT ?p) AS ?cnt)
WHERE
{ ?s ?p ?o}
Your second query with all classes of type rdf:Class returns nothing because no class in DBpedia is of that type. It's more usual to query for classes of type owl:Class for OWL ontologies. The third query in fact returns all resources that ever occur in rdf:type triples in object position, which is slightly different as it works on instance data. That means it return all classes that are really used in the data.
For your last question. I haven't read the paper, but a common metric in many research papers is often to use the distinct subjects that use a given property.

Downloaded DBpedia data does not include all instance-types triples

I downloaded some data from http://downloads.dbpedia.org/2015-04/core/,
including: instance-type_en.nt, mappingbased-properties_en.nt, and some others.
I have successfully loaded them into OpenLink Virtuoso DB, but when I run some sample SPARQL query, e.g., a query to see all the triples about a subject Xiamen_University, the problem appears.
select ?s ?p ?o
where
{
?s rdfs:label "Xiamen University"#en .
?s ?p ?o .
}
From the DBPedia SPARQL endpoint, there are heaps of triples about iamen_University; while in my db, there are only 4 or 5 of them.
Especially, there is no triple in db indicating Xiamen_University is a type of University, or any instance-type triples at all. I found similar cases on some other subjects as well.
I think the instance-types_en.nt file does not include all the instance-types triple from Wikipedia, same problem with mappingbased-properties. Is that right? If so, where could I find the right source file?

There's a whole list of the datasets available at on the downloads page. I don't see lots of documentation about exactly what's in each one of them, but the names are fairly descriptive, and the question mark links next to each one show a preview of what kind of information is in each one of them. Hovering over each title will provide a short description. E.g.:
It looks like to get most of the interesting properties, you'd probably want the mappingbased datasets, along with the labels dataset (since the query that you wrote identifies objects by label).

SPARQL 1.1 entailment regimes and query with FROM clause

I'm currently documenting/testing about SPARQL 1.1 entailment regimes and the recommendation repeatedly states that
The scoping graph is graph-equivalent to the active graph
but it does not specifies what is the active graph referring to : is it the data-set used in the query ? a union of all graphs in the store ?
As a test to determine this , I got this graph URIed <http://www.example.org/> in a Sesame Memory store with RDF Schema and direct type inferencing store (v2.7.14)
#prefix ex:<http://www.example.org/> .
ex:book1 rdf:type ex:Publication .
ex:book2 rdf:type ex:Article .
ex:Article rdfs:subClassOf ex:Publication .
ex:publishes rdfs:range ex:Publication .
ex:MITPress ex:publishes ex:book3 .
I've been trying the following query (which means using the default graph thus the inference engine)
SELECT ?s WHERE { ?s a ex:Publication . }
As expected, it returns me all three instances
<http://www.example.org/book1>
<http://www.example.org/book2>
<http://www.example.org/book3>
while the query :
SELECT ?s FROM ex: WHERE { ?s a ex:Publication . }
returns only
<http://www.example.org/book1>
Under the said circumstances, shouldn't both results be the same?
What should happen (according to the recommendation) if data and schema are split between two graphs in the store (like <urn:rdfs-schema> and <urn:data>, or even scattered across more graphs) and the query uses both graphs (or a subset of schema-related graphs) in the FROM clause instead of the default graph ?
Meaning should the inferencing be global throughout the store or does it depend on the query dataset ?
Or maybe is the recommendation loose enough to make this an implementation dependent issue ?
Thanks for your lights,
Max.
EDIT this question is being redirected to SPARQL 1.1 entailment regimes and query with FROM clause (follow-up)

Your second query only returns only book1 because in Sesame's RDFS inferencer, entailed statements are inserted in the default graph, not in the named graph(s) from which the premises for the entailment come. So the entailed results are simply not present in the graph you are querying.
The reason for this design choice is at least partly historic, as the Sesame RDFS inference engine predates the W3C notion of entailment regimes. The rationale at the time was that in the case of inferencing over several named graphs (where e.g. one premise comes from graph A and another from B), insertion in the default graph (rather than in either A, or B, or both) was simplest with the least amount for confusion.
Sesame currently does not explicitly support the W3C entailment regimes specification. However, if you feel that a simple improvement would be possible to make it more compatible, by all means log a feature request.
(disclosure: Sesame developer)

What exactly is in the default graph isn't specified by the SPARQL 1.1 standard. In particular, see 13.1 Examples of RDF Datasets which mentions that:
The definition of RDF Dataset does not restrict the relationships of
named and default graphs. Information can be repeated in different
graphs; relationships between graphs can be exposed. Two useful
arrangements are:
to have information in the default graph that includes provenance information about the named graphs
to include the information in the named graphs in the default graph as well.
However, by using FROM clauses to specify which graph should be the default graph, or by using multiple FROM clauses to specify which graphs should be merged to be the default graph.
That's all concerning the default graph. The active graph is another term you'll see in the SPARQL 1.1 spec:
The graph that is used for matching a basic graph pattern is the
active graph. In the previous sections, all queries have been shown
executed against a single graph, the default graph of an RDF dataset
as the active graph. The GRAPH keyword is used to make the active
graph one of all of the named graphs in the dataset for part of the
query.
So, you can use from (possible multiple times) to control the default graph, and thereby the initial active graph, and then graph { … } within a query to change the active graph.

Why 2 different vocabularies for the same attribute in DBpedia?

Why does DBpedia use multiple vocabularies for the same attributes?
I have to get data of all possible movies.
For each movie I have observed that it has a dbpedia-owl and a dbpprop vocabulary for producers, directors and so on.. I retrieve the attribute with the following query:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?star_name
WHERE {
<http://dbpedia.org/resource/Goal_III:_Taking_on_the_World> dbpedia-owl:starring ?star.
?star foaf:name ?star_name
}
I'll have the page id of each movie and then I'll retrieve stars and producers. For some I think
dbpedia-owl works and for some dbpprop works.
I am puzzled about it. I have to write code in Python to run this query for each movie. Hence every time I'll have to check that the result is null and then run the code for the other vocabulary.

DBPedia's data is extracted using a mapping based language from the info boxes you see on the corresponding wikipedia pages. Different mappings are used for different info boxes so two different types of resource may be mapped completely differently which is perfectly logical if you think about it.
Now the problem you are talking about is that two resources of the same type having the same data mapped differently. I suspect (though can't confirm because you didn't give examples of two movies which map properties differently) that the problem here is the data in Wikipedia. It may be that there is more than one way to express the information you are interested in within an info box and that the mapping for the info box maps differently for the different ways. This isn't ideal but Wikipedia does not have lovely clean data so you shouldn't expect DBPedia to have clean data either.
You may consider asking a question on the DBPedia mailing list at dbpedia-discussion#lists.sf.net about this to try and find out why this happens as they will be better placed to help you.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas