The total number of classes and properties in DBpedia - sparql

Okay this seems like a really basic question, but for some reason I am not able to figure this out. I have the DBpedia 2014 owl file from here. Now when I load this in Protégé and look at the Ontology metrics tab, I see that the class count is 814, Object property count is 1310, Data property count is 1725. Is this the right number? Out of curiosity I tried to check the numbers on the Virtuoso endpoint and for the query
select ?p (count(?p) as ?totalCount) where {?s ?p ?o } group by ?p order by DESC(?totalCount)
i.e. trying to find the properties and the total number of times they appear in the graph, I find that the total is 10,000. Now I am not sure if this is the right way to check the properties and the number of times they appear in the graph.
For classes when I issue this query :
SELECT ?class
WHERE {
?class rdf:type rdfs:Class.
}
I don't get any results at all. Now using the default query in Virtuoso i.e.
Select count(distinct ?Concept) where {[] a ?Concept}
I get the value as 369857. So I am a bit confused. Is this large number because of the fact that the graph has concepts from yago,umbel,schema.org and purl or am I looking at something wrongly? Are the Concepts completely different from the classes? (interpreted differently, which I haven't thought of).
Now honestly I got waylaid by these numbers because I needed them to calculate the selectivity as defined in this paper
Here the say that for a triple pattern, the selectivity of a subject is 1/R, where R is the number of resource, so do the resources mean the Class count or the Concept count? or the count of ?s in the ?s ?p ?o . triple pattern?

The DBpedia ontology contains just the axioms for classes and properties with the namespace http://dbpedia.org/ontology.
The DBpedia SPARQL endpoint contains much more data:
At first, it contains triples with properties that have the namespace http://dbpedia.org/property . Those properties are untyped (i.e. of typerdf:Property, which in fact means that the value can be both a resource or a literal. In OWL we have typed properties, i.e. object and data properties.
Other information that are loaded into the SPARQL endpoint are, among others, links to external datasets like YAGO or the upper-level ontology UMBEL. You can find more details here [1], [2].
By the way, you can see that easily from your first query. There are much more properties with different namespaces.
According to your first query: It's the correct query if you want the number of triples for each property. It returns only 10000 because that's the default result set limit of the Virtuoso triple store in which DBpedia is loaded. For more results you have to use pagination. The total number of properties used in triples can be found with
SELECT (COUNT(DISTINCT ?p) AS ?cnt)
WHERE
{ ?s ?p ?o}
Your second query with all classes of type rdf:Class returns nothing because no class in DBpedia is of that type. It's more usual to query for classes of type owl:Class for OWL ontologies. The third query in fact returns all resources that ever occur in rdf:type triples in object position, which is slightly different as it works on instance data. That means it return all classes that are really used in the data.
For your last question. I haven't read the paper, but a common metric in many research papers is often to use the distinct subjects that use a given property.

Related

How to make a selection of a giant ontology, built from several aligned reference ontologies?

My organization has an information requirement spanning several information domains. In order to capture this, we are building a large organization ontology in which we align several domain specific reference ontologies / vocabularies (think of dublin core, geosparql, industry specific information models etc) and where necessary, we add concepts in an ` extension' ontology (which is then also aligned with the reference ontologies).
The totality of this aligned ontology (>3000 classes and >10000 ObjectProperties) contains both unused concepts and semantic doubles, and for the newcomer is impossible to navigate. Further more the organization wishes to standardize the use of specific concepts, so doubles are extremely undesirable. We are therefore looking for a way to construct the SuperAwesomeOntology that contains all concepts (and their owl related predicates like subClassOf, domain/range etc) that have been labeled (maybe by something like dcterms:isRequiredBy "SuperAwesomeOntology"). The result should be a correct OWL-ontology that can be stored in a single file.
One constraint: it has to be done programmatically,(the copy/move/delete axioms interface of protege wont do), because if one of the reference ontologies gets an update, we want to be able to render the SuperAwesomeOntology again from its most up-to-date reference ontologies and find out if there are any conflicts.
How would we go about this? could SPARQL do this, how? Alternative suggestions to the isRequiredBy labeling are also welcome.
If I understand you correctly, you want to programmatically remove unused concepts from a large ontology or a collection of ontologies/graphs and you also want to remove concepts/classes that you identified as duplicates via interlinking.
Identified duplicates are easy to remove:
Define what a duplicate is for you. For example, nodes at either end of a owl:sameAs or skos:closeMatch link that are outside of your core graph (so you don't remove the "original").
Construct the new graph using a SPARQL query:
construct {?s ?p ?o.}
{
?s ?p ?o.
filter not exists {graph ?g {?s owl:sameAs ?x.} filter(?g!=<http://my.core.graph>)}
filter not exists {graph ?g {?o owl:sameAs ?x.} filter(?g!=<http://my.core.graph>)}
}
I tested this query for syntax and performance but not for correctness.
Unused concepts are more difficult to remove:
First, again, you need to define, what "unused" means for you. This criterion will however certainly involve reachability, or "connectedness" in the combined graph, where you want to only select the graph component that contains your core ontology. The problem is that, if you treat the triples as undirected edges, you will probably get a connected graph (that is only a single component and no nodes to remove) because the type hierarchy often connects everything. You could take the direction of edges into account, that is include only resources Y where there is a directed path from any resource X in your core ontology to Y.
This would ensure that you can go up the subclass hierarchy of the target ontology until e.g. owl:Thing but not down again. The problem is that you don't know what other type of edges are in the target ontology and in which direction they go but you could use only rdfs:subClassOf edges for now.
If you have sufficiently defined your "unused concept" or want to try it with some experimental definition, you can either use a graph library or graph analysis application and import your code there.
Here is an example of how to import a SPARQL endpoint into the Cytoscape.js JavaScript graph visualization library, it can be used in node as well. You need to heavily adapt the code however.
Or you do it again in SPARQL using SPARQL 1.1 property paths.
The problem is that those can have a large performance impact (or even a complexity that is way too large to ever complete) especially when applied to a large number of resources and an unrestricted path length. So it is possible that a query such as that times out but feel free to try and adapt it:
construct {?s ?p ?o.}
{
{?s ?p ?o.}
graph <http://my.core.graph> {?x rdfs:subClassOf ?X.}
{?x (<>|!<>)* ?s.}
}
The ?x rdfs:subClassOf ?X statement is just an identifier for which resources of your core ontology you want to use a source points, I couldn't get a valid query without that. When I apply a graph statement to the path expression, I get a syntax error from Virtuoso.

No Data Available When Query on Apache Jena Fuseki

I'm new with Apache Jena Fuseki and SparQL. I have a problem when I tried to query data on Fuseki. The data I used is from DBpedia named 'Topical Concepts' (can be found here). I upload the data through the control panel on a browser (through the default port 3030) and used the query below:
SELECT ?subject ?predicate ?object
WHERE {
?subject ?predicate ?subject
}
LIMIT 25
I got a null table and a message "no data available in this table". However, when I installed Fuseki and do the same thing on my Mac (the problem in above happened on my desktop with Ubuntu 16 operation system), I successfully got 25 entries of the data. I don't think it is the problem of the operation system, but I really have no idea about why it happened. Does anybody encounter the same problem?
In your SPARQL query, you have the following pattern:
?subject ?predicate ?subject
Notice that you repeat ?subject. This query effectively asks: "give me all RDF triples for which the subject is the same value as the object". It's likely that the reason you're not getting a result is simply that no such triples exist in your database.
As for why this didn't happen on a Mac, without more info about your setup we can only speculate. It's possible that you configured your database slightly differently there (e.g. enabling a reasoner which would result in additional RDF triples that do match the query), or it might be as simple as you did a slightly different query there.
I am making two assumptions for answering your question:
you have two different instances of Jena installed. One on your laptop and one on your desktop.
You are sure you have uploaded data, possibly into a named graph and the default is empty
Fuseki, I haven't tried this on TBD, has one feature that, often, by default is set to query only the default graph. If in the config setting you activate tdb:unionDefaultGraph true ; then it will query all the graphs. Before changing the settings, please do check that this is true. You could check by executing this query:
SELECT distinct ?g
WHERE {
graph ?g{
?s ?p ?o
}
}
If you get a result, that means you need to change the settings for it to work, or be mindful of this fact and always call your queries with graphs (as in the above query).
For more explanation please refer to https://jena.apache.org/documentation/serving_data/

Grouping by blank nodes

I have the following data:
#prefix f: <http://example.org#> .
_:a f:trait "Rude"#en .
_:a f:name "John" .
_:a f:surname "Roy" .
_:b f:trait "Crude"#en .
_:b f:name "Mary" .
_:b f:surname "Lestern" .
However, if I execute the following query in Blazegraph:
PREFIX f: <http://example.org#>
SELECT ?s ?o
WHERE
{
?s f:trait ?o .
}
I get six results:
s o
t32 Crude
t37 Crude
t39 Crude
t31 Rude
t36 Rude
t38 Rude
If blank nodes _:a and _:b are distinct nodes, how should I write a SPARQL query to return only two distinct results? I have tried SELECT DISTINCT, but it still returns six results. I have tried grouping by ?o, but Blazegraph returns an error, saying it's a bad aggregate. Why does this kind of output of repeating tuples happen? And how to avoid it?
The problem is that you have inserted data containing blank nodes several times. This operation is not idempotent.
Useful quotes
From RDF 1.1 Concepts and Abstract Syntax:
Blank node identifiers are local identifiers that are used in some concrete RDF syntaxes or RDF store implementations. They are always locally scoped to the file or RDF store, and are not persistent or portable identifiers for blank nodes.
From RDF 1.1 Semantics:
RDF graphs can be viewed as conjunctions of simple atomic sentences in first-order logic, where blank nodes are free variables which are understood to be existential. Taking the union of two graphs is then analogous to syntactic conjunction in this syntax. RDF syntax has no explicit variable-binding quantifiers, so the truth conditions for any RDF graph treat the free variables in that graph as existentially quantified in that graph. Taking the union of graphs which share a blank node changes the implied quantifier scopes.
From SPARQL 1.1 Query Language:
Blank node labels are scoped to a result set.
There need not be any relation between a label _:a in the result set and a blank node in the data graph with the same label.
An application writer should not expect blank node labels in a query to refer to a particular blank node in the data.
From SPARQL 1.1 Update:
Blank nodes... are assumed to be disjoint from the blank nodes in the Graph Store, i.e., will be inserted with "fresh" blank nodes.
Some discussion
Problems of the RDF Model: Blank Nodes
Blank nodes considered harmful
Different triplestores provides solutions for the "problems" described. E.g., Jena allows to use pseudo-URIs like <_:b1> etc.

Downloaded DBpedia data does not include all instance-types triples

I downloaded some data from http://downloads.dbpedia.org/2015-04/core/,
including: instance-type_en.nt, mappingbased-properties_en.nt, and some others.
I have successfully loaded them into OpenLink Virtuoso DB, but when I run some sample SPARQL query, e.g., a query to see all the triples about a subject Xiamen_University, the problem appears.
select ?s ?p ?o
where
{
?s rdfs:label "Xiamen University"#en .
?s ?p ?o .
}
From the DBPedia SPARQL endpoint, there are heaps of triples about iamen_University; while in my db, there are only 4 or 5 of them.
Especially, there is no triple in db indicating Xiamen_University is a type of University, or any instance-type triples at all. I found similar cases on some other subjects as well.
I think the instance-types_en.nt file does not include all the instance-types triple from Wikipedia, same problem with mappingbased-properties. Is that right? If so, where could I find the right source file?
There's a whole list of the datasets available at on the downloads page. I don't see lots of documentation about exactly what's in each one of them, but the names are fairly descriptive, and the question mark links next to each one show a preview of what kind of information is in each one of them. Hovering over each title will provide a short description. E.g.:
It looks like to get most of the interesting properties, you'd probably want the mappingbased datasets, along with the labels dataset (since the query that you wrote identifies objects by label).

DBPedia queries missing certain chemical compounds

I am running this query to get list of all compounds from the DBPedia public SPARQL endpoint.
SELECT * WHERE {
?y rdf:type dbpedia-owl:Drug.
?y rdfs:label ?Name .
OPTIONAL {?y dbpedia-owl:iupacName ?iupacname} .
OPTIONAL {?y dcterms:subject ?y1}
FILTER (langMatches(lang(?Name),"en"))
}
LIMIT 50000
I am downloading in batches of 50000 (2 files) using offset parameter.
Somehow Isopropyl_alcohol is not getting covered in this even where page exists at
http://live.dbpedia.org/page/Isopropyl_alcohol
and it has the properties that I am searching for?
There are two issues here. The first is that DBpedia Live and DBpedia do not have exactly the same content. According to the DBpedia live webpage
Wikipedia users constantly revise Wikipedia articles with updates
happening almost each second. Hence, data stored in the official
DBpedia endpoint can quickly become outdated, and Wikipedia articles
need to be re-extracted. DBpedia Live enables such a continuous
synchronization between DBpedia and Wikipedia.
That page also lists two SPARQL endpoints for DBpedia Live:
http://live.dbpedia.org/sparql
http://dbpedia-live.openlinksw.com/sparql
However, you'll run into issues on both. Isopropyl_alcohol is in DBpedia, and its URI is
http://dbpedia.org/resource/Isopropyl_alcohol (which, when using a web browser, redirects to
http://dbpedia.org/page/Isopropyl_alcohol)
Looking there, we see that Isopropyl alcohol doesn't have rdf:type dbpedia-owl:Drug, but only
owl:Thing
dbpedia-owl:ChemicalCompound
dbpedia-owl:ChemicalSubstance
yago:Alcohols
http://umbel.org/umbel/rc/ChemicalSubstanceType
yago:AlcoholSolvents
so you won't be able to find it with your query on DBpedia, because it doesn't have the type `dbpedia-owl:Drug. Now, Isopropyl_alcohol also exists in DBpedia live, and its URL is
http://live.dbpedia.org/resource/Isopropyl_alcohol, which redirects to
http://live.dbpedia.org/page/Isopropyl_alcohol
but it only has the folllowing rdf:types:
yago:Alcohols
yago:AlcoholSolvents
http://umbel.org/umbel/rc/ChemicalSubstanceType
so it won't be found by your query on DBpedia Live, for the same reason.
The second issue is the one that AndyS pointed out. Even if the query would select Isopropyl_alcohol in DBpedia or DBpedia Live, unless you provide an ordering constraint, the limit/offset combination won't be guaranteed to return it, since without an ordering constraint, the server could legitimately return the same set of 50000 results to you every time.
Maybe it is not finding it in the LIMIT/OFFSET combination you are using. The server is not obliged to answer queries in the same order everytime unless you use ORDER BY so maybe the slices you have are not in fact all results.
Maybe the SPARQL site and live.dbpedia are not in step.
Try asking directly for Isopropyl_alcohol.