Finding common super classes in SPARQL - sparql

We need a SPARQL query to find the common superclasses for two Wikidata entities. I tried it like this:
select ?commonBase #?commonBaseLabel
where {
wd:Q39798 wdt:P31/wdt:P279* ?commonBase.
wd:Q26868 wdt:P31/wdt:P279* ?commonBase.
#SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
Here's a direct link to the query.
This always times out, even with the SERVICE line commented out.
However, when I run the same query with just one of the WHERE clauses, it runs very fast, yielding either 130 or 12 results.
Why is it so much slower with both WHERE clauses? In theory, the database could just perform both WHERE clauses separately and then do an INTERSECT, i.e. deliver those results that have occurred in both queries.
If this is a flaw in the design of the SPARQL engine, then how can I adapt the query to make it run faster?

Related

Jena seems to cache BIND(RAND()) Queries

I am running the following query thousands of times in a row in a Java program using Apache Jena (in order to generate random walks).
SELECT ?p ?o
WHERE {
$ENTITY$ ?p ?o .
FILTER(!isLiteral(?o)).
BIND(RAND() AS ?sortKey)
} ORDER BY ?sortKey LIMIT 1
However, I always get the same set of properties and objects (even though this seems to be extremely unlikely). I suppose Jena is caching the result to the query (despite the RAND() component).
What is the best and most efficient way to solve this problem?

No Data Available When Query on Apache Jena Fuseki

I'm new with Apache Jena Fuseki and SparQL. I have a problem when I tried to query data on Fuseki. The data I used is from DBpedia named 'Topical Concepts' (can be found here). I upload the data through the control panel on a browser (through the default port 3030) and used the query below:
SELECT ?subject ?predicate ?object
WHERE {
?subject ?predicate ?subject
}
LIMIT 25
I got a null table and a message "no data available in this table". However, when I installed Fuseki and do the same thing on my Mac (the problem in above happened on my desktop with Ubuntu 16 operation system), I successfully got 25 entries of the data. I don't think it is the problem of the operation system, but I really have no idea about why it happened. Does anybody encounter the same problem?
In your SPARQL query, you have the following pattern:
?subject ?predicate ?subject
Notice that you repeat ?subject. This query effectively asks: "give me all RDF triples for which the subject is the same value as the object". It's likely that the reason you're not getting a result is simply that no such triples exist in your database.
As for why this didn't happen on a Mac, without more info about your setup we can only speculate. It's possible that you configured your database slightly differently there (e.g. enabling a reasoner which would result in additional RDF triples that do match the query), or it might be as simple as you did a slightly different query there.
I am making two assumptions for answering your question:
you have two different instances of Jena installed. One on your laptop and one on your desktop.
You are sure you have uploaded data, possibly into a named graph and the default is empty
Fuseki, I haven't tried this on TBD, has one feature that, often, by default is set to query only the default graph. If in the config setting you activate tdb:unionDefaultGraph true ; then it will query all the graphs. Before changing the settings, please do check that this is true. You could check by executing this query:
SELECT distinct ?g
WHERE {
graph ?g{
?s ?p ?o
}
}
If you get a result, that means you need to change the settings for it to work, or be mindful of this fact and always call your queries with graphs (as in the above query).
For more explanation please refer to https://jena.apache.org/documentation/serving_data/

Fetching entities label data using SPARQL in Wikidata

I am using wikidata Query Service to fetch data: https://query.wikidata.org/
I have already managed to use an entity's label using 2 methods:
Using wikibase label service. For example:
SELECT ?spouse ?spouseLabel WHERE {
wd:Q1744 wdt:P26 ?spouse.
SERVICE wikibase:label {
bd:serviceParam wikibase:language "en" .
}
}
Using the rdfs:label property:
SELECT ?spouse ?spouseLabel WHERE {
wd:Q1744 wdt:P26 ?spouse.
?spouse rdfs:label ?spouseLabel. filter(lang(?spouseLabel) = "en").
}
However, it seems that for complex queries, the second method performs faster, in contrary to what the MediaWiki user manual states:
The service is very helpful when you want to retrieve labels, as it
reduces the complexity of SPARQL queries that you would otherwise need
to achieve the same effect.
(https://www.mediawiki.org/wiki/Wikidata_query_service/User_Manual#Label_service)
What does the wikibase add that I can't achieve using just rdfs:label?
It seems odd since they both seemingly achieve the same purpose, but rdfs:label method seems faster (which is logical, becuase the query does not need to join data from external sources).
Thanks!
As I understand from the documentation, the wikibase label service simplifies the query by removing the need to explicitly search for labels. In that regard it reduces complexity of the query you need to write, in terms of syntax.
I would assume that the query is then expanded to another representation, maybe with the rdfs namespace like in your second option, before actually being resolved.
As per the second option being faster, have you done a systematic benchmarking? In a couple of tries of mine, the first option was faster. I would assume that performance of the public endpoint is anyways subject to fluctuation based upon demand, caching, etc so it may be tricky to draw conclusions on performance for similar queries.

The total number of classes and properties in DBpedia

Okay this seems like a really basic question, but for some reason I am not able to figure this out. I have the DBpedia 2014 owl file from here. Now when I load this in Protégé and look at the Ontology metrics tab, I see that the class count is 814, Object property count is 1310, Data property count is 1725. Is this the right number? Out of curiosity I tried to check the numbers on the Virtuoso endpoint and for the query
select ?p (count(?p) as ?totalCount) where {?s ?p ?o } group by ?p order by DESC(?totalCount)
i.e. trying to find the properties and the total number of times they appear in the graph, I find that the total is 10,000. Now I am not sure if this is the right way to check the properties and the number of times they appear in the graph.
For classes when I issue this query :
SELECT ?class
WHERE {
?class rdf:type rdfs:Class.
}
I don't get any results at all. Now using the default query in Virtuoso i.e.
Select count(distinct ?Concept) where {[] a ?Concept}
I get the value as 369857. So I am a bit confused. Is this large number because of the fact that the graph has concepts from yago,umbel,schema.org and purl or am I looking at something wrongly? Are the Concepts completely different from the classes? (interpreted differently, which I haven't thought of).
Now honestly I got waylaid by these numbers because I needed them to calculate the selectivity as defined in this paper
Here the say that for a triple pattern, the selectivity of a subject is 1/R, where R is the number of resource, so do the resources mean the Class count or the Concept count? or the count of ?s in the ?s ?p ?o . triple pattern?
The DBpedia ontology contains just the axioms for classes and properties with the namespace http://dbpedia.org/ontology.
The DBpedia SPARQL endpoint contains much more data:
At first, it contains triples with properties that have the namespace http://dbpedia.org/property . Those properties are untyped (i.e. of typerdf:Property, which in fact means that the value can be both a resource or a literal. In OWL we have typed properties, i.e. object and data properties.
Other information that are loaded into the SPARQL endpoint are, among others, links to external datasets like YAGO or the upper-level ontology UMBEL. You can find more details here [1], [2].
By the way, you can see that easily from your first query. There are much more properties with different namespaces.
According to your first query: It's the correct query if you want the number of triples for each property. It returns only 10000 because that's the default result set limit of the Virtuoso triple store in which DBpedia is loaded. For more results you have to use pagination. The total number of properties used in triples can be found with
SELECT (COUNT(DISTINCT ?p) AS ?cnt)
WHERE
{ ?s ?p ?o}
Your second query with all classes of type rdf:Class returns nothing because no class in DBpedia is of that type. It's more usual to query for classes of type owl:Class for OWL ontologies. The third query in fact returns all resources that ever occur in rdf:type triples in object position, which is slightly different as it works on instance data. That means it return all classes that are really used in the data.
For your last question. I haven't read the paper, but a common metric in many research papers is often to use the distinct subjects that use a given property.

Unable to use SPARQL SERVICE with FactForge

I am trying to access FactForge from Sesame triplestore. This is the query:
select *
where{
SERVICE <http://factforge.net/sparql>{
?s ?p ?o
}
}
LIMIT 100
The query doesn't get executed. The same structure works with DBpedia. FactForge's SPARQL endpoint on the web is working. What do I need to do to access the endpoint successfully from Sesame?
What you need to do is write a more meaningful (or at least more constrained) query. Your query is just selecting all possible triples, which presumably puts a lot of stress on the factforge endpoint (which contains about 3 billion triples). The reason your query "does not get executed" (which probably means that you are just waiting forever for the query to return a result) is that it takes the SPARQL endpoint a very long time to return its response.
The LIMIT 100 you put on the query is outside the scope of the SERVICE clause, and therefore not actually communicated to the remote endpoint you're querying. While in this particular case it would be possible for Sesame's optimizer to add that (since there are no additional constraints in your query outside the scope of the SERVICE clause), unfortunately it's currently not that smart - so the query sent to factforge is without a limit, and the actual limit is only applied after you get back the result (which, in the case of your "give me all your triples" query, naturally takes a while).
However, demonstrably the SERVICE clause does work for FactForge when used from Sesame, because if you try a slightly more constrained query, for example a query selecting all companies:
PREFIX dbp-ont: <http://dbpedia.org/ontology/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
select *
where{
SERVICE <http://factforge.net/sparql>{
?s a dbp-ont:Company
}
} LIMIT 100
it works fine, and you get a response.
More generally speaking, I should recommend that if you want to do queries that are specifically to a particular SPARQL endpoint, you should use a SPARQL endpoint proxy (which is one of the repository types available in Sesame) instead of using the SERVICE clause. SERVICE is only really useful when trying to combine data from your local repository with that of a remote endpoint in a single query. Using a SPARQL endpoint proxy allows you to make sure LIMIT clauses are actually communicated to the endpoint, and just generally will give you better performance than a SERVICE query will.