When triple count is very large, why is sparql federated query so slow, but local query so fast? - sparql

I set up SPARQL endpoints on several linux servers (RDF database: fuseki 4.4.0, Number of triples: 6,000,000), and then queried several SPARQL endpoints through SPARQL Federated Query.
Results: sparql federated query is so slow, but local query so fast.
Sparql federated query (very slow: Several hours passed and there was no response):
SELECT * WHERE {
{
SERVICE SILENT <fuseki endpoint 1> {
SELECT * WHERE {
?s ?p ?o .
}
}
}
UNION
{
SERVICE SILENT <fuseki endpoint 2> {
SELECT * WHERE {
?s ?p ?o .
}
}
}
} OFFSET 0 LIMIT 5
Local query (very fast, used 0.02 s):
SELECT * WHERE {
?s ?p ?o .
} OFFSET 0 LIMIT 5
However, querying Virtuoso with the same sparql statement is very fast. Such as DBpedia, although there are hundreds of millions of triples.

SERVICE will return all results (a single HTTP request) for the SERVICE block. It does not know there is an overall query limit and a more complex query may be locally filtering of joining SERVICE results so they may need to be more than 5 returned.
Apache Jena 4.6.1 has new support for enhancing SERVICE: https://jena.apache.org/documentation/query/service_enhancer.html

Related

Finding common super classes in SPARQL

We need a SPARQL query to find the common superclasses for two Wikidata entities. I tried it like this:
select ?commonBase #?commonBaseLabel
where {
wd:Q39798 wdt:P31/wdt:P279* ?commonBase.
wd:Q26868 wdt:P31/wdt:P279* ?commonBase.
#SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
Here's a direct link to the query.
This always times out, even with the SERVICE line commented out.
However, when I run the same query with just one of the WHERE clauses, it runs very fast, yielding either 130 or 12 results.
Why is it so much slower with both WHERE clauses? In theory, the database could just perform both WHERE clauses separately and then do an INTERSECT, i.e. deliver those results that have occurred in both queries.
If this is a flaw in the design of the SPARQL engine, then how can I adapt the query to make it run faster?

Jena seems to cache BIND(RAND()) Queries

I am running the following query thousands of times in a row in a Java program using Apache Jena (in order to generate random walks).
SELECT ?p ?o
WHERE {
$ENTITY$ ?p ?o .
FILTER(!isLiteral(?o)).
BIND(RAND() AS ?sortKey)
} ORDER BY ?sortKey LIMIT 1
However, I always get the same set of properties and objects (even though this seems to be extremely unlikely). I suppose Jena is caching the result to the query (despite the RAND() component).
What is the best and most efficient way to solve this problem?

How to import and query 4-column N-Quad quadruples into Blazegraph?

I'm relatively new to RDF/SPARQL and I'm trying to import and query some N-Quad, 4 column/quadruple data, into Blazegraph (See schemaOrgEvent.sample.txt).
When I import the Quads data (web console), I'm left with only triples-based 3-column tuple data, i.e. Blazegraph is allowing only SPARQL queries SELECTing {?s ?p ?o},but quadruple selections {?s ?p ?o ?c}, do not. What am I doing wrong???
Can Blazegraph store 4-column Quad data, natively, or am I mis-understanding the nature of RDF/Triple/Quad storage?
My Blazegraph schema/knowledge database (to import into), is configured for QUADs with:
com.bigdata.rdf.store.AbstractTripleStore.quads=true
I've imported the Quads-file into Blazegraph using the web-based UPDATE console # http://127.0.0.1:9999/bigdata/#update. I haven't tried direct bulk upload yet.
Also, the imported quad data, as well as becoming triples (normalised???), appears to have column 1 converted to another, Blazegraph-provided (?), identifier, the original (Quad) data format being (4-columns)
_:node03f536f724c9d62eb9acac3ef91faa9 <http://schema.org/PostalAddress/addressRegion> "Kentucky"#en <http://concerts.eventful.com/Lauren-Alaina> .
And after import (3-columns):
t1702 schema:PostalAddress/addressRegion Kentucky
Whose query was:
SELECT * WHERE
{
?s ?p ?o
#?s ?p ?o ?c - Won't work :-(
FILTER(STR(?o)="Kentucky")
}
The value 't1702' is a 'foreign key' of sorts and can be used to link to other triples (i.e. it is repeated within the imported date).
SPARQL queries RDF triples. The fourth dimension of a Quad data store is the graph. RDF can be divided into separate graphs of triples, analogous to tables in a database. You may be able to query the Quad store with the following:
SELECT *
WHERE {
GRAPH ?g {
?s ?p ?o
}
}

Filter by datatype

I'm trying to filter by datatype in DBpedia. For example:
SELECT *
WHERE {?s ?p ?o .
FILTER ( datatype(?o) = xsd:integer)
}
LIMIT 10
But I get no results, while there are certainly integer values. I get the same from other endpoints using Virtuoso, but I do get results from alternative endpoints. What could be the problem? If Virtuoso does not implement this SPARQL function properly, what to use instead?
This is an expensive query, since it has to traverse all triples (?s ?p ?o .). The query's execution time exceeds the maximum time configured for the Virtuoso instance that serves DBpedia's SPARQL endpoint at http://dbpedia.org/sparql.
If you don't use the timeout parameter, then you will get a time out error (Virtuoso S1T00 Error SR171: Transaction timed out). When you use the timeout (by default set to 30000 for the DBpedia endpoint), you will get incomplete results that will contain HTTP headers like these:
X-SQL-State: S1TAT
X-SQL-Message: RC...: Returning incomplete results, query interrupted by result timeout. Activity: 1.389M rnd 5.146M seq 0 same seg 55.39K same pg 195 same par 0 disk 0 spec disk 0B / 0 me
Empty results thus may be incomplete and it doesn't need to indicate there are no xsd:integer literals in DBpedia. A related discussion about the partial results in Virtuoso can be found here.
As a solution to your query, you can load the DBpedia from dumps and analyze it locally.
As a side note, your query is syntactically invalid, because it's missing namespace for the xsd prefix. You can check the syntax of SPARQL queries via SPARQLer Query Validator. You can find the namespaces for common prefixes using Prefix.cc. Virtuoso that provides the DBpedia SPARQL endpoint ignores the missing namespace for the xsd prefix, but it's a good practice to stick to syntactically valid SPARQL queries for greater interoperability.

Unable to use SPARQL SERVICE with FactForge

I am trying to access FactForge from Sesame triplestore. This is the query:
select *
where{
SERVICE <http://factforge.net/sparql>{
?s ?p ?o
}
}
LIMIT 100
The query doesn't get executed. The same structure works with DBpedia. FactForge's SPARQL endpoint on the web is working. What do I need to do to access the endpoint successfully from Sesame?
What you need to do is write a more meaningful (or at least more constrained) query. Your query is just selecting all possible triples, which presumably puts a lot of stress on the factforge endpoint (which contains about 3 billion triples). The reason your query "does not get executed" (which probably means that you are just waiting forever for the query to return a result) is that it takes the SPARQL endpoint a very long time to return its response.
The LIMIT 100 you put on the query is outside the scope of the SERVICE clause, and therefore not actually communicated to the remote endpoint you're querying. While in this particular case it would be possible for Sesame's optimizer to add that (since there are no additional constraints in your query outside the scope of the SERVICE clause), unfortunately it's currently not that smart - so the query sent to factforge is without a limit, and the actual limit is only applied after you get back the result (which, in the case of your "give me all your triples" query, naturally takes a while).
However, demonstrably the SERVICE clause does work for FactForge when used from Sesame, because if you try a slightly more constrained query, for example a query selecting all companies:
PREFIX dbp-ont: <http://dbpedia.org/ontology/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
select *
where{
SERVICE <http://factforge.net/sparql>{
?s a dbp-ont:Company
}
} LIMIT 100
it works fine, and you get a response.
More generally speaking, I should recommend that if you want to do queries that are specifically to a particular SPARQL endpoint, you should use a SPARQL endpoint proxy (which is one of the repository types available in Sesame) instead of using the SERVICE clause. SERVICE is only really useful when trying to combine data from your local repository with that of a remote endpoint in a single query. Using a SPARQL endpoint proxy allows you to make sure LIMIT clauses are actually communicated to the endpoint, and just generally will give you better performance than a SERVICE query will.