Performance of `OFFSET ... LIMIT ...` query in Virtuoso - sparql

I'm trying to query dbpedia on a local installation of Virtuoso (a little over a billion triples), and would like to be able to read the entire thing in pages of about 1000 triples at a time. The following query seemed promising:
SELECT *
WHERE {
?s ?o ?p.
}
LIMIT 1000
OFFSET 10000000
until I realized that queries of this type run in time proportional to the OFFSET value.
Looking into the query plan it seems that queries such as this get translated into SQL that looks like this:
SELECT TOP 100000000, 1 __id2in ( "s_7_2_t0"."S") AS "s",
__id2in ( "s_7_2_t0"."P") AS "o",
__ro2sq ( "s_7_2_t0"."O") AS "p"
FROM DB.DBA.RDF_QUAD AS "s_7_2_t0"
OPTION (QUIETCAST)
which confirms my observation.
Is it possible to run such queries in constant time, either in SPARQL or directly in SQL on the SQL table? Since it's all SQL under the hood I had hoped that it would be a straightforward matter of writing the corresponding SQL query but for some reason the query select * from DB.DBA.RDF_QUAD limit 1; fails with the error syntax error which leaves me more confused than ever.

Related

Limit records returned by range

Hi I have a particular sparql query which returns all my triples in a graph. This takes some time to run so I was wondering if I can limit the amount of triples and rerun the query everytime the user presses next on the interface ex: first he sees the first 10, then next 10 etc... Right now I am getting all the results and saving them and then traverse the results. Can I just get the first 10, next 10 etc..
For a SELECT query, to get only the first ten rows:
SELECT ...
WHERE {
...
}
LIMIT 10
To get the next ten:
SELECT ...
WHERE {
...
}
OFFSET 10 LIMIT 10
For the next ten, increase the OFFSET to 20, and so on.
You say that your query is returning triples, so is it a CONSTRUCT query? LIMIT and OFFSET also work with CONSTRUCT, but the number of triples returned will depend on the number of triple patterns in the construct template.
Edit: When using LIMIT and OFFSET, on some SPARQL stores you will need to also use ORDER BY ?x, where ?x is one of the query variables. This ensures a predictable order of the results.

Count properties in a large data set via SPARQL

I'd like to have a list of most used properties in a SPARQL endpoint. The most straightforward query would be:
select ?p ( count ( distinct * ) as ?ct )
{
?s ?p ?o.
}
group by ?p
order by desc ( ?ct )
limit 1000
The problem is that there are too many triples (1.6 billions) and the server times out. So, after googling, I've also tried this, to get at least a sample statistics (yes, it's Virtuoso-specific and it's fine in my case):
select ?p ( count ( distinct * ) as ?ct )
{
?s ?p ?o.
FILTER ( 1 > <SHORT_OR_LONG::bif:rnd> (0.0001, ?s, ?p, ?o) )
}
group by ?p
order by desc ( ?ct )
limit 1000
But it times out anyway, I guess because it still has to group, count and then order. So, how can I do it? I have access to the Virtuoso relational DB (i.e., iSQL), but I cannot find docs about SQL syntax and how to select random triples from the table db.dba.rdf_quad.
EDIT: I've fixed the queries, initially they were wrong, thanks for the comments. The versions above still don't work.
OK, I've found a way, at least a partial one: Virtuoso has a command line administration tool, isql. This accepts SPARQL queries as well, in the form: SPARQL <query>;. And they're executed without timeout or result size restrictions.
This is still not good if you can only access an endpoint via HTTP, I don't quite know if that way it is possible at all.

BigQuery returning "query too large" on short query

I've been trying to run this query:
SELECT
created
FROM
TABLE_DATE_RANGE(
program1_insights.insights_,
TIMESTAMP('2016-01-01'),
TIMESTAMP('2016-02-09')
)
LIMIT
10
And BigQuery complains that the query is too large.
I've experimented with writing the table names out manually:
SELECT
created
FROM program1_insights.insights_20160101,
program1_insights.insights_20160102,
program1_insights.insights_20160103,
program1_insights.insights_20160104,
program1_insights.insights_20160105,
program1_insights.insights_20160106,
program1_insights.insights_20160107,
program1_insights.insights_20160108,
program1_insights.insights_20160109,
program1_insights.insights_20160110,
program1_insights.insights_20160111,
program1_insights.insights_20160112,
program1_insights.insights_20160113,
program1_insights.insights_20160114,
program1_insights.insights_20160115,
program1_insights.insights_20160116,
program1_insights.insights_20160117,
program1_insights.insights_20160118,
program1_insights.insights_20160119,
program1_insights.insights_20160120,
program1_insights.insights_20160121,
program1_insights.insights_20160122,
program1_insights.insights_20160123,
program1_insights.insights_20160124,
program1_insights.insights_20160125,
program1_insights.insights_20160126,
program1_insights.insights_20160127,
program1_insights.insights_20160128,
program1_insights.insights_20160129,
program1_insights.insights_20160130,
program1_insights.insights_20160131,
program1_insights.insights_20160201,
program1_insights.insights_20160202,
program1_insights.insights_20160203,
program1_insights.insights_20160204,
program1_insights.insights_20160205,
program1_insights.insights_20160206,
program1_insights.insights_20160207,
program1_insights.insights_20160208,
program1_insights.insights_20160209
LIMIT
10
And not surprisingly, BigQuery returns the same error.
This Q&A says that "query too large" means that BigQuery is generating an internal query that's too large to be processed. But in the past, I've run queries over way more than 40 tables with no problem.
My question is: what is it about this query in particular that's causing this error, when other, larger-seeming queries run fine? Is it that doing a single union over this number of tables is not supported?
Answering question: what is it about this query in particular that's causing this error
The problem is not in query itself.
Query looks good.
I just run similar query against ~400 daily tables with total 5.8B (billion) rows of total size 5.7TB with:
Query complete (150.0s elapsed, 21.7 GB processed)
SELECT
Timestamp
FROM
TABLE_DATE_RANGE(
MyEvents.Events_,
TIMESTAMP('2015-01-01'),
TIMESTAMP('2016-02-12')
)
LIMIT
10
You should look around - btw, are you sure you are not over-simplifying query in your question?

Combine SELECT and ASK queries in a same query

Context: 14M triples, Blazegraph workbench. I'm currently attempting to design queries which combine SELECT and ASK. More exactly, I want to select results in my graph where an assumption is true.
For my example, imagine I've many books which have one author and one editor. I want to select the book from the author which his book is linked through random path length property to the client#1.
In my case, with my data, it takes a lot of time to realise the query directly like that:
SELECT ?id_book
WHERE {?id_book prefix:hasAuthor :author#1.
?id_book prefix:linkedToEditor*/prefix:hasClient :client#1}
ORDER by ?id_book
To reduce the time of calculus (x 1:1000), I'm using a script to realise these queries successively. The script selects the books which have as author the author n°1:
SELECT ?id_book
WHERE {?id_book prefix:hasAuthor :author#1}
ORDER by ?id_book
And I ask for each result for 1 to n (id_book#1, id_book#2, ..., id_book#n) if it's linked to client n°1:
ASK {id_book#i prefix:linkedToEditor*/prefix:hasClient :client#1}
The SELECT query followed by the ASK query is far faster than the first SELECT query for the same results. I don't want to explore all the possibilities of ?id_book prefix:linkedToEditor*/prefix:hasClient :client#1; I just want to save results where the link exists. I tried with FILTER EXISTS or two SELECT queries, but the query times are similarly long:
SELECT ?id_book
WHERE {?id_book prefix:hasAuthor :author#1.}
FILTER EXIST {?id_book prefix:linkedToEditor*/prefix:hasClient :client#1}
ORDER by ?id_book
or
SELECT ?id_book
WHERE {?id_book prefix:linkedToEditor*/prefix:hasClient :client#1.
{SELECT ?id_book
WHERE {?id_book prefix:hasAuthor :author#1.}
}
}
How can I optimise my queries into one query?
It's a bit surprising that there's such a difference in your query times; a SPARQL engine should probably be able optimize the query to perform the simple part first, and then do the more complicated query property path afterward. The ordering could also cause some time increase, and it's really not important if you're just interested in boolean results.
At any rate, since nested queries are performed innermost first, you can sort of force the "do this first, then do that" by nesting the queries like this:
select ?id_book {
#-- first, get the books by author one
{ select ?id_book { ?id_book prefix:hasAuthor :author#1 } }
#-- then, then check that the book is related to client one
?id_book prefix:linkedToEditor*/prefix:hasClient :client#1
}
order by ?id_book

SPARQL query coming up empty

I'd like to query DBPedia with SPARQL to select car manufacturers (of France to start), but I have no idea how to do this.
I started with this query:
select distinct * where {
?carManufacturer dbpedia-owl:product <http://dbpedia.org/page/Automobile> .
?carManufacturer dbpprop:locationCountry "France"#en
} LIMIT 100
However, it returns an empty result set. What am I doing wrong?
You’re using the wrong URI to identify DBPedia resources. On DBPedia, http://dbpedia.org/resource/Automobile refers to the noun ‘automobile’, while http://dbpedia.org/page/Automobile refers to a page describing the noun ‘automobile’. Thus,
SELECT DISTINCT * WHERE {
?carManufacturer dbpedia-owl:product <http://dbpedia.org/resource/Automobile>.
?carManufacturer dbpprop:locationCountry "France"#en.
} LIMIT 100
works just fine.
However, if you want to be a bit more idiomatic, you can use a bit of syntactic sugar to eliminate the subject repetition in your query. DBPedia also loads http://dbpedia.org/resource/ as prefix dbpedia, so you can actually eliminate all URIs from your query entirely:
SELECT DISTINCT * WHERE {
?carManufacturer dbpedia-owl:product dbpedia:Automobile;
dbpprop:locationCountry "France"#en.
} LIMIT 100