Combine SELECT and ASK queries in a same query - sparql

Context: 14M triples, Blazegraph workbench. I'm currently attempting to design queries which combine SELECT and ASK. More exactly, I want to select results in my graph where an assumption is true.
For my example, imagine I've many books which have one author and one editor. I want to select the book from the author which his book is linked through random path length property to the client#1.
In my case, with my data, it takes a lot of time to realise the query directly like that:
SELECT ?id_book
WHERE {?id_book prefix:hasAuthor :author#1.
?id_book prefix:linkedToEditor*/prefix:hasClient :client#1}
ORDER by ?id_book
To reduce the time of calculus (x 1:1000), I'm using a script to realise these queries successively. The script selects the books which have as author the author n°1:
SELECT ?id_book
WHERE {?id_book prefix:hasAuthor :author#1}
ORDER by ?id_book
And I ask for each result for 1 to n (id_book#1, id_book#2, ..., id_book#n) if it's linked to client n°1:
ASK {id_book#i prefix:linkedToEditor*/prefix:hasClient :client#1}
The SELECT query followed by the ASK query is far faster than the first SELECT query for the same results. I don't want to explore all the possibilities of ?id_book prefix:linkedToEditor*/prefix:hasClient :client#1; I just want to save results where the link exists. I tried with FILTER EXISTS or two SELECT queries, but the query times are similarly long:
SELECT ?id_book
WHERE {?id_book prefix:hasAuthor :author#1.}
FILTER EXIST {?id_book prefix:linkedToEditor*/prefix:hasClient :client#1}
ORDER by ?id_book
or
SELECT ?id_book
WHERE {?id_book prefix:linkedToEditor*/prefix:hasClient :client#1.
{SELECT ?id_book
WHERE {?id_book prefix:hasAuthor :author#1.}
}
}
How can I optimise my queries into one query?

It's a bit surprising that there's such a difference in your query times; a SPARQL engine should probably be able optimize the query to perform the simple part first, and then do the more complicated query property path afterward. The ordering could also cause some time increase, and it's really not important if you're just interested in boolean results.
At any rate, since nested queries are performed innermost first, you can sort of force the "do this first, then do that" by nesting the queries like this:
select ?id_book {
#-- first, get the books by author one
{ select ?id_book { ?id_book prefix:hasAuthor :author#1 } }
#-- then, then check that the book is related to client one
?id_book prefix:linkedToEditor*/prefix:hasClient :client#1
}
order by ?id_book

Related

Performance of `OFFSET ... LIMIT ...` query in Virtuoso

I'm trying to query dbpedia on a local installation of Virtuoso (a little over a billion triples), and would like to be able to read the entire thing in pages of about 1000 triples at a time. The following query seemed promising:
SELECT *
WHERE {
?s ?o ?p.
}
LIMIT 1000
OFFSET 10000000
until I realized that queries of this type run in time proportional to the OFFSET value.
Looking into the query plan it seems that queries such as this get translated into SQL that looks like this:
SELECT TOP 100000000, 1 __id2in ( "s_7_2_t0"."S") AS "s",
__id2in ( "s_7_2_t0"."P") AS "o",
__ro2sq ( "s_7_2_t0"."O") AS "p"
FROM DB.DBA.RDF_QUAD AS "s_7_2_t0"
OPTION (QUIETCAST)
which confirms my observation.
Is it possible to run such queries in constant time, either in SPARQL or directly in SQL on the SQL table? Since it's all SQL under the hood I had hoped that it would be a straightforward matter of writing the corresponding SQL query but for some reason the query select * from DB.DBA.RDF_QUAD limit 1; fails with the error syntax error which leaves me more confused than ever.

PostgreSQL - Query keyword patterns in columns in table

we all know in SQL we can query a column (lets say, column "breeds") for a certain word like "dog" via a query like this:
select breeds
from myStackOverflowDBTable
where breeds = 'dog'
However, say I had many more columns with much more data, say millions of records, and I did not want to find a word, but rather the most common keyword pattern or wildcard expression, a query like this:
SELECT *
FROM myStackOverflowDBTable
WHERE address LIKE '%alb%'"
Is there an efficient way to find these 'patterns' inside the columns using SQL? I need to find the most common substring so-to-speak, per the query above, say the wildcard string "alb" appeared the most in a "location" column that had words like Albany, Albuquerque, Alabama, obviously querying the words directly would yield 0 results but querying on that wildcard keyword pattern would yield many, but I want to find the most repeating or most frequent wildcard/keyword pattern/regex expression/substring (however you want to define it) for a given column - is there an easy way to do this without querying a million test queries and doing it manually???
Well, if you want to find three character patterns, you could extract all 3-character patterns, aggregate and count:
select substr(t.address, gs.i, 3) as ngram_3, count(*)
from t cross join lateral
generate_series(1, length(address) - 3, 1) gs(i)
group by ngram_3
order by count(*) desc
limit 100;

SQLite alias (AS) not working in the same query

I'm stuck in an (apparently) extremely trivial task that I can't make work , and I really feel no chance than to ask for advice.
I used to deal with PHP/MySQL more than 10 years ago and I might be quite rusty now that I'm dealing with an SQLite DB using Qt5.
Basically I'm selecting some records while wanting to make some math operations on the fetched columns. I recall (and re-read some documentation and examples) that the keyword "AS" is going to conveniently rename (alias) a value.
So for example I have this query, where "X" is an integer number that I render into this big Qt string before executing it with a QSqlQuery. This query lets me select all the electronic components used in a Project and calculate how many of them to order (rounding to the nearest multiple of 5) and the total price per component.
SELECT Inventory.id, UsedItems.pid, UsedItems.RefDes, Inventory.name, Inventory.category,
Inventory.type, Inventory.package, Inventory.value, Inventory.manufacturer,
Inventory.price, UsedItems.qty_used as used_qty,
UsedItems.qty_used*X AS To_Order,
ROUND((UsedItems.qty_used*X/5)+0.5)*5*CAST((X > 0) AS INT) AS Nearest5,
Inventory.price*Nearest5 AS TotPrice
FROM Inventory
LEFT JOIN UsedItems ON Inventory.id=UsedItems.cid
WHERE UsedItems.pid='1'
ORDER BY RefDes, value ASC
So, for example, I aliased UsedItems.qty_used as used_qty. At first I tried to use it in the next field, multiplying it by X, writing "used_qty*X AS To_Order" ... Query failed. Well, no worries, I had just put the original tab.field name and it worked.
Going further, I have a complex calculation and I want to use its result on the next field, but the same issue popped out: if I alias "ROUND(...)" AS Nearest5, and then try to use this value by multiplying it in the next field, the query will fail.
Please note: the query WORKS, but ONLY if I don't use aliases in the following fields, namely if I don't use the alias Nearest5 in the TotPrice field. I just want to avoid re-writing the whole ROUND(...) thing for the TotPrice field.
What am I missing/doing wrong? Either SQLite does not support aliases on the same query or I am using a wrong syntax and I am just too stuck/confused to see the mistake (which I'm sure it has to be really stupid).
Column aliases defined in a SELECT cannot be used:
For other expressions in the same SELECT.
For filtering in the WHERE.
For conditions in the FROM clause.
Many databases also restrict their use in GROUP BY and HAVING.
All databases support them in ORDER BY.
This is how SQL works. The issue is two things:
The logic order of processing clauses in the query (i.e. how they are compiled). This affects the scoping of parameters.
The order of processing expressions in the SELECT. This is indeterminate. There is no requirement for the ordering of parameters.
For a simple example, what should x refer to in this example?
select x as a, y as x
from t
where x = 2;
By not allowing duplicates, SQL engines do not have to make a choice. The value is always t.x.
You can try with nested queries.
A SELECT query can be nested in another SELECT query within the FROM clause;
multiple queries can be nested, for example by following the following pattern:
SELECT *,[your last Expression] AS LastExp From (SELECT *,[your Middle Expression] AS MidExp FROM (SELECT *,[your first Expression] AS FirstExp FROM yourTables));
Obviously, respecting the order that the expressions of the innermost select query can be used by subsequent select queries:
the first expressions can be used by all other queries, but the other intermediate expressions can only be used by queries that are further upstream.
For your case, your query may be:
SELECT *, PRC*Nearest5 AS TotPrice FROM (SELECT *, ROUND((UsedItems.qty_used*X/5)+0.5)*5*CAST((X > 0) AS INT) AS Nearest5 FROM (SELECT Inventory.id, UsedItems.pid, UsedItems.RefDes, Inventory.name, Inventory.category, Inventory.type, Inventory.package, Inventory.value, Inventory.manufacturer, Inventory.price AS PRC, UsedItems.qty_used*X AS To_Order FROM Inventory LEFT JOIN UsedItems ON Inventory.id=UsedItems.cid WHERE UsedItems.pid='1' ORDER BY RefDes, value ASC))

Wikidata: an effective way to count items that share two properties

I would like to count the number of Wikidata items that have two properties at the same time. For example, a Viaf ID and a BNF ID, or a LoC Id and a SUDOC id. The first way that comes to my mind would be a query like this:
SELECT (COUNT(DISTINCT ?item) AS ?count) WHERE {
?item wdt:P214 ?viaf.
?item wdt:P268 ?bnf.
}
Try it.
But this query is inefficient (23 seconds) and, to apply it to 10 properties, would require 90 comparisons two by two. Is there a more efficient way to perform these calculations?

Vague count in sql select statements

I guess this has been asked in the site before but I can't find it.
I've seen in some sites that there is a vague count over the results of a search. For example, here in stackoverflow, when you search a question, it says +5000 results (sometimes), in gmail, when you search by keywords, it says "hundreds" and in google it says aprox X results. Is this just a way to show the user an easy-to-understand-a-huge-number? or this is actually a fast way to count results that can be used in a database [I'm learning Oracle at the moment 10g version]? something like "hey, if you get more than 1k results, just stop and tell me there are more than 1k".
Thanks
PS. I'm new to databases.
Usually this is just a nice way to display a number.
I don't believe there is a way to do what you are asking for in SQL - count does not have an option for counting up until some number.
I also would not assume this is coming from SQL in either gmail, or stackoverflow.
Most search engines will return a total number of matches to a search, and then let you page through results.
As for making an exact number more human readable, here is an example from Rails:
http://api.rubyonrails.org/classes/ActionView/Helpers/NumberHelper.html#method-i-number_to_human
With Oracle, you can always resort to analytical functions in order to calculate the exact number of rows about to be returned. This is an example of such a query:
SELECT inner.*, MAX(ROWNUM) OVER(PARTITION BY 1) as TOTAL_ROWS
FROM (
[... your own, sorted search query ...]
) inner
This will give you the total number of rows for your specific subquery. When you want to apply paging as well, you can further wrap these SQL parts as such:
SELECT outer.* FROM (
SELECT * FROM (
SELECT inner.*,ROWNUM as RNUM, MAX(ROWNUM) OVER(PARTITION BY 1) as TOTAL_ROWS
FROM (
[... your own, sorted search query ...]
) inner
)
WHERE ROWNUM < :max_row
) outer
WHERE outer.RNUM > :min_row
Replace min_row and max_row by meaningful values. But beware that calculating the exact number of rows can be expensive when you're not filtering using UNIQUE SCAN or relatively narrow RANGE SCAN operations on indexes. Read more about this here: Speed of paged queries in Oracle
As others have said, you can always have an absolute upper limit, such as 5000 to your query using a ROWNUM <= 5000 filter and then just indicate that there are more than 5000+ results. Note that Oracle can be very good at optimising queries when you apply ROWNUM filtering. Find some info on that subject here:
http://www.dba-oracle.com/t_sql_tuning_rownum_equals_one.htm
Vague count is a buffer which will be displayed promptly. If user wants to see more results then he can request more.
It's a performance facility, after displaying the results the sites like google keep searching for more results.
I don't know how fast this will run, but you can try:
SELECT NULL FROM your_tables WHERE your_condition AND ROWNUM <= 1001
If count of rows in result will equals to 1001 then total count of records will > 1000.
this question gives some pretty good information
When you do an SQL query you can set a
LIMIT 0, 100
for example and you will only get the first hundred answers. so you can then print to your viewer that there are 100+ answers to their request.
For google I couldn't say if they really know there is more than 27'000'000'000 answer to a request but I believe they really do know. There are some standard request that have results stored and where the update is done in the background.