Slow performance with NOT ANY and ORDER BY in same query - cratedb

It seems like there is a strange performance hit when running a query that includes both NOT 'some string' = ANY(array_column) as well as an ORDER BY statement in the same query.
The following is a simplified table structure illustrating the behavior where tagger is an array of UUIDs (v4):
CREATE TABLE IF NOT EXISTS "doc"."test" (
"id" STRING,
"last_active" TIMESTAMP,
"taggers" ARRAY(STRING)
)
The taggers array can grow somewhat large with maybe hundreds and in some case thousands of individual strings.
The following queries are all very performant and resolve within .03 seconds:
SELECT id FROM test ORDER BY last_active DESC LIMIT 10;
SELECT id FROM test WHERE NOT ('da10187a-408d-4dfc-ae46-857fd23a574a' = ANY(taggers)) LIMIT 10;
SELECT id FROM test WHERE ('da10187a-408d-4dfc-ae46-857fd23a574a' = ANY(taggers)) ORDER BY last_active DESC LIMIT 10;
However including both parts in the query jumps to around 2 - 3 seconds:
SELECT id FROM test WHERE NOT ('da10187a-408d-4dfc-ae46-857fd23a574a' = ANY(taggers)) ORDER BY last_active LIMIT 10;
What's very strange is that of the previous list of queries that run fast the last one is almost the exact same as the slow one, just without the negation. Negation of the ANY is also very fast. It's only when negation of ANY in a combination of a limit is added that things slow down. Any help would be greatly appreciated.

The query with only ORDER BY doesn't apply any filtering and it's much faster of course.
The query that only has the filtering NOT ...ANY() without ORDER BY applies the filter only to a short number of records until the LIMIT number (10 in this case) is reached.
The last query (filtering with NOT & ANY and ORDER BY) is significantly slower because it has to do much more work: It has to apply the filter on all records of the table, then sort them and finally return the first 10 (LIMIT).

Related

Two queries with equal results takes vastly different time to execute?

I'm currently using redshift. I was trying to execute a query to calculate a column called id_number (data type INTEGER) from a VARCHAR column called id to speed up further queries using id_number instead of id.
Here is the first query I tried :
select rank() over (order by id) id_number, id, sid1 ,sid2
from table
limit 10000
However, noticing that this query was taking quite some time, I tried the next query:
with A as(
select id, sid1, sid2
from table
limit 10000
)
select rank() over (order by id) id_number, id, sid1 ,sid2
from A
which was over in a flash.
How was it that the second query took such less time to execute, while the two queries seem to do the exact same thing?
If it is because of the positions of limit 10000, how did the position of limit contribute to the difference in execution time?
Your two queries are quite different.
The first one has to sort the complete table to get the rank() and then emits the first 10000 rows of the result (with no particular ordering enforced).
The second one selects 10000 rows (without a particular ordering enforced) and then sorts those to calculate rank() on them.
If the table is significantly larger than 10000 rows, it is unsurprising that the first query, which has to sort it all, is much slower.
Look at the EXPLAIN (ANALYZE, BUFFERS) output to understand this better.

efficiently find subset of records as well as total count

I'm writing a function in ColdFusion that returns the first couple of records that match the user's input, as well as the total count of matching records in the entire database. The function will be used to feed an autocomplete, so speed/efficiency are its top concerns. For example, if the function receives input "bl", it might return {sampleMatches:["blue", "blade", "blunt"], totalMatches:5000}
I attempted to do this in a single query for speed purposes, and ended up with something that looked like this:
select record, count(*) over ()
from table
where criteria like :criteria
and rownum <= :desiredCount
The problem with this solution is that count(*) over () always returns the value of :desiredCount. I saw a similar question to mine here, but my app will not have permissions to create a temp table. So is there a way to solve my problem in one query? Is there a better way to solve it? Thanks!
I'm writing this on top of my head, so you should definitely have to time this, but I believe that using following CTE
only requires you to write the conditions once
only returns the amount of records you specify
has the correct total count added to each record
and is evaluated only once
SQL Statement
WITH q AS (
SELECT record
FROM table
WHERE criteria like :criteria
)
SELECT q1.*, q2.*
FROM q q1
CROSS JOIN (
SELECT COUNT(*) FROM q
) q2
WHERE rownum <= :desiredCount
A nested subquery should return the results you want
select record, cnt
from (select record, count(*) over () cnt
from table
where criteria like :criteria)
where rownum <= :desiredCount
This will, however, force Oracle to completely process the query in order to generate the accurate count. This seems unlikely to be what you want if you're trying to do an autocomplete particularly when Oracle may decide that it would be more efficient to do a table scan on table if :criteria is just b since that predicate isn't selective enough. Are you really sure that you need a completely accurate count of the number of results? Are you sure that your table is small enough/ your system is fast enough/ your predicates are selective enough for that to be a requirement that you could realistically meet? Would it be possible to return a less-expensive (but less-accurate) estimate of the number of rows? Or to limit the count to something smaller (say, 100) and have the UI display something like "and 100+ more results"?

MAX vs Top 1 - which is better?

I had to review some code, and came across something that someone did, and can't think of a reason why my way is better and it probably isn't, so, which is better/safer/more efficient?
SELECT MAX(a_date) FROM a_table WHERE a_primary_key = 5 GROUP BY event_id
OR
SELECT TOP 1 a_date FROM a_table WHERE a_primary_key = 5 ORDER BY a_date
I would have gone with the 2nd option, but I'm not sure why, and if that's right.
1) When there is a clustered index on the table and the column to be queried, both the MAX() operator and the query SELECT TOP 1 will have almost identical performance.
2) When there is no clustered index on the table and the column to be queried, the MAX() operator offers the better performance.
Reference: http://www.johnsansom.com/performance-comparison-of-select-top-1-verses-max/
Performance is generally similar, if your table is indexed.
Worth considering though: Top usually only makes sense if you're ordering your results (otherwise, top of what?)
Ordering a result requires more processing.
Min doesn't always require ordering. (Just depends, but often you don't need order by or group by, etc.)
In your two examples, I'd expect speed / x-plan to be very similar. You can always turn to your stats to make sure, but I doubt the difference would be significant.
They are different queries.
The first one returns many records (the biggest a_date for each event_id found within a_primary_key = 5)
The second one returns one record (the smallest a_date found within a_primary_key = 5).
For the queries to have the same result you would need:
SELECT MAX(a_date) FROM a_table WHERE a_primary_key = 5
SELECT TOP 1 a_date FROM a_table WHERE a_primary_key = 5 ORDER BY a_date DESC
The best way to know which is faster is to check the query plan and do your benchmarks. There are many factors that would affect the speed, such as table/heap size, etc. And even different versions of the same database may be optimized to favor one query over the other.
I perform max and top on one table with 20,00,000+ records ,
and found that Top give faster result with order by than max or min function.
So , best way is to execute both your query one by one for some time and check connection elapsed time for than.
MAX and TOP function differently. Your first query will return the maximum value found for a_date that has a a_primary_key = 5 for each different event_id found. The second query will simply grab the first a_date with a a_primary_key = 5 found in the result set.
To add the otherwise brilliant responses noting that the queries do very different things indeed, I'd like to point out that the results will be very different if there are no rows matching the criteria in the select.
SELECT MAX() will return one result with a NULL value
SELECT TOP 1 will result zero results
These are very different things.
I ran an experiment and I got the Clustered Index Cost 98% when I used an aggregate like Min/ Max, but when I used TOP and Order By, Clustered Index Scan cost was reduced to 45%. When it comes to querying large datasets, TOP and Order By combination will be less expensive and will give faster results.

Efficient SQL to count an occurrence in the latest X rows

For example I have:
create table a (i int);
Assume there are 10k rows.
I want to count 0's in the last 20 rows.
Something like:
select count(*) from (select i from a limit 20) where i = 0;
Is that possible to make it more efficient? Like a single SQL statement or something?
PS. DB is SQLite3 if that matters at all...
UPDATE
PPS. No need to group by anything in this instance, assume the table that is literally 1 column (and presumably the internal DB row_ID or something). I'm just curious if this is possible to do without the nested selects?
You'll need to order by something in order to determine the last 20 rows. When you say last, do you mean by date, by ID, ...?
Something like this should work:
select count(*)
from (
select i
from a
order by j desc
limit 20
) where i = 0;
If you do not remove rows from the table, you may try the following hacky query:
SELECT COUNT(*) as cnt
FROM A
WHERE
ROWID > (SELECT MAX(ROWID)-20 FROM A)
AND i=0;
It operates with ROWIDs only. As the documentation says: Rows are stored in rowid order.
You need to remember to order by when you use limit, otherwise the result is indeterminate. To get the latest rows added, you need to include a column with the insertion date, then you can use that. Without this column you cannot guarantee that you will get the latest rows.
To make it efficient you should ensure that there is an index on the column you order by, possibly even a clustered index.
I'm afraid that you need a nested select to be able to count and restrict to last X rows at a time, because something like this
SELECT count(*) FROM a GROUP BY i HAVING i = 0
will count 0's, but in ALL table records, because a LIMIT in this query will basically have no effect.
However, you can optimize making COUNT(i) as it is faster to COUNT only one field than 2 or more (in this case your table will have 2 fields, i and rowid, that is automatically created by SQLite in PKless tables)

Are the results deterministic, if I partition SQL SELECT query without ORDER BY?

I have SQL SELECT query which returns a lot of rows, and I have to split it into several partitions. Ie, set max results to 10000 and iterate the rows calling the query select time with increasing first result (0, 10000, 20000). All the queries are done in same transaction, and data that my queries are fetching is not changing during the process (other data in those tables can change, though).
Is it ok to use just plain select:
select a from b where...
Or do I have to use order by with the select:
select a from b where ... order by c
In order to be sure that I will get all the rows? In other word, is it guaranteed that query without order by will always return the rows in the same order?
Adding order by to the query drops performance of the query dramatically.
I'm using Oracle, if that matters.
EDIT: Unfortunately I cannot take advantage of scrollable cursor.
Order is definitely not guaranteed without an order by clause, but whether or not your results will be deterministic (aside from the order) would depend on the where clause. For example, if you have a unique ID column and your where clause included a different filter range each time you access it, then you would have non-ordered deterministic results, i.e.:
select a from b where ID between 1 and 100
select a from b where ID between 101 and 200
select a from b where ID between 201 and 300
would all return distinct result sets, but order would not be any way guaranteed.
No, without order by it is not guaranteed that query will ALWAYS return the rows in the same order.
No guarantees unless you have an order by on the outermost query.
Bad SQL Server example, but same rules apply. Not guaranteed order even with inner query
SELECT
*
FROM
(
SELECT
*
FROM
Mytable
ORDER BY SomeCol
) foo
Use Limit
So you would do:
SELECT * FROM table ORDER BY id LIMIT 0,100
SELECT * FROM table ORDER BY id LIMIT 101,100
SELECT * FROM table ORDER BY id LIMIT 201,100
The LIMIT would be from which position you want to start and the second variable would be how many results you want to see.
Its a good pagnation trick.