sqlite choose wrong query plan - sql

Consider following example:
DROP TABLE IF EXISTS t1;
CREATE TABLE t1(a INTEGER PRIMARY KEY, b) WITHOUT ROWID;
WITH RECURSIVE
cnt(x) AS (VALUES(1000) UNION ALL SELECT x+1 FROM cnt WHERE x<2000)
INSERT INTO t1(a,b) SELECT x, x FROM cnt;
CREATE INDEX t1b ON t1(b);
This query creates table without rowid column and insert values(x, x) where
1000 < x < 2000. In order to help query planner lets run ANALYZE.
ANALYZE;
EXPLAIN QUERY PLAN
SELECT * FROM t1 WHERE b BETWEEN 500 AND 2500;
EXPLAIN QUERY PLAN
SELECT * FROM t1 WHERE b BETWEEN 2900 AND 3000;
The output in both cases is:0|0|0|SEARCH TABLE t1 USING COVERING INDEX t1b (b>? AND b<?)
However, there is no sense to use index (for the first query) for the reason that anyway we have to iterate through whole table, so ordinary SCAN TABLE seems to be more efficient. Exactly in this way tables with rowid work:
DROP TABLE IF EXISTS t1;
CREATE TABLE t1(a, b);
WITH RECURSIVE
cnt(x) AS (VALUES(1000) UNION ALL SELECT x+1 FROM cnt WHERE x<2000)
INSERT INTO t1(a,b) SELECT x, x FROM cnt;
CREATE INDEX t1a ON t1(a);
ANALYZE;
EXPLAIN QUERY PLAN
SELECT * FROM t1 WHERE a BETWEEN 500 AND 2500;
EXPLAIN QUERY PLAN
SELECT * FROM t1 WHERE a BETWEEN 2900 AND 3000;
In this case output will be:0|0|0|SCAN TABLE t1
and 0|0|0|SEARCH TABLE t1 USING INDEX t1a (a>? AND a<?)
So, could anybody explain how query planner optimize queries for WITHOUT ROWID tables?

The output in both cases is:
0|0|0|SEARCH TABLE t1 USING COVERING INDEX t1b (b>? AND b<?)
However, there is no sense to use index (for
the first query) for the reason that anyway we have to iterate through
whole table, so ordinary SCAN TABLE seems to be more efficient.
You missed the COVERING INDEX part: that means it is using the index-only — not accessing the table at all.
You are right that a regular index access (without "COVERING") might be slower than a full table scan if all rows are needed, but this is not the case for an index-only scan.
Read more about index-only scans here: http://use-the-index-luke.com/sql/clustering/index-only-scan-covering-index
EDIT
WITHOUT ROWID are in SQLite what are so-called clustered indexes in other databases: they contain all table columns. Therefore, there is no need to visit the table, even if you select all columns (like in select *).
Read more about clustered indexes here: http://use-the-index-luke.com/sql/clustering/index-organized-clustered-index

Related

Difference between SQL query and use of index on Datetime column

I have simple table with index on DateTime column.
Can someone explain me which one of these two queries will use index?
CREATE TABLE exams
(
name VARCHAR(50),
grade INT,
date DATETIME
);
CREATE INDEX date_idx ON exams(date);
SELECT *
FROM exams
WHERE date = '2018-01-01'; -- doesn't use index?
SELECT *
FROM exams
WHERE MONTH(date) = 1; -- uses index?
There are different ways this to be solved by the SQL Engine. Let's insert some sample data in your table:
DROP TABLE if exists exams;
CREATE TABLE exams(
name varchar(50),
grade INT,
date datetime
);
INSERT exams
SELECT TOP (5000) CONCAT('name', row_number() over(order by t1.number))
,6
,'2019-07-01'
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
INSERT exams
SELECT TOP (5) CONCAT('name', row_number() over(order by t1.number))
,6
,'2019-07-02'
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CREATE INDEX date_idx ON exams(date);
As you can see, we have inserted:
5 000 rows for date 2019-07-01
5 rows for date 2019-07-02
Let's execute the following queries, now:
SELECT * FROM exams WHERE date= '2019-07-01';
SELECT * FROM exams WHERE date= '2019-07-02';
SELECT * FROM exams WHERE MONTH(date)=1;
and check the execution plans:
In the first query, the engine knows (because of the statistics) that almost all of the data is going to be read, so it performs table scan on your heap.
In the second query, the engine see that only few of the records are going to be return, so there is no need to read all the data - it uses the index, and performs index seek.
In the last case, the index can't be used, because the query si not sargable.
So, the engine decides if or not to use an index, and if or not to perform seek or scan. The only thing you can do is to make sure your indexes are covering, your statistics are updated and your queries are sargable.
Well, I believe that the truth is quite inversed:
First query uses index, while second DOES NOT.
Why second query doesn't use index? Because indexed column is wrapped in a function which prevents SQL Server from using index.
Index can be thought of as way of storing records. Applying function to indexed column may alter order of stored records, thus index can be no longer valid when using function.

Efficiently selecting distinct (a, b) from big table

I have a table with around 54 million rows in a Postgres 9.6 DB and would like to find all distinct pairs of two columns (there are around 4 million such values). I have an index over the two columns of interest:
create index ab_index on tbl (a, b)
What is the most efficient way to get such pairs? I have tried:
select a,b
from tbl
where a>$previouslargesta
group by a,b
order by a,b
limit 1000
And also:
select distinct(a,b)
from tbl
where a>previouslargesta
order by a,b
limit 1000
Also this recursive query:
with recursive t AS (
select min(a) AS a from tbl
union all
select (select min(a) from tickets where a > t.a)
FROM t)
select a FROM t
But all are slooooooow.
Is there a faster way to get this information?
Your table has 54 million rows and ...
there are around 4 million such values
7,4 % of all rows is a high percentage, an index can mostly only help by providing pre-sorted data, ideally in an index-only scan. There are more sophisticated techniques for smaller result sets (see below), and there are much faster ways for paging which returns much fewer rows at a time (see below) but for the general case a plain DISTINCT may be among the fastest:
SELECT DISTINCT a, b -- *no* parentheses
FROM tbl;
-- ORDER BY a, b -- ORDER BY wasn't not mentioned as requirement ...
Don't confuse it with DISTINCT ON, which would require parentheses. See:
Select first row in each GROUP BY group?
The B-tree index ab_index you have on (a, b) is already the best index for this. It has to be scanned in its entirety, though. The challenge is to have enough work_mem to process all in RAM. With standard settings it occupies at least 1831 MB on disk, typically more with some bloat. If you can afford it, run the query with a work_mem setting of 2 GB (or more) in your session. See:
Configuration parameter work_mem in PostgreSQL on Linux
SET work_mem = '2 GB';
SELECT DISTINCT a, b ...
RESET work_mem;
A read-only table helps. Else you need aggressive enough VACUUM settings to allow an index-only scan. And some more RAM, yet, would help (with appropriate settings) to keep the index cashed.
Also upgrade to the latest version of Postgres (11.3 as of writing). There have been many improvements for big data.
Paging
If you want to add paging as indicated by your sample query, urgently consider ROW value comparison. See:
Optimize query with OFFSET on large table
SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'
SELECT DISTINCT a, b
FROM tbl
WHERE (a, b) > ($previous_a, $previous_b) -- !!!
ORDER BY a, b
LIMIT 1000;
Recursive CTE
This also may or may not be faster for the general big query as well. For the small subset, it becomes much more attractive:
WITH RECURSIVE cte AS (
( -- parentheses required du to LIMIT 1
SELECT a, b
FROM tbl
WHERE (a, b) > ($previous_a, $previous_b) -- !!!
ORDER BY a, b
LIMIT 1
)
UNION ALL
SELECT x.a, x.b
FROM cte c
CROSS JOIN LATERAL (
SELECT t.a, t.b
FROM tbl t
WHERE (t.a, t.b) > (c.a, c.b) -- lateral reference
ORDER BY t.a, t.b
LIMIT 1
) x
)
TABLE cte
LIMIT 1000;
This can make perfect use of your index and should be as fast as it gets.
Further reading:
Optimize GROUP BY query to retrieve latest row per user
For repeated use and no or little write load on the table, consider a MATERIALIZED VIEW, based on one of the above queries - for much faster read performance.
I cannot guarantee for performance at Postgres, but this is a technique i had used on sql server in a similar case and proven faster than others:
get distinct A into a temp a
get distinct B into a temp b
cross a and b temps to Cartesian into a temp abALL
rank the abALL (optionally)
create a view myview as select top 1 a,b from tbl (your_main_table)
join temp abALL with myview into a temp abCLEAN
rank abCLEAN here if you havent rank above

Which Select statement is faster 1)cluster 2)nonClustered 3)with no indexes

In sybase 12.5
there are 3 tables with respective indexes
employee_1 table with clustered index on id column
employee_2 table with Non-clustered index on id column
employee_3 table with No index on id column
Which select will be faster and why ?
select * from employee_1
select * from employee_2
select * from employee_3
I think 3rd select should be faster as its not using any keys and compiler scan entire page to retrieve than using index page, please let me know. thanks
While you are not limiting your query based on id , there is no difference.
but if you want to look up ids and you have a huge amount of records , then clustered indexing would help

Query with rownum got slow

I have got a table table with 21 millions records of which 20 millions meet the criterion col1= 'text'. I then started iteratively to set the value of col2 to a value unequal to NULL. After I have mutated 10 million records, the following query got slow, that was fast in the beginning:
SELECT T_PK
FROM (SELECT T_PK
FROM table
WHERE col1= 'text' AND col2 IS NULL
ORDER BY T_PK DESC)
WHERE ROWNUM < 100;
I noticed that as soon as I remove the DESC, the whole order by clause ORDER BY T_PK DESC or the whole outer query with the condition WHERE ROWNUM < 100 it is fast again (fast means a couple of seconds, < 10s).
The execution plan looks as follows:
where the index full scan descending index is performed on the PK of the table. Besides the index on the PK, I have an index defined on col2.
What could be the reason that the query was fast and then got slow? How can I make the query fast regardless of how many records are already set to non-null value?
For this query:
SELECT T_PK
FROM (SELECT T_PK
FROM table
WHERE col1= 'text' AND col2 IS NULL
ORDER BY T_PK DESC
) t
WHERE ROWNUM < 100;
The optimal index is table(col1, col2, t_pk).
I think the problem is that the optimizer has a choice of two indexes -- either for the where clause (col1 and -- probably -- col2) or one on t_pk. If you have a single index that handles both clauses, then performance should improve.
One reason that the DESC might make a difference is where the matching rows lie. If all the matching rows are in the first 100,000 rows of the table, then when you order descending, the query might have to throw out 20.9 million rows before finding a match.
I think Burleson explained this quite nicely:
http://www.dba-oracle.com/t_sql_tuning_rownum_equals_one.htm
Beware!
This use of rownum< can cause performance problems. Using rownum may change the all_rows optimizer mode for a query to first_rows, causing unexpected sub-optimal execution plans. One solution is to always include an all_rows hint when using rownum to perform a top-n query.

Query using Rownum and order by clause does not use the index

I am using Oracle (Enterprise Edition 10g) and I have a query like this:
SELECT * FROM (
SELECT * FROM MyTable
ORDER BY MyColumn
) WHERE rownum <= 10;
MyColumn is indexed, however, Oracle is for some reason doing a full table scan before it cuts the first 10 rows. So for a table with 4 million records the above takes around 15 seconds.
Now consider this equivalent query:
SELECT MyTable.*
FROM
(SELECT rid
FROM
(SELECT rowid as rid
FROM MyTable
ORDER BY MyColumn
)
WHERE rownum <= 10
)
INNER JOIN MyTable
ON MyTable.rowid = rid
ORDER BY MyColumn;
Here Oracle scans the index and finds the top 10 rowids, and then uses nested loops to find the 10 records by rowid. This takes less than a second for a 4 million table.
My first question is why is the optimizer taking such an apparently bad decision for the first query above?
An my second and most important question is: is it possible to make the first query perform better. I have a specific need to use the first query as unmodified as possible. I am looking for something simpler than my second query above. Thank you!
Please note that for particular reasons I am unable to use the /*+ FIRST_ROWS(n) */ hint, or the ROW_NUMBER() OVER (ORDER BY column) construct.
If this is acceptable in your case, adding a WHERE ... IS NOT NULL clause will help the optimizer to use the index instead of doing a full table scan when using an ORDER BY clause:
SELECT * FROM (
SELECT * FROM MyTable
WHERE MyColumn IS NOT NULL
-- ^^^^^^^^^^^^^^^^^^^^
ORDER BY MyColumn
) WHERE rownum <= 10;
The rational is Oracle does not store NULL values in the index. As your query was originally written, the optimizer took the decision of doing a full table scan, as if there was less than 10 non-NULL values, it should retrieve some "NULL rows" to "fill in" the remaining rows. Apparently it is not smart enough to check first if the index contains enough rows...
With the added WHERE MyColumn IS NOT NULL, you inform the optimizer that you don't want in any circumstances any row having NULL in MyColumn. So it can blindly use the index without worrying about hypothetical rows having NULL in MyColumn.
For the same reason, declaring the ORDER BY column as NOT NULL should prevent the optimizer to do a full table scan. So, if you can change the schema, a cleaner option would be:
ALTER TABLE MyTable MODIFY (MyColumn NOT NULL);
See http://sqlfiddle.com/#!4/e3616/1 for various comparisons (click on view execution plan)