SQLite `explain query plan` shows not every step? - sql

This is some pseudo SQL in which the 'problem' is easily replicated:
create table Child (
childId text primary key,
some_int int not null
);
create table Person (
personId text primary key,
childId text,
foreign key (childId) references Child (childId) on delete cascade
);
create index Person_childId on Person (childId);
explain query plan select count(1)
from Person
left outer join Child on Child.childId = Person.childId
where Person.childId is null or Child.some_int = 0;
The result of the query plan is this:
SCAN Person USING COVERING INDEX Person_childId
SEARCH Child USING INDEX sqlite_autoindex_Child_1 (childId=?) LEFT-JOIN
This looks great right? But I am curious if this is the 'full' plan. This is because some_int does not have an index. But the query plan does not uncover this, I don't see the filtering anywhere. The database must filter on this field right?
When I execute the some_int field in a separate query, it shows a SCAN, exactly like I though I would see in the previous query plan because there is no index:
explain query plan select * from Child where some_int = 0;
Gives:
SCAN Child
Now my questions:
Why isn't there SCAN Child shown in the first query plan?
Why is there a SCAN on Person and not a SEARCH?
Is the first query plan 'quick' or do I still need to add an index?

You should take a look at this page. It explains the sqlite query planner in depth and you can find answers to all your questions.
Note that filtering conditions like WHERE some_int=0 are not displayed in the query plan because they don't affect the plan but only the result set.
In brief:
Why isn't there SCAN Child shown in the first query plan?
Because, due to the LEFT JOIN, sqlite needs to SCAN Person and, for every row of Person, use the index on ChildId to find the corresponding records in Child.
Why is there a SCAN on Person and not a SEARCH?
A SCAN means reading of all rows of a table, in the order in which they are stored. A SEARCH is a lookup of a single value in the table, using an index to find out the rowid and the using the rowid to get to that row of the table, without the need to scan all te table to find the row.
Since your query needs to read all Person.childId, it does a full SCAN.
Is the first query plan 'quick' or do I still need to add an index?
Your query is already using all the indexes it could use, so it's already as fast as you could get it.

Related

Performance impact of view on aggregate function vs result set limiting

The problem
Using PostgreSQL 13, I ran into a performance issue selecting the highest id from a view that joins two tables, depending on the select statement I execute.
Here's a sample setup:
CREATE TABLE test1 (
id BIGSERIAL PRIMARY KEY,
joincol VARCHAR
);
CREATE TABLE test2 (
joincol VARCHAR
);
CREATE INDEX ON test1 (id);
CREATE INDEX ON test1 (joincol);
CREATE INDEX ON test2 (joincol);
CREATE VIEW testview AS (
SELECT test1.id,
test1.joincol AS t1charcol,
test2.joincol AS t2charcol
FROM test1, test2
WHERE test1.joincol = test2.joincol
);
What I found out
I'm executing two statements which result in completely different execution plans and runtimes. The following statement executes in less than 100ms. As far as I understand the execution plan, the runtime is independent of the rowcount, since Postgres iterates the rows one by one (starting at the highest id, using the index) until a join on a row is possible and immediately returns.
SELECT id FROM testview ORDER BY ID DESC LIMIT 1;
However, this one takes over 1 second on average (depending on rowcount), since the two tables are "joined completely", before Postgres uses the index to select the highest id.
SELECT MAX(id) FROM testview;
Please refer to this sample on dbfiddle to check the explain plans:
https://www.db-fiddle.com/f/bkMNeY6zXqBAYUsprJ5eWZ/1
My real environment
On my real environment test1 contains only a hand full of rows (< 100), having unique values in joincol. test2 contains up to ~10M rows, where joincol always matches a value of test1's joincol. test2's joincol is not nullable.
The actual question
Why does Postgres not recognize that it could use an Index Scan Backward on row basis for the second select? Is there anything I could improve on the tables/indexes?
Queries not strictly equivalent
why does Postgres not recognize that it could use a Index Scan Backward on row basis for the second select?
To make the context clear:
max(id) excludes NULL values. But ORDER BY ... LIMIT 1 does not.
NULL values sort last in ascending sort order, and first in descending. So an Index Scan Backward might not find the greatest value (according to max()) first, but any number of NULL values.
The formal equivalent of:
SELECT max(id) FROM testview;
is not:
SELECT id FROM testview ORDER BY id DESC LIMIT 1;
but:
SELECT id FROM testview ORDER BY id DESC NULLS LAST LIMIT 1;
The latter query doesn't get the fast query plan. But it would with an index with matching sort order: (id DESC NULLS LAST).
That's different for the aggregate functions min() and max(). Those get a fast plan when targeting table test1 directly using the plain PK index on (id). But not when based on the view (or the underlying join-query directly - the view is not the blocker). An index sorting NULL values in the right place has hardly any effect.
We know that id in this query can never be NULL. The column is defined NOT NULL. And the join in the view is effectively an INNER JOIN which cannot introduce NULL values for id.
We also know that the index on test.id cannot contain NULL values.
But the Postgres query planner is not an AI. (Nor does it try to be, that could get out of hands quickly.) I see two shortcomings:
min() and max() get the fast plan only when targeting the table, regardless of index sort order, an index condition is added: Index Cond: (id IS NOT NULL)
ORDER BY ... LIMIT 1 gets the fast plan only with the exactly matching index sort order.
Not sure, whether that might be improved (easily).
db<>fiddle here - demonstrating all of the above
Indexes
Is there anything I could improve on the tables/indexes?
This index is completely useless:
CREATE INDEX ON "test" ("id");
The PK on test.id is implemented with a unique index on the column, that already covers everything the additional index might do for you.
There may be more, waiting for the question to clear up.
Distorted test case
The test case is too far away from actual use case to be meaningful.
In the test setup, each table has 100k rows, there is no guarantee that every value in joincol has a match on the other side, and both columns can be NULL
Your real case has 10M rows in table1 and < 100 rows in table2, every value in table1.joincol has a match in table2.joincol, both are defined NOT NULL, and table2.joincol is unique. A classical one-to-many relationship. There should be a UNIQUE constraint on table2.joincol and a FK constraint t1.joincol --> t2.joincol.
But that's currently all twisted in the question. Standing by till that's cleaned up.
This is a very good problem, and good testcase.
I tested it in postgres 9.3 perhaps 13 is can it more more fast.
I used Occam's Razor and i excluded some possiblities
View (without view is slow to)
JOIN can filter some rows (unfortunatly in your test not, but more length md5 5-6 yes)
Other basic equivalent select statements not solve yout problem (inner query or exists)
I achieved to use just index, but because the tables isn't bigger than indexes it was not the solution.
I think
CREATE INDEX on "test" ("id");
is useless, because PK!
If you change this
CREATE INDEX on "test" ("joincol");
to this
CREATE INDEX ON TEST (joincol, id);
Than the second query use just indexes.
After you run this
REINDEX table test;
REINDEX table test2;
VACUUM ANALYZE test;
VACUUM ANALYZE test2;
you can achive some performance tuning. Because you created indexes before inserts.
I think the reason is the two aim of DB.
First aim optimalize just some row. So run Nested Loop. You can force it with limit x.
Second aim optimalize whole table. Run this query fast for whole table.
In this situation postgres optimalizer didn't notice that simple MAX can run with NESTED LOOP. Or perhaps postgres cannot use limit in aggregate clause (can run on whole partial select, what is filtered with query).
And this is very expensive. But you have possiblities to write there other aggregates, like SUM, MIN, AVG stb.
Perhaps can help you the Window functions too.

Optimize the Clustered Index Scan into Clustered Index Seek

There is scenario, I have table with 40 columns and I have to select all data of a table (including all columns). I have created a clustered index on the table and its including Clustered Index Scan while fetching full data set from the table.
I know that without any filter or join key, SQL Server will choose Clustered Index Scan instead of Clustered Index Seek. But, I want to have optimize execution plan by optimizing Clustered Index Scan into Clustered Index Seek. Is there any solution to achieve this? Please share.
Below is the screenshot of the execution plan:
Something is not quite right in the question / request, because what you are asking for will perform badly. I suspect it comes from mis-understanding what a clustered index is.
The clustered index - which is perhaps better stated as a clustered table - is the table of data, its not separate to the table, it is the table. If the order of the data on the table is already based on ITEM ID then the scan is the most efficient access method for your query (especially given the select *) - you do not want to seek in this scenario at all - and I don't believe that it is your scenario due to the sort operator.
If the clustered table is ordered based on another field, then you would need an additional non-clustered index to provide the correct order. You would then try to force a plan which was a non-clustered index scan, nested loop to a clustered index seek. That can be achieved using query hints, most likely an INNER LOOP JOIN would cause the seek - but a FORCESEEK also exists which can be used.
Performance wise this second option is never going to win - you are in effect looking at a tipping point notion (https://www.sqlskills.com/blogs/kimberly/the-tipping-point-query-answers/)
Well, I was trying to achieve the same, I wanted an index seek instead of an index scan on my top query.
SELECT TOP 5 id FROM mytable
Here is the execution plan being shown for the query:
I even tried the Offset Fetch Next approach, the plan was same.
To avoid a index scan, I included a fake primary key filter like below:
SELECT TOP 5 id FROM mytable where id != 0
I know, I won't have a 0 value in my primary key, so I added it in top query, which was resolved to an index seek instead of index scan:
Even though, the query plan comparison gives operation cost as similar to other, for index seek and scan in this regard. But I think to achieve index seek this way, it is an extra operation for the db to perform because it has to compare whether the id is 0 or not. Which we entirely do not need it to do if we want the top few records.

Database Sql Query

I want to extract record of person from table (employ) and I write query
SELECT *
FROM employ
WHERE employ_Id=some_specific_id
Now my question is what this query does first, mean this will first go to the table(employ) and selects all the records and then apply filter on it or just go the table(employ) and find record of the employ with the specific id given after WHERE clause.
1) Table records are mostly stored in order of primary key (known as clustered index). So, when you use primary key as where condition then rdbms doesn't requires to scan table (all records)
2) For other then primary key. Rdbms checks if index is created on table and if can be used for your where condition. so, it can avoid full table scan.
3) If non of above is possible then full table scan if performed.
When executing a query, it will look through ALL ROWS to see if they match your condition. This is why the more data you have, the longer the query will take.
If your condition is an index, as I believe is the case in your query, assuming empId is a primary key of that table, then the search will only be on that sorted index which will be much faster as not all the rows will need to be checked.
1-> At first control will check for the table in user_tab data dictionary.
2->Then will check for column availability in the table if the column exists the check for the where condition.
3-> Condition may or may not true, the control will go to select columns

oracle execution plan, trying to understand

EXPLAIN PLAN FOR
SELECT sightings.sighting_id, spotters.spotter_name,
sightings.sighting_date
FROM sightings
INNER JOIN spotters
ON sightings.spotter_id = spotters.spotter_id
WHERE sightings.spotter_id = 1255;
SELECT plan_table_output
FROM table(dbms_xplan.display('plan_table',null,'basic'));
id Operation Name
0 select statement
1 nested loops
2 table access by index rowid spotters
3 index unique scan pk_spotter_ID
4 table access full sightings
Im trying to understand whats exactly going on here does this sound right:
First the select statement is evaluated and attributes not in the select list are ignored for the output
Nested loop then computes the inner join on spotters.spotters_id = sightings.spotter_id
Table access by index rowid retrieves the rows with the rowids that were returned by step 3 from the spotters table
Index unique scan, scans spotter_id in PK_SPOTTER_ID index and finds rowids associated rows in the spotters table
Table access full, then scans through sightings completely untill sighting_id = 1255 is found
Steps seem to be basically correct but should be buttom-up.
The projection (choosing the relevant columns) is optimally done as early as possible at the scan phase.
The index operation is SEEK (you are not scanning the whole index)
NOTE: THIS ANSWER REFERS TO THE ORIGINAL VERSION OF THE QUESTION.
Oracle is reading the two tables in their entirety.
It is hashing each of the tables based on the join keys -- "re-ordering" the tables so similar keys appear near each other.
It is doing the join.
It is then doing the calculations for the final select and returning the results to the user.
This is what happens, informally, in the right order:
-- The index pk_spotter_id is scanned for at most one row that satisfies spotter_id = 1255
3 index unique scan pk_spotter_ID
-- The spotter_name column is fetched from the table spotters for the previously found row
2 table access by index rowid spotters
-- A nested loop is run for each (i.e. at most one) of the previously found rows
1 nested loops
-- That nested loop will scan the entire sightings table for rows that match the join
-- predicate sightings.spotter_id = spotters.spotter_id
4 table access full sightings
-- That'll be it for your select statement
0 select statement
In general (there are tons of exceptions), Oracle execution plans can be read
Bottom-up
First sibling first
This means that you go down the tree until you find the first leaf operation (e.g. #3), that'll be executed "first", its results are fed to the parent (e.g. #2), all the siblings are then executed top down, all the siblings' results are also fed to the parent, then the parent result is fed to the grand parent (e.g. #1), until you reach the top operation.
This is a very informal explanation of what happens. Do note there will be many exceptions to these rules once statements become more complex.

How do I make the database perform an index scan?

Simple question I think. I want to do an index scan on a table but it's not doing it. So I have a table with a unique clustered index on ID column and have 2 other columns, first_name and last_name. The following was my query...
SELECT FIRST_NAME
FROM TABLE_A
WHERE FIRST_NAME LIKE 'GUY'
I thought since I wasn't searching on the column with the index it should do it.
Why isn't it working and how do I make sure that I can get this to work every time I want it to?
Since first_name is not part of any index, there's no point in the database using an index - it will have to scan all of it, access the actual table row for each entry, and evaluate the first_name value there. Since it's accessing all the table's rows anyway, the optimizer just prefers to perform a full table scan, and save the (useless) index accesses.
If you want to use an index to speed up your query, you should create one that covers this column. E.g.:
CREATE INDEX table_a_first_name_ind ON table_a(first_name)