Why is this query not using an index sort?

Why is this query not using an index sort? - sql

Note: Table/Column/Index names are made up.
Background
I am having some trouble figuring out how to query one of my database tables efficiently (the table has about a million rows). The query in question involves a WHERE clause with a foreign key and an ORDER BY clause with another column.
The database generates an index on the FK and I create an index on the column I will use for ordering:
CREATE INDEX ab ON a(b);
Problem
When I run the query without filtering on the FK:
EXPLAIN SELECT * FROM a ORDER BY b;
The database properly uses the index to sort. I know this because the result (truncated) from this query returns:
FROM PUBLIC.A
/* PUBLIC.AB */
ORDER BY 3
/* index sorted */
However, when the query is modified to filter on the FK:
EXPLAIN SELECT * FROM a WHERE a_fk_id = 3 ORDER BY b
Only the FK index is used:
FROM PUBLIC.A
/* PUBLIC.A_FK_INDEX_NAME: A_FK_ID = 3 */
WHERE A_FK_ID = 3
ORDER BY 3
As you can see, only the FK index is used.
Questions
What is going on here?
I thought perhaps it had something to do with the separate indexes, but even creating a multi-column index like:
CREATE INDEX a_fk_id_b ON a(a_fk_id, b);
Did nothing to resolve the problem (neither did reversing the order of those columns in the index, but I didn't expect it to anyways).
Any suggestions would be greatly appreciated. I am by no means a database or SQL expert, but I was surprised to get these results. Perhaps I just need to query for this information differently, but I figured this was a relatively simple case.

Turns out the answer is very simple, although not entirely obvious (at least not to me). The table needed to have a composite index between the FK and the order column:
CREATE INDEX a_fk_id_b ON a(a_fk_id, b);
And in order for the database to utilize that index instead of the generated FK index, the a_fk_id column needed to included in the ORDER BY clause:
EXPLAIN SELECT * FROM a WHERE a_fk_id = 3 ORDER BY a_fk_id, b;
This results in our composite index being used for filtering as well as for ordering as shown in this truncated explain plan:
FROM PUBLIC.A
/* PUBLIC.A_FK_ID_B: A_FK_ID = 3 */
WHERE A_FK_ID = 3
ORDER BY 25, 19
/* index sorted */

Related

Performance impact of view on aggregate function vs result set limiting

The problem
Using PostgreSQL 13, I ran into a performance issue selecting the highest id from a view that joins two tables, depending on the select statement I execute.
Here's a sample setup:
CREATE TABLE test1 (
id BIGSERIAL PRIMARY KEY,
joincol VARCHAR
);
CREATE TABLE test2 (
joincol VARCHAR
);
CREATE INDEX ON test1 (id);
CREATE INDEX ON test1 (joincol);
CREATE INDEX ON test2 (joincol);
CREATE VIEW testview AS (
SELECT test1.id,
test1.joincol AS t1charcol,
test2.joincol AS t2charcol
FROM test1, test2
WHERE test1.joincol = test2.joincol
);
What I found out
I'm executing two statements which result in completely different execution plans and runtimes. The following statement executes in less than 100ms. As far as I understand the execution plan, the runtime is independent of the rowcount, since Postgres iterates the rows one by one (starting at the highest id, using the index) until a join on a row is possible and immediately returns.
SELECT id FROM testview ORDER BY ID DESC LIMIT 1;
However, this one takes over 1 second on average (depending on rowcount), since the two tables are "joined completely", before Postgres uses the index to select the highest id.
SELECT MAX(id) FROM testview;
Please refer to this sample on dbfiddle to check the explain plans:
https://www.db-fiddle.com/f/bkMNeY6zXqBAYUsprJ5eWZ/1
My real environment
On my real environment test1 contains only a hand full of rows (< 100), having unique values in joincol. test2 contains up to ~10M rows, where joincol always matches a value of test1's joincol. test2's joincol is not nullable.
The actual question
Why does Postgres not recognize that it could use an Index Scan Backward on row basis for the second select? Is there anything I could improve on the tables/indexes?

Queries not strictly equivalent
why does Postgres not recognize that it could use a Index Scan Backward on row basis for the second select?
To make the context clear:
max(id) excludes NULL values. But ORDER BY ... LIMIT 1 does not.
NULL values sort last in ascending sort order, and first in descending. So an Index Scan Backward might not find the greatest value (according to max()) first, but any number of NULL values.
The formal equivalent of:
SELECT max(id) FROM testview;
is not:
SELECT id FROM testview ORDER BY id DESC LIMIT 1;
but:
SELECT id FROM testview ORDER BY id DESC NULLS LAST LIMIT 1;
The latter query doesn't get the fast query plan. But it would with an index with matching sort order: (id DESC NULLS LAST).
That's different for the aggregate functions min() and max(). Those get a fast plan when targeting table test1 directly using the plain PK index on (id). But not when based on the view (or the underlying join-query directly - the view is not the blocker). An index sorting NULL values in the right place has hardly any effect.
We know that id in this query can never be NULL. The column is defined NOT NULL. And the join in the view is effectively an INNER JOIN which cannot introduce NULL values for id.
We also know that the index on test.id cannot contain NULL values.
But the Postgres query planner is not an AI. (Nor does it try to be, that could get out of hands quickly.) I see two shortcomings:
min() and max() get the fast plan only when targeting the table, regardless of index sort order, an index condition is added: Index Cond: (id IS NOT NULL)
ORDER BY ... LIMIT 1 gets the fast plan only with the exactly matching index sort order.
Not sure, whether that might be improved (easily).
db<>fiddle here - demonstrating all of the above
Indexes
Is there anything I could improve on the tables/indexes?
This index is completely useless:
CREATE INDEX ON "test" ("id");
The PK on test.id is implemented with a unique index on the column, that already covers everything the additional index might do for you.
There may be more, waiting for the question to clear up.
Distorted test case
The test case is too far away from actual use case to be meaningful.
In the test setup, each table has 100k rows, there is no guarantee that every value in joincol has a match on the other side, and both columns can be NULL
Your real case has 10M rows in table1 and < 100 rows in table2, every value in table1.joincol has a match in table2.joincol, both are defined NOT NULL, and table2.joincol is unique. A classical one-to-many relationship. There should be a UNIQUE constraint on table2.joincol and a FK constraint t1.joincol --> t2.joincol.
But that's currently all twisted in the question. Standing by till that's cleaned up.

This is a very good problem, and good testcase.
I tested it in postgres 9.3 perhaps 13 is can it more more fast.
I used Occam's Razor and i excluded some possiblities
View (without view is slow to)
JOIN can filter some rows (unfortunatly in your test not, but more length md5 5-6 yes)
Other basic equivalent select statements not solve yout problem (inner query or exists)
I achieved to use just index, but because the tables isn't bigger than indexes it was not the solution.
I think
CREATE INDEX on "test" ("id");
is useless, because PK!
If you change this
CREATE INDEX on "test" ("joincol");
to this
CREATE INDEX ON TEST (joincol, id);
Than the second query use just indexes.
After you run this
REINDEX table test;
REINDEX table test2;
VACUUM ANALYZE test;
VACUUM ANALYZE test2;
you can achive some performance tuning. Because you created indexes before inserts.
I think the reason is the two aim of DB.
First aim optimalize just some row. So run Nested Loop. You can force it with limit x.
Second aim optimalize whole table. Run this query fast for whole table.
In this situation postgres optimalizer didn't notice that simple MAX can run with NESTED LOOP. Or perhaps postgres cannot use limit in aggregate clause (can run on whole partial select, what is filtered with query).
And this is very expensive. But you have possiblities to write there other aggregates, like SUM, MIN, AVG stb.
Perhaps can help you the Window functions too.

db2 10.5 multi-column index explanation

My first time working with indexes in database and so far I've learn that if you have a multi-column index such as index('col1', 'col2', 'col3'), and if you do a query that uses where col2='col2' and col3='col3', that index would not be use.
I also learn that if a column is very low selectivity column. Indexing is useless.
However, from my test, it seems none of the above is true at all. Can someone explain more on this?
I have a table with more than 16 million records. Let's say claimID is the primary key, then there're a historynumber column that only have 3 distinct values (1,2,3), and a last column with storeNumber that has about 1 million distinct values.
I have an index for claimID alone, another index(historynumber, claimID), and other index with index(historynumber, storeNumber), and finally index(storeNumber, historynumber).
My guess was that if I do:
select * from my_table where claimId='123456' and historynumber = 1
would be much faster than
select * from my_table where historynumber = 1 and claimId = '123456'
However, the 2 have exactly the same performance (instant). So I thought the primary key index can work on any column order. Therefore, I tried the same thing but on historynumber and storeNumber instead. The result is exactly the same. Then I start trying out on columns that has no indexes and of course the result is the same also.
Finally, I do a
select * from my_table where historynumber = 1
and the query takes so long I had to cancel it.
So my conclusion is that the column order in where clause is completely useless, and so is the column order in the index definition since it seems like the database is smart enough to tell which column is the highest selectivity column.
Could someone give me an example that could prove otherwise?

Index explanation is a huge topic.
Don't worry about the sequence of different attributes in the SQL - it has no effect whether you specify
...where claimId='123456' and historynumber = 1
or the other way round. Each SQL is checked and optimized by the optimizer. To proove how the data gets accessed you could do a EXPLAIN. Check the documentation for more details.
For your other problem
select * from my_table where historynumber = 1
with an index of (storeNumber, historynumber).
Have you ever tried to lookup the name of a caller (having the telephone number) in a telephone book?
Well it is pretty much the same for an index - so the column order when creatin the index matters!
There are techniques which could help - i.e. index jump scan - but there is no guarantee.
Check out following sites to learn a little bit more about DB2 indexes:
http://db2commerce.com/2013/09/19/db2-luw-basics-indexes/
http://use-the-index-luke.com/sql/where-clause/the-equals-operator/concatenated-keys

Oracle: Full text search with condition

I've created an Oracle Text index like the following:
create index my_idx on my_table (text) indextype is ctxsys.context;
And I can then do the following:
select * from my_table where contains(text, '%blah%') > 0;
But lets say we have a have another column in this table, say group_id, and I wanted to do the following query instead:
select * from my_table where contains(text, '%blah%') > 0 and group_id = 43;
With the above index, Oracle will have to search for all items that contain 'blah', and then check all of their group_ids.
Ideally, I'd prefer to only search the items with group_id = 43, so I'd want an index like this:
create index my_idx on my_table (group_id, text) indextype is ctxsys.context;
Kind of like a normal index, so a separate text search can be done for each group_id.
Is there a way to do something like this in Oracle (I'm using 10g if that is important)?
Edit (clarification)
Consider a table with one million rows and the following two columns among others, A and B, both numeric. Lets say there are 500 different values of A and 2000 different values of B, and each row is unique.
Now lets consider select ... where A = x and B = y
An index on A and B separately as far as I can tell do an index search on B, which will return 500 different rows, and then do a join/scan on these rows. In any case, at least 500 rows have to be looked at (aside from the database being lucky and finding the required row early.
Whereas an index on (A,B) is much more effective, it finds the one row in one index search.
Putting separate indexes on group_id and the text I feel only leaves the query generator with two options.
(1) Use the group_id index, and scan all the resulting rows for the text.
(2) Use the text index, and scan all the resulting rows for the group_id.
(3) Use both indexes, and do a join.
Whereas I want:
(4) Use the (group_id, "text") index to find the text index under the particular group_id and scan that text index for the particular row/rows I need. No scanning and checking or joining required, much like when using an index on (A,B).

Oracle Text
1 - You can improve performance by creating the CONTEXT index with FILTER BY:
create index my_idx on my_table(text) indextype is ctxsys.context filter by group_id;
In my tests the filter by definitely improved the performance, but it was still slightly faster to just use a btree index on group_id.
2 - CTXCAT indexes use "sub-indexes", and seem to work similar to a multi-column index. This seems to be the option (4) you're looking for:
begin
ctx_ddl.create_index_set('my_table_index_set');
ctx_ddl.add_index('my_table_index_set', 'group_id');
end;
/
create index my_idx2 on my_table(text) indextype is ctxsys.ctxcat
parameters('index set my_table_index_set');
select * from my_table where catsearch(text, 'blah', 'group_id = 43') > 0
This is likely the fastest approach. Using the above query against 120MB of random text similar to your A and B scenario required only 18 consistent gets. But on the downside, creating the CTXCAT index took almost 11 minutes and used 1.8GB of space.
(Note: Oracle Text seems to work correctly here, but I'm not familiar with Text and I can't gaurentee this isn't an inappropriate use of these indexes like #NullUserException said.)
Multi-column indexes vs. index joins
For the situation you describe in your edit, normally there would not be a significant difference between using an index on (A,B) and joining separate indexes on A and B. I built some tests with data similar to what you described and an index join required only 7 consistent gets versus 2 consistent gets for the multi-column index.
The reason for this is because Oracle retrieves data in blocks. A block is usually 8K, and an index block is already sorted, so you can probably fit the 500 to 2000 values in a few blocks. If you're worried about performance, usually the IO to read and write blocks is the only thing that matters. Whether or not Oracle has to join together a few thousand rows is an inconsequential amount of CPU time.
However, this doesn't apply to Oracle Text indexes. You can join a CONTEXT index with a btree index (a "bitmap and"?), but the performance is poor.

I'd put an index on group_id and see if that's good enough. You don't say how many rows we're talking about or what performance you need.
Remember, the order in which the predicates are handled is not necessarily the order in which you wrote them in the query. Don't try to outsmart the optimizer unless you have a real reason to.

Short version: There's no need to do that. The query optimizer is smart enough to decide what's the best way to select your data. Just create a btree index on group_id, ie:
CREATE INDEX my_group_idx ON my_table (group_id);
Long version: I created a script (testperf.sql) that inserts 136 rows of dummy data.
DESC my_table;
Name Null Type
-------- -------- ---------
ID NOT NULL NUMBER(4)
GROUP_ID NUMBER(4)
TEXT CLOB
There is a btree index on group_id. To ensure the index will actually be used, run this as a dba user:
EXEC DBMS_STATS.GATHER_TABLE_STATS('<YOUR USER HERE>', 'MY_TABLE', cascade=>TRUE);
Here's how many rows each group_id has and the corresponding percentage:
GROUP_ID COUNT PCT
---------------------- ---------------------- ----------------------
1 1 1
2 2 1
3 4 3
4 8 6
5 16 12
6 32 24
7 64 47
8 9 7
Note that the query optimizer will use an index only if it thinks it's a good idea - that is, you are retrieving up to a certain percentage of rows. So, if you ask it for a query plan on:
SELECT * FROM my_table WHERE group_id = 1;
SELECT * FROM my_table WHERE group_id = 7;
You will see that for the first query, it will use the index, whereas for the second query, it will perform a full table scan, since there are too many rows for the index to be effective when group_id = 7.
Now, consider a different condition - WHERE group_id = Y AND text LIKE '%blah%' (since I am not very familiar with ctxsys.context).
SELECT * FROM my_table WHERE group_id = 1 AND text LIKE '%ipsum%';
Looking at the query plan, you will see that it will use the index on group_id. Note that the order of your conditions is not important:
SELECT * FROM my_table WHERE text LIKE '%ipsum%' AND group_id = 1;
Generates the same query plan. And if you try to run the same query on group_id = 7, you will see that it goes back to the full table scan:
SELECT * FROM my_table WHERE group_id = 7 AND text LIKE '%ipsum%';
Note that stats are gathered automatically by Oracle every day (it's scheduled to run every night and on weekends), to continually improve the effectiveness of the query optimizer. In short, Oracle does its best to optimize the optimizer, so you don't have to.

I do not have an Oracle instance at hand to test, and have not used the full-text indexing in Oracle, but I have generally had good performance with inline views, which might be an alternative to the sort of index you had in mind. Is the following syntax legit when contains() is involved?
This inline view gets you the PK values of the rows in group 43:
(
select T.pkcol
from T
where group = 43
)
If group has a normal index, and doesn't have low cardinality, fetching this set should be quick. Then you would inner join that set with T again:
select * from T
inner join
(
select T.pkcol
from T
where group = 43
) as MyGroup
on T.pkcol = MyGroup.pkcol
where contains(text, '%blah%') > 0
Hopefully the optimizer would be able to use the PK index to optimize the join and then appy the contains predicate only to the group 43 rows.

index with multiple columns - ok when doing query on only one column?

If I have an table
create table sv ( id integer, data text )
and an index:
create index myindex_idx on sv (id,text)
would this still be usefull if I did a query
select * from sv where id = 10
My reason for asking is that i'm looking through a set of tables with out any indexes, and seeing different combinations of select queries. Some uses just one column other has more than one. Do I need to have indexes for both sets or is an all-inclusive-index ok?
I am adding the indexes for faster lookups than full table scans.
Example (based on the answer by Matt Huggins):
select * from table where col1 = 10
select * from table where col1 = 10 and col2=12
select * from table where col1 = 10 and col2=12 and col3 = 16
could all be covered by index table (co1l1,col2,col3) but
select * from table where col2=12
would need another index?

It should be useful since an index on (id, text) first indexes by id, then text respectively.
If you query by id, this index will be used.
If you query by id & text, this index will be used.
If you query by text, this index will NOT be used.
Edit: when I say it's "useful", I mean it's useful in terms of query speed/optimization. As Sune Rievers pointed out, it will not mean you will get a unique record given just ID (unless you specify ID as unique in your table definition).

Oracle supports a number of ways of using an index, and you ought to start by understanding all of them so have a quick read here: http://download.oracle.com/docs/cd/B19306_01/server.102/b14211/optimops.htm#sthref973
Your query select * from table where col2=12 could usefully leverage an index skip scan if the leading column is of very low cardinality, or a fast full index scan if it is not. These would probably be fine for running reports, however for an OLTP query it is likely that you would do better to create an index with col2 as the leading column.

I assume id is primary key. There is no point in adding a primary key to the index, as this will always be unique. Adding something unique to something else will also be unique.
Add a unique index to text, if you really need it, otherwise just use id is uniqueness for the table.
If id is not your primary key, then you will not be guaranteed to get a unique result from your query.
Regarding your last example with lookup on col2, I think you could need another index. Indexes are not a cure-all solution for performance problems though, sometimes your database design or your queries needs to be optimized, for instance rewritten into stored procedures (while I'm not totally sure Oracle has them, I'm sure there's an Oracle equivalent).

If the driver behind your question is that you have a table with several columns and any combination of these columns may be used in a query, then you should look at BITMAP indexes.
Looking at your example:
select * from mytable where col1 = 10 and col2=12 and col3 = 16
You could create 3 bitmap indexes:
create bitmap index ix_mytable_col1 on mytable(col1);
create bitmap index ix_mytable_col2 on mytable(col2);
create bitmap index ix_mytable_col3 on mytable(col3);
These bitmap indexes have the great benefit that they can be combined as required.
So, each of the following queries would use one or more of the indexes:
select * from mytable where col1 = 10;
select * from mytable where col2 = 10 and col3 = 16;
select * from mytable where col3 = 16;
So, bitmap indexes may be an option for you. However, as David Aldridge pointed out, depending on your particular data set a single index on (col1,col2,col3) might be preferable. As ever, it depends. Take a look at your data, the likely queries against that data, and make sure your statistics are up to date.
Hope this helps.

Creating Indexes for Group By Fields?

Do you need to create an index for fields of group by fields in an Oracle database?
For example:
select *
from some_table
where field_one is not null and field_two = ?
group by field_three, field_four, field_five
I was testing the indexes I created for the above and the only relevant index for this query is an index created for field_two. Other single-field or composite indexes created on any of the other fields will not be used for the above query. Does this sound correct?

It could be correct, but that would depend on how much data you have. Typically I would create an index for the columns I was using in a GROUP BY, but in your case the optimizer may have decided that after using the field_two index that there wouldn't be enough data returned to justify using the other index for the GROUP BY.

No, this can be incorrect.
If you have a large table, Oracle can prefer deriving the fields from the indexes rather than from the table, even there is no single index that covers all values.
In the latest article in my blog:
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: Oracle
, there is a query in which Oracle does not use full table scan but rather joins two indexes to get the column values:
SELECT l.id, l.value
FROM t_left l
WHERE NOT EXISTS
(
SELECT value
FROM t_right r
WHERE r.value = l.value
)
The plan is:
SELECT STATEMENT
HASH JOIN ANTI
VIEW , 20090917_anti.index$_join$_001
HASH JOIN
INDEX FAST FULL SCAN, 20090917_anti.PK_LEFT_ID
INDEX FAST FULL SCAN, 20090917_anti.IX_LEFT_VALUE
INDEX FAST FULL SCAN, 20090917_anti.IX_RIGHT_VALUE
As you can see, there is no TABLE SCAN on t_left here.
Instead, Oracle takes the indexes on id and value, joins them on rowid and gets the (id, value) pairs from the join result.
Now, to your query:
SELECT *
FROM some_table
WHERE field_one is not null and field_two = ?
GROUP BY
field_three, field_four, field_five
First, it will not compile, since you are selecting * from a table with a GROUP BY clause.
You need to replace * with expressions based on the grouping columns and aggregates of the non-grouping columns.
You will most probably benefit from the following index:
CREATE INDEX ix_sometable_23451 ON some_table (field_two, field_three, field_four, field_five, field_one)
, since it will contain everything for both filtering on field_two, sorting on field_three, field_four, field_five (useful for GROUP BY) and making sure that field_one is NOT NULL.

Do you need to create an index for fields of group by fields in an Oracle database?
No. You don't need to, in the sense that a query will run irrespective of whether any indexes exist or not. Indexes are provided to improve query performance.
It can, however, help; but I'd hesitate to add an index just to help one query, without thinking about the possible impact of the new index on the database.
...the only relevant index for this query is an index created for field_two. Other single-field or composite indexes created on any of the other fields will not be used for the above query. Does this sound correct?
Not always. Often a GROUP BY will require Oracle to perform a sort (but not always); and you can eliminate the sort operation by providing a suitable index on the column(s) to be sorted.
Whether you actually need to worry about the GROUP BY performance, however, is an important question for you to think about.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas