Index scan for multicolumn comparison - non-uniform index column ordering

Index scan for multicolumn comparison - non-uniform index column ordering - sql

This question is closely related to Enforcing index scan for multicolumn comparison
The solution there is perfect, but seems to works only if all index columns have same ordering. This question is different because column b is desc here, and this fact stops from using row-syntax to solve the same problem. This is why I'm looking for another solution.
Suppose index is built for 3 columns (a asc, b DESC, c asc), I want Postgres to:
find key [a=10, b=20, c=30] in that B-tree,
scan next 10 entries and return them.
If the index has only one column the solution is obvious:
select * from table1 where a >= 10 order by a limit 10
But if there are more columns the solution becomes much more complex. For 3 columns:
select * from table1
where a > 10 or (a = 10 and (b < 20 or b = 20 and c <= 30))
order by a, b DESC, c
limit 10;
How can I tell Postgres that I want this operation?
And can I be sure that even for those complex queries for 2+ columns the optimizer will always understand that he should perform range-scan? Why?

PostgreSQL implements tuples very thoroughly, (unlike half implementations found in Oracle, DB2, SQL Server, etc.). You can write your condition using "tuples inequality", as in:
select *
from table1
where (a, -b, c) >= (10, -20, 30)
order by a, -b, c
limit 10
Please note that since the second column is in descending order, you must "invert" its value during the comparison. That's why it's expressed as -b and also, -20. This can be tricky for non-numeric columns such as dates, varchars, LOBs, etc.
Finally, the use of an index is still possible with the -b column value if you create an ad-hoc index, such as:
create index ix1 on table1 (a, (-b), c);
However, you can never force PostgreSQL to use an index. SQL is a declarative language, not an imperative one. You can entice it to do it by keeping table stats up to date, and also by selecting a small number of rows. If your LIMIT is too big, PostgreSQL may be inclined to use a full table scan instead.

Strictly speaking, your index on (a ASC, b DESC, c ASC) can still be used, but only based on the leading expression a. See:
Is a composite index also good for queries on the first field?
Working of indexes in PostgreSQL
It's usefulness is limited and Postgres will only use it if the predicate on a alone is selective enough (less than roughly 5% of all rows have a >= 10). (Or possibly to profit from an index-only scans where possible.) But all index tuples qualifying on only a have to be read and you will see a FILTER step in the query plan to discard non-qualifying rows - both adding additional cost. An index on just (a) typically does a better job as it's smaller and cheaper to maintain.
I have tried and failed in the past to make full use of an index with non-uniform sort order (ASC | DESC) like you display for ROW value comparison. I am pretty certain it's not possible. Think of it: Postgres compares whole row values, which can either be greater or smaller, but not both at the same time.
There are workarounds for datatypes with a defined negator (like - for numeric data types). See the solution provided by "The Impaler"! The trick is to invert values and wrap it in an expression index to get uniform sort order for all index expressions after all - which is currently the only way to tap into the full potential of row comparison. Be sure to make both WHERE conditions and ORDER BY match the special index.

Related

Can PostgreSQL use an index for row-wise comparison?

Let's say that we have the following SQL:
SELECT a, b, c
FROM example_table
WHERE a = '12345' AND (b, c) <= ('2020-08-15'::date, '2020-08-15 00:40:33'::timestamp)
LIMIT 20
Can PostgreSQL efficiently use a B-Tree index defined on (a, b, c) to answer this query?
To elaborate a little bit on the use-case. This SQL query is part of my cursor-pagination implementation. Since I'm using a UUID as a primary key, I have to resort to using the date/timestamp columns for the cursor, which more closely fits my actual needs anyway. I'm new to PostgreSQL and this row-wise comparison feature, so I'm unsure how I can use an index to speed it up. In my testing using "explain analyze" I wasn't able to make the query use the index, but I assume this may be due to the fact that a table scan is more efficient given that there aren't many rows in the table.

Well, it should use the index. But only the first two columns. It can scan the rows with value of a you have specified. If the index is up-to-date with the table, then Postgres can pull the values of b and c from the index. That will allow it to scan a range of values for b.

Oracle multiple vs single column index

Imagine I have a table with the following columns:
Column: A (numer(10)) (PK)
Column: B (numer(10))
Column: C (numer(10))
CREATE TABLE schema_name.table_name (
column_a number(10) primary_key,
column_b number(10) ,
column_c number(10)
);
Column A is my PK.
Imagine my application now has a flow that queries by B and C. Something like:
SELECT * FROM SCHEMA.TABLE WHERE B=30 AND C=99
If I create an index only using the Column B, this will already improve my query right?
The strategy behind this query would benefit from the index on column B?
Q1 - If so, why should I create an index with those two columns?
Q2 - If I decided to create an index with B and C, If I query selecting only B, would this one be affected by the index?

The simple answers to your questions.
For this query:
SELECT *
FROM SCHEMA.TABLE
WHERE B = 30 AND C = 99;
The optimal index either (B, C) or (C, B). The order does matter because the two comparisons are =.
An index on either column can be used, but all the matching values will need to be scanned to compare to the second value.
If you have an index on (B, C), then this can be used for a query on WHERE B = 30. Oracle also implements a skip-scan optimization, so it is possible that the index could also be used for WHERE C = 99 -- but it probably would not be.
I think the documentation for MySQL has a good introduction to multi-column indexes. It doesn't cover the skip-scan but is otherwise quite applicable to Oracle.

Short answer: always check the real performance, not theoretical. It means, that my answer requires verification at real database.
Inside SQL (Oracle, Postgre, MsSql, etc.) the Primary Key is used for at least two purposes:
Ordering of rows (e.g. if PK is incremented only then all values will be appended)
Link to row. It means that if you have any extra index, it will contain whole PK to have ability to jump from additional index to other rows.
If I create an index only using the Column B, this will already improve my query right?
The strategy behind this query would benefit from the index on column B?
It depends. If your table is too small, Oracle can do just full scan of it. For large table Oracle can (and will do in common scenario) use index for column B and next do range scan. In this case Oracle check all values with B=30. Therefore, if you can only one row with B=30 then you can achieve good performance. If you have millions of such rows, Oracle will need to do million of reads. Oracle can get this information via statistic.
Q1 - If so, why should I create an index with those two columns?
It is needed to direct access to row. In this case Oracle requires just few jumps to find your row. Moreover, you can apply unique modifier to help Oracle. Then it will know, that not more than single row will be returned.
However if your table has other columns, real execution plan will include access to PK (to retrieve other rows).
If I decided to create an index with B and C, If I query selecting only B, would this one be affected by the index?
Yes. Please check the details here. If index have several columns, than Oracle will sort them according to column ordering. E.g. if you create index with columns B, C then Oracle will able to use it to retrieve values like "B=30", e.g. when you restricted only B.

Well, it all depends.
If that table is tiny, you won't see any benefit regardless any indexes you might create - it is just too small and Oracle returns data immediately.
If the table is huge, then it depends on column's selectivity. There's no guarantee that Oracle will ever use that index. If optimizer decides (upon information it has - don't forget to regularly collect statistics!) that the index should not be used, then you created it in vain (though, you can choose to use a hint, but - unless you know what you're doing, don't do it).
How will you know what's going on? See the explain plan.
But, generally speaking, yes - indexes help.
Q1 - If so, why should I create an index with those two columns?
Which "two columns"? A? If it is a primary key column, Oracle automatically creates an index, you don't have to do that.
Q2 - If I decided to create an index with B and C, If I query selecting only B, would this one be affected by the index?
If you are talking about a composite index (containing both B and C columns, respectively), and if query uses B column, then yes - index will (OK, might be used). But, if query uses only column C, then this index will be completely useless.

In spite of this question being answered and one answer being accepted already, I'll just throw in some more information :-)
An index is an offer to the DBMS that it can use to access data quicker in some situations. Whether it actually uses the index is a decision made by the DBMS.
Oracle has a built-in optimizer that looks at the query and tries to find the best execution plan to get the results you are after.
Let's say that 90% of all rows have B = 30 AND C = 99. Why then should Oracle laboriously walk through the index only to have to access almost every row in the table at last? So, even with an index on both columns, Oracle may decide not to use the index at all and even perform the query faster because of the decision against the index.
Now to the questions:
If I create an index only using the Column B, this will already improve my query right?
It may. If Oracle thinks that B = 30 reduces the rows it will have to read from the table imensely, it will.
If so, why should I create an index with those two columns?
If the combination of B = 30 AND C = 99 limits the rows to read from the table further, it's a good idea to use this index instead.
If I decided to create an index with B and C, If I query selecting only B, would this one be affected by the index?
If the index is on (B, C), i.e. B first, then Oracle may find it useful, yes. In the extreme case that there are only the two columns in the table, that would even be a covering index (i.e. containing all columns accessed in the query) and the DBMS wouldn't have to read any table row, as all the information is already in the index itself. If the index is (C, B), i.e. C first, it is quite unlikely that the index would be used. In some edge-case situations, Oracle might do so, though.

Maximum number of useful indexes a table can have?

The Meeting
In a meeting last week the client was discussing how to make an important search page faster. The page searches on a single table (12 columns, 20 million rows) by asking for values (strings) on any field; it returns 50 rows (with pagination), starting with the specified criteria (each column can be ascending or descending). When the criteria doesn't match the existing indexes, the search becomes slow, and the client is not happy.
And then -- in the middle of the meeting -- the semi-technical analyst threw this one into the air: Why don't we create all possible indexes on the table to make everything fast?
I responded at once "No, there are too many and that would make the table really slow to modify, so we need to create few cleverly chosen indexes to do it". We ended up creating the most useful ones, and the page is now much faster. Problem solved.
The Question
But still... I keep thinking about that question and I wanted to have a better understanding of it, so here it is:
In theory, how many possible useful indexes can I create on a table with N columns?
I think that by useful we should consider (I can be wrong):
Indexes not already covered by other ones: for example (a, b) should not be counted if (a, b, c) is included.
In order to show multiple rows (not just equality) ascending and descending indexes should be counted as separate ones when they are part of a composite index. That is: (a) serves the same purpose of (a DESC), but (a, b) serves a different purpose than (a DESC, b).
So, a table with a single column (a) can have only a single index:
(a)
With two columns (a, b) I can have four useful indexes:
(a, b)
(b, a)
(a DESC, b)
(b DESC, a)
(a) -- already covered by #1
(b) -- already covered by #2
(a, b DESC) -- already coverred by #1 (reading index in reverse)
(b, a DESC) -- already covered by #2
(a DESC, b DESC) -- already covered by #3
(b DESC, a DESC) -- already covered by #4
(a DESC) -- already covered by #3
(b DESC) -- already covered by #4
With three columns (a, b, c):
(a, b, c)
(a, c, b)
(b, c, a)
(b, a, c)
(c, a, b)
(c, b, a)
...

Let's say you have a table t with columns a, b, and c.
For the query
select a from t where b = 1 order by c;
the best index is on t(b,c,a), because you first look up values using b, then order results by c and have a in the results.
For this query:
select a from t where c = 1 order by b;
the best index is on t(c,b,a).
For this query:
select b from t where c = 1 order by a;
the best index is on t(c,a,b).
With more columns a query could look like this:
select a from t where b = 1 order by c, d, e;
and you'd best want an index on t(b,c,d,e,a).
While for
select a from t where b = 1 order by e, d, c;
you'd want an index on t(b,e,d,c,a).
So the maximal number of useful indexes for n columns is n!, i.e. all permutations.
This is for indexes on the mere columns alone. As Gordon Linoff has mentioned in the comments section to your request, you may also want function indexes (e.g. on t(upper(a),lower(b)). The number of usefull function indexes is theoretically unlimited. And yes, Gordon is also right about further index types.
So the final answer is that theoretically the number of useful indexes per table is unlimited.

All the other answers contain something valuable, but there is enough that I have to say about it to warrant a third one.
There is no exact answer to the question like you put it. In a way, it is like asking “What's the limit beyond which you would call a person crazy?” There is a large grey area.
My points are:
What would happen if you add too many indexes:
Modifying the table gets substantially slower. Even with few indexes, data manipulation will already become an order of magnitude slower. If you ever want to INSERT, UPDATE or DELETE, a table with all conceivable indexes would make such an operation glacially slow.
With many indexes, the query planner has to consider many different access paths, so planning the query will become slightly slower with any index you add. With very many indexes, it may well be that the planning overhead will make the query too slow even before the executor has started working.
What can you do to reduce the number of indexes needed:
Look at the operators. If the operators <, <=, >= and > are never used, there is no point in adding indexes with descending columns.
Remember that an index on (a, b, c) can also be used for a query that only uses a in its condition, so you don't need an extra index on (a).
What is a practical way forward for you?
I have two suggestions:
One way it to add a simple index on each of your twelve columns.
Twelve indexes are already quite a lot, but you are still not in the crazy range.
PostgreSQL can use these indexes efficiently in a query with conditions on more than one column, and even if none of the conditions alone would be selective enough to warrant an index scan.
This is because PostgreSQL has bitmap index scans. See this example from the documentation:
EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 100 AND unique2 > 9000;
QUERY PLAN
-------------------------------------------------------------------------------------
Bitmap Heap Scan on tenk1 (cost=25.08..60.21 rows=10 width=244)
Recheck Cond: ((unique1 < 100) AND (unique2 > 9000))
-> BitmapAnd (cost=25.08..25.08 rows=10 width=0)
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..5.04 rows=101 width=0)
Index Cond: (unique1 < 100)
-> Bitmap Index Scan on tenk1_unique2 (cost=0.00..19.78 rows=999 width=0)
Index Cond: (unique2 > 9000)
Each index is scanned and a bitmap is formed that contains 1 for each row that matches the condition. Then the bitmaps are combined, and finally the rows are fetched from the table.
The other idea is to use a Bloom filter.
If the only operator in your conditions is =, you can
CREATE EXTENSION bloom;
and create a single index USING bloom over all table columns.
Such an index can be used for queries with any combination of columns in the WHERE clause. The down side is that it is a lossy index, so you will get false positive results that have to be fetched and filtered out.
It depends on your case, but this might be an elegant (and underestimated!) solution that balances query and update speed.

In theory, how many possible useful indexes can I create on a table with N columns?
Rather than answering this question theoretically, a practical answer is much better.
The first point to note is that all sequential searches should be avoided (unless the table is very small). By "very small", I mean, just a few rows (say, max 10). (However, even in such a table, a primary key is encouraged, to enforce uniqueness. This would, of course, be implemented as an index.)
Therefore, if the client has a valid search path, an index is required. If an existing index serves the purpose, that's OK; else, in all probability, an additional index is needed.
One transaction table in one application in my experience had 8 indexes. The client insisted on certain search paths, and so we had no choice but to provide them. Of course, we informed the client that updates would slow down, but the client found that acceptable. In reality, the slowdown in speed during updates wasn't appreciable.
So that is the approach suggested - warn the client accordingly.
It is important to verify, during design, that a SQL statement uses indexed search paths (for every accessed table), rather than searching sequentially. ORACLE has a tool for this, called EXPLAIN PLAN. Other DBs should also have similar tools.

Finding the "next 25 rows" in Oracle SQL based on an indexed column

I have a large table (~200M rows) that is indexed on a numeric column, Z. There is also an index on the key column, K.
K Z
= ==========================================
1 0.6508784068583483336644518457703156855132
2 0.4078768075307567089075462518978907890789
3 0.5365440453204830852096396398565048002638
4 0.7573281573257782352853823856682368153782
What I need to be able to do is find the 25 records "surrounding" a given record. For instance, the "next" record starting at K=3 would be K=1, followed by K=4.
I have been lead by several sources (most notably this paper from some folks at Florida State University) that SQL like the following should work. It's not hard to imagine that scanning along the indexed column in ascending or descending order would be efficient.
select * from (
select *
from T
where Z >= [origin's Z value]
order by Z asc
) where rownum <= 25;
In theory, this should find the 25 "next" rows, and a similar variation would find the 25 "previous" rows. However, this can take minutes and the explain plan consistently contains a full table scan. A full table scan is simply too expensive for my purpose, but nothing I do seems to prompt the query optimizer to take advantage of the index (short, of course, of changing the ">=" above to an equals sign, which indicates that the index is present and operational). I have tried several hints to no avail (index, index_asc in several permutations).
Is what I am trying to do impossible? If I were trying to do this on a large data structure over which I had more control, I'd build a linked list on the indexed column's values and a tree to find the right entry point. Then traversing the list would be very inexpensive (yes I might have to run all over the disk to find the records I'm looking for, but I surely wouldn't have to scan the whole table).
I'll add in case it's important to my query that the database I'm using is running Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit.

I constructed a small test case with 10K rows. When I populated the table such that the Z values were already ordered, the exact query you gave tended to use the index. But when I populated it with random values, and refreshed the table statistics, it started doing full table scans, at least for some values of n larger than 25. So there is a tipping point at which the optimizer decides that the amount of work it will do to look up index entries then find the corresponding rows in the table is more than the amount of work to do a full scan. (It might be wrong in its estimate, of course, but that is what it has to go on.)
I noticed that you are using SELECT *, which means the query is returning both columns. This means that the actual table rows must be accessed, since neither index includes both columns. This might push the optimizer towards preferring a full table scan for a larger samples. If the query could be fulfilled from the index alone, it would be more likely to use the index.
One possibility is that you don't really need to return the values of K at all. If so, I'd suggest that you change both occurrences of SELECT * to SELECT z. In my test, this change caused a query that had been doing a full table scan to use an index scan instead (and not access the table itself at all).
If you do need to include K in the result, then you might try creating an index on (Z, K). This index could be used to satisfy the query without accessing the table.

Is an index on A, B redundant if there is an index on A, B, C?

Having years of experience as a DBA, I do believe I know the answer to the question, but I figured it never hurts to check my bases.
Using SQL Server, assuming I have a table which has an index on column A and column B, and a second index on columns A, B, and C, would it be safe to drop the first index, as the second index basically would satisfy queries that would benefit from the first index?

It depends, but the answer is often 'Yes, you could drop the index on (A,B)'.
The counter-case (where you would not drop the index on (A,B)) is when the index on (A,B) is a unique index that is enforcing a constraint; then you do not want to drop the index on (A,B). The index on (A,B,C) could also be unique, but the uniqueness is redundant because the (A,B) combination is unique because of the other index.
But in the absence of such unusual cases (for example, if both (A,B) and (A,B,C) allow duplicate entries), then the (A,B) index is logically redundant. However, if the column C is 'wide' (a CHAR(100) column perhaps), whereas A and B are small (say INTEGER), then the (A,B) index is more efficient than the (A,B,C) index because you can get more information read per page of the (A,B) index. So, even though (A,B) is redundant, it may be worth keeping. You also need to consider the volatility of the table; if the table seldom changes, the extra indexes don't matter much; if the table changes a lot, extra indexes slow up modifications to the table. Whether that's significant is difficult to guess; you probably need to do the performance measurements.

The first index covers queries that look up on A , A,B and the second index can be used to cover queries that look up on A , A,B or A,B,C which is clearly a superset of the first case.
If C is very wide however the index on A,B may still be useful as it can satisfy certain queries with fewer reads.
e.g. if C was a char(800) column the following query may benefit significantly from having the narrower index available.
SELECT a,b
FROM YourTable
ORDER BY a,b

Yes, this is a common optimization. Any query that would benefit from the index on A,B can also benefit just as well from the index on A,B,C.
In the MySQL community, there's even a tool to search your whole schema for redundant indexes: http://www.percona.com/doc/percona-toolkit/pt-duplicate-key-checker.html
The possible exception case would be if the index on A,B were more compact and used much more frequently, and you wanted to control which index was kept loaded in memory.

Much of what I was thinking was written by Jonathan in a previous answer. Uniqueness, faster work, and one other thing I think he missed.
If the first index is made A desc, B asc and second A asc, B asc, C asc, then deleting the first index isn't really a way to go, because the second one isn't a superset of the first one, and your query cannot benefit from the second index if ordering is as written in the first one.
In some cases like when you use the first index, you can order by A desc, B asc (of course) and A asc, B desc, but you can also make a query that will use any part of that index, like Order by A desc.
But a query like order by A asc, B asc, will not be 'covered' by the first index.
So I would add up, you can usually delete the first index, but that depends on your table configuration and your query (and, of course, indexes).

I typically would find this "almost" similar index in table that contains historical data. If column C is a date or integer column, be careful. It is most likely used to satisfy the MAX function as in WHERE tblA.C = MAX(tblB.C), which skips the table altogether and utilize an index only access path.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas