Why isn't Postgres using the index? - sql

I have a table with an integer column called account_id. I have an index on that column.
But seems Postgres doesn't want to use my index:
EXPLAIN ANALYZE SELECT "invoices".* FROM "invoices" WHERE "invoices"."account_id" = 1;
Seq Scan on invoices (cost=0.00..6504.61 rows=117654 width=186) (actual time=0.021..33.943 rows=118027 loops=1)
Filter: (account_id = 1)
Rows Removed by Filter: 51462
Total runtime: 39.917 ms
(4 rows)
Any idea why that would be?

Because of:
Seq Scan on invoices (...) (actual ... rows=118027 <— this
Filter: (account_id = 1)
Rows Removed by Filter: 51462 <— vs this
Total runtime: 39.917 ms
You're selecting so many rows that it's cheaper to read the entire table.
Related earlier questions and answers from today for further reading:
Why doesn't Postgresql use index for IN query?
Postgres using wrong index when querying a view of indexed expressions?
(See also Craig's longer answer on the second one for additional notes on indexes subtleties.)

Related

PostgreSQL not using any index in regex search

I have the following SQL statement to filter data with a regex search:
select * from others.table
where vintage ~* '(17|18|19|20)[0-9]{2,}'
Upon some researching, I found that I need to create gin/gist index for better performance:
create index idx_vintage_gist on others.table using gist (vintage gist_trgm_ops);
create index idx_vintage_gin on others.table using gin (vintage gin_trgm_ops);
create index idx_vintage_varchar on others.table using btree (vintage varchar_pattern_ops);
Looking at the explain plan, it is not using any index but a seq scan:
Seq Scan on table t (cost=0.00..45412.25 rows=1070800 width=91) (actual time=0.038..8518.830 rows=1075980 loops=1)
Filter: (vintage ~* '(17|18|19|20)[0-9]{2,}'::text)
Rows Removed by Filter: 25400
Planning Time: 0.481 ms
Execution Time: 8767.998 ms
There are total 1101380 rows in the table.
My question is why is it not using any index for the regex search?
(Answer was in comments; posting as community wiki.)
From the execution plan, 1070800 rows were expected to be returned, which is 1070800/1101380 ≈ 97.2% of the table. With so much of the table being in the results, using an index wouldn't be advantageous, so a sequential scan is performed.

Why is this SQL statement so fast?

The table in question has 3.8M records. The data column is indexed on a different field: "idx2_table_on_data_id" btree ((data ->> 'id'::text)). I assumed the sequential scan would be very slow but it is completing in just over 1 second. data->'array' does not exist in many of the records, fyi. Why is this running so quickly? Postgres v10
db=> explain analyze select * from table where jsonb_array_length(data->'array') != 0;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Seq Scan on table (cost=0.00..264605.21 rows=3797785 width=681) (actual time=0.090..1189.997 rows=1762 loops=1)
Filter: (jsonb_array_length((data -> 'array'::text)) <> 0)
Rows Removed by Filter: 3818154
Planning time: 0.561 ms
Execution time: 1190.492 ms
(5 rows)
We could tell for sure if you had run EXPLAIN (ANALYZE, BUFFERS), but odds are that most of the data were cached in RAM.
Also jsonb_array_length(data->'array') is not terribly expensive if the JSON is short.

Prevent usage of index for a particular query in Postgres

I have a slow query in a Postgres DB. Using explain analyze, I can see that Postgres makes bitmap index scan on two different indexes followed by bitmap AND on the two resulting sets.
Deleting one of the indexes makes the evaluation ten times faster (bitmap index scan is still used on the first index). However, that deleted index is useful in other queries.
Query:
select
booking_id
from
booking
where
substitute_confirmation_token is null
and date_trunc('day', from_time) >= cast('01/25/2016 14:23:00.004' as date)
and from_time >= '01/25/2016 14:23:00.004'
and type = 'LESSON_SUBSTITUTE'
and valid
order by
booking_id;
Indexes:
"idx_booking_lesson_substitute_day" btree (date_trunc('day'::text, from_time)) WHERE valid AND type::text = 'LESSON_SUBSTITUTE'::text
"booking_substitute_confirmation_token_key" UNIQUE CONSTRAINT, btree (substitute_confirmation_token)
Query plan:
Sort (cost=287.26..287.26 rows=1 width=8) (actual time=711.371..711.377 rows=44 loops=1)
Sort Key: booking_id
Sort Method: quicksort Memory: 27kB
Buffers: shared hit=8 read=7437 written=1
-> Bitmap Heap Scan on booking (cost=275.25..287.25 rows=1 width=8) (actual time=711.255..711.294 rows=44 loops=1)
Recheck Cond: ((date_trunc('day'::text, from_time) >= '2016-01-25'::date) AND valid AND ((type)::text = 'LESSON_SUBSTITUTE'::text) AND (substitute_confirmation_token IS NULL))
Filter: (from_time >= '2016-01-25 14:23:00.004'::timestamp without time zone)
Buffers: shared hit=5 read=7437 written=1
-> BitmapAnd (cost=275.25..275.25 rows=3 width=0) (actual time=711.224..711.224 rows=0 loops=1)
Buffers: shared hit=5 read=7433 written=1
-> Bitmap Index Scan on idx_booking_lesson_substitute_day (cost=0.00..20.50 rows=594 width=0) (actual time=0.080..0.080 rows=72 loops=1)
Index Cond: (date_trunc('day'::text, from_time) >= '2016-01-25'::date)
Buffers: shared hit=5 read=1
-> Bitmap Index Scan on booking_substitute_confirmation_token_key (cost=0.00..254.50 rows=13594 width=0) (actual time=711.102..711.102 rows=2718734 loops=1)
Index Cond: (substitute_confirmation_token IS NULL)
Buffers: shared read=7432 written=1
Total runtime: 711.436 ms
Can I prevent using a particular index for a particular query in Postgres?
Your clever solution
You already found a clever solution for your particular case: A partial unique index that only covers rare values, so Postgres won't (can't) use the index for the common NULL value.
CREATE UNIQUE INDEX booking_substitute_confirmation_uni
ON booking (substitute_confirmation_token)
WHERE substitute_confirmation_token IS NOT NULL;
It's a textbook use-case for a partial index. Literally! The manual has a similar example and these perfectly matching advice to go with it:
Finally, a partial index can also be used to override the system's
query plan choices. Also, data sets with peculiar distributions might
cause the system to use an index when it really should not. In that
case the index can be set up so that it is not available for the
offending query. Normally, PostgreSQL makes reasonable choices about
index usage (e.g., it avoids them when retrieving common values, so
the earlier example really only saves index size, it is not required
to avoid index usage), and grossly incorrect plan choices are cause
for a bug report.
Keep in mind that setting up a partial index indicates that you know
at least as much as the query planner knows, in particular you know
when an index might be profitable. Forming this knowledge requires
experience and understanding of how indexes in PostgreSQL work. In
most cases, the advantage of a partial index over a regular index will
be minimal.
You commented: The table has few millions of rows and just few thousands of rows with not null values, so this is a perfect use-case. It will even speed up queries on non-null values for substitute_confirmation_token because the index is much smaller now.
Answer to question
To answer your original question: it's not possible to "disable" an existing index for a particular query. You would have to drop it, but that's way to expensive.
Fake drop index
You could drop an index inside a transaction, run your SELECT and then, instead of committing, use ROLLBACK. That's fast, but be aware that (per documentation):
A normal DROP INDEX acquires exclusive lock on the table, blocking
other accesses until the index drop can be completed.
So this is no good for multi-user environments.
BEGIN;
DROP INDEX big_user_id_created_at_idx;
SELECT ...;
ROLLBACK; -- so the index is preserved after all
More detailed statistics
Normally, though, it should be enough to raise the STATISTICS target for the column, so Postgres can more reliably identify common values and avoid the index for those. Try:
ALTER TABLE booking ALTER COLUMN substitute_confirmation_token SET STATISTICS 2000;
Then: ANALYZE booking; before you try your query again. 2000 is an example value. Related:
Keep PostgreSQL from sometimes choosing a bad query plan

Why does this SUM() function take so long in PostgreSQL?

This is my query:
SELECT SUM(amount) FROM bill WHERE name = 'peter'
There are 800K+ rows in the table. EXPLAIN ANALYZE says:
Aggregate (cost=288570.06..288570.07 rows=1 width=4) (actual time=537213.327..537213.328 rows=1 loops=1)
-> Seq Scan on bill (cost=0.00..288320.94 rows=498251 width=4) (actual time=48385.201..535941.041 rows=800947 loops=1)
Filter: ((name)::text = 'peter'::text)
Rows Removed by Filter: 8
Total runtime: 537213.381 ms
All rows are affected, and this is correct. But why so long? A similar query without WHERE runs way faster:
ANALYZE EXPLAIN SELECT SUM(amount) FROM bill
Aggregate (cost=137523.31..137523.31 rows=1 width=4) (actual time=2198.663..2198.664 rows=1 loops=1)
-> Index Only Scan using idx_amount on bill (cost=0.00..137274.17 rows=498268 width=4) (actual time=0.032..1223.512 rows=800955 loops=1)
Heap Fetches: 533399
Total runtime: 2198.717 ms
I have an index on amount and an index on name. Have I missed any indexes?
ps. I managed to solve the problem just by adding a new idex ON bill(name, amount). I didn't get why it helped, so let's leave the question open for some time...
Since you are searching for a specific name, you should have an index that has name as the first column, e.g. CREATE INDEX IX_bill_name ON bill( name ).
But Postgres can still opt to do a full table scan if it estimates your index to not be specific enough, i.e. if it thinks it is faster to just scan all rows and pick the matching ones instead of consulting an index and start jumping around in the table to gather the matching rows. Postgres uses a cost-based estimation technique that weights random disk reads to be more expensive than sequential reads.
For an index to actually be used in your situation, there should be no more than 10% of the rows matching what you are searching for. Since most of your rows have name=peter it is actually faster to do a full table scan.
As to why the SUM without filtering runs faster has to do with overall width of the table. With a where-clause, postgres has to sequentially read all rows in the table so it can disregard those that do not match the filter. Without a where-clause, postgres can instead read all the amounts from the index. Because the index on amounts contains the amounts and pointers to each corresponding rows, but no other data from the table, it is simply less data to wade through. Based on the big different in performance I guess you have quite a lot of other fields in your table..

Indexing a column used to ORDER BY with a constraint in PostgreSQL

I've got a modest table of about 10k rows that is often sorted by a column called 'name'. So, I added an index on this column. Now selects on it are fast:
EXPLAIN ANALYZE SELECT * FROM crm_venue ORDER BY name ASC LIMIT 10;
...query plan...
Limit (cost=0.00..1.22 rows=10 width=154) (actual time=0.029..0.065 rows=10 loops=1)
-> Index Scan using crm_venue_name on crm_venue (cost=0.00..1317.73 rows=10768 width=154) (actual time=0.026..0.050 rows=10 loops=1)
Total runtime: 0.130 ms
If I increase the LIMIT to 60 (which is roughly what I use in the application) the total runtime doesn't increase much further.
Since I'm using a "logical delete pattern" on this table I only consider entries where the delete_date NULL. So this is a common select I make:
SELECT * FROM crm_venue WHERE delete_date IS NULL ORDER BY name ASC LIMIT 10;
To make that query snappy as well I put an index on the name column with a constraint like this:
CREATE INDEX name_delete_date_null ON crm_venue (name) WHERE delete_date IS NULL;
Now it's fast to do the ordering with the logical delete constraint:
EXPLAIN ANALYZE SELECT * FROM crm_venue WHERE delete_date IS NULL ORDER BY name ASC LIMIT 10;
Limit (cost=0.00..84.93 rows=10 width=154) (actual time=0.020..0.039 rows=10 loops=1)
-> Index Scan using name_delete_date_null on crm_venue (cost=0.00..458.62 rows=54 width=154) (actual time=0.018..0.033 rows=10 loops=1)
Total runtime: 0.076 ms
Awesome! But this is were I get myself into trouble. The application rarely calls for the first 10 rows. So, let's select some more rows:
EXPLAIN ANALYZE SELECT * FROM crm_venue WHERE delete_date IS NULL ORDER BY name ASC LIMIT 20;
Limit (cost=135.81..135.86 rows=20 width=154) (actual time=18.171..18.189 rows=20 loops=1)
-> Sort (cost=135.81..135.94 rows=54 width=154) (actual time=18.168..18.173 rows=20 loops=1)
Sort Key: name
Sort Method: top-N heapsort Memory: 21kB
-> Bitmap Heap Scan on crm_venue (cost=4.67..134.37 rows=54 width=154) (actual time=2.355..8.126 rows=10768 loops=1)
Recheck Cond: (delete_date IS NULL)
-> Bitmap Index Scan on crm_venue_delete_date_null_idx (cost=0.00..4.66 rows=54 width=0) (actual time=2.270..2.270 rows=10768 loops=1)
Index Cond: (delete_date IS NULL)
Total runtime: 18.278 ms
As you can see it goes from 0.1 ms to 18!!
Clearly what happens is that there's a point where the ordering can no longer use the index to run the sort. I noticed that as I increase the LIMIT number from 20 to higher numbers it always takes around 20-25 ms.
Am I doing it wrong or is this a limitation of PostgreSQL? What is best way to set up indexes for this type of queries?
My guess would be that since, logically, the index is comprised of pointers to a set of rows on a set of data pages. if you fetch a page that is known to ONLY have "deleted" records on it, it doesn't have to recheck the page once it is fetched to only fetch the records that are deleted.
Therefore, it may be that when you do LIMIT 10 and order by the name, the first 10 that come back from the index are all on a data page (or pages) that are comprised only of deleted records. Since it knows that these pages are homogeneous, then it doesn't have to recheck them once it's fetched them from disk. Once you increase to LIMIT 20, at least one of the first 20 is on a mixed page with non-deleted records. This would then force the executor to recheck each record since it can't fetch data pages in less than 1 page increments from either the disk or the cache.
As an experiment, if you can create an index (delete_date, name) and issue the command CLUSTER crm_venue ON where the index is your new index. This should rebuild the table in the sort order of delete_date then name. Just to be super-sure, you should then issue a REINDEX TABLE crm_venue. Now try the query again. Since all the NOT NULLs will be clustered together on disk, it may work faster with the larger LIMIT values.
Of course, this is all just off-the-cuff theory, so YMMV...
As you increase the number of rows, the index cardinality changes. I am not sure, but it could be that because it is using a greater number of rows from the table, it will need to read enough table blocks that those plus the index blocks are enough to make the index no longer make sense to use. This may be a miscalculation by the planner. Also your name (the field indexed) is not the field limiting the scope of the index which may be wreaking havoc with the planner math.
Things to try:
Increase the percentage of the table considered when building your statistics, your data may be skewed in such a way that the statistics are not picking up a true representative sample.
Index all rows, not just the NULL rows, see which is better. you could even try indexing where NOT NULL.
Cluster based on an index on that field to reduce the data blocks required and turn it into a range scan.
Nulls and indexes do not always play nice. Try another way:
alter table crm_venue add column char delete_flag;
update crm_venue set delete flag='Y' where delete_date is not null;
update crm_venue set delete flag='N' where delete_date is null;
create index deleted_venue (delete_flag) where delete_flag = 'N';
SELECT * FROM crm_venue WHERE delete__flag='Y' ORDER BY name ASC LIMIT 20;