Sequential scan for column indexed with varchar_pattern_ops - sql

I have a table Users and it contains location column. I have indexed location column using varchar_pattern_ops. But when I run query planner it tells me it is doing a sequential scan.
EXPLAIN ANALAYZE
SELECT * FROM USERS
WHERE lower(location) like '%nepa%'
ORDER BY location desc;
It gives following result:
Sort (cost=12.41..12.42 rows=1 width=451) (actual time=0.084..0.087 rows=8 loops=1)
Sort Key: location
Sort Method: quicksort Memory: 27kB
-> Seq Scan on users (cost=0.00..12.40 rows=1 width=451) (actual time=0.029..0.051 rows=8 loops=1)
Filter: (lower((location)::text) ~~ '%nepa%'::text)
Planning time: 0.211 ms
Execution time: 0.147 ms
I have searched through stackoverflow. Found most answers to be like "postgres performs sequential scan in large table in case index scan will be slower". But my table is not big either.
The index in my users table is:
"index_users_on_lower_location_varchar_pattern_ops" btree (lower(location::text) varchar_pattern_ops)
What is going on?

*_patter_ops indexes are good for prefix matching - LIKE patterns anchored to the start, without leading wildcard. But not for your predicate:
WHERE lower(location) like '%nepa%'
I suggest you create a trigram index instead. And you do not need lower() in the index (or query) since trigram indexes support case insensitive ILIKE (or ~*) at practically the same cost.
Follow instructions here:
PostgreSQL LIKE query performance variations
Also:
But my table is not big either.
You seem to have that backwards. If your table is not big enough, it may be faster for Postgres to just read it sequentially and not bother with indexes. You would not create any indexes for this at all. The tipping point depends on many factors.
Aside: your index definition does not make sense to begin with:
(lower(location::text) varchar_pattern_ops)
For a varchar columns use the varchar_pattern_ops operator class.
But if you cast to text, use text_pattern_ops. Since lower() returns text even for varchar input, use text_pattern_ops. Except that you probably do not need this (or any?) index at all, as advised.

Related

Prevent usage of index for a particular query in Postgres

I have a slow query in a Postgres DB. Using explain analyze, I can see that Postgres makes bitmap index scan on two different indexes followed by bitmap AND on the two resulting sets.
Deleting one of the indexes makes the evaluation ten times faster (bitmap index scan is still used on the first index). However, that deleted index is useful in other queries.
Query:
select
booking_id
from
booking
where
substitute_confirmation_token is null
and date_trunc('day', from_time) >= cast('01/25/2016 14:23:00.004' as date)
and from_time >= '01/25/2016 14:23:00.004'
and type = 'LESSON_SUBSTITUTE'
and valid
order by
booking_id;
Indexes:
"idx_booking_lesson_substitute_day" btree (date_trunc('day'::text, from_time)) WHERE valid AND type::text = 'LESSON_SUBSTITUTE'::text
"booking_substitute_confirmation_token_key" UNIQUE CONSTRAINT, btree (substitute_confirmation_token)
Query plan:
Sort (cost=287.26..287.26 rows=1 width=8) (actual time=711.371..711.377 rows=44 loops=1)
Sort Key: booking_id
Sort Method: quicksort Memory: 27kB
Buffers: shared hit=8 read=7437 written=1
-> Bitmap Heap Scan on booking (cost=275.25..287.25 rows=1 width=8) (actual time=711.255..711.294 rows=44 loops=1)
Recheck Cond: ((date_trunc('day'::text, from_time) >= '2016-01-25'::date) AND valid AND ((type)::text = 'LESSON_SUBSTITUTE'::text) AND (substitute_confirmation_token IS NULL))
Filter: (from_time >= '2016-01-25 14:23:00.004'::timestamp without time zone)
Buffers: shared hit=5 read=7437 written=1
-> BitmapAnd (cost=275.25..275.25 rows=3 width=0) (actual time=711.224..711.224 rows=0 loops=1)
Buffers: shared hit=5 read=7433 written=1
-> Bitmap Index Scan on idx_booking_lesson_substitute_day (cost=0.00..20.50 rows=594 width=0) (actual time=0.080..0.080 rows=72 loops=1)
Index Cond: (date_trunc('day'::text, from_time) >= '2016-01-25'::date)
Buffers: shared hit=5 read=1
-> Bitmap Index Scan on booking_substitute_confirmation_token_key (cost=0.00..254.50 rows=13594 width=0) (actual time=711.102..711.102 rows=2718734 loops=1)
Index Cond: (substitute_confirmation_token IS NULL)
Buffers: shared read=7432 written=1
Total runtime: 711.436 ms
Can I prevent using a particular index for a particular query in Postgres?
Your clever solution
You already found a clever solution for your particular case: A partial unique index that only covers rare values, so Postgres won't (can't) use the index for the common NULL value.
CREATE UNIQUE INDEX booking_substitute_confirmation_uni
ON booking (substitute_confirmation_token)
WHERE substitute_confirmation_token IS NOT NULL;
It's a textbook use-case for a partial index. Literally! The manual has a similar example and these perfectly matching advice to go with it:
Finally, a partial index can also be used to override the system's
query plan choices. Also, data sets with peculiar distributions might
cause the system to use an index when it really should not. In that
case the index can be set up so that it is not available for the
offending query. Normally, PostgreSQL makes reasonable choices about
index usage (e.g., it avoids them when retrieving common values, so
the earlier example really only saves index size, it is not required
to avoid index usage), and grossly incorrect plan choices are cause
for a bug report.
Keep in mind that setting up a partial index indicates that you know
at least as much as the query planner knows, in particular you know
when an index might be profitable. Forming this knowledge requires
experience and understanding of how indexes in PostgreSQL work. In
most cases, the advantage of a partial index over a regular index will
be minimal.
You commented: The table has few millions of rows and just few thousands of rows with not null values, so this is a perfect use-case. It will even speed up queries on non-null values for substitute_confirmation_token because the index is much smaller now.
Answer to question
To answer your original question: it's not possible to "disable" an existing index for a particular query. You would have to drop it, but that's way to expensive.
Fake drop index
You could drop an index inside a transaction, run your SELECT and then, instead of committing, use ROLLBACK. That's fast, but be aware that (per documentation):
A normal DROP INDEX acquires exclusive lock on the table, blocking
other accesses until the index drop can be completed.
So this is no good for multi-user environments.
BEGIN;
DROP INDEX big_user_id_created_at_idx;
SELECT ...;
ROLLBACK; -- so the index is preserved after all
More detailed statistics
Normally, though, it should be enough to raise the STATISTICS target for the column, so Postgres can more reliably identify common values and avoid the index for those. Try:
ALTER TABLE booking ALTER COLUMN substitute_confirmation_token SET STATISTICS 2000;
Then: ANALYZE booking; before you try your query again. 2000 is an example value. Related:
Keep PostgreSQL from sometimes choosing a bad query plan

Why does this simple query not use the index in postgres?

In my postgreSQL database I have a table named "product". In this table I have a column named "date_touched" with type timestamp. I created a simple btree index on this column. This is the schema of my table (I omitted irrelevant column & index definitions):
Table "public.product"
Column | Type | Modifiers
---------------------------+--------------------------+-------------------
id | integer | not null default nextval('product_id_seq'::regclass)
date_touched | timestamp with time zone | not null
Indexes:
"product_pkey" PRIMARY KEY, btree (id)
"product_date_touched_59b16cfb121e9f06_uniq" btree (date_touched)
The table has ~300,000 rows and I want to get the n-th element from the table ordered by "date_touched". when I want to get the 1000th element, it takes 0.2s, but when I want to get the 100,000th element, it takes about 6s. My question is, why does it take too much time to retrieve the 100,000th element, although I've defined a btree index?
Here is my query with explain analyze that shows postgreSQL does not use the btree index and instead sorts all rows to find the 100,000th element:
first query (100th element):
explain analyze
SELECT product.id
FROM product
ORDER BY product.date_touched ASC
LIMIT 1
OFFSET 1000;
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Limit (cost=3035.26..3038.29 rows=1 width=12) (actual time=160.208..160.209 rows=1 loops=1)
-> Index Scan using product_date_touched_59b16cfb121e9f06_uniq on product (cost=0.42..1000880.59 rows=329797 width=12) (actual time=16.651..159.766 rows=1001 loops=1)
Total runtime: 160.395 ms
second query (100,000th element):
explain analyze
SELECT product.id
FROM product
ORDER BY product.date_touched ASC
LIMIT 1
OFFSET 100000;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Limit (cost=106392.87..106392.88 rows=1 width=12) (actual time=6621.947..6621.950 rows=1 loops=1)
-> Sort (cost=106142.87..106967.37 rows=329797 width=12) (actual time=6381.174..6568.802 rows=100001 loops=1)
Sort Key: date_touched
Sort Method: external merge Disk: 8376kB
-> Seq Scan on product (cost=0.00..64637.97 rows=329797 width=12) (actual time=1.357..4184.115 rows=329613 loops=1)
Total runtime: 6629.903 ms
It is a very good thing, that SeqScan is used here. Your OFFSET 100000 is not a good thing for the IndexScan.
A bit of theory
Btree indexes contain 2 structures inside:
balanced tree and
double-linked list of keys.
First structure allows for fast keys lookups, second is responsible for the ordering. For bigger tables, linked list cannot fit into a single page and therefore it is a list of linked pages, where each page's entries maintain ordering, specified during index creation.
It is wrong to think, though, that such pages are sitting together on the disk. In fact, it is more probable that those are spread across different locations. And in order to read pages based on the index's order, system has to perform random disk reads. Random disk IO is expensive, compared to sequential access. Therefore good optimizer will prefer a SeqScan instead.
I highly recommend “SQL Performance Explained” book to better understand indexes. It is also available on-line.
What is going on?
Your OFFSET clause would cause database to read index's linked list of keys (causing lots of random disk reads) and than discarding all those results, till you hit the wanted offset. And it is good, in fact, that Postgres decided to use SeqScan + Sort here — this should be faster.
You can check this assumption by:
running EXPLAIN (analyze, buffers) of your big-OFFSET query
than do SET enable_seqscan TO 'off';
and run EXPLAIN (analyze, buffers) again, comparing the results.
In general, it is better to avoid OFFSET, as DBMSes not always pick the right approach here. (BTW, which version of PostgreSQL you're using?)
Here's a comparison of how it performs for different offset values.
EDIT: In order to avoid OFFSET one would have to base pagination on the real data, that exists in the table and is a part of the index. For this particular case, the following might be possible:
show first N (say, 20) elements
include maximal date_touched that is shown on the page to all the “Next” links. You can compute this value on the application side. Do similar for the “Previous” links, except include minimal date_touch for these.
on the server side you will get the limiting value. Therefore, say for the “Next” case, you can do a query like this:
SELECT id
FROM product
WHERE date_touched > $max_date_seen_on_the_page
ORDER BY date_touched ASC
LIMIT 20;
This query makes best use of the index.
Of course, you can adjust this example to your needs. I used pagination as it is a typical case for the OFFSET.
One more note — querying 1 row many times, increasing offset for each query by 1, will be much more time consuming, than doing a single batch query that returns all those records, which are then iterated from on the application side.

Effectively query on column that includes a substring

Given a string column with a value similar to /123/12/34/56/5/, what is the optimal way of querying for all the records that include the given number (12 for example)?
The solution from top of my head is:
SELECT id FROM things WHERE things.path LIKE '%/12/%'
But AFAIK this query can't use indexes on the column due to the leading %.
There must be something better. What is it?
Using PostgreSQL, but would prefer the solution that would work across other DBs too.
If you're happy turning that column into an array of integers, like:
'/123/12/34/56/5/' becomes ARRAY[123,12,34,56,5]
So that path_arr is a column of type INTEGER[], then you can create a GIN index on that column:
CREATE INDEX ON things USING gin(path_arr);
A query for all items containing 12 then becomes:
SELECT * FROM things WHERE ARRAY[12] <# path_arr;
Which will use the index. In my test (with a million rows), I get plans like:
EXPLAIN SELECT * FROM things WHERE ARRAY[12] <# path_arr;
QUERY PLAN
----------------------------------------------------------------------------------------
Bitmap Heap Scan on things (cost=5915.75..9216.99 rows=1000 width=92)
Recheck Cond: (path_arr <# '{12}'::integer[])
-> Bitmap Index Scan on things_path_arr_idx (cost=0.00..5915.50 rows=1000 width=0)
Index Cond: ('{12}'::integer[] <# path_arr)
(4 rows)
In PostgreSQL 9.1 you could utilize the pg_trgm module and build a GIN index with it.
CREATE EXTENSION pg_trgm; -- once per database
CREATE INDEX things_path_trgm_gin_idx ON things USING gin (path gin_trgm_ops);
Your LIKE expression can use this index even if it is not left-anchored.
See a detailed demo by depesz here.
Normalize it If you can, though.

How to increase query speed without using full-text search?

This is my simple query; By searching selectnothing I'm sure I'll have no hits.
SELECT nome_t FROM myTable WHERE nome_t ILIKE '%selectnothing%';
This is the EXPLAIN ANALYZE VERBOSE
Seq Scan on myTable (cost=0.00..15259.04 rows=37 width=29) (actual time=2153.061..2153.061 rows=0 loops=1)
Output: nome_t
Filter: (nome_t ~~* '%selectnothing%'::text)
Total runtime: 2153.116 ms
myTable has around 350k rows and the table definition is something like:
CREATE TABLE myTable (
nome_t text NOT NULL,
)
I have an index on nome_t as stated below:
CREATE INDEX idx_m_nome_t ON myTable
USING btree (nome_t);
Although this is clearly a good candidate for Fulltext search I would like to rule that option out for now.
This query is meant to be run from a web application and currently it's taking around 2 seconds which is obviously too much;
Is there anything I can do, like using other index methods, to improve the speed of this query?
No, ILIKE '%selectnothing%' always needs a full table scan, every index is useless. You need full text search, it's not that hard to implement.
Edit: You could use a Wildspeed, I forgot about this option. The indexes will be huge, but your performance will also be much better.
Wildspeed extension provides GIN index
support for wildcard search for LIKE
operator.
http://www.sai.msu.su/~megera/wiki/wildspeed
another thing you can do-- is break this nome_t column in table myTable into it's own table. Searching one column out of a table is slow (if there are fifty other wide columns) because the other data effectively slows down the scan against that column (because there are less records per page/extent).

SQL indexes for "not equal" searches

The SQL index allows to find quickly a string which matches my query. Now, I have to search in a big table the strings which do not match. Of course, the normal index does not help and I have to do a slow sequential scan:
essais=> \d phone_idx
Index "public.phone_idx"
Column | Type
--------+------
phone | text
btree, for table "public.phonespersons"
essais=> EXPLAIN SELECT person FROM PhonesPersons WHERE phone = '+33 1234567';
QUERY PLAN
-------------------------------------------------------------------------------
Index Scan using phone_idx on phonespersons (cost=0.00..8.41 rows=1 width=4)
Index Cond: (phone = '+33 1234567'::text)
(2 rows)
essais=> EXPLAIN SELECT person FROM PhonesPersons WHERE phone != '+33 1234567';
QUERY PLAN
----------------------------------------------------------------------
Seq Scan on phonespersons (cost=0.00..18621.00 rows=999999 width=4)
Filter: (phone <> '+33 1234567'::text)
(2 rows)
I understand (see Mark Byers' very good explanations) that PostgreSQL
can decide not to use an index when it sees that a sequential scan
would be faster (for instance if almost all the tuples match). But,
here, "not equal" searches are really slower.
Any way to make these "is not equal to" searches faster?
Here is another example, to address Mark Byers' excellent remarks. The
index is used for the '=' query (which returns the vast majority of
tuples) but not for the '!=' query:
essais=> \d tld_idx
Index "public.tld_idx"
Column | Type
-----------------+------
pg_expression_1 | text
btree, for table "public.emailspersons"
essais=> EXPLAIN ANALYZE SELECT person FROM EmailsPersons WHERE tld(email) = 'fr';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Index Scan using tld_idx on emailspersons (cost=0.25..4010.79 rows=97033 width=4) (actual time=0.137..261.123 rows=97110 loops=1)
Index Cond: (tld(email) = 'fr'::text)
Total runtime: 444.800 ms
(3 rows)
essais=> EXPLAIN ANALYZE SELECT person FROM EmailsPersons WHERE tld(email) != 'fr';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Seq Scan on emailspersons (cost=0.00..27129.00 rows=2967 width=4) (actual time=1.004..1031.224 rows=2890 loops=1)
Filter: (tld(email) <> 'fr'::text)
Total runtime: 1037.278 ms
(3 rows)
DBMS is PostgreSQL 8.3 (but I can upgrade to 8.4).
Possibly it would help to write:
SELECT person FROM PhonesPersons WHERE phone < '+33 1234567'
UNION ALL
SELECT person FROM PhonesPersons WHERE phone > '+33 1234567'
or simply
SELECT person FROM PhonesPersons WHERE phone > '+33 1234567'
OR phone < '+33 1234567'
PostgreSQL should be able to determine that the selectivity of the range operation is very high and to consider using an index for it.
I don't think it can use an index directly to satisfy a not-equals predicate, although it would be nice if it could try re-writing the not-equals as above (if it helps) during planning. If it works, suggest it to the developers ;)
Rationale: searching an index for all values not equal to a certain one requires scanning the full index. By contrast, searching for all elements less than a certain key means finding the greatest non-matching item in the tree and scanning backwards. Similarly, searching for all elements greater than a certain key in the opposite direction. These operations are easy to fulfill using b-tree structures. Also, the statistics that PostgreSQL collects should be able to point out that "+33 1234567" is a known frequent value: by removing the frequency of those and nulls from 1, we have the proportion of rows left to select: the histogram bounds will indicate whether those are skewed to one side or not. But if the exclusion of nulls and that frequent value pushes the proportion of rows remaining low enough (Istr about 20%), an index scan should be appropriate. Check the stats for the column in pg_stats to see what proportion it's actually calculated.
Update: I tried this on a local table with a vaguely similar distribution, and both forms of the above produced something other than a plain seq scan. The latter (using "OR") was a bitmap scan that may actually devolve to just being a seq scan if the bias towards your common value is particularly extreme... although the planner can see that, I don't think it will automatically rewrite to an "Append(Index Scan,Index Scan)" internally. Turning "enable_bitmapscan" off just made it revert to a seq scan.
PS: indexing a text column and using the inequality operators can be an issue, if your database location is not C. You may need to add an extra index that uses text_pattern_ops or varchar_pattern_ops; this is similar to the problem of indexing for column LIKE 'prefix%' predicates.
Alternative: you could create a partial index:
CREATE INDEX PhonesPersonsOthers ON PhonesPersons(phone) WHERE phone <> '+33 1234567'
this will make the <>-using select statement just scan through that partial index: since it excludes most of the entries in the table, it should be small.
The database is able use the index for this query, but it chooses not to because it would be slower. Update: This is not quite right: you have to rewrite the query slightly. See Araqnid's answer.
Your where clause selects almost all rows in your table (rows = 999999). The database can see that a table scan would be faster in this case and therefore ignores the index. It is faster because the column person is not in your index so it would have to make two lookups for each row, once in the index to check the WHERE clause, and then again in the main table to fetch the column person.
If you had a different type of data where there were most values were foo and just a few were bar and you said WHERE col <> 'foo' then it probably would use the index.
Any way to make these "is not equal to" searches faster?
Any query that selects almost 1 million rows is going to be slow. Try adding a limit clause.