PostgreSQL not using any index in regex search - sql

I have the following SQL statement to filter data with a regex search:
select * from others.table
where vintage ~* '(17|18|19|20)[0-9]{2,}'
Upon some researching, I found that I need to create gin/gist index for better performance:
create index idx_vintage_gist on others.table using gist (vintage gist_trgm_ops);
create index idx_vintage_gin on others.table using gin (vintage gin_trgm_ops);
create index idx_vintage_varchar on others.table using btree (vintage varchar_pattern_ops);
Looking at the explain plan, it is not using any index but a seq scan:
Seq Scan on table t (cost=0.00..45412.25 rows=1070800 width=91) (actual time=0.038..8518.830 rows=1075980 loops=1)
Filter: (vintage ~* '(17|18|19|20)[0-9]{2,}'::text)
Rows Removed by Filter: 25400
Planning Time: 0.481 ms
Execution Time: 8767.998 ms
There are total 1101380 rows in the table.
My question is why is it not using any index for the regex search?

(Answer was in comments; posting as community wiki.)
From the execution plan, 1070800 rows were expected to be returned, which is 1070800/1101380 ≈ 97.2% of the table. With so much of the table being in the results, using an index wouldn't be advantageous, so a sequential scan is performed.

Related

Sequential scan for column indexed with varchar_pattern_ops

I have a table Users and it contains location column. I have indexed location column using varchar_pattern_ops. But when I run query planner it tells me it is doing a sequential scan.
EXPLAIN ANALAYZE
SELECT * FROM USERS
WHERE lower(location) like '%nepa%'
ORDER BY location desc;
It gives following result:
Sort (cost=12.41..12.42 rows=1 width=451) (actual time=0.084..0.087 rows=8 loops=1)
Sort Key: location
Sort Method: quicksort Memory: 27kB
-> Seq Scan on users (cost=0.00..12.40 rows=1 width=451) (actual time=0.029..0.051 rows=8 loops=1)
Filter: (lower((location)::text) ~~ '%nepa%'::text)
Planning time: 0.211 ms
Execution time: 0.147 ms
I have searched through stackoverflow. Found most answers to be like "postgres performs sequential scan in large table in case index scan will be slower". But my table is not big either.
The index in my users table is:
"index_users_on_lower_location_varchar_pattern_ops" btree (lower(location::text) varchar_pattern_ops)
What is going on?
*_patter_ops indexes are good for prefix matching - LIKE patterns anchored to the start, without leading wildcard. But not for your predicate:
WHERE lower(location) like '%nepa%'
I suggest you create a trigram index instead. And you do not need lower() in the index (or query) since trigram indexes support case insensitive ILIKE (or ~*) at practically the same cost.
Follow instructions here:
PostgreSQL LIKE query performance variations
Also:
But my table is not big either.
You seem to have that backwards. If your table is not big enough, it may be faster for Postgres to just read it sequentially and not bother with indexes. You would not create any indexes for this at all. The tipping point depends on many factors.
Aside: your index definition does not make sense to begin with:
(lower(location::text) varchar_pattern_ops)
For a varchar columns use the varchar_pattern_ops operator class.
But if you cast to text, use text_pattern_ops. Since lower() returns text even for varchar input, use text_pattern_ops. Except that you probably do not need this (or any?) index at all, as advised.

PostgreSQL index for jsonb #> search

I have the following query:
SELECT "survey_results".* FROM "survey_results" WHERE (raw #> '{"client":{"token":"test_token"}}');
EXPLAIN ANALYZE returns following results:
Seq Scan on survey_results (cost=0.00..352.68 rows=2 width=2039) (actual time=132.942..132.943 rows=1 loops=1)
Filter: (raw #> '{"client": {"token": "test_token"}}'::jsonb)
Rows Removed by Filter: 2133
Planning time: 0.157 ms
Execution time: 132.991 ms
(5 rows)
I want to add index on client key inside raw field so search will be faster. I don't know how to do it. When I add index for whole raw column like this:
CREATE INDEX test_index on survey_results USING GIN (raw);
then everything works as expected. I don't want to add index for whole raw because I have a lot of records in database and I do not want to increase its size.
If you are using JSON objects as atm in the example then you can specify index only client like that:
CREATE INDEX test_client_index ON survey_results USING GIN (( raw->'client ));
But since you are using #> operator in your query then in your case it might make sense to create index only for that operator like that:
CREATE INDEX test_client_index ON survey_results USING GIN (raw jsonb_path_ops);
See more from documentation about Postgres JSONB indexing:

Behavior of WHERE clause on a primary key field

select * from table where username="johndoe"
In Postgres, if username is not a primary key, I know it will iterate through all the records.
But if it is a primary key field, will the above SQL statement iterate through the entire table, or terminate as soon as username is matched. In other words, does "where" act differently when it is running on a primary key column or not?
Primary keys (and all indexed columns for that matter) take advantage of indexes when those column(s) are used as filter predicates, like WHERE and JOIN...ON clauses.
As a real world example, my application has a table called Log_Games, which is a table with millions of rows, ID as the primary key, and a number of other non-indexed columns such as ParsedAt. Compare the below:
INDEXED QUERY
EXPLAIN ANALYZE
SELECT *
FROM "Log_Games"
WHERE "ID" = 792046
INDEXED QUERY PLAN
Index Scan using "Log_Games_pkey" on "Log_Games" (cost=0.43..8.45 rows=1 width=4190) (actual time=0.024..0.024 rows=1 loops=1)
Index Cond: ("ID" = 792046)
Planning time: 1.059 ms
Execution time: 0.066 ms
NON-INDEXED QUERY
EXPLAIN ANALYZE
SELECT *
FROM "Log_Games"
WHERE "ParsedAt" = '2015-05-07 07:31:24+00'
NON-INDEXED QUERY PLAN
Seq Scan on "Log_Games" (cost=0.00..141377.34 rows=18 width=4190) (actual time=0.013..793.094 rows=1 loops=1)
Filter: ("ParsedAt" = '2015-05-07 07:31:24+00'::timestamp with time zone)
Rows Removed by Filter: 1924676
Planning time: 0.794 ms
Execution time: 793.135 ms
The query with the indexed clause uses the index Log_Games_pkey, resulting in a query that executes in 0.066ms. The query with the non-indexed clause reverts to a sequential scan, which means it goes from the start to the finish of the table to see which columns match, an operation that causes the execution time to blow out to 793.135ms.
There are plenty of good resources around the web that can help you read execution plans and decide when they may need supporting indexes. A good place to start is the PostgreSQL documentation:
https://www.postgresql.org/docs/9.6/static/using-explain.html

Why is PostgreSQL not using *just* the covering index in this query depending on the contents of its IN() clause?

I have a table with a covering index that should respond to a query using just the index, without checking the table at all. Postgres does, in fact, do that, if the IN() clause has 1 or a few elements in it. However, if the IN clause has lots of elements, it seems like it's doing the search on the index, and then going to the table and re-checking the conditions...
I can't figure out why Postgres would do that. It can either serve the query straight from the index or it can't, why would it go to the table if it (in theory) doesn't have anything else to add?
The table:
CREATE TABLE phone_numbers
(
id serial NOT NULL,
phone_number character varying,
hashed_phone_number character varying,
user_id integer,
created_at timestamp without time zone,
updated_at timestamp without time zone,
ghost boolean DEFAULT false,
CONSTRAINT phone_numbers_pkey PRIMARY KEY (id)
)
WITH (
OIDS=FALSE
);
CREATE INDEX index_phone_numbers_covering_hashed_ghost_and_user
ON phone_numbers
USING btree
(hashed_phone_number COLLATE pg_catalog."default", ghost, user_id);
The query I'm running is :
SELECT "phone_numbers"."user_id"
FROM "phone_numbers"
WHERE "phone_numbers"."hashed_phone_number" IN (*several numbers*)
AND "phone_numbers"."ghost" = 'f'
As you can see, the index has all the fields it needs to reply to that query.
And if I have only one or a few numbers in the IN clause, it does:
1 number:
Index Scan using index_phone_numbers_on_hashed_phone_number on phone_numbers (cost=0.41..8.43 rows=1 width=4)
Index Cond: ((hashed_phone_number)::text = 'bebd43a6eb29b2fda3bcb63dcc7ffaf5433e78660ccd1a495c1180a3eaaf6b6a'::text)
Filter: (NOT ghost)"
3 numbers:
Index Only Scan using index_phone_numbers_covering_hashed_ghost_and_user on phone_numbers (cost=0.42..17.29 rows=1 width=4)
Index Cond: ((hashed_phone_number = ANY ('{8228a8116f1fdb12e243102cb85ecd859ebf7873d9332dce5f1343a481ec72e8,43ddeebdca2ea829d468d5debc84d475c8322cf4bf6edca286c918b04216387e,1578bf773eb6eb8a9b57a130922a28c9c91f1bda67202ef5936b39630ca4cfe4}'::text[])) AND (...)
Filter: (NOT ghost)"
However, when I have a lot of numbers in the IN clause, Postgres is using the Index, but then hitting the table, and I don't know why:
Bitmap Heap Scan on phone_numbers (cost=926.59..1255.81 rows=106 width=4)
Recheck Cond: ((hashed_phone_number)::text = ANY ('{b6459ce58f21d99c462b132cce7adc9ea947fa522a3849321e9fb65893006a5e,8228a8116f1fdb12e243102cb85ecd859ebf7873d9332dce5f1343a481ec72e8,ab3554acc1f287bb2e22ff20bb855e19a4177ef552676689d217dbb2a1a6177b,7ec9f58 (...)
Filter: (NOT ghost)
-> Bitmap Index Scan on index_phone_numbers_covering_hashed_ghost_and_user (cost=0.00..926.56 rows=106 width=0)
Index Cond: (((hashed_phone_number)::text = ANY ('{b6459ce58f21d99c462b132cce7adc9ea947fa522a3849321e9fb65893006a5e,8228a8116f1fdb12e243102cb85ecd859ebf7873d9332dce5f1343a481ec72e8,ab3554acc1f287bb2e22ff20bb855e19a4177ef552676689d217dbb2a1a6177b,7e (...)
This is currently making this query, which is looking for 250 records in a table with 50k total rows, about twice as low as a similar query on another table, which looks for 250 records in a table with 5 million rows, which doesn't make much sense.
Any ideas what could be happening, and whether I can do anything to improve this?
UPDATE: Changing the order of the columns in the covering index to have first ghost and then hashed_phone_number also doesn't solve it:
Bitmap Heap Scan on phone_numbers (cost=926.59..1255.81 rows=106 width=4)
Recheck Cond: ((hashed_phone_number)::text = ANY ('{b6459ce58f21d99c462b132cce7adc9ea947fa522a3849321e9fb65893006a5e,8228a8116f1fdb12e243102cb85ecd859ebf7873d9332dce5f1343a481ec72e8,ab3554acc1f287bb2e22ff20bb855e19a4177ef552676689d217dbb2a1a6177b,7ec9f58 (...)
Filter: (NOT ghost)
-> Bitmap Index Scan on index_phone_numbers_covering_ghost_hashed_and_user (cost=0.00..926.56 rows=106 width=0)
Index Cond: ((ghost = false) AND ((hashed_phone_number)::text = ANY ('{b6459ce58f21d99c462b132cce7adc9ea947fa522a3849321e9fb65893006a5e,8228a8116f1fdb12e243102cb85ecd859ebf7873d9332dce5f1343a481ec72e8,ab3554acc1f287bb2e22ff20bb855e19a4177ef55267668 (...)
The choice of indexes is based on what the optimizer says is the best solution for the query. Postgres is trying really hard with your index, but it is not the best index for the query.
The best index has ghost first:
CREATE INDEX index_phone_numbers_covering_hashed_ghost_and_user
ON phone_numbers
USING btree
(ghost, hashed_phone_number COLLATE pg_catalog."default", user_id);
I happen to think that MySQL documentation does a good job of explaining how composite indexes are used.
Essentially, what is happening is that Postgres needs to do an index seek for every element of the in list. This may be compounded by the use of strings -- because collations/encodings affect the comparisons. Eventually, Postgres decides that other approaches are more efficient. If you put ghost first, then it will just jump to the right part of the index and find the rows it needs there.

Why isn't Postgres using the index?

I have a table with an integer column called account_id. I have an index on that column.
But seems Postgres doesn't want to use my index:
EXPLAIN ANALYZE SELECT "invoices".* FROM "invoices" WHERE "invoices"."account_id" = 1;
Seq Scan on invoices (cost=0.00..6504.61 rows=117654 width=186) (actual time=0.021..33.943 rows=118027 loops=1)
Filter: (account_id = 1)
Rows Removed by Filter: 51462
Total runtime: 39.917 ms
(4 rows)
Any idea why that would be?
Because of:
Seq Scan on invoices (...) (actual ... rows=118027 <— this
Filter: (account_id = 1)
Rows Removed by Filter: 51462 <— vs this
Total runtime: 39.917 ms
You're selecting so many rows that it's cheaper to read the entire table.
Related earlier questions and answers from today for further reading:
Why doesn't Postgresql use index for IN query?
Postgres using wrong index when querying a view of indexed expressions?
(See also Craig's longer answer on the second one for additional notes on indexes subtleties.)