SELECT COUNT query on indexed column - sql

I have a simple table storing words for different ids.
CREATE TABLE words (
id INTEGER,
word TEXT,
);
CREATE INDEX ON words USING hash (id);
CREATE INDEX ON words USING hash (word);
Now I simply want to count the number of times a given word appears. My actual query is a bit different and involves other filters.
SELECT COUNT(1) FROM "words"
WHERE word = 'car'
My table has a billion of rows but the answer for this specific query is about 45k.
I hoped that the index on the words would make the query super fast but it still takes a 1 min 20 to be executed, which looks dereasonable. As a comparison, SELECT COUNT(1) FROM "words" takes 1 min 57.
Here is the output of EXPLAIN:
Aggregate (cost=48667.00..48667.01 rows=1 width=8)
-> Bitmap Heap Scan on words (cost=398.12..48634.05 rows=13177 width=0)
Recheck Cond: (word = 'car'::text)
-> Bitmap Index Scan on words_word_idx (cost=0.00..394.83 rows=13177 width=0)
Index Cond: (word = 'car'::text)
I don't understand why there is a need to recheck the condition and why this query is not efficient.

Hash indexes don't store the indexed value in the index, just its 32-bit hash and the ctid (pointer to the table row). That means they can't resolve hash collisions on their own, so it has to go to the table to obtain the value and then recheck it. This can involve a lot or extra IO compared to a btree index, which do store the value and can support index only scans.

You could use B-Tree index whenever an indexed column is involved in a comparison using one of these operators:
< <= = >= >
I assume you are using = for counting how many words are there. So, B-Tree index satisfy your requirements.
Reference: https://www.postgresql.org/docs/current/indexes-types.html#INDEXES-TYPES-BTREE

Related

Sequential scan for column indexed with varchar_pattern_ops

I have a table Users and it contains location column. I have indexed location column using varchar_pattern_ops. But when I run query planner it tells me it is doing a sequential scan.
EXPLAIN ANALAYZE
SELECT * FROM USERS
WHERE lower(location) like '%nepa%'
ORDER BY location desc;
It gives following result:
Sort (cost=12.41..12.42 rows=1 width=451) (actual time=0.084..0.087 rows=8 loops=1)
Sort Key: location
Sort Method: quicksort Memory: 27kB
-> Seq Scan on users (cost=0.00..12.40 rows=1 width=451) (actual time=0.029..0.051 rows=8 loops=1)
Filter: (lower((location)::text) ~~ '%nepa%'::text)
Planning time: 0.211 ms
Execution time: 0.147 ms
I have searched through stackoverflow. Found most answers to be like "postgres performs sequential scan in large table in case index scan will be slower". But my table is not big either.
The index in my users table is:
"index_users_on_lower_location_varchar_pattern_ops" btree (lower(location::text) varchar_pattern_ops)
What is going on?
*_patter_ops indexes are good for prefix matching - LIKE patterns anchored to the start, without leading wildcard. But not for your predicate:
WHERE lower(location) like '%nepa%'
I suggest you create a trigram index instead. And you do not need lower() in the index (or query) since trigram indexes support case insensitive ILIKE (or ~*) at practically the same cost.
Follow instructions here:
PostgreSQL LIKE query performance variations
Also:
But my table is not big either.
You seem to have that backwards. If your table is not big enough, it may be faster for Postgres to just read it sequentially and not bother with indexes. You would not create any indexes for this at all. The tipping point depends on many factors.
Aside: your index definition does not make sense to begin with:
(lower(location::text) varchar_pattern_ops)
For a varchar columns use the varchar_pattern_ops operator class.
But if you cast to text, use text_pattern_ops. Since lower() returns text even for varchar input, use text_pattern_ops. Except that you probably do not need this (or any?) index at all, as advised.

Why is PostgreSQL not using *just* the covering index in this query depending on the contents of its IN() clause?

I have a table with a covering index that should respond to a query using just the index, without checking the table at all. Postgres does, in fact, do that, if the IN() clause has 1 or a few elements in it. However, if the IN clause has lots of elements, it seems like it's doing the search on the index, and then going to the table and re-checking the conditions...
I can't figure out why Postgres would do that. It can either serve the query straight from the index or it can't, why would it go to the table if it (in theory) doesn't have anything else to add?
The table:
CREATE TABLE phone_numbers
(
id serial NOT NULL,
phone_number character varying,
hashed_phone_number character varying,
user_id integer,
created_at timestamp without time zone,
updated_at timestamp without time zone,
ghost boolean DEFAULT false,
CONSTRAINT phone_numbers_pkey PRIMARY KEY (id)
)
WITH (
OIDS=FALSE
);
CREATE INDEX index_phone_numbers_covering_hashed_ghost_and_user
ON phone_numbers
USING btree
(hashed_phone_number COLLATE pg_catalog."default", ghost, user_id);
The query I'm running is :
SELECT "phone_numbers"."user_id"
FROM "phone_numbers"
WHERE "phone_numbers"."hashed_phone_number" IN (*several numbers*)
AND "phone_numbers"."ghost" = 'f'
As you can see, the index has all the fields it needs to reply to that query.
And if I have only one or a few numbers in the IN clause, it does:
1 number:
Index Scan using index_phone_numbers_on_hashed_phone_number on phone_numbers (cost=0.41..8.43 rows=1 width=4)
Index Cond: ((hashed_phone_number)::text = 'bebd43a6eb29b2fda3bcb63dcc7ffaf5433e78660ccd1a495c1180a3eaaf6b6a'::text)
Filter: (NOT ghost)"
3 numbers:
Index Only Scan using index_phone_numbers_covering_hashed_ghost_and_user on phone_numbers (cost=0.42..17.29 rows=1 width=4)
Index Cond: ((hashed_phone_number = ANY ('{8228a8116f1fdb12e243102cb85ecd859ebf7873d9332dce5f1343a481ec72e8,43ddeebdca2ea829d468d5debc84d475c8322cf4bf6edca286c918b04216387e,1578bf773eb6eb8a9b57a130922a28c9c91f1bda67202ef5936b39630ca4cfe4}'::text[])) AND (...)
Filter: (NOT ghost)"
However, when I have a lot of numbers in the IN clause, Postgres is using the Index, but then hitting the table, and I don't know why:
Bitmap Heap Scan on phone_numbers (cost=926.59..1255.81 rows=106 width=4)
Recheck Cond: ((hashed_phone_number)::text = ANY ('{b6459ce58f21d99c462b132cce7adc9ea947fa522a3849321e9fb65893006a5e,8228a8116f1fdb12e243102cb85ecd859ebf7873d9332dce5f1343a481ec72e8,ab3554acc1f287bb2e22ff20bb855e19a4177ef552676689d217dbb2a1a6177b,7ec9f58 (...)
Filter: (NOT ghost)
-> Bitmap Index Scan on index_phone_numbers_covering_hashed_ghost_and_user (cost=0.00..926.56 rows=106 width=0)
Index Cond: (((hashed_phone_number)::text = ANY ('{b6459ce58f21d99c462b132cce7adc9ea947fa522a3849321e9fb65893006a5e,8228a8116f1fdb12e243102cb85ecd859ebf7873d9332dce5f1343a481ec72e8,ab3554acc1f287bb2e22ff20bb855e19a4177ef552676689d217dbb2a1a6177b,7e (...)
This is currently making this query, which is looking for 250 records in a table with 50k total rows, about twice as low as a similar query on another table, which looks for 250 records in a table with 5 million rows, which doesn't make much sense.
Any ideas what could be happening, and whether I can do anything to improve this?
UPDATE: Changing the order of the columns in the covering index to have first ghost and then hashed_phone_number also doesn't solve it:
Bitmap Heap Scan on phone_numbers (cost=926.59..1255.81 rows=106 width=4)
Recheck Cond: ((hashed_phone_number)::text = ANY ('{b6459ce58f21d99c462b132cce7adc9ea947fa522a3849321e9fb65893006a5e,8228a8116f1fdb12e243102cb85ecd859ebf7873d9332dce5f1343a481ec72e8,ab3554acc1f287bb2e22ff20bb855e19a4177ef552676689d217dbb2a1a6177b,7ec9f58 (...)
Filter: (NOT ghost)
-> Bitmap Index Scan on index_phone_numbers_covering_ghost_hashed_and_user (cost=0.00..926.56 rows=106 width=0)
Index Cond: ((ghost = false) AND ((hashed_phone_number)::text = ANY ('{b6459ce58f21d99c462b132cce7adc9ea947fa522a3849321e9fb65893006a5e,8228a8116f1fdb12e243102cb85ecd859ebf7873d9332dce5f1343a481ec72e8,ab3554acc1f287bb2e22ff20bb855e19a4177ef55267668 (...)
The choice of indexes is based on what the optimizer says is the best solution for the query. Postgres is trying really hard with your index, but it is not the best index for the query.
The best index has ghost first:
CREATE INDEX index_phone_numbers_covering_hashed_ghost_and_user
ON phone_numbers
USING btree
(ghost, hashed_phone_number COLLATE pg_catalog."default", user_id);
I happen to think that MySQL documentation does a good job of explaining how composite indexes are used.
Essentially, what is happening is that Postgres needs to do an index seek for every element of the in list. This may be compounded by the use of strings -- because collations/encodings affect the comparisons. Eventually, Postgres decides that other approaches are more efficient. If you put ghost first, then it will just jump to the right part of the index and find the rows it needs there.

PostgreSQL does not use a partial index

I have a table in PostgreSQL 9.2 that has a text column. Let's call this text_col. The values in this column are fairly unique (may contain 5-6 duplicates at the most). The table has ~5 million rows. About half these rows contain a null value for text_col. When I execute the following query I expect 1-5 rows. In most cases (>80%) I only expect 1 row.
Query
explain analyze SELECT col1,col2.. colN
FROM table
WHERE text_col = 'my_value';
A btree index exists on text_col. This index is never used by the query planner and I am not sure why. This is the output of the query.
Planner
Seq Scan on two (cost=0.000..459573.080 rows=93 width=339) (actual time=1392.864..3196.283 rows=2 loops=1)
Filter: (victor = 'foxtrot'::text)
Rows Removed by Filter: 4077384
I added another partial index to try to filter out those values that were not null, but that did not help (with or without text_pattern_ops. I do not need text_pattern_ops considering no LIKE conditions are expressed in my queries, but they also match equality).
CREATE INDEX name_idx
ON table
USING btree
(text_col COLLATE pg_catalog."default" text_pattern_ops)
WHERE text_col IS NOT NULL;
Disabling sequence scans using set enable_seqscan = off; makes the planner still pick the seqscan over an index_scan. In summary...
The number of rows returned by this query is small.
Given that the non-null rows are fairly unique, an index scan over the text should be faster.
Vacuuming and analyzing the table did not help the optimizer pick the index.
My questions
Why does the database pick the sequence scan over the index scan?
When a table has a text column whose equality condition should be checked, are there any best practices I can adhere to?
How do I reduce the time taken for this query?
[Edit - More information]
The index scan is picked up on my local database that houses about 10% of the data that is available in production.
A partial index is a good idea to exclude half the rows of the table which you obviously do not need. Simpler:
CREATE INDEX name_idx ON table (text_col)
WHERE text_col IS NOT NULL;
Be sure to run ANALYZE table after creating the index. (Autovacuum does that automatically after some time if you don't do it manually, but if you test right after creation, your test will fail.)
Then, to convince the query planner that a particular partial index can be used, repeat the WHERE condition in the query - even if it seems completely redundant:
SELECT col1,col2, .. colN
FROM table
WHERE text_col = 'my_value'
AND text_col IS NOT NULL; -- repeat condition
Voilá.
Per documentation:
However, keep in mind that the predicate must match the conditions
used in the queries that are supposed to benefit from the index. To be
precise, a partial index can be used in a query only if the system can
recognize that the WHERE condition of the query mathematically implies
the predicate of the index. PostgreSQL does not have a sophisticated
theorem prover that can recognize mathematically equivalent
expressions that are written in different forms. (Not only is such a
general theorem prover extremely difficult to create, it would
probably be too slow to be of any real use.) The system can recognize
simple inequality implications, for example "x < 1" implies "x < 2";
otherwise the predicate condition must exactly match part of the
query's WHERE condition or the index will not be recognized as usable.
Matching takes place at query planning time, not at run time. As a
result, parameterized query clauses do not work with a partial index.
As for parameterized queries: again, add the (redundant) predicate of the partial index as an additional, constant WHERE condition, and it works just fine.
An important update in Postgres 9.6 largely improves chances for index-only scans (which can make queries cheaper and the query planner will more readily chose such query plans). Related:
PostgreSQL not using index during count(*)
A partial index is only used if the WHERE conditions match. Thus an index with WHERE text_col IS NOT NULL can only be used if you use the same condition in your SELECT. Collation mismatch could also cause harm.
Try the following:
Make a simplest possible btree index CREATE INDEX foo ON table (text_col)
ANALYZE table
Query
I figured it out. Upon taking a closer look at the pg_stats view that analyze helps build, I came across this excerpt on the documentation.
Correlation
Statistical correlation between physical row ordering and logical
ordering of the column values. This ranges from -1 to +1. When the
value is near -1 or +1, an index scan on the column will be estimated
to be cheaper than when it is near zero, due to reduction of random
access to the disk. (This column is null if the column data type does
not have a < operator.)
On my local box the correlation number is 0.97 and on production it was 0.05. Thus the planner is estimating that it is easier to go through all those rows sequentially instead of looking up the index each time and diving into a random access on the disk block. This is the query I used to peek at the correlation number.
select * from pg_stats where tablename = 'table_name' and attname = 'text_col';
This table also has a few updates performed on its rows. The avg_width of the rows is estimated to be 20 bytes. If the update has a large value for a text column, it can exceed the average and also result in a slower update. My guess was that the physical and logical ordering are slowing moving apart with each update. To fix that I executed the following queries.
ALTER TABLE table_name SET (FILLFACTOR = 80);
VACUUM FULL table_name;
REINDEX TABLE table_name;
ANALYZE table_name;
The idea is that I could give each disk block a 20% buffer and vacuum full the table to reclaim lost space and maintain physical and logical order. After I did this the query picks up the index.
Query
explain analyze SELECT col1,col2... colN
FROM table_name
WHERE text_col is not null
AND
text_col = 'my_value';
Partial index scan - 1.5ms
Index Scan using tango on two (cost=0.000..165.290 rows=40 width=339) (actual time=0.083..0.086 rows=1 loops=1)
Index Cond: ((victor five NOT NULL) AND (victor = 'delta'::text))
Excluding the NULL condition picks up the other index with a bitmap heap scan.
Full index - 0.08ms
Bitmap Heap Scan on two (cost=5.380..392.150 rows=98 width=339) (actual time=0.038..0.039 rows=1 loops=1)
Recheck Cond: (victor = 'delta'::text)
-> Bitmap Index Scan on tango (cost=0.000..5.360 rows=98 width=0) (actual time=0.029..0.029 rows=1 loops=1)
Index Cond: (victor = 'delta'::text)
[EDIT]
While it initially looked like correlation plays a major role in choosing the index scan #Mike has observed that a correlation value that is close to 0 on his database still resulted in an index scan. Changing fill factor and vacuuming fully has helped but I'm unsure why.

Effectively query on column that includes a substring

Given a string column with a value similar to /123/12/34/56/5/, what is the optimal way of querying for all the records that include the given number (12 for example)?
The solution from top of my head is:
SELECT id FROM things WHERE things.path LIKE '%/12/%'
But AFAIK this query can't use indexes on the column due to the leading %.
There must be something better. What is it?
Using PostgreSQL, but would prefer the solution that would work across other DBs too.
If you're happy turning that column into an array of integers, like:
'/123/12/34/56/5/' becomes ARRAY[123,12,34,56,5]
So that path_arr is a column of type INTEGER[], then you can create a GIN index on that column:
CREATE INDEX ON things USING gin(path_arr);
A query for all items containing 12 then becomes:
SELECT * FROM things WHERE ARRAY[12] <# path_arr;
Which will use the index. In my test (with a million rows), I get plans like:
EXPLAIN SELECT * FROM things WHERE ARRAY[12] <# path_arr;
QUERY PLAN
----------------------------------------------------------------------------------------
Bitmap Heap Scan on things (cost=5915.75..9216.99 rows=1000 width=92)
Recheck Cond: (path_arr <# '{12}'::integer[])
-> Bitmap Index Scan on things_path_arr_idx (cost=0.00..5915.50 rows=1000 width=0)
Index Cond: ('{12}'::integer[] <# path_arr)
(4 rows)
In PostgreSQL 9.1 you could utilize the pg_trgm module and build a GIN index with it.
CREATE EXTENSION pg_trgm; -- once per database
CREATE INDEX things_path_trgm_gin_idx ON things USING gin (path gin_trgm_ops);
Your LIKE expression can use this index even if it is not left-anchored.
See a detailed demo by depesz here.
Normalize it If you can, though.

SQL indexes for "not equal" searches

The SQL index allows to find quickly a string which matches my query. Now, I have to search in a big table the strings which do not match. Of course, the normal index does not help and I have to do a slow sequential scan:
essais=> \d phone_idx
Index "public.phone_idx"
Column | Type
--------+------
phone | text
btree, for table "public.phonespersons"
essais=> EXPLAIN SELECT person FROM PhonesPersons WHERE phone = '+33 1234567';
QUERY PLAN
-------------------------------------------------------------------------------
Index Scan using phone_idx on phonespersons (cost=0.00..8.41 rows=1 width=4)
Index Cond: (phone = '+33 1234567'::text)
(2 rows)
essais=> EXPLAIN SELECT person FROM PhonesPersons WHERE phone != '+33 1234567';
QUERY PLAN
----------------------------------------------------------------------
Seq Scan on phonespersons (cost=0.00..18621.00 rows=999999 width=4)
Filter: (phone <> '+33 1234567'::text)
(2 rows)
I understand (see Mark Byers' very good explanations) that PostgreSQL
can decide not to use an index when it sees that a sequential scan
would be faster (for instance if almost all the tuples match). But,
here, "not equal" searches are really slower.
Any way to make these "is not equal to" searches faster?
Here is another example, to address Mark Byers' excellent remarks. The
index is used for the '=' query (which returns the vast majority of
tuples) but not for the '!=' query:
essais=> \d tld_idx
Index "public.tld_idx"
Column | Type
-----------------+------
pg_expression_1 | text
btree, for table "public.emailspersons"
essais=> EXPLAIN ANALYZE SELECT person FROM EmailsPersons WHERE tld(email) = 'fr';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Index Scan using tld_idx on emailspersons (cost=0.25..4010.79 rows=97033 width=4) (actual time=0.137..261.123 rows=97110 loops=1)
Index Cond: (tld(email) = 'fr'::text)
Total runtime: 444.800 ms
(3 rows)
essais=> EXPLAIN ANALYZE SELECT person FROM EmailsPersons WHERE tld(email) != 'fr';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Seq Scan on emailspersons (cost=0.00..27129.00 rows=2967 width=4) (actual time=1.004..1031.224 rows=2890 loops=1)
Filter: (tld(email) <> 'fr'::text)
Total runtime: 1037.278 ms
(3 rows)
DBMS is PostgreSQL 8.3 (but I can upgrade to 8.4).
Possibly it would help to write:
SELECT person FROM PhonesPersons WHERE phone < '+33 1234567'
UNION ALL
SELECT person FROM PhonesPersons WHERE phone > '+33 1234567'
or simply
SELECT person FROM PhonesPersons WHERE phone > '+33 1234567'
OR phone < '+33 1234567'
PostgreSQL should be able to determine that the selectivity of the range operation is very high and to consider using an index for it.
I don't think it can use an index directly to satisfy a not-equals predicate, although it would be nice if it could try re-writing the not-equals as above (if it helps) during planning. If it works, suggest it to the developers ;)
Rationale: searching an index for all values not equal to a certain one requires scanning the full index. By contrast, searching for all elements less than a certain key means finding the greatest non-matching item in the tree and scanning backwards. Similarly, searching for all elements greater than a certain key in the opposite direction. These operations are easy to fulfill using b-tree structures. Also, the statistics that PostgreSQL collects should be able to point out that "+33 1234567" is a known frequent value: by removing the frequency of those and nulls from 1, we have the proportion of rows left to select: the histogram bounds will indicate whether those are skewed to one side or not. But if the exclusion of nulls and that frequent value pushes the proportion of rows remaining low enough (Istr about 20%), an index scan should be appropriate. Check the stats for the column in pg_stats to see what proportion it's actually calculated.
Update: I tried this on a local table with a vaguely similar distribution, and both forms of the above produced something other than a plain seq scan. The latter (using "OR") was a bitmap scan that may actually devolve to just being a seq scan if the bias towards your common value is particularly extreme... although the planner can see that, I don't think it will automatically rewrite to an "Append(Index Scan,Index Scan)" internally. Turning "enable_bitmapscan" off just made it revert to a seq scan.
PS: indexing a text column and using the inequality operators can be an issue, if your database location is not C. You may need to add an extra index that uses text_pattern_ops or varchar_pattern_ops; this is similar to the problem of indexing for column LIKE 'prefix%' predicates.
Alternative: you could create a partial index:
CREATE INDEX PhonesPersonsOthers ON PhonesPersons(phone) WHERE phone <> '+33 1234567'
this will make the <>-using select statement just scan through that partial index: since it excludes most of the entries in the table, it should be small.
The database is able use the index for this query, but it chooses not to because it would be slower. Update: This is not quite right: you have to rewrite the query slightly. See Araqnid's answer.
Your where clause selects almost all rows in your table (rows = 999999). The database can see that a table scan would be faster in this case and therefore ignores the index. It is faster because the column person is not in your index so it would have to make two lookups for each row, once in the index to check the WHERE clause, and then again in the main table to fetch the column person.
If you had a different type of data where there were most values were foo and just a few were bar and you said WHERE col <> 'foo' then it probably would use the index.
Any way to make these "is not equal to" searches faster?
Any query that selects almost 1 million rows is going to be slow. Try adding a limit clause.