How to increase query speed without using full-text search? - sql

This is my simple query; By searching selectnothing I'm sure I'll have no hits.
SELECT nome_t FROM myTable WHERE nome_t ILIKE '%selectnothing%';
This is the EXPLAIN ANALYZE VERBOSE
Seq Scan on myTable (cost=0.00..15259.04 rows=37 width=29) (actual time=2153.061..2153.061 rows=0 loops=1)
Output: nome_t
Filter: (nome_t ~~* '%selectnothing%'::text)
Total runtime: 2153.116 ms
myTable has around 350k rows and the table definition is something like:
CREATE TABLE myTable (
nome_t text NOT NULL,
)
I have an index on nome_t as stated below:
CREATE INDEX idx_m_nome_t ON myTable
USING btree (nome_t);
Although this is clearly a good candidate for Fulltext search I would like to rule that option out for now.
This query is meant to be run from a web application and currently it's taking around 2 seconds which is obviously too much;
Is there anything I can do, like using other index methods, to improve the speed of this query?

No, ILIKE '%selectnothing%' always needs a full table scan, every index is useless. You need full text search, it's not that hard to implement.
Edit: You could use a Wildspeed, I forgot about this option. The indexes will be huge, but your performance will also be much better.
Wildspeed extension provides GIN index
support for wildcard search for LIKE
operator.
http://www.sai.msu.su/~megera/wiki/wildspeed

another thing you can do-- is break this nome_t column in table myTable into it's own table. Searching one column out of a table is slow (if there are fifty other wide columns) because the other data effectively slows down the scan against that column (because there are less records per page/extent).

Related

Sequential scan for column indexed with varchar_pattern_ops

I have a table Users and it contains location column. I have indexed location column using varchar_pattern_ops. But when I run query planner it tells me it is doing a sequential scan.
EXPLAIN ANALAYZE
SELECT * FROM USERS
WHERE lower(location) like '%nepa%'
ORDER BY location desc;
It gives following result:
Sort (cost=12.41..12.42 rows=1 width=451) (actual time=0.084..0.087 rows=8 loops=1)
Sort Key: location
Sort Method: quicksort Memory: 27kB
-> Seq Scan on users (cost=0.00..12.40 rows=1 width=451) (actual time=0.029..0.051 rows=8 loops=1)
Filter: (lower((location)::text) ~~ '%nepa%'::text)
Planning time: 0.211 ms
Execution time: 0.147 ms
I have searched through stackoverflow. Found most answers to be like "postgres performs sequential scan in large table in case index scan will be slower". But my table is not big either.
The index in my users table is:
"index_users_on_lower_location_varchar_pattern_ops" btree (lower(location::text) varchar_pattern_ops)
What is going on?
*_patter_ops indexes are good for prefix matching - LIKE patterns anchored to the start, without leading wildcard. But not for your predicate:
WHERE lower(location) like '%nepa%'
I suggest you create a trigram index instead. And you do not need lower() in the index (or query) since trigram indexes support case insensitive ILIKE (or ~*) at practically the same cost.
Follow instructions here:
PostgreSQL LIKE query performance variations
Also:
But my table is not big either.
You seem to have that backwards. If your table is not big enough, it may be faster for Postgres to just read it sequentially and not bother with indexes. You would not create any indexes for this at all. The tipping point depends on many factors.
Aside: your index definition does not make sense to begin with:
(lower(location::text) varchar_pattern_ops)
For a varchar columns use the varchar_pattern_ops operator class.
But if you cast to text, use text_pattern_ops. Since lower() returns text even for varchar input, use text_pattern_ops. Except that you probably do not need this (or any?) index at all, as advised.

PostgreSQL does not use a partial index

I have a table in PostgreSQL 9.2 that has a text column. Let's call this text_col. The values in this column are fairly unique (may contain 5-6 duplicates at the most). The table has ~5 million rows. About half these rows contain a null value for text_col. When I execute the following query I expect 1-5 rows. In most cases (>80%) I only expect 1 row.
Query
explain analyze SELECT col1,col2.. colN
FROM table
WHERE text_col = 'my_value';
A btree index exists on text_col. This index is never used by the query planner and I am not sure why. This is the output of the query.
Planner
Seq Scan on two (cost=0.000..459573.080 rows=93 width=339) (actual time=1392.864..3196.283 rows=2 loops=1)
Filter: (victor = 'foxtrot'::text)
Rows Removed by Filter: 4077384
I added another partial index to try to filter out those values that were not null, but that did not help (with or without text_pattern_ops. I do not need text_pattern_ops considering no LIKE conditions are expressed in my queries, but they also match equality).
CREATE INDEX name_idx
ON table
USING btree
(text_col COLLATE pg_catalog."default" text_pattern_ops)
WHERE text_col IS NOT NULL;
Disabling sequence scans using set enable_seqscan = off; makes the planner still pick the seqscan over an index_scan. In summary...
The number of rows returned by this query is small.
Given that the non-null rows are fairly unique, an index scan over the text should be faster.
Vacuuming and analyzing the table did not help the optimizer pick the index.
My questions
Why does the database pick the sequence scan over the index scan?
When a table has a text column whose equality condition should be checked, are there any best practices I can adhere to?
How do I reduce the time taken for this query?
[Edit - More information]
The index scan is picked up on my local database that houses about 10% of the data that is available in production.
A partial index is a good idea to exclude half the rows of the table which you obviously do not need. Simpler:
CREATE INDEX name_idx ON table (text_col)
WHERE text_col IS NOT NULL;
Be sure to run ANALYZE table after creating the index. (Autovacuum does that automatically after some time if you don't do it manually, but if you test right after creation, your test will fail.)
Then, to convince the query planner that a particular partial index can be used, repeat the WHERE condition in the query - even if it seems completely redundant:
SELECT col1,col2, .. colN
FROM table
WHERE text_col = 'my_value'
AND text_col IS NOT NULL; -- repeat condition
Voilá.
Per documentation:
However, keep in mind that the predicate must match the conditions
used in the queries that are supposed to benefit from the index. To be
precise, a partial index can be used in a query only if the system can
recognize that the WHERE condition of the query mathematically implies
the predicate of the index. PostgreSQL does not have a sophisticated
theorem prover that can recognize mathematically equivalent
expressions that are written in different forms. (Not only is such a
general theorem prover extremely difficult to create, it would
probably be too slow to be of any real use.) The system can recognize
simple inequality implications, for example "x < 1" implies "x < 2";
otherwise the predicate condition must exactly match part of the
query's WHERE condition or the index will not be recognized as usable.
Matching takes place at query planning time, not at run time. As a
result, parameterized query clauses do not work with a partial index.
As for parameterized queries: again, add the (redundant) predicate of the partial index as an additional, constant WHERE condition, and it works just fine.
An important update in Postgres 9.6 largely improves chances for index-only scans (which can make queries cheaper and the query planner will more readily chose such query plans). Related:
PostgreSQL not using index during count(*)
A partial index is only used if the WHERE conditions match. Thus an index with WHERE text_col IS NOT NULL can only be used if you use the same condition in your SELECT. Collation mismatch could also cause harm.
Try the following:
Make a simplest possible btree index CREATE INDEX foo ON table (text_col)
ANALYZE table
Query
I figured it out. Upon taking a closer look at the pg_stats view that analyze helps build, I came across this excerpt on the documentation.
Correlation
Statistical correlation between physical row ordering and logical
ordering of the column values. This ranges from -1 to +1. When the
value is near -1 or +1, an index scan on the column will be estimated
to be cheaper than when it is near zero, due to reduction of random
access to the disk. (This column is null if the column data type does
not have a < operator.)
On my local box the correlation number is 0.97 and on production it was 0.05. Thus the planner is estimating that it is easier to go through all those rows sequentially instead of looking up the index each time and diving into a random access on the disk block. This is the query I used to peek at the correlation number.
select * from pg_stats where tablename = 'table_name' and attname = 'text_col';
This table also has a few updates performed on its rows. The avg_width of the rows is estimated to be 20 bytes. If the update has a large value for a text column, it can exceed the average and also result in a slower update. My guess was that the physical and logical ordering are slowing moving apart with each update. To fix that I executed the following queries.
ALTER TABLE table_name SET (FILLFACTOR = 80);
VACUUM FULL table_name;
REINDEX TABLE table_name;
ANALYZE table_name;
The idea is that I could give each disk block a 20% buffer and vacuum full the table to reclaim lost space and maintain physical and logical order. After I did this the query picks up the index.
Query
explain analyze SELECT col1,col2... colN
FROM table_name
WHERE text_col is not null
AND
text_col = 'my_value';
Partial index scan - 1.5ms
Index Scan using tango on two (cost=0.000..165.290 rows=40 width=339) (actual time=0.083..0.086 rows=1 loops=1)
Index Cond: ((victor five NOT NULL) AND (victor = 'delta'::text))
Excluding the NULL condition picks up the other index with a bitmap heap scan.
Full index - 0.08ms
Bitmap Heap Scan on two (cost=5.380..392.150 rows=98 width=339) (actual time=0.038..0.039 rows=1 loops=1)
Recheck Cond: (victor = 'delta'::text)
-> Bitmap Index Scan on tango (cost=0.000..5.360 rows=98 width=0) (actual time=0.029..0.029 rows=1 loops=1)
Index Cond: (victor = 'delta'::text)
[EDIT]
While it initially looked like correlation plays a major role in choosing the index scan #Mike has observed that a correlation value that is close to 0 on his database still resulted in an index scan. Changing fill factor and vacuuming fully has helped but I'm unsure why.

Effectively query on column that includes a substring

Given a string column with a value similar to /123/12/34/56/5/, what is the optimal way of querying for all the records that include the given number (12 for example)?
The solution from top of my head is:
SELECT id FROM things WHERE things.path LIKE '%/12/%'
But AFAIK this query can't use indexes on the column due to the leading %.
There must be something better. What is it?
Using PostgreSQL, but would prefer the solution that would work across other DBs too.
If you're happy turning that column into an array of integers, like:
'/123/12/34/56/5/' becomes ARRAY[123,12,34,56,5]
So that path_arr is a column of type INTEGER[], then you can create a GIN index on that column:
CREATE INDEX ON things USING gin(path_arr);
A query for all items containing 12 then becomes:
SELECT * FROM things WHERE ARRAY[12] <# path_arr;
Which will use the index. In my test (with a million rows), I get plans like:
EXPLAIN SELECT * FROM things WHERE ARRAY[12] <# path_arr;
QUERY PLAN
----------------------------------------------------------------------------------------
Bitmap Heap Scan on things (cost=5915.75..9216.99 rows=1000 width=92)
Recheck Cond: (path_arr <# '{12}'::integer[])
-> Bitmap Index Scan on things_path_arr_idx (cost=0.00..5915.50 rows=1000 width=0)
Index Cond: ('{12}'::integer[] <# path_arr)
(4 rows)
In PostgreSQL 9.1 you could utilize the pg_trgm module and build a GIN index with it.
CREATE EXTENSION pg_trgm; -- once per database
CREATE INDEX things_path_trgm_gin_idx ON things USING gin (path gin_trgm_ops);
Your LIKE expression can use this index even if it is not left-anchored.
See a detailed demo by depesz here.
Normalize it If you can, though.

SQL indexes for "not equal" searches

The SQL index allows to find quickly a string which matches my query. Now, I have to search in a big table the strings which do not match. Of course, the normal index does not help and I have to do a slow sequential scan:
essais=> \d phone_idx
Index "public.phone_idx"
Column | Type
--------+------
phone | text
btree, for table "public.phonespersons"
essais=> EXPLAIN SELECT person FROM PhonesPersons WHERE phone = '+33 1234567';
QUERY PLAN
-------------------------------------------------------------------------------
Index Scan using phone_idx on phonespersons (cost=0.00..8.41 rows=1 width=4)
Index Cond: (phone = '+33 1234567'::text)
(2 rows)
essais=> EXPLAIN SELECT person FROM PhonesPersons WHERE phone != '+33 1234567';
QUERY PLAN
----------------------------------------------------------------------
Seq Scan on phonespersons (cost=0.00..18621.00 rows=999999 width=4)
Filter: (phone <> '+33 1234567'::text)
(2 rows)
I understand (see Mark Byers' very good explanations) that PostgreSQL
can decide not to use an index when it sees that a sequential scan
would be faster (for instance if almost all the tuples match). But,
here, "not equal" searches are really slower.
Any way to make these "is not equal to" searches faster?
Here is another example, to address Mark Byers' excellent remarks. The
index is used for the '=' query (which returns the vast majority of
tuples) but not for the '!=' query:
essais=> \d tld_idx
Index "public.tld_idx"
Column | Type
-----------------+------
pg_expression_1 | text
btree, for table "public.emailspersons"
essais=> EXPLAIN ANALYZE SELECT person FROM EmailsPersons WHERE tld(email) = 'fr';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Index Scan using tld_idx on emailspersons (cost=0.25..4010.79 rows=97033 width=4) (actual time=0.137..261.123 rows=97110 loops=1)
Index Cond: (tld(email) = 'fr'::text)
Total runtime: 444.800 ms
(3 rows)
essais=> EXPLAIN ANALYZE SELECT person FROM EmailsPersons WHERE tld(email) != 'fr';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Seq Scan on emailspersons (cost=0.00..27129.00 rows=2967 width=4) (actual time=1.004..1031.224 rows=2890 loops=1)
Filter: (tld(email) <> 'fr'::text)
Total runtime: 1037.278 ms
(3 rows)
DBMS is PostgreSQL 8.3 (but I can upgrade to 8.4).
Possibly it would help to write:
SELECT person FROM PhonesPersons WHERE phone < '+33 1234567'
UNION ALL
SELECT person FROM PhonesPersons WHERE phone > '+33 1234567'
or simply
SELECT person FROM PhonesPersons WHERE phone > '+33 1234567'
OR phone < '+33 1234567'
PostgreSQL should be able to determine that the selectivity of the range operation is very high and to consider using an index for it.
I don't think it can use an index directly to satisfy a not-equals predicate, although it would be nice if it could try re-writing the not-equals as above (if it helps) during planning. If it works, suggest it to the developers ;)
Rationale: searching an index for all values not equal to a certain one requires scanning the full index. By contrast, searching for all elements less than a certain key means finding the greatest non-matching item in the tree and scanning backwards. Similarly, searching for all elements greater than a certain key in the opposite direction. These operations are easy to fulfill using b-tree structures. Also, the statistics that PostgreSQL collects should be able to point out that "+33 1234567" is a known frequent value: by removing the frequency of those and nulls from 1, we have the proportion of rows left to select: the histogram bounds will indicate whether those are skewed to one side or not. But if the exclusion of nulls and that frequent value pushes the proportion of rows remaining low enough (Istr about 20%), an index scan should be appropriate. Check the stats for the column in pg_stats to see what proportion it's actually calculated.
Update: I tried this on a local table with a vaguely similar distribution, and both forms of the above produced something other than a plain seq scan. The latter (using "OR") was a bitmap scan that may actually devolve to just being a seq scan if the bias towards your common value is particularly extreme... although the planner can see that, I don't think it will automatically rewrite to an "Append(Index Scan,Index Scan)" internally. Turning "enable_bitmapscan" off just made it revert to a seq scan.
PS: indexing a text column and using the inequality operators can be an issue, if your database location is not C. You may need to add an extra index that uses text_pattern_ops or varchar_pattern_ops; this is similar to the problem of indexing for column LIKE 'prefix%' predicates.
Alternative: you could create a partial index:
CREATE INDEX PhonesPersonsOthers ON PhonesPersons(phone) WHERE phone <> '+33 1234567'
this will make the <>-using select statement just scan through that partial index: since it excludes most of the entries in the table, it should be small.
The database is able use the index for this query, but it chooses not to because it would be slower. Update: This is not quite right: you have to rewrite the query slightly. See Araqnid's answer.
Your where clause selects almost all rows in your table (rows = 999999). The database can see that a table scan would be faster in this case and therefore ignores the index. It is faster because the column person is not in your index so it would have to make two lookups for each row, once in the index to check the WHERE clause, and then again in the main table to fetch the column person.
If you had a different type of data where there were most values were foo and just a few were bar and you said WHERE col <> 'foo' then it probably would use the index.
Any way to make these "is not equal to" searches faster?
Any query that selects almost 1 million rows is going to be slow. Try adding a limit clause.

PostgreSQL LIKE query performance variations

I have been seeing quite a large variation in response times regarding LIKE queries to a particular table in my database. Sometimes I will get results within 200-400 ms (very acceptable) but other times it might take as much as 30 seconds to return results.
I understand that LIKE queries are very resource intensive but I just don't understand why there would be such a large difference in response times. I have built a btree index on the owner1 field but I don't think it helps with LIKE queries. Anyone have any ideas?
Sample SQL:
SELECT gid, owner1 FORM parcels
WHERE owner1 ILIKE '%someones name%' LIMIT 10
I've also tried:
SELECT gid, owner1 FROM parcels
WHERE lower(owner1) LIKE lower('%someones name%') LIMIT 10
And:
SELECT gid, owner1 FROM parcels
WHERE lower(owner1) LIKE lower('someones name%') LIMIT 10
With similar results.
Table Row Count: about 95,000.
FTS does not support LIKE
The previously accepted answer was incorrect. Full Text Search with its full text indexes is not for the LIKE operator at all, it has its own operators and doesn't work for arbitrary strings. It operates on words based on dictionaries and stemming. It does support prefix matching for words, but not with the LIKE operator:
Get partial match from GIN indexed TSVECTOR column
Trigram index for LIKE
Install the additional module pg_trgm which provides operator classes for GIN and GiST trigram indexes to support all LIKE and ILIKE patterns, not just left-anchored ones:
Example index:
CREATE INDEX tbl_col_gin_trgm_idx ON tbl USING gin (col gin_trgm_ops);
Or:
CREATE INDEX tbl_col_gist_trgm_idx ON tbl USING gist (col gist_trgm_ops);
Difference between GiST and GIN index
Example query:
SELECT * FROM tbl WHERE col LIKE 'foo%';
SELECT * FROM tbl WHERE col LIKE '%foo%'; -- works with leading wildcard, too
SELECT * FROM tbl WHERE col ILIKE '%foo%'; -- works case insensitively as well
Trigrams? What about shorter strings?
Words with less than 3 letters in indexed values still work. The manual:
Each word is considered to have two spaces prefixed and one space
suffixed when determining the set of trigrams contained in the string.
And search patterns with less than 3 letters? The manual:
For both LIKE and regular-expression searches, keep in mind that a
pattern with no extractable trigrams will degenerate to a full-index scan.
Meaning, that index / bitmap index scans still work (query plans for prepared statement won't break), it just won't buy you better performance. Typically no big loss, since 1- or 2-letter strings are hardly selective (more than a few percent of the underlying table matches) and index support would not improve performance (much) to begin with, because a full table scan is faster.
Prefix matching
Search patterns with no leading wildcard: col LIKE 'foo%'.
^# operator / starts_with() function
Quoting the release notes of Postgres 11:
Add prefix-match operator text ^# text, which is supported by SP-GiST
(Ildus Kurbangaliev)
This is similar to using var LIKE 'word%' with a btree index, but it
is more efficient.
Example query:
SELECT * FROM tbl WHERE col ^# 'foo'; -- no added wildcard
But the potential of operator and function stays limited until planner support is improved in Postgres 15 and the ^# operator is documented properly. The release notes:
Allow the ^# starts-with operator and the starts_with() function to
use btree indexes if using the C collation (Tom Lane)
Previously these could only use SP-GiST indexes.
COLLATE "C"
Since Postgres 9.1, an index with COLLATE "C" provides the same functionality as the operator class text_pattern_ops described below. See:
Is there a difference between text_pattern_ops and COLLATE "C"?
text_pattern_ops (original answer)
For just left-anchored patterns (no leading wildcard) you get the optimum with a suitable operator class for a btree index: text_pattern_ops or varchar_pattern_ops. Both built-in features of standard Postgres, no additional module needed. Similar performance, but much smaller index.
Example index:
CREATE INDEX tbl_col_text_pattern_ops_idx ON tbl(col text_pattern_ops);
Example query:
SELECT * FROM tbl WHERE col LIKE 'foo%'; -- no leading wildcard
Or, if you should be running your database with the 'C' locale (effectively no locale), then everything is sorted according to byte order anyway and a plain btree index with default operator class does the job.
Further reading
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
How is LIKE implemented?
Finding similar strings with PostgreSQL quickly
Possibly the fast ones are anchored patterns with case-sensitive like that can use indexes. i.e. there is no wild card at the beginning of the match string so the executor can use an index range scan. (the relevant comment in the docs is here) Lower and ilike will also lose your ability to use the index unless you specifically create an index for that purpose (see functional indexes).
If you want to search for string in the middle of the field, you should look into full text or trigram indexes. First of them is in Postgres core, the other is available in the contrib modules.
You could install Wildspeed, a different type of index in PostgreSQL. Wildspeed does work with %word% wildcards, no problem. The downside is the size of the index, this can be large, very large.
I recently had a similar issue with a table containing 200000 records and I need to do repeated LIKE queries. In my case, the string being search was fixed. Other fields varied. Because that, I was able to rewrite:
SELECT owner1 FROM parcels
WHERE lower(owner1) LIKE lower('%someones name%');
as
CREATE INDEX ix_parcels ON parcels(position(lower('someones name') in lower(owner1)));
SELECT owner1 FROM parcels
WHERE position(lower('someones name') in lower(owner1)) > 0;
I was delighted when the queries came back fast and verified the index is being used with EXPLAIN ANALYZE:
Bitmap Heap Scan on parcels (cost=7.66..25.59 rows=453 width=32) (actual time=0.006..0.006 rows=0 loops=1)
Recheck Cond: ("position"(lower(owner1), 'someones name'::text) > 0)
-> Bitmap Index Scan on ix_parcels (cost=0.00..7.55 rows=453 width=0) (actual time=0.004..0.004 rows=0 loops=1)
Index Cond: ("position"(lower(owner1), 'someones name'::text) > 0)
Planning time: 0.075 ms
Execution time: 0.025 ms
When ever you use a clause on a column with functions eg LIKE, ILIKE, upper, lower etc. Then postgres wont take your normal index into consideration. It will do a full scan of the table going through each row and therefore it will be slow.
The correct way would be to create a new index according to your query. For example if i want to match a column without case sensitivity and my column is a varchar. Then you can do it like this.
create index ix_tblname_col_upper on tblname (UPPER(col) varchar_pattern_ops);
Similarly if your column is a text then you do something like this
create index ix_tblname_col_upper on tblname (UPPER(col) text_pattern_ops);
Similarly you can change the function upper to any other function that you want.
Please Execute below mentioned query for improve the LIKE query performance in postgresql.
create an index like this for bigger tables:
CREATE INDEX <indexname> ON <tablename> USING btree (<fieldname> text_pattern_ops)
for what it's worth, Django ORM tends to use UPPER(text) for all LIKE queries to make it case insensitive,
Adding an index on UPPER(column::text) has greatly sped up my system, unlike any other thing.
As far as leading %, yes that will not use an index. See this blog for a great explanation:
https://use-the-index-luke.com/sql/where-clause/searching-for-ranges/like-performance-tuning
Your like queries probably cannot use the indexes you created because:
1) your LIKE criteria begins with a wildcard.
2) you've used a function with your LIKE criteria.