Poor regex performance for word matching on postgres - sql

I have a list of blocked phrases and I want to match for the existence of those phrases in user inputed text, but performance is very bad.
I am using this query :
SELECT value FROM blocked_items WHERE lower(unaccent( 'my input text' )) ~* ('[[:<:]]' || value || '[[:>:]]') LIMIT 1;
After my investigation I found out that world boundries [[:<:]] and [[:>:]] perform very badly knowing that blocked_items has 24k records in it.
For instance when I try to run this one:
SELECT value FROM blocked_items WHERE lower(unaccent( 'my input text ' )) ilike ('%' || value || '%') LIMIT 1;
it's very fast compared to the first one. The problem is that I need to keep the test for word boundries.
This check is performed frequently on a large program so the performance is very important for me.
Do you guys have any suggestions for making this faster?
EXPLIAN ANALYZE screenshot

Since you know that the LIKE (~~) query is fast and the RegEx (~) query is slow, the easiest solution is to combine both conditions (\m / \M are equivalent to [[:<:]] / [[:>:]]):
SELECT value FROM blocked_items
WHERE lower(unaccent('my input text')) ~~ ('%'||value||'%')
AND lower(unaccent('my input text')) ~ ('\m'||value||'\M')
LIMIT 1;
This way the fast query condition filters out most of the rows and then the slow condition discards the remaining ones.
I am using the faster case sensitive operators, assuming that value is already normalized. If that is not the case, drop the (then redundant) lower() and use the case sensitive versions as in the original queries.
On my test set with 370k rows that speeds up the query from 6s (warm) to 90ms:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..1651.85 rows=1 width=10) (actual time=89.702..89.702 rows=1 loops=1)
-> Seq Scan on blocked_items (cost=0.00..14866.61 rows=9 width=10) (actual time=89.701..89.701 rows=1 loops=1)
Filter: ((lower(unaccent('my input text'::text)) ~~ (('%'::text || value) || '%'::text)) AND (lower(unaccent('my input text'::text)) ~ (('\m'::text || value) || '\M'::text)))
Rows Removed by Filter: 153281
Planning Time: 0.097 ms
Execution Time: 89.717 ms
(6 rows)
However we are still doing a full table scan and performance will vary based on the position in the table.
Ideally we can answer the query in near constant time by using an index.
Let's rewrite the query to use Text Search Functions and Operators:
SELECT value FROM blocked_items
WHERE to_tsvector('simple', unaccent('my input text'))
## phraseto_tsquery('simple', value)
LIMIT 1;
First we split the input into search vectors and then check if the blocked phrase matches those vectors.
This takes about 440ms for the test query – a step back from our combined query:
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..104.01 rows=1 width=10) (actual time=437.761..437.761 rows=1 loops=1)
-> Seq Scan on blocked_items (cost=0.00..192516.05 rows=1851 width=10) (actual time=437.760..437.760 rows=1 loops=1)
Filter: (to_tsvector('simple'::regconfig, unaccent('my input text'::text)) ## phraseto_tsquery('simple'::regconfig, value))
Rows Removed by Filter: 153281
Planning Time: 0.063 ms
Execution Time: 437.772 ms
(6 rows)
Since we can't use tsvector ## tsquery for indexing of tsquery, we can rewrite the query again to check if the blocked phrase is a subquery of the input phrase using the tsquery #> tsquery Text Search Operator, which can then be indexed using tsquery_ops from the GiST Operator Classes:
CREATE INDEX blocked_items_search ON blocked_items
USING gist (phraseto_tsquery('simple', value));
ANALYZE blocked_items; -- update query planner stats
SELECT value FROM blocked_items
WHERE phraseto_tsquery('simple', unaccent('my input text'))
#> phraseto_tsquery('simple', value)
LIMIT 1;
The query can now use an index scan and takes 20ms with the same data.
Since GiST is a lossy index, query times can vary a bit depending on how many rechecks are required:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.54..4.23 rows=1 width=10) (actual time=19.215..19.215 rows=1 loops=1)
-> Index Scan using blocked_items_search on blocked_items (cost=0.54..1367.01 rows=370 width=10) (actual time=19.214..19.214 rows=1 loops=1)
Index Cond: (phraseto_tsquery('simple'::regconfig, value) <# phraseto_tsquery('simple'::regconfig, unaccent('my input text'::text)))
Rows Removed by Index Recheck: 4028
Planning Time: 0.093 ms
Execution Time: 19.236 ms
(6 rows)
One great advantage of using full text search is that you can now use language specific matching of words by using search configurations (regconfig).
The above queries are all using the default 'simple' regconfig to match the behavior of the original query. By switching to 'english' you can also match variations of the same word like cat and cats (stemming) and common words without significance like the or my will be ignored (stop words).

Related

How do I fetch similar posts from database where text length can be more than 2000 characters

As far as I'm aware, there is no simple, quick solution. I am trying to do a full-text keyword or semantic search, which is a very advanced topic. There are dedicated search servers created specifically for that reason, but still is there a way that I can implement for a query execution time for less than a second?
Here's what I have tried so far:
begin;
SET pg_trgm.similarity_threshold = 0.3;
select
id, <col_name>
similarity(<column with gin index>,
'<text to be searched>') as sml
from
<table> p
where
<clauses> and
<indexed_col> % '<text to be searched>'
and indexed_col <-> '<text to be searched>' < 0.5
order by
indexed_col <-> '<text to be searched>'
limit 10;
end;
Index created is as follows:
CREATE INDEX trgm_idx ON posts USING gin (post_title_combined gin_trgm_ops);
The above query takes around 6-7 secs to execute and sometimes even 200 ms which is weird to me because it changes the query plan according to the input I pass in for similarity.
I tried ts_vector ## ts_query, but they turn out to be too strict due to & operator.
EDIT: Here's the EXPLAIN ANALYZE of the above query
-> Sort (cost=463.82..463.84 rows=5 width=321) (actual time=3778.726..3778.728 rows=0 loops=1)
Sort Key: ((post_title_combined <-> 'Test text not to be disclosed'::text))
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on posts p (cost=404.11..463.77 rows=5 width=321) (actual time=3778.722..3778.723 rows=0 loops=1)
Recheck Cond: (post_title_combined % 'Test text not to be disclosed'::text)
Rows Removed by Index Recheck: 36258
Filter: ((content IS NOT NULL) AND (is_crawlable IS TRUE) AND (score IS NOT NULL) AND (status = 1) AND ((post_title_combined <-> 'Test text not to be disclosed'::text) < '0.5'::double precision))
Heap Blocks: exact=24043
-> Bitmap Index Scan on trgm_idx (cost=0.00..404.11 rows=15 width=0) (actual time=187.394..187.394 rows=36916 loops=1)
Index Cond: (post_title_combined % 'Test text not to be disclosed'::text)
Planning Time: 8.782 ms
Execution Time: 3778.787 ms```
Your redundant/overlapping query conditions aren't helpful. Setting similarity_threshold=0.3 then doing
t % q and t <-> q < 0.5
just throws away index selectivity for no reason. Set similarity_threshold to as stringent of a value as you want to use, then get rid of the unnecessary <-> condition.
You could try the GiST version of trigram indexing. I can support the ORDER BY ... <-> ... LIMIT 10 operation directly from the index. I doubt it will be very effective with 2000 char strings, but it is worth a try.

Postgres / Postgis query optimization

I have a query in Postgres / Postgis which is based around finding the nearest points of a given point filtered by some other columns in the table.
The table consists of a bit over 10 million rows and the query looks like this:
SELECT t.id FROM my_table t
WHERE round(t.col1) IN (1,2,3)
ORDER BY t.geom <-> st_transform(st_setsrid('POINT(lon lat)'::geometry, 4326), 3857)
LIMIT 1000;
The geom column is indexed using GIST and col1 is also indexed.
When the WHERE clause finds many rows that are also near the point this is blazing fast using the geom index:
Limit (cost=0.42..10575.49 rows=1000 width=12) (actual time=0.150..6.742 rows=1000 loops=1)
-> Index Scan using my_table_geom_idx on my_table t (cost=0.42..2148612.35 rows=203177 width=12) (actual time=0.149..6.663 rows=1000 loops=1)
Order By: (geom <-> '.....'::geometry)
Filter: (round(t.col1) = ANY ('{1,2,3}'::double precision[]))
Rows Removed by Filter: 3348
Planning Time: 0.288 ms
Execution Time: 6.817 ms
The problem occurs when the WHERE clause does not find many rows that are close in distance to the given point. Example:
SELECT t.id FROM my_table t
WHERE round(t.col1) IN (1) // 1 is very rare near the given point
ORDER BY t.geom <-> st_transform(st_setsrid('POINT(lon lat)'::geometry, 4326), 3857)
LIMIT 1000;
This query runs much much slower:
Limit (cost=0.42..14487.97 rows=1000 width=12) (actual time=8443.514..10629.745 rows=1000 loops=1)
-> Index Scan using my_table_geom_idx on my_table t (cost=0.42..1962368.41 rows=135452 width=12) (actual time=8443.513..10629.553 rows=1000 loops=1)
Order By: (t.geom <-> '.....'::geometry)
Filter: (round(t.col1) = ANY ('{1}'::double precision[]))
Rows Removed by Filter: 5866030
Planning Time: 0.292 ms
Execution Time: 10629.906 ms
I created an index on round(col1) trying to speed up searches on col1 but postgres uses the geom index only which works great when there are many rows nearby that fit the criteria but not so great if there are few rows that match.
If I remove the LIMIT clause Postgres uses the index on col1, which works great when there are few resulting rows but is very slow when the result contains many rows, so I would like to keep the LIMIT clause.
Any suggestions on how I could optimize this query or create an index that handles this?
EDIT:
Thank you for all the suggestions and feedback!
I tried the tip from #JGH and restricted my query using st_dwithin as to not order the entire table before limiting.
...where st_dwithin(geom, searchpoint, 10000)
This greatly reduced the time of the slow query, down to a few milliseconds. Restricting the search to a constant distance works well in my application, so I will use this as the solution.

Postgresql LIKE ANY versus LIKE

I've tried to be thorough in this question, so if you're impatient, just jump to the end to see what the actual question is...
I'm working on adjusting how some search features in one of our databases is implemented. To this end, I'm adding some wildcard capabilities to our application's API that interfaces back to Postgresql.
The issue that I've found is that the EXPLAIN ANALYZE times do not make sense to me and I'm trying to figure out where I could be going wrong; it doesn't seem likely that 15 queries is better than just one optimized query!
The table, Words, has two relevant columns for this question: id and text. The text column has an index on it that was build with the text_pattern_ops option. Here's what I'm seeing:
First, using a LIKE ANY with a VALUES clause, which some references seem to indicate would be ideal in my case (finding multiple words):
events_prod=# explain analyze select distinct id from words where words.text LIKE ANY (values('test%'));
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=6716668.40..6727372.85 rows=1070445 width=4) (actual time=103088.381..103091.468 rows=256 loops=1)
Group Key: words.id
-> Nested Loop Semi Join (cost=0.00..6713992.29 rows=1070445 width=4) (actual time=0.670..103087.904 rows=256 loops=1)
Join Filter: ((words.text)::text ~~ "*VALUES*".column1)
Rows Removed by Join Filter: 214089311
-> Seq Scan on words (cost=0.00..3502655.91 rows=214089091 width=21) (actual time=0.017..25232.135 rows=214089567 loops=1)
-> Materialize (cost=0.00..0.02 rows=1 width=32) (actual time=0.000..0.000 rows=1 loops=214089567)
-> Values Scan on "*VALUES*" (cost=0.00..0.01 rows=1 width=32) (actual time=0.006..0.006 rows=1 loops=1)
Planning time: 0.226 ms
Execution time: 103106.296 ms
(10 rows)
As you can see, the execution time is horrendous.
A second attempt, using LIKE ANY(ARRAY[... yields:
events_prod=# explain analyze select distinct id from words where words.text LIKE ANY(ARRAY['test%']);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=3770401.08..3770615.17 rows=21409 width=4) (actual time=37399.573..37399.704 rows=256 loops=1)
Group Key: id
-> Seq Scan on words (cost=0.00..3770347.56 rows=21409 width=4) (actual time=0.224..37399.054 rows=256 loops=1)
Filter: ((text)::text ~~ ANY ('{test%}'::text[]))
Rows Removed by Filter: 214093922
Planning time: 0.611 ms
Execution time: 37399.895 ms
(7 rows)
As you can see, performance is dramatically improved, but still far from ideal... 37 seconds. with one word in the list. Moving that up to three words that returns a total of 256 rows changes the execution time to well over 100 seconds.
The last try, doing a LIKE for a single word:
events_prod=# explain analyze select distinct id from words where words.text LIKE 'test%';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=60.14..274.23 rows=21409 width=4) (actual time=1.437..1.576 rows=256 loops=1)
Group Key: id
-> Index Scan using words_special_idx on words (cost=0.57..6.62 rows=21409 width=4) (actual time=0.048..1.258 rows=256 loops=1)
Index Cond: (((text)::text ~>=~ 'test'::text) AND ((text)::text ~<~ 'tesu'::text))
Filter: ((text)::text ~~ 'test%'::text)
Planning time: 0.826 ms
Execution time: 1.858 ms
(7 rows)
As expected, this is the fastest, but the 1.85ms makes me wonder if there is something else I'm missing with the VALUES and ARRAY approach.
The Question
Is there some more efficient way to do something like this in Postgresql that I've missed in my research?
select distinct id
from words
where words.text LIKE ANY(ARRAY['word1%', 'another%', 'third%']);
This is a bit speculative. I think the key is your pattern:
where words.text LIKE 'test%'
Note that the like pattern starts with a constant string. The means that Postgres can do a range scan on the index for the words that start with 'test'.
When you then introduce multiple comparisons, the optimizer gets confused and no longer considers multiple range scans. Instead, it decides that it needs to process all the rows.
This may be a case where this re-write gives you the performance that you want:
select id
from words
where words.text LIKE 'word1%'
union
select id
from words
where words.text LIKE 'another%'
union
select id
from words
where words.text LIKE 'third%';
Notes:
The distinct is not needed because of the union.
If the pattern starts with a wildcard, then a full scan is needed anyway.
You might want to consider an n-gram or full-text index on the table.

How can I tune the PostgreSQL query optimizer when using SQL logical 'OR' and operator 'LIKE'?

How can I tune the PostgreSQL query plan (or the SQL query itself) to make more optimal use of available indexes when the query contains an SQL OR condition that uses the LIKE operator instead of =?
For example consider the following query that takes just 5 milliseconds to execute:
explain analyze select *
from report_workflow.request_attribute
where domain = 'externalId'
and scoped_id_value[2] = 'G130324135454100';
"Bitmap Heap Scan on request_attribute (cost=44.94..6617.85 rows=2081 width=139) (actual time=4.619..4.619 rows=2 loops=1)"
" Recheck Cond: (((scoped_id_value[2])::text = 'G130324135454100'::text) AND ((domain)::text = 'externalId'::text))"
" -> Bitmap Index Scan on request_attribute_accession_number (cost=0.00..44.42 rows=2081 width=0) (actual time=3.777..3.777 rows=2 loops=1)"
" Index Cond: ((scoped_id_value[2])::text = 'G130324135454100'::text)"
"Total runtime: 5.059 ms"
As the query plan shows, this query takes advantage of partial index request_attribute_accession_number and index condition scoped_id_value[2] = 'G130324135454100'. Index request_attribute_accession_number has the following definition:
CREATE INDEX request_attribute_accession_number
ON report_workflow.request_attribute((scoped_id_value[2]))
WHERE domain = 'externalId';
(Note that column scoped_id_value in table request_attribute has type character varying[].)
However, when I add to the same query an extra OR condition that uses the same array column element scoped_id_value[2], but the the LIKE operator instead of =, the query, despite producing the same result for the same first condition, now takes 7553 ms:
explain analyze select *
from report_workflow.request_attribute
where domain = 'externalId'
and (scoped_id_value[2] = 'G130324135454100'
or scoped_id_value[2] like '%G130324135454100%');
"Bitmap Heap Scan on request_attribute (cost=7664.77..46768.27 rows=2122 width=139) (actual time=142.164..7552.650 rows=2 loops=1)"
" Recheck Cond: ((domain)::text = 'externalId'::text)"
" Rows Removed by Index Recheck: 1728712"
" Filter: (((scoped_id_value[2])::text = 'G130324135454100'::text) OR ((scoped_id_value[2])::text ~~ '%G130324135454100%'::text))"
" Rows Removed by Filter: 415884"
" -> Bitmap Index Scan on request_attribute_accession_number (cost=0.00..7664.24 rows=416143 width=0) (actual time=136.249..136.249 rows=415886 loops=1)"
"Total runtime: 7553.154 ms"
Notice that this time the query optimizer ignores index condition scoped_id_value[2] = 'G130324135454100' when it performs the inner bitmap index scan using index request_attribute_accession_number and consequently generates 415,886 rows instead of just two as did the first query.
When introducing the OR condition with the LIKE operator into this second query, why does the optimizer produce a much less optimal query plan as the first? How can I tune the query optimizer or the query to perform more like the first query?
In the second plan, you have:
scoped_id_value[2] like '%G130324135454100%'
Postgres (nor any other database) cannot use the index to solve this. Where would it look in the index? It doesn't even know where to start, so it has to do a full table scan.
You can handle this, for this one case, by building an index on an expression (see here). However, that would be very specific to the string 'G130324135454100'.
I should add, the problem is not the like. Postgres would use the index on:
scoped_id_value[2] like 'G130324135454100%'
It is not possible to short circuit this expression:
scoped_id_value[2] = 'G130324135454100'
or scoped_id_value[2] like '%G130324135454100%'
into this one:
scoped_id_value[2] = 'G130324135454100'
because it would not catch the cases where there are characters before or after that which would be matched with:
scoped_id_value[2] like '%G130324135454100%'
The only short circuit possible would be the last one. And only if Postgresql realizes that the core string (between %s) in the last one is the same as the previous one.

How to optimize my PostgreSQL DB for prefix search?

I have a table called "nodes" with roughly 1.7 million rows in my PostgreSQL db
=#\d nodes
Table "public.nodes"
Column | Type | Modifiers
--------+------------------------+-----------
id | integer | not null
title | character varying(256) |
score | double precision |
Indexes:
"nodes_pkey" PRIMARY KEY, btree (id)
I want to use information from that table for autocompletion of a search field, showing the user a list of the ten titles having the highest score fitting to his input. So I used this query (here searching for all titles starting with "s")
=# explain analyze select title,score from nodes where title ilike 's%' order by score desc;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Sort (cost=64177.92..64581.38 rows=161385 width=25) (actual time=4930.334..5047.321 rows=161264 loops=1)
Sort Key: score
Sort Method: external merge Disk: 5712kB
-> Seq Scan on nodes (cost=0.00..46630.50 rows=161385 width=25) (actual time=0.611..4464.413 rows=161264 loops=1)
Filter: ((title)::text ~~* 's%'::text)
Total runtime: 5260.791 ms
(6 rows)
This was much to slow for using it with autocomplete. With some information from Using PostgreSQL in Web 2.0 Applications I was able to improve that with a special index
=# create index title_idx on nodes using btree(lower(title) text_pattern_ops);
=# explain analyze select title,score from nodes where lower(title) like lower('s%') order by score desc limit 10;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=18122.41..18122.43 rows=10 width=25) (actual time=1324.703..1324.708 rows=10 loops=1)
-> Sort (cost=18122.41..18144.60 rows=8876 width=25) (actual time=1324.700..1324.702 rows=10 loops=1)
Sort Key: score
Sort Method: top-N heapsort Memory: 17kB
-> Bitmap Heap Scan on nodes (cost=243.53..17930.60 rows=8876 width=25) (actual time=96.124..1227.203 rows=161264 loops=1)
Filter: (lower((title)::text) ~~ 's%'::text)
-> Bitmap Index Scan on title_idx (cost=0.00..241.31 rows=8876 width=0) (actual time=90.059..90.059 rows=161264 loops=1)
Index Cond: ((lower((title)::text) ~>=~ 's'::text) AND (lower((title)::text) ~<~ 't'::text))
Total runtime: 1325.085 ms
(9 rows)
So this gave me a speedup of factor 4. But can this be further improved? What if I want to use '%s%' instead of 's%'? Do I have any chance of getting a decent performance with PostgreSQL in that case, too? Or should I better try a different solution (Lucene?, Sphinx?) for implementing my autocomplete feature?
You will need a text_pattern_ops index if you're not in the C locale.
See: index types.
Tips for further investigation :
Partition the table on the title key. This makes the lists smaller that postgres need to work with.
give postgresql more memory so the cache hit rate > 98%. This table will take about 0.5G, I think 2G should be no problem nowadays. Make sure statistics collection is enabled and read up on the pg_stats tables.
Make a second table with a reduced sustring of the title e.g. 12 characters so the complete table fits in less database blocks. An index on a substring may also work, but requires careful querying.
The long the substring, the faster the query will run. Create a separate table for small substrings, and store in the value the top ten or whatever of choices you would want to show. There are about 20000 combinations of 1,2,3 character strings.
You can use the same idea if you want to have %abc% queries, but probably switching to lucene makes sense now.
You're obviously not interested in 150000+ results, so you should limit them:
select title,score
from nodes
where title ilike 's%'
order by score desc
limit 10;
You can also consider creating functional index, and using ">=" and "<":
create index nodes_title_lower_idx on nodes (lower(title));
select title,score
from nodes
where lower(title)>='s' and lower(title)<'t'
order by score desc
limit 10;
You should also create index on score, which will help in ilike %s% case.