Why does my query using LIKE perform a seq scan? - sql

I have a table with 2.2 Million rows.
Table "public.index"
Column | Type | Modifiers
-----------+-----------------------------+-----------------------------------------------------
fid | integer | not null default nextval('index_fid_seq'::regclass)
location | character varying |
Indexes:
"index_pkey" PRIMARY KEY, btree (fid)
"location_index" btree (location text_pattern_ops)
The location is the full path to a file, but I need to query using the name of the folder the file is located in. That folder name is unique in the table.
To avoid % at the beginning, I search for the full path which I know:
select fid from index where location like '/path/to/folder/%'
Explain Analyze:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on index (cost=0.00..120223.34 rows=217 width=4) (actual time=1181.701..1181.701 rows=0 loops=1)
Filter: ((location)::text ~~ '/path/to/folder/%'::text)
Rows Removed by Filter: 2166034
Planning time: 0.954 ms
Execution time: 1181.748 ms
(5 rows)
The question is not how to make a workaround, because
I have found that for my case:
When creating a foldername_index
create index on index (substring(location, '(?<=/path/to/)[^\/]*');
I can succesfully use the folder_name to query:
explain analyze select fid from index where substring(location, '(?<=/path/to/)[^\/]*') = 'foldername';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on index (cost=600.49..31524.74 rows=10830 width=12) (actual time=0.030..0.030 rows=1 loops=1)
Recheck Cond: ("substring"((location)::text, '(?<=/path/to/)[^\/]*'::text) = 'folder_name'::text)
Heap Blocks: exact=1
-> Bitmap Index Scan on foldername_index (cost=0.00..597.78 rows=10830 width=0) (actual time=0.023..0.023 rows=1 loops=1)
Index Cond: ("substring"((location)::text, '(?<=/path/to/)[^\/]*'::text) = 'folder_name'::text)
Planning time: 0.115 ms
Execution time: 0.059 ms
(7 rows)
I have followed the PostgreSQL FAQ:
When using wild-card operators such as LIKE or ~, indexes can only be
used in certain circumstances:
The beginning of the search string must be anchored to the start of the string, i.e.
LIKE patterns must not start with % or _.
The search string can not start with a character class, e.g. [a-e].
Everyting not the case in my query.
C locale must be used during initdb because sorting in a non-C locale often doesn't match the behavior of LIKE. You can create a special text_pattern_ops index that will work in such cases, but note it is only helpful for LIKE indexing.
I have C Locale:
# show LC_COLLATE;
lc_collate
------------
C
(1 row)
I also followed the instructions from this great answer here on Stack Overflow, which is why I use text_pattern_ops which did not change anything. Unfortunately, I cannot install new modules.
So: Why does my query perform a seq scan?

I have found the solution myself by thinking about it over and over again. Although it might be obvious for some people, it may help others:
/path/to/folder is actually /the_path/to/folder/ (There are underscores in the path). But _ is a wildcard in SQL (like %).
select fid from index where location like '/the_path/to/folder/%'
Uses seq scan because the index cannot filter any rows, because the part up to the underscore is the same for all rows.
select fid from index where location like '/the\_path/to/folder/%'
uses index scan.

Related

How to optimize large database query accessed with an item ID?

I'm making a site that stores a large amount of data (8 data points for 313 item_ids every 10 seconds over 24 hr) and I serve that data to users on demand. The request is supplied with an item ID with which I query the database with something along the lines of SELECT * FROM current_day_data WHERE item_id = <supplied ID> (assuming the id is valid).
CREATE TABLE current_day_data (
"time" bigint,
item_id text NOT NULL,
-- some data,
id integer NOT NULL
);
CREATE INDEX item_id_lookup ON public.current_day_data USING btree (item_id);
This works fine, but the request takes about a third of a second, so I'm looking into either other database options to help optimize this, or some way to optimize the query itself.
My current setup is a PostgreSQL database with an index on the item ID column, but I feel like there's options in the realm of NoSQL (an area I'm unfamiliar with) due to it's similarity to a hash table.
My ideal solution would be a hash table with the item IDs as the key and the data as a JSON-like object but I don't know what options could achieve that.
tl;dr how to optimize SELECT * FROM current_day_data WHERE item_id = <supplied ID> through better querying or new database solution?
edit: here's the EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM current_day_data
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Seq Scan on current_day_data (cost=0.00..46811.09 rows=2584364 width=75) (actual time=0.013..291.667 rows=2700251 loops=1)
Buffers: shared hit=39058
Planning:
Buffers: shared hit=112
Planning Time: 0.584 ms
Execution Time: 446.622 ms
(6 rows)
EXPLAIN with a specified item id EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM current_day_data WHERE item_id = 'SUGAR_CANE';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on current_day_info (cost=33.40..12099.27 rows=8592 width=75) (actual time=2.949..12.236 rows=8627 loops=1)
Recheck Cond: (product_id = 'SUGAR_CANE'::text)
Heap Blocks: exact=8570
Buffers: shared hit=8619
-> Bitmap Index Scan on prod_id_lookup (cost=0.00..32.97 rows=8592 width=0) (actual time=1.751..1.751 rows=8665 loops=1)
Index Cond: (product_id = 'SUGAR_CANE'::text)
Buffers: shared hit=12
Planning:
Buffers: shared hit=68
Planning Time: 0.339 ms
Execution Time: 12.686 ms
(11 rows)
Now this says 12.7ms which makes me think the 300ms has something to do with the library I'm using (SQLAlchemy), but that wouldn't really make sense since it's a popular library. More specifically, the line I'm using is:
results = CurrentDayData.query.filter(CurrentDayData.item_id == item_id).all()
That’s a very simple query that uses an index, therefore the only way to possible speed it up would be to improve the specification of your hardware.
Moving to a different form of database, on the same hardware, is not going to make a significant difference in performance to this type of query.

Why is PostgreSQL 11 optimizer refusing best plan which would use an index w/ included column?

PostgreSQL 11 isn't smart enough to use indexes with included columns?
CREATE INDEX organization_locations__org_id_is_headquarters__inc_location_id_ix
ON organization_locations(org_id, is_headquarters) INCLUDE (location_id);
ANALYZE organization_locations;
ANALYZE organizations;
EXPLAIN VERBOSE
SELECT location_id
FROM organization_locations ol
WHERE org_id = (SELECT id FROM organizations WHERE code = 'akron')
AND is_headquarters = 1;
QUERY PLAN
Seq Scan on organization_locations ol (cost=8.44..14.61 rows=1 width=4)
Output: ol.location_id
Filter: ((ol.org_id = $0) AND (ol.is_headquarters = 1))
InitPlan 1 (returns $0)
-> Index Scan using organizations__code_ux on organizations (cost=0.42..8.44 rows=1 width=4)
Output: organizations.id
Index Cond: ((organizations.code)::text = 'akron'::text)
There are only 211 rows currently in organization_locations, average row length 91 bytes.
I get only loading one data page. But the I/O is the same to grab the index page and the target data is right there (no extra lookup into the data page from the index). What is PG thinking with this plan?
This just creates a TODO for me to round back and check to make sure the right plan starts getting generated once the table burgeons.
EDIT: Here is the explain with buffers:
Seq Scan on organization_locations ol (cost=8.44..14.33 rows=1 width=4) (actual time=0.018..0.032 rows=1 loops=1)
Filter: ((org_id = $0) AND (is_headquarters = 1))
Rows Removed by Filter: 210
Buffers: shared hit=7
InitPlan 1 (returns $0)
-> Index Scan using organizations__code_ux on organizations (cost=0.42..8.44 rows=1 width=4) (actual time=0.008..0.009 rows=1 loops=1)
Index Cond: ((code)::text = 'akron'::text)
Buffers: shared hit=4
Planning Time: 0.402 ms
Execution Time: 0.048 ms
Reading one index page is not cheaper than reading a table page, so with tiny tables you cannot expect a gain from an index-only scan.
Besides, did you
VACUUM organization_locations;
Without that, the visibility map won't show that the table block is all-visible, so you cannot get an index-only scan no matter what.
In addition to the other answers, this is probably a silly index to have in the first place. INCLUDE is good when you need a unique index but you also want to tack on a column which is not part of the unique constraint, or when the included column doesn't have btree operators and so can't be in the main body of the index. In other cases, you should just put the extra column in the index itself.
This just creates a TODO for me to round back and check to make sure the right plan starts getting generated once the table burgeons.
This is your workflow problem that you can't expect PostgreSQL to solve for you. Do you really think PostgreSQL should create actual plans based on imaginary scenarios?

Poor regex performance for word matching on postgres

I have a list of blocked phrases and I want to match for the existence of those phrases in user inputed text, but performance is very bad.
I am using this query :
SELECT value FROM blocked_items WHERE lower(unaccent( 'my input text' )) ~* ('[[:<:]]' || value || '[[:>:]]') LIMIT 1;
After my investigation I found out that world boundries [[:<:]] and [[:>:]] perform very badly knowing that blocked_items has 24k records in it.
For instance when I try to run this one:
SELECT value FROM blocked_items WHERE lower(unaccent( 'my input text ' )) ilike ('%' || value || '%') LIMIT 1;
it's very fast compared to the first one. The problem is that I need to keep the test for word boundries.
This check is performed frequently on a large program so the performance is very important for me.
Do you guys have any suggestions for making this faster?
EXPLIAN ANALYZE screenshot
Since you know that the LIKE (~~) query is fast and the RegEx (~) query is slow, the easiest solution is to combine both conditions (\m / \M are equivalent to [[:<:]] / [[:>:]]):
SELECT value FROM blocked_items
WHERE lower(unaccent('my input text')) ~~ ('%'||value||'%')
AND lower(unaccent('my input text')) ~ ('\m'||value||'\M')
LIMIT 1;
This way the fast query condition filters out most of the rows and then the slow condition discards the remaining ones.
I am using the faster case sensitive operators, assuming that value is already normalized. If that is not the case, drop the (then redundant) lower() and use the case sensitive versions as in the original queries.
On my test set with 370k rows that speeds up the query from 6s (warm) to 90ms:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..1651.85 rows=1 width=10) (actual time=89.702..89.702 rows=1 loops=1)
-> Seq Scan on blocked_items (cost=0.00..14866.61 rows=9 width=10) (actual time=89.701..89.701 rows=1 loops=1)
Filter: ((lower(unaccent('my input text'::text)) ~~ (('%'::text || value) || '%'::text)) AND (lower(unaccent('my input text'::text)) ~ (('\m'::text || value) || '\M'::text)))
Rows Removed by Filter: 153281
Planning Time: 0.097 ms
Execution Time: 89.717 ms
(6 rows)
However we are still doing a full table scan and performance will vary based on the position in the table.
Ideally we can answer the query in near constant time by using an index.
Let's rewrite the query to use Text Search Functions and Operators:
SELECT value FROM blocked_items
WHERE to_tsvector('simple', unaccent('my input text'))
## phraseto_tsquery('simple', value)
LIMIT 1;
First we split the input into search vectors and then check if the blocked phrase matches those vectors.
This takes about 440ms for the test query – a step back from our combined query:
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..104.01 rows=1 width=10) (actual time=437.761..437.761 rows=1 loops=1)
-> Seq Scan on blocked_items (cost=0.00..192516.05 rows=1851 width=10) (actual time=437.760..437.760 rows=1 loops=1)
Filter: (to_tsvector('simple'::regconfig, unaccent('my input text'::text)) ## phraseto_tsquery('simple'::regconfig, value))
Rows Removed by Filter: 153281
Planning Time: 0.063 ms
Execution Time: 437.772 ms
(6 rows)
Since we can't use tsvector ## tsquery for indexing of tsquery, we can rewrite the query again to check if the blocked phrase is a subquery of the input phrase using the tsquery #> tsquery Text Search Operator, which can then be indexed using tsquery_ops from the GiST Operator Classes:
CREATE INDEX blocked_items_search ON blocked_items
USING gist (phraseto_tsquery('simple', value));
ANALYZE blocked_items; -- update query planner stats
SELECT value FROM blocked_items
WHERE phraseto_tsquery('simple', unaccent('my input text'))
#> phraseto_tsquery('simple', value)
LIMIT 1;
The query can now use an index scan and takes 20ms with the same data.
Since GiST is a lossy index, query times can vary a bit depending on how many rechecks are required:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.54..4.23 rows=1 width=10) (actual time=19.215..19.215 rows=1 loops=1)
-> Index Scan using blocked_items_search on blocked_items (cost=0.54..1367.01 rows=370 width=10) (actual time=19.214..19.214 rows=1 loops=1)
Index Cond: (phraseto_tsquery('simple'::regconfig, value) <# phraseto_tsquery('simple'::regconfig, unaccent('my input text'::text)))
Rows Removed by Index Recheck: 4028
Planning Time: 0.093 ms
Execution Time: 19.236 ms
(6 rows)
One great advantage of using full text search is that you can now use language specific matching of words by using search configurations (regconfig).
The above queries are all using the default 'simple' regconfig to match the behavior of the original query. By switching to 'english' you can also match variations of the same word like cat and cats (stemming) and common words without significance like the or my will be ignored (stop words).

Postgresql LIKE ANY versus LIKE

I've tried to be thorough in this question, so if you're impatient, just jump to the end to see what the actual question is...
I'm working on adjusting how some search features in one of our databases is implemented. To this end, I'm adding some wildcard capabilities to our application's API that interfaces back to Postgresql.
The issue that I've found is that the EXPLAIN ANALYZE times do not make sense to me and I'm trying to figure out where I could be going wrong; it doesn't seem likely that 15 queries is better than just one optimized query!
The table, Words, has two relevant columns for this question: id and text. The text column has an index on it that was build with the text_pattern_ops option. Here's what I'm seeing:
First, using a LIKE ANY with a VALUES clause, which some references seem to indicate would be ideal in my case (finding multiple words):
events_prod=# explain analyze select distinct id from words where words.text LIKE ANY (values('test%'));
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=6716668.40..6727372.85 rows=1070445 width=4) (actual time=103088.381..103091.468 rows=256 loops=1)
Group Key: words.id
-> Nested Loop Semi Join (cost=0.00..6713992.29 rows=1070445 width=4) (actual time=0.670..103087.904 rows=256 loops=1)
Join Filter: ((words.text)::text ~~ "*VALUES*".column1)
Rows Removed by Join Filter: 214089311
-> Seq Scan on words (cost=0.00..3502655.91 rows=214089091 width=21) (actual time=0.017..25232.135 rows=214089567 loops=1)
-> Materialize (cost=0.00..0.02 rows=1 width=32) (actual time=0.000..0.000 rows=1 loops=214089567)
-> Values Scan on "*VALUES*" (cost=0.00..0.01 rows=1 width=32) (actual time=0.006..0.006 rows=1 loops=1)
Planning time: 0.226 ms
Execution time: 103106.296 ms
(10 rows)
As you can see, the execution time is horrendous.
A second attempt, using LIKE ANY(ARRAY[... yields:
events_prod=# explain analyze select distinct id from words where words.text LIKE ANY(ARRAY['test%']);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=3770401.08..3770615.17 rows=21409 width=4) (actual time=37399.573..37399.704 rows=256 loops=1)
Group Key: id
-> Seq Scan on words (cost=0.00..3770347.56 rows=21409 width=4) (actual time=0.224..37399.054 rows=256 loops=1)
Filter: ((text)::text ~~ ANY ('{test%}'::text[]))
Rows Removed by Filter: 214093922
Planning time: 0.611 ms
Execution time: 37399.895 ms
(7 rows)
As you can see, performance is dramatically improved, but still far from ideal... 37 seconds. with one word in the list. Moving that up to three words that returns a total of 256 rows changes the execution time to well over 100 seconds.
The last try, doing a LIKE for a single word:
events_prod=# explain analyze select distinct id from words where words.text LIKE 'test%';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=60.14..274.23 rows=21409 width=4) (actual time=1.437..1.576 rows=256 loops=1)
Group Key: id
-> Index Scan using words_special_idx on words (cost=0.57..6.62 rows=21409 width=4) (actual time=0.048..1.258 rows=256 loops=1)
Index Cond: (((text)::text ~>=~ 'test'::text) AND ((text)::text ~<~ 'tesu'::text))
Filter: ((text)::text ~~ 'test%'::text)
Planning time: 0.826 ms
Execution time: 1.858 ms
(7 rows)
As expected, this is the fastest, but the 1.85ms makes me wonder if there is something else I'm missing with the VALUES and ARRAY approach.
The Question
Is there some more efficient way to do something like this in Postgresql that I've missed in my research?
select distinct id
from words
where words.text LIKE ANY(ARRAY['word1%', 'another%', 'third%']);
This is a bit speculative. I think the key is your pattern:
where words.text LIKE 'test%'
Note that the like pattern starts with a constant string. The means that Postgres can do a range scan on the index for the words that start with 'test'.
When you then introduce multiple comparisons, the optimizer gets confused and no longer considers multiple range scans. Instead, it decides that it needs to process all the rows.
This may be a case where this re-write gives you the performance that you want:
select id
from words
where words.text LIKE 'word1%'
union
select id
from words
where words.text LIKE 'another%'
union
select id
from words
where words.text LIKE 'third%';
Notes:
The distinct is not needed because of the union.
If the pattern starts with a wildcard, then a full scan is needed anyway.
You might want to consider an n-gram or full-text index on the table.

How to optimize my PostgreSQL DB for prefix search?

I have a table called "nodes" with roughly 1.7 million rows in my PostgreSQL db
=#\d nodes
Table "public.nodes"
Column | Type | Modifiers
--------+------------------------+-----------
id | integer | not null
title | character varying(256) |
score | double precision |
Indexes:
"nodes_pkey" PRIMARY KEY, btree (id)
I want to use information from that table for autocompletion of a search field, showing the user a list of the ten titles having the highest score fitting to his input. So I used this query (here searching for all titles starting with "s")
=# explain analyze select title,score from nodes where title ilike 's%' order by score desc;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Sort (cost=64177.92..64581.38 rows=161385 width=25) (actual time=4930.334..5047.321 rows=161264 loops=1)
Sort Key: score
Sort Method: external merge Disk: 5712kB
-> Seq Scan on nodes (cost=0.00..46630.50 rows=161385 width=25) (actual time=0.611..4464.413 rows=161264 loops=1)
Filter: ((title)::text ~~* 's%'::text)
Total runtime: 5260.791 ms
(6 rows)
This was much to slow for using it with autocomplete. With some information from Using PostgreSQL in Web 2.0 Applications I was able to improve that with a special index
=# create index title_idx on nodes using btree(lower(title) text_pattern_ops);
=# explain analyze select title,score from nodes where lower(title) like lower('s%') order by score desc limit 10;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=18122.41..18122.43 rows=10 width=25) (actual time=1324.703..1324.708 rows=10 loops=1)
-> Sort (cost=18122.41..18144.60 rows=8876 width=25) (actual time=1324.700..1324.702 rows=10 loops=1)
Sort Key: score
Sort Method: top-N heapsort Memory: 17kB
-> Bitmap Heap Scan on nodes (cost=243.53..17930.60 rows=8876 width=25) (actual time=96.124..1227.203 rows=161264 loops=1)
Filter: (lower((title)::text) ~~ 's%'::text)
-> Bitmap Index Scan on title_idx (cost=0.00..241.31 rows=8876 width=0) (actual time=90.059..90.059 rows=161264 loops=1)
Index Cond: ((lower((title)::text) ~>=~ 's'::text) AND (lower((title)::text) ~<~ 't'::text))
Total runtime: 1325.085 ms
(9 rows)
So this gave me a speedup of factor 4. But can this be further improved? What if I want to use '%s%' instead of 's%'? Do I have any chance of getting a decent performance with PostgreSQL in that case, too? Or should I better try a different solution (Lucene?, Sphinx?) for implementing my autocomplete feature?
You will need a text_pattern_ops index if you're not in the C locale.
See: index types.
Tips for further investigation :
Partition the table on the title key. This makes the lists smaller that postgres need to work with.
give postgresql more memory so the cache hit rate > 98%. This table will take about 0.5G, I think 2G should be no problem nowadays. Make sure statistics collection is enabled and read up on the pg_stats tables.
Make a second table with a reduced sustring of the title e.g. 12 characters so the complete table fits in less database blocks. An index on a substring may also work, but requires careful querying.
The long the substring, the faster the query will run. Create a separate table for small substrings, and store in the value the top ten or whatever of choices you would want to show. There are about 20000 combinations of 1,2,3 character strings.
You can use the same idea if you want to have %abc% queries, but probably switching to lucene makes sense now.
You're obviously not interested in 150000+ results, so you should limit them:
select title,score
from nodes
where title ilike 's%'
order by score desc
limit 10;
You can also consider creating functional index, and using ">=" and "<":
create index nodes_title_lower_idx on nodes (lower(title));
select title,score
from nodes
where lower(title)>='s' and lower(title)<'t'
order by score desc
limit 10;
You should also create index on score, which will help in ilike %s% case.