Following on from this answer I want to know what the best way to use PostgreSQL's built-in full text search is if I want to sort by rank, and limit to only matching queries.
Let's assume a very simple table.
CREATE TABLE pictures (
id SERIAL PRIMARY KEY,
title varchar(300),
...
)
or whatever. Now I want to search the title field. First I create an index:
CREATE INDEX pictures_title ON pictures
USING gin(to_tsvector('english', title));
Now I want to search for 'small dog'. This works:
SELECT pictures.id,
ts_rank_cd(
to_tsvector('english', pictures.title), 'small dog'
) AS score
FROM pictures
ORDER BY score DESC
But what I really want is this:
SELECT pictures.id,
ts_rank_cd(
to_tsvector('english', pictures.title), to_tsquery('small dog')
) AS score
FROM pictures
WHERE to_tsvector('english', pictures.title) ## to_tsquery('small dog')
ORDER BY score DESC
Or alternatively this (which doesn't work - can't use score in the WHERE clause):
SELECT pictures.id,
ts_rank_cd(
to_tsvector('english', pictures.title), to_tsquery('small dog')
) AS score
FROM pictures WHERE score > 0
ORDER BY score DESC
What's the best way to do this? My questions are many-fold:
If I use the version with repeated to_tsvector(...) will it call that twice, or is it smart enough to cache the results somehow?
Is there a way to do it without repeating the to_ts... function calls?
Is there a way to use score in the WHERE clause at all?
If there is, would it be better to filter by score > 0 or use the ## thing?
The use of the ## operator will utilize the full text GIN index, while the test for score > 0 would not.
I created a table as in the Question, but added a column named title_tsv:
CREATE TABLE test_pictures (
id BIGSERIAL,
title text,
title_tsv tsvector
);
CREATE INDEX ix_pictures_title_tsv ON test_pictures
USING gin(title_tsv);
I populated the table with some test data:
INSERT INTO test_pictures(title, title_tsv)
SELECT T.data, to_tsvector(T.data)
FROM some_table T;
Then I ran the previously accepted answer with explain analyze:
EXPLAIN ANALYZE
SELECT score, id, title
FROM (
SELECT ts_rank_cd(P.title_tsv, to_tsquery('address & shipping')) AS score
,P.id
,P.title
FROM test_pictures as P
) S
WHERE score > 0
ORDER BY score DESC;
And got the following. Note the execution time of 5,015 ms
QUERY PLAN |
----------------------------------------------------------------------------------------------------------------------------------------------|
Gather Merge (cost=274895.48..323298.03 rows=414850 width=60) (actual time=5010.844..5011.330 rows=1477 loops=1) |
Workers Planned: 2 |
Workers Launched: 2 |
-> Sort (cost=273895.46..274414.02 rows=207425 width=60) (actual time=4994.539..4994.555 rows=492 loops=3) |
Sort Key: (ts_rank_cd(p.title_tsv, to_tsquery('address & shipping'::text))) DESC |
Sort Method: quicksort Memory: 131kB |
-> Parallel Seq Scan on test_pictures p (cost=0.00..247776.02 rows=207425 width=60) (actual time=17.672..4993.997 rows=492 loops=3) |
Filter: (ts_rank_cd(title_tsv, to_tsquery('address & shipping'::text)) > '0'::double precision) |
Rows Removed by Filter: 497296 |
Planning time: 0.159 ms |
Execution time: 5015.664 ms |
Now compare that with the ## operator:
EXPLAIN ANALYZE
SELECT ts_rank_cd(to_tsvector(P.title), to_tsquery('address & shipping')) AS score
,P.id
,P.title
FROM test_pictures as P
WHERE P.title_tsv ## to_tsquery('address & shipping')
ORDER BY score DESC;
And the results coming in with an execution time of about 29 ms:
QUERY PLAN |
-------------------------------------------------------------------------------------------------------------------------------------------------|
Gather Merge (cost=13884.42..14288.35 rows=3462 width=60) (actual time=26.472..26.942 rows=1477 loops=1) |
Workers Planned: 2 |
Workers Launched: 2 |
-> Sort (cost=12884.40..12888.73 rows=1731 width=60) (actual time=17.507..17.524 rows=492 loops=3) |
Sort Key: (ts_rank_cd(to_tsvector(title), to_tsquery('address & shipping'::text))) DESC |
Sort Method: quicksort Memory: 171kB |
-> Parallel Bitmap Heap Scan on test_pictures p (cost=72.45..12791.29 rows=1731 width=60) (actual time=1.781..17.268 rows=492 loops=3) |
Recheck Cond: (title_tsv ## to_tsquery('address & shipping'::text)) |
Heap Blocks: exact=625 |
-> Bitmap Index Scan on ix_pictures_title_tsv (cost=0.00..71.41 rows=4155 width=0) (actual time=3.765..3.765 rows=1477 loops=1) |
Index Cond: (title_tsv ## to_tsquery('address & shipping'::text)) |
Planning time: 0.214 ms |
Execution time: 28.995 ms |
As you can see in the execution plan, the index ix_pictures_title_tsv was used in the second query, but not in the first one, making the query with the ## operator a whopping 172 times faster!
select *
from (
SELECT
pictures.id,
ts_rank_cd(to_tsvector('english', pictures.title),
to_tsquery('small dog')) AS score
FROM pictures
) s
WHERE score > 0
ORDER BY score DESC
If I use the version with repeated to_tsvector(...) will it call that twice, or is it smart enough to cache the results somehow?
The best way to notice these things is to do a simple explain, although those can be hard to read.
Long story short, yes, PostgreSQL is smart enough to reuse computed results.
Is there a way to do it without repeating the to_ts... function calls?
What I usually do is add a tsv column which is the text search vector. If you make this auto update using triggers it immediately gives you the vector easily accessible but it also allows you to selectively update the search index by making the trigger selective.
Is there a way to use score in the WHERE clause at all?
Yes, but not with that name.
Alternatively you could create a sub-query, but I would personally just repeat it.
If there is, would it be better to filter by score > 0 or use the ## thing?
The simplest version I can think of is this:
SELECT *
FROM pictures
WHERE 'small dog' ## text_search_vector
The text_search_vector could obviously be replaced with something like to_tsvector('english', pictures.title)
Related
There is a table, has 10 million records, and it has a column which type is array, it looks like:
id | content | contained_special_ids
----------------------------------------
1 | abc | { 1, 2 }
2 | abd | { 1, 3 }
3 | abe | { 1, 4 }
4 | abf | { 3 }
5 | abg | { 2 }
6 | abh | { 3 }
and I want to know that how many records there is which contained_special_ids includes 3, so my sql is:
select count(*) from my_table where contained_special_ids #> array[3]
It works fine when data is small, however it takes long time (about 30+ seconds) when the table has 10 million records.
I have added index to this column:
"index_my_table_on_contained_special_ids" gin (contained_special_ids)
So, how to optimize this select count query?
Thanks a lot!
UPDATE
below is the explain:
Finalize Aggregate (cost=1049019.17..1049019.18 rows=1 width=8) (actual time=44343.230..44362.224 rows=1 loops=1)
Output: count(*)
-> Gather (cost=1049018.95..1049019.16 rows=2 width=8) (actual time=44340.332..44362.217 rows=3 loops=1)
Output: (PARTIAL count(*))
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=1048018.95..1048018.96 rows=1 width=8) (actual time=44337.615..44337.615 rows=1 loops=3)
Output: PARTIAL count(*)
Worker 0: actual time=44336.442..44336.442 rows=1 loops=1
Worker 1: actual time=44336.564..44336.564 rows=1 loops=1
-> Parallel Bitmap Heap Scan on public.my_table (cost=9116.31..1046912.22 rows=442694 width=0) (actual time=330.602..44304.221 rows=391431 loops=3)
Recheck Cond: (my_table.contained_special_ids #> '{12511}'::bigint[])
Rows Removed by Index Recheck: 501077
Heap Blocks: exact=67496 lossy=109789
Worker 0: actual time=329.547..44301.513 rows=409272 loops=1
Worker 1: actual time=329.794..44304.582 rows=378538 loops=1
-> Bitmap Index Scan on index_my_table_on_contained_special_ids (cost=0.00..8850.69 rows=1062465 width=0) (actual time=278.413..278.414 rows=1176563 loops=1)
Index Cond: (my_table.contained_special_ids #> '{12511}'::bigint[])
Planning Time: 1.041 ms
Execution Time: 44362.262 ms
Increase work_mem until the lossy blocks go away. Also, make sure the table is well vacuumed to support index-only bitmap scans, and that you are using a new enough version (which you should tell us) to support those. Finally, you can try increasing effective_io_concurrency.
Also, post plans as text, not images; and turn on track_io_timing.
There is no way to optimize such a query due to 2 factors :
The use of a non atomic value that violate the FIRST NORMAL FORM
The fact that PostGreSQL is unable to perform quickly aggregate computation
On the first problem... 1st NORMAL FORM
each data in table's colums must be atomic.... Of course an array containing multiple value is not atomic.
Then no index would be efficient on such a column due to a type that violate 1FN
This can be reduced by using a table instaed of an array
On the poor performance of PG's aggregate
PG use a model of MVCC that combine in the same table data pages with phantom records and valid records, so to count valid record, that's need to red one by one all the records to distinguish wich one are valid to be counted from the other taht must not be count...
Most of other DBMS does not works as PG, like Oracle or SQL Server that does not keep phantom records inside the datapages, and some others have the exact count of the valid rows into the page header...
As a example, read the tests I have done comparing COUNT and other aggregate functions between PG and SQL Server, some queries runs 1500 time faster on SQL Server...
SELECT
a.geom, 'tk' category,
ROUND(avg(tk), 1) tk
FROM
tb_grid_4326_100m a left outer join
(
SELECT
tk-273.15 tk, geom
FROM
tb_points
WHERE
hour = '23'
) b ON st_contains(a.geom, b.geom)
GROUP BY
a.geom
QUERY PLAN |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Finalize GroupAggregate (cost=54632324.85..54648025.25 rows=50698 width=184) (actual time=8522.042..8665.129 rows=50698 loops=1) |
Group Key: a.geom |
-> Gather Merge (cost=54632324.85..54646504.31 rows=101396 width=152) (actual time=8522.032..8598.567 rows=50698 loops=1) |
Workers Planned: 2 |
Workers Launched: 2 |
-> Partial GroupAggregate (cost=54631324.83..54633800.68 rows=50698 width=152) (actual time=8490.577..8512.725 rows=16899 loops=3) |
Group Key: a.geom |
-> Sort (cost=54631324.83..54631785.36 rows=184212 width=130) (actual time=8490.557..8495.249 rows=16996 loops=3) |
Sort Key: a.geom |
Sort Method: external merge Disk: 2296kB |
Worker 0: Sort Method: external merge Disk: 2304kB |
Worker 1: Sort Method: external merge Disk: 2296kB |
-> Nested Loop Left Join (cost=0.41..54602621.56 rows=184212 width=130) (actual time=1.729..8475.942 rows=16996 loops=3) |
-> Parallel Seq Scan on tb_grid_4326_100m a (cost=0.00..5866.24 rows=21124 width=120) (actual time=0.724..2.846 rows=16899 loops=3) |
-> Index Scan using sidx_tb_points on tb_points (cost=0.41..2584.48 rows=10 width=42) (actual time=0.351..0.501 rows=1 loops=50698)|
Index Cond: (((hour)::text = '23'::text) AND (geom # a.geom)) |
Filter: st_contains(a.geom, geom) |
Rows Removed by Filter: 0 |
Planning Time: 1.372 ms |
Execution Time: 8667.418 ms |
I want to join 100m grid table, 100,000 points table using st_contains function.
The 100m grid table has 75,769 records, and tb_points table has 2,434,536 records.
When a time condition is given, the tb_points table returns about 100,000 records.
(As a result, about 75,000 records JOIN about 100,000 records.)
(Index information)
100m grid table using gist(geom),
tb_points table using gist(hour, geom)
It took 30 seconds. How can i imporve the performance?
It is hard to give a definitive answer, but here are several things you can try:
For a multicolumn gist index, it is often a good idea to put the most selectively used column first. In your case, that would have the index be on (geom, hour), not (hour, geom). On the other hand, it can also be better to put the faster column first, and testing for scalar equality should be much faster than testing for containment. You would have to do the test and see which factor is more important for you.
You could try for an Index-only scan, which doesn't need to visit the table. That could save a lot of random IO. Do do that you would need the index gist (hour, geom) INCLUDE (tk, geom). The geom column in a gist index is not considered to be "returnable", so it also needs to be put in the INCLUDE part into order to get the IOS.
Finally, you could partition the table tb_points on "hour". Then you wouldn't need to put "hour" into the gist index, as it is already fulfilled by the partitioning.
And these can be mixed and matched, so you could also swap the column order in the INCLUDE index, or you could try to get both partitioning and the INCLUDE index working together.
I have a large table over which I want to execute some window functions by scanning over an index and I want to stop scanning and produce the row when one of a number of conditions hold involving these aggregates (so WHERE ... LIMIT 1 is out of question, since I can't have window functions inside the WHERE).
Let me expand further on my concrete case:
Here's my events table:
=> \d events
Table "public.events"
Column | Type | Collation | Nullable | Default
------------+-------------------+-----------+----------+---------
block | character varying | | not null |
chainid | bigint | | not null |
height | bigint | | not null |
idx | bigint | | not null |
module | character varying | | not null |
modulehash | character varying | | not null |
name | character varying | | not null |
params | jsonb | | not null |
paramtext | character varying | | not null |
qualname | character varying | | not null |
requestkey | character varying | | not null |
Indexes:
"events_pkey" PRIMARY KEY, btree (block, idx, requestkey)
"events_height_chainid_idx" btree (height DESC, chainid, idx)
After much experimentation, I've arrived at a query that returns exactly the row I want and it also produces exactly the query plan that I'm envisioning:
=> EXPLAIN ANALYZE SELECT *
FROM (
SELECT *
, ROW_NUMBER() OVER (ORDER BY height DESC, block, requestkey, idx) as scan_num
, count(*) FILTER (WHERE qualname ILIKE '%transfer%') OVER
( ORDER BY height DESC, block, requestkey, idx
ROWS BETWEEN unbounded PRECEDING AND CURRENT ROW
) AS foundCnt
FROM events
ORDER BY height DESC, block, requestkey, idx
) as scanned_events
WHERE foundCnt = 3 OR scan_num = 100000
LIMIT 1
;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1065.81..1400.34 rows=1 width=397) (actual time=0.095..0.096 rows=1 loops=1)
-> Subquery Scan on scanned_events (cost=1065.81..165535223.46 rows=494824 width=397) (actual time=0.095..0.095 rows=1 loops=1)
Filter: ((scanned_events.foundcnt = 3) OR (scanned_events.scan_num = 100000))
Rows Removed by Filter: 2
-> WindowAgg (cost=1065.81..164791126.56 rows=49606460 width=397) (actual time=0.089..0.094 rows=3 loops=1)
-> WindowAgg (cost=1065.81..163550965.06 rows=49606460 width=389) (actual time=0.081..0.083 rows=4 loops=1)
-> Incremental Sort (cost=1065.81..162434819.71 rows=49606460 width=381) (actual time=0.076..0.076 rows=5 loops=1)
Sort Key: events.height DESC, events.block, events.requestkey, events.idx
Presorted Key: events.height
Full-sort Groups: 1 Sort Method: quicksort Average Memory: 56kB Peak Memory: 56kB
-> Index Scan using events_height_chainid_idx on events (cost=0.56..158424783.98 rows=49606460 width=381) (actual time=0.015..0.035 rows=53 loops=1)
Planning Time: 0.112 ms
Execution Time: 0.128 ms
(13 rows)
Here's what this query is trying to achieve: Scan through the events table counting the number of rows whose qualname contains 'transfers' and return the row as soon as you find the 3rd match OR you end up scanning 100000 rows.
So, my high-level intention is to look for some condition (involving a moving aggregate) but I want to put an upper bound on how many rows I'm willing to fetch. But if I happen to find what I'm looking for quickly, I also don't want to go through the rest of those 100000 rows unnecessarily (similar to the query plan above, where it ends up scanning just 53 rows).
If you inspect the query plan, this query is doing exactly what I want, but it has a serious flaw: It's not guaranteed to produce the correct result, it just happens to do it since the correct result is produced by the most natural way to execute the query, but the top-level SELECT has no ORDER BY clause, so Postgres could in theory execute it in a different way and end up returning any one row that happens to return foundCnt = 3.
In order to remedy this flaw, I've tried the following:
=> EXPLAIN ANALYZE SELECT *
FROM (
SELECT *
, ROW_NUMBER() OVER (ORDER BY height DESC, block, requestkey, idx) as scan_num
, count(*) FILTER (WHERE qualname ILIKE '%transfer%') OVER
( ORDER BY height DESC, block, requestkey, idx
ROWS BETWEEN unbounded PRECEDING AND CURRENT ROW
) AS foundCnt
FROM events
ORDER BY height DESC, block, requestkey, idx
) as scanned_events
WHERE foundCnt = 3 OR scan_num = 100000
ORDER BY height DESC, block, requestkey, idx
LIMIT 1
;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=16173553.41..16173571.35 rows=1 width=397) (actual time=86703.480..88314.937 rows=1 loops=1)
-> Subquery Scan on scanned_events (cost=16173553.41..25051383.19 rows=494821 width=397) (actual time=86435.692..88047.148 rows=1 loops=1)
Filter: ((scanned_events.foundcnt = 3) OR (scanned_events.scan_num = 100000))
Rows Removed by Filter: 2
-> WindowAgg (cost=16173553.41..24307291.63 rows=49606104 width=397) (actual time=86435.682..88047.143 rows=3 loops=1)
-> WindowAgg (cost=16173553.41..23067139.03 rows=49606104 width=389) (actual time=86435.662..88047.120 rows=4 loops=1)
-> Gather Merge (cost=16173553.41..21951001.69 rows=49606104 width=381) (actual time=86435.630..88047.085 rows=5 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=16172553.39..16224226.41 rows=20669210 width=381) (actual time=86147.622..86147.642 rows=106 loops=3)
Sort Key: events.height DESC, events.block, events.requestkey, events.idx
Sort Method: external merge Disk: 6535240kB
Worker 0: Sort Method: external merge Disk: 6503568kB
Worker 1: Sort Method: external merge Disk: 6506736kB
-> Parallel Seq Scan on events (cost=0.00..2852191.10 rows=20669210 width=381) (actual time=43.151..4135.334 rows=16430767 loops=3)
Planning Time: 0.353 ms
JIT:
Functions: 16
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 3.412 ms, Inlining 105.447 ms, Optimization 204.392 ms, Emission 87.327 ms, Total 400.578 ms
Execution Time: 89345.338 ms
(21 rows)
Now it ends up scanning the entire table, even though I've just explicitly specified what it was already doing. I've tried many variations on the latter query, such as moving the subquery to a CTE or ordering the outer SELECT by scan_num, ordering both SELECTs by scan_num, only ordering the outer SELECT by height DESC, block, requestkey, idx. I honestly lost track of the variations I've already tried, but as soon as I have an ORDER BY clause on the outer SELECT, Postgres ends up scanning the entire table.
So, my question is: Is there any way to achieve what I want without relying on fragile semantics (like the query that does exactly what I want). I.e. what would be the correct way to write a Postgres query that will scan a bounded number of rows and return as soon as a condition (involving window functions) is satisfied.
Addressing the comments
#nbk suggested trying to add an index on height DESC, block, requestkey, idx, i.e. the exact order we're looking for. Even though I want to avoid adding that index because I'm happy with the performance of my first query (so the index shouldn't be necessary), I still tried it, but it didn't change the query plan of the second query at all, it doesn't use any indexes anyway. It just made the first query slighly faster as expected, since that one does use indexes.
We have a table that contains raw analytics (like Google Analytics and similar) numbers for views on our videos. It contains numbers like raw views, downloads, loads, etc. Each video is identified by a video_id.
Data is recorded per-day, but because we need to extract on a number of metrics each day can contain multiple records for a specific video_id. Example:
date | video_id | country | source | downloads | etc...
----------------------------------------------------------------
2014-01-02 | 1 | us | facebook | 10 |
2014-01-02 | 1 | dk | facebook | 13 |
2014-01-02 | 1 | dk | admin | 20 |
I have a query where I need to get aggregate data for all videos that have new data beyond a certain date. To get the video ID's I do this query: SELECT video_id FROM table WHERE date >= '2014-01-01' GROUP BY photo_id (alternatively I could do a DISTINCT(video_id) without a GROUP BY, performance is identical).
Once I have these IDs I need the total aggregate data (for all time). Combined, this turns into the following query:
SELECT
video_id,
SUM(downloads),
SUM(loads),
<more SUMs),
FROM
table
WHERE
video_id IN (SELECT video_id FROM table WHERE date >= '2014-01-01' GROUP BY video_id)
GROUP BY
video_id
There is around ~10 columns we SUM (5-10 depending on the query). The EXPLAIN ANALYZE gives the following:
GroupAggregate (cost=2370840.59..2475948.90 rows=42537 width=72) (actual time=153790.362..162668.962 rows=87661 loops=1)
-> Sort (cost=2370840.59..2378295.16 rows=2981826 width=72) (actual time=153790.329..155833.770 rows=3285001 loops=1)
Sort Key: table.video_id
Sort Method: external merge Disk: 263528kB
-> Hash Join (cost=57066.94..1683266.53 rows=2981826 width=72) (actual time=740.210..143814.921 rows=3285001 loops=1)
Hash Cond: (table.video_id = table.video_id)
-> Seq Scan on table (cost=0.00..1550549.52 rows=5963652 width=72) (actual time=1.768..47613.953 rows=5963652 loops=1)
-> Hash (cost=56924.17..56924.17 rows=11422 width=8) (actual time=734.881..734.881 rows=87661 loops=1)
Buckets: 2048 Batches: 4 (originally 1) Memory Usage: 1025kB
-> HashAggregate (cost=56695.73..56809.95 rows=11422 width=8) (actual time=693.769..715.665 rows=87661 loops=1)
-> Index Only Scan using table_recent_ids on table (cost=0.00..52692.41 rows=1601328 width=8) (actual time=1.279..314.249 rows=1614339 loops=1)
Index Cond: (date >= '2014-01-01'::date)
Heap Fetches: 0
Total runtime: 162693.367 ms
As you can see, it's using a (quite big) external disk merge sort and taking a long time. I am unsure of why the sorts are triggered in the first place, and I am looking for a way to avoid it or at least minimize it. I know increasing work_mem can alleviate external disk merges, but in this case it seems to be excessive and having a work_mem above 500MB seems like a bad idea.
The table has two (relevant) indexes: One on video_id alone and another on (date, video_id).
EDIT: Updated query after running ANALYZE table.
Edited to match the revised query plan.
You are getting a sort because Postgres needs to sort the result rows to group them.
This query looks like it could really benefit from an index on table(video_id, date), or even just an index on table(video_id). Having such an index would likely avoid the need to sort.
Edited (#2) to suggest
You could also consider testing an alternative query such as this:
SELECT
video_id,
MAX(date) as latest_date,
<SUMs>
FROM
table
GROUP BY
video_id
HAVING
latest_date >= '2014-01-01'
That avoids any join or subquery, and given an index on table(video_id [, other columns]) it can be hoped that the sort will be avoided as well. It will compute the sums over the whole base table before filtering out the groups you don't want, but that operation is O(n), whereas sorting is O(m log m). Thus, if the date criterion is not very selective then checking it after the fact may be an improvement.
I have a table called "nodes" with roughly 1.7 million rows in my PostgreSQL db
=#\d nodes
Table "public.nodes"
Column | Type | Modifiers
--------+------------------------+-----------
id | integer | not null
title | character varying(256) |
score | double precision |
Indexes:
"nodes_pkey" PRIMARY KEY, btree (id)
I want to use information from that table for autocompletion of a search field, showing the user a list of the ten titles having the highest score fitting to his input. So I used this query (here searching for all titles starting with "s")
=# explain analyze select title,score from nodes where title ilike 's%' order by score desc;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Sort (cost=64177.92..64581.38 rows=161385 width=25) (actual time=4930.334..5047.321 rows=161264 loops=1)
Sort Key: score
Sort Method: external merge Disk: 5712kB
-> Seq Scan on nodes (cost=0.00..46630.50 rows=161385 width=25) (actual time=0.611..4464.413 rows=161264 loops=1)
Filter: ((title)::text ~~* 's%'::text)
Total runtime: 5260.791 ms
(6 rows)
This was much to slow for using it with autocomplete. With some information from Using PostgreSQL in Web 2.0 Applications I was able to improve that with a special index
=# create index title_idx on nodes using btree(lower(title) text_pattern_ops);
=# explain analyze select title,score from nodes where lower(title) like lower('s%') order by score desc limit 10;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=18122.41..18122.43 rows=10 width=25) (actual time=1324.703..1324.708 rows=10 loops=1)
-> Sort (cost=18122.41..18144.60 rows=8876 width=25) (actual time=1324.700..1324.702 rows=10 loops=1)
Sort Key: score
Sort Method: top-N heapsort Memory: 17kB
-> Bitmap Heap Scan on nodes (cost=243.53..17930.60 rows=8876 width=25) (actual time=96.124..1227.203 rows=161264 loops=1)
Filter: (lower((title)::text) ~~ 's%'::text)
-> Bitmap Index Scan on title_idx (cost=0.00..241.31 rows=8876 width=0) (actual time=90.059..90.059 rows=161264 loops=1)
Index Cond: ((lower((title)::text) ~>=~ 's'::text) AND (lower((title)::text) ~<~ 't'::text))
Total runtime: 1325.085 ms
(9 rows)
So this gave me a speedup of factor 4. But can this be further improved? What if I want to use '%s%' instead of 's%'? Do I have any chance of getting a decent performance with PostgreSQL in that case, too? Or should I better try a different solution (Lucene?, Sphinx?) for implementing my autocomplete feature?
You will need a text_pattern_ops index if you're not in the C locale.
See: index types.
Tips for further investigation :
Partition the table on the title key. This makes the lists smaller that postgres need to work with.
give postgresql more memory so the cache hit rate > 98%. This table will take about 0.5G, I think 2G should be no problem nowadays. Make sure statistics collection is enabled and read up on the pg_stats tables.
Make a second table with a reduced sustring of the title e.g. 12 characters so the complete table fits in less database blocks. An index on a substring may also work, but requires careful querying.
The long the substring, the faster the query will run. Create a separate table for small substrings, and store in the value the top ten or whatever of choices you would want to show. There are about 20000 combinations of 1,2,3 character strings.
You can use the same idea if you want to have %abc% queries, but probably switching to lucene makes sense now.
You're obviously not interested in 150000+ results, so you should limit them:
select title,score
from nodes
where title ilike 's%'
order by score desc
limit 10;
You can also consider creating functional index, and using ">=" and "<":
create index nodes_title_lower_idx on nodes (lower(title));
select title,score
from nodes
where lower(title)>='s' and lower(title)<'t'
order by score desc
limit 10;
You should also create index on score, which will help in ilike %s% case.