Avoiding external disk sort for aggregate query

Avoiding external disk sort for aggregate query - sql

We have a table that contains raw analytics (like Google Analytics and similar) numbers for views on our videos. It contains numbers like raw views, downloads, loads, etc. Each video is identified by a video_id.
Data is recorded per-day, but because we need to extract on a number of metrics each day can contain multiple records for a specific video_id. Example:
date | video_id | country | source | downloads | etc...
----------------------------------------------------------------
2014-01-02 | 1 | us | facebook | 10 |
2014-01-02 | 1 | dk | facebook | 13 |
2014-01-02 | 1 | dk | admin | 20 |
I have a query where I need to get aggregate data for all videos that have new data beyond a certain date. To get the video ID's I do this query: SELECT video_id FROM table WHERE date >= '2014-01-01' GROUP BY photo_id (alternatively I could do a DISTINCT(video_id) without a GROUP BY, performance is identical).
Once I have these IDs I need the total aggregate data (for all time). Combined, this turns into the following query:
SELECT
video_id,
SUM(downloads),
SUM(loads),
<more SUMs),
FROM
table
WHERE
video_id IN (SELECT video_id FROM table WHERE date >= '2014-01-01' GROUP BY video_id)
GROUP BY
video_id
There is around ~10 columns we SUM (5-10 depending on the query). The EXPLAIN ANALYZE gives the following:
GroupAggregate (cost=2370840.59..2475948.90 rows=42537 width=72) (actual time=153790.362..162668.962 rows=87661 loops=1)
-> Sort (cost=2370840.59..2378295.16 rows=2981826 width=72) (actual time=153790.329..155833.770 rows=3285001 loops=1)
Sort Key: table.video_id
Sort Method: external merge Disk: 263528kB
-> Hash Join (cost=57066.94..1683266.53 rows=2981826 width=72) (actual time=740.210..143814.921 rows=3285001 loops=1)
Hash Cond: (table.video_id = table.video_id)
-> Seq Scan on table (cost=0.00..1550549.52 rows=5963652 width=72) (actual time=1.768..47613.953 rows=5963652 loops=1)
-> Hash (cost=56924.17..56924.17 rows=11422 width=8) (actual time=734.881..734.881 rows=87661 loops=1)
Buckets: 2048 Batches: 4 (originally 1) Memory Usage: 1025kB
-> HashAggregate (cost=56695.73..56809.95 rows=11422 width=8) (actual time=693.769..715.665 rows=87661 loops=1)
-> Index Only Scan using table_recent_ids on table (cost=0.00..52692.41 rows=1601328 width=8) (actual time=1.279..314.249 rows=1614339 loops=1)
Index Cond: (date >= '2014-01-01'::date)
Heap Fetches: 0
Total runtime: 162693.367 ms
As you can see, it's using a (quite big) external disk merge sort and taking a long time. I am unsure of why the sorts are triggered in the first place, and I am looking for a way to avoid it or at least minimize it. I know increasing work_mem can alleviate external disk merges, but in this case it seems to be excessive and having a work_mem above 500MB seems like a bad idea.
The table has two (relevant) indexes: One on video_id alone and another on (date, video_id).
EDIT: Updated query after running ANALYZE table.

Edited to match the revised query plan.
You are getting a sort because Postgres needs to sort the result rows to group them.
This query looks like it could really benefit from an index on table(video_id, date), or even just an index on table(video_id). Having such an index would likely avoid the need to sort.
Edited (#2) to suggest
You could also consider testing an alternative query such as this:
SELECT
video_id,
MAX(date) as latest_date,
<SUMs>
FROM
table
GROUP BY
video_id
HAVING
latest_date >= '2014-01-01'
That avoids any join or subquery, and given an index on table(video_id [, other columns]) it can be hoped that the sort will be avoided as well. It will compute the sums over the whole base table before filtering out the groups you don't want, but that operation is O(n), whereas sorting is O(m log m). Thus, if the date criterion is not very selective then checking it after the fact may be an improvement.

Related

How to optimize this "select count" SQL? (postgres array comparision)

There is a table, has 10 million records, and it has a column which type is array, it looks like:
id | content | contained_special_ids
----------------------------------------
1 | abc | { 1, 2 }
2 | abd | { 1, 3 }
3 | abe | { 1, 4 }
4 | abf | { 3 }
5 | abg | { 2 }
6 | abh | { 3 }
and I want to know that how many records there is which contained_special_ids includes 3, so my sql is:
select count(*) from my_table where contained_special_ids #> array[3]
It works fine when data is small, however it takes long time (about 30+ seconds) when the table has 10 million records.
I have added index to this column:
"index_my_table_on_contained_special_ids" gin (contained_special_ids)
So, how to optimize this select count query?
Thanks a lot!
UPDATE
below is the explain:
Finalize Aggregate (cost=1049019.17..1049019.18 rows=1 width=8) (actual time=44343.230..44362.224 rows=1 loops=1)
Output: count(*)
-> Gather (cost=1049018.95..1049019.16 rows=2 width=8) (actual time=44340.332..44362.217 rows=3 loops=1)
Output: (PARTIAL count(*))
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=1048018.95..1048018.96 rows=1 width=8) (actual time=44337.615..44337.615 rows=1 loops=3)
Output: PARTIAL count(*)
Worker 0: actual time=44336.442..44336.442 rows=1 loops=1
Worker 1: actual time=44336.564..44336.564 rows=1 loops=1
-> Parallel Bitmap Heap Scan on public.my_table (cost=9116.31..1046912.22 rows=442694 width=0) (actual time=330.602..44304.221 rows=391431 loops=3)
Recheck Cond: (my_table.contained_special_ids #> '{12511}'::bigint[])
Rows Removed by Index Recheck: 501077
Heap Blocks: exact=67496 lossy=109789
Worker 0: actual time=329.547..44301.513 rows=409272 loops=1
Worker 1: actual time=329.794..44304.582 rows=378538 loops=1
-> Bitmap Index Scan on index_my_table_on_contained_special_ids (cost=0.00..8850.69 rows=1062465 width=0) (actual time=278.413..278.414 rows=1176563 loops=1)
Index Cond: (my_table.contained_special_ids #> '{12511}'::bigint[])
Planning Time: 1.041 ms
Execution Time: 44362.262 ms

Increase work_mem until the lossy blocks go away. Also, make sure the table is well vacuumed to support index-only bitmap scans, and that you are using a new enough version (which you should tell us) to support those. Finally, you can try increasing effective_io_concurrency.
Also, post plans as text, not images; and turn on track_io_timing.

There is no way to optimize such a query due to 2 factors :
The use of a non atomic value that violate the FIRST NORMAL FORM
The fact that PostGreSQL is unable to perform quickly aggregate computation
On the first problem... 1st NORMAL FORM
each data in table's colums must be atomic.... Of course an array containing multiple value is not atomic.
Then no index would be efficient on such a column due to a type that violate 1FN
This can be reduced by using a table instaed of an array
On the poor performance of PG's aggregate
PG use a model of MVCC that combine in the same table data pages with phantom records and valid records, so to count valid record, that's need to red one by one all the records to distinguish wich one are valid to be counted from the other taht must not be count...
Most of other DBMS does not works as PG, like Oracle or SQL Server that does not keep phantom records inside the datapages, and some others have the exact count of the valid rows into the page header...
As a example, read the tests I have done comparing COUNT and other aggregate functions between PG and SQL Server, some queries runs 1500 time faster on SQL Server...

postgresql st_contains performance

SELECT
a.geom, 'tk' category,
ROUND(avg(tk), 1) tk
FROM
tb_grid_4326_100m a left outer join
(
SELECT
tk-273.15 tk, geom
FROM
tb_points
WHERE
hour = '23'
) b ON st_contains(a.geom, b.geom)
GROUP BY
a.geom
QUERY PLAN |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Finalize GroupAggregate (cost=54632324.85..54648025.25 rows=50698 width=184) (actual time=8522.042..8665.129 rows=50698 loops=1) |
Group Key: a.geom |
-> Gather Merge (cost=54632324.85..54646504.31 rows=101396 width=152) (actual time=8522.032..8598.567 rows=50698 loops=1) |
Workers Planned: 2 |
Workers Launched: 2 |
-> Partial GroupAggregate (cost=54631324.83..54633800.68 rows=50698 width=152) (actual time=8490.577..8512.725 rows=16899 loops=3) |
Group Key: a.geom |
-> Sort (cost=54631324.83..54631785.36 rows=184212 width=130) (actual time=8490.557..8495.249 rows=16996 loops=3) |
Sort Key: a.geom |
Sort Method: external merge Disk: 2296kB |
Worker 0: Sort Method: external merge Disk: 2304kB |
Worker 1: Sort Method: external merge Disk: 2296kB |
-> Nested Loop Left Join (cost=0.41..54602621.56 rows=184212 width=130) (actual time=1.729..8475.942 rows=16996 loops=3) |
-> Parallel Seq Scan on tb_grid_4326_100m a (cost=0.00..5866.24 rows=21124 width=120) (actual time=0.724..2.846 rows=16899 loops=3) |
-> Index Scan using sidx_tb_points on tb_points (cost=0.41..2584.48 rows=10 width=42) (actual time=0.351..0.501 rows=1 loops=50698)|
Index Cond: (((hour)::text = '23'::text) AND (geom # a.geom)) |
Filter: st_contains(a.geom, geom) |
Rows Removed by Filter: 0 |
Planning Time: 1.372 ms |
Execution Time: 8667.418 ms |
I want to join 100m grid table, 100,000 points table using st_contains function.
The 100m grid table has 75,769 records, and tb_points table has 2,434,536 records.
When a time condition is given, the tb_points table returns about 100,000 records.
(As a result, about 75,000 records JOIN about 100,000 records.)
(Index information)
100m grid table using gist(geom),
tb_points table using gist(hour, geom)
It took 30 seconds. How can i imporve the performance?

It is hard to give a definitive answer, but here are several things you can try:
For a multicolumn gist index, it is often a good idea to put the most selectively used column first. In your case, that would have the index be on (geom, hour), not (hour, geom). On the other hand, it can also be better to put the faster column first, and testing for scalar equality should be much faster than testing for containment. You would have to do the test and see which factor is more important for you.
You could try for an Index-only scan, which doesn't need to visit the table. That could save a lot of random IO. Do do that you would need the index gist (hour, geom) INCLUDE (tk, geom). The geom column in a gist index is not considered to be "returnable", so it also needs to be put in the INCLUDE part into order to get the IOS.
Finally, you could partition the table tb_points on "hour". Then you wouldn't need to put "hour" into the gist index, as it is already fulfilled by the partitioning.
And these can be mixed and matched, so you could also swap the column order in the INCLUDE index, or you could try to get both partitioning and the INCLUDE index working together.

Best way to use PostgreSQL full text search ranking

Following on from this answer I want to know what the best way to use PostgreSQL's built-in full text search is if I want to sort by rank, and limit to only matching queries.
Let's assume a very simple table.
CREATE TABLE pictures (
id SERIAL PRIMARY KEY,
title varchar(300),
...
)
or whatever. Now I want to search the title field. First I create an index:
CREATE INDEX pictures_title ON pictures
USING gin(to_tsvector('english', title));
Now I want to search for 'small dog'. This works:
SELECT pictures.id,
ts_rank_cd(
to_tsvector('english', pictures.title), 'small dog'
) AS score
FROM pictures
ORDER BY score DESC
But what I really want is this:
SELECT pictures.id,
ts_rank_cd(
to_tsvector('english', pictures.title), to_tsquery('small dog')
) AS score
FROM pictures
WHERE to_tsvector('english', pictures.title) ## to_tsquery('small dog')
ORDER BY score DESC
Or alternatively this (which doesn't work - can't use score in the WHERE clause):
SELECT pictures.id,
ts_rank_cd(
to_tsvector('english', pictures.title), to_tsquery('small dog')
) AS score
FROM pictures WHERE score > 0
ORDER BY score DESC
What's the best way to do this? My questions are many-fold:
If I use the version with repeated to_tsvector(...) will it call that twice, or is it smart enough to cache the results somehow?
Is there a way to do it without repeating the to_ts... function calls?
Is there a way to use score in the WHERE clause at all?
If there is, would it be better to filter by score > 0 or use the ## thing?

The use of the ## operator will utilize the full text GIN index, while the test for score > 0 would not.
I created a table as in the Question, but added a column named title_tsv:
CREATE TABLE test_pictures (
id BIGSERIAL,
title text,
title_tsv tsvector
);
CREATE INDEX ix_pictures_title_tsv ON test_pictures
USING gin(title_tsv);
I populated the table with some test data:
INSERT INTO test_pictures(title, title_tsv)
SELECT T.data, to_tsvector(T.data)
FROM some_table T;
Then I ran the previously accepted answer with explain analyze:
EXPLAIN ANALYZE
SELECT score, id, title
FROM (
SELECT ts_rank_cd(P.title_tsv, to_tsquery('address & shipping')) AS score
,P.id
,P.title
FROM test_pictures as P
) S
WHERE score > 0
ORDER BY score DESC;
And got the following. Note the execution time of 5,015 ms
QUERY PLAN |
----------------------------------------------------------------------------------------------------------------------------------------------|
Gather Merge (cost=274895.48..323298.03 rows=414850 width=60) (actual time=5010.844..5011.330 rows=1477 loops=1) |
Workers Planned: 2 |
Workers Launched: 2 |
-> Sort (cost=273895.46..274414.02 rows=207425 width=60) (actual time=4994.539..4994.555 rows=492 loops=3) |
Sort Key: (ts_rank_cd(p.title_tsv, to_tsquery('address & shipping'::text))) DESC |
Sort Method: quicksort Memory: 131kB |
-> Parallel Seq Scan on test_pictures p (cost=0.00..247776.02 rows=207425 width=60) (actual time=17.672..4993.997 rows=492 loops=3) |
Filter: (ts_rank_cd(title_tsv, to_tsquery('address & shipping'::text)) > '0'::double precision) |
Rows Removed by Filter: 497296 |
Planning time: 0.159 ms |
Execution time: 5015.664 ms |
Now compare that with the ## operator:
EXPLAIN ANALYZE
SELECT ts_rank_cd(to_tsvector(P.title), to_tsquery('address & shipping')) AS score
,P.id
,P.title
FROM test_pictures as P
WHERE P.title_tsv ## to_tsquery('address & shipping')
ORDER BY score DESC;
And the results coming in with an execution time of about 29 ms:
QUERY PLAN |
-------------------------------------------------------------------------------------------------------------------------------------------------|
Gather Merge (cost=13884.42..14288.35 rows=3462 width=60) (actual time=26.472..26.942 rows=1477 loops=1) |
Workers Planned: 2 |
Workers Launched: 2 |
-> Sort (cost=12884.40..12888.73 rows=1731 width=60) (actual time=17.507..17.524 rows=492 loops=3) |
Sort Key: (ts_rank_cd(to_tsvector(title), to_tsquery('address & shipping'::text))) DESC |
Sort Method: quicksort Memory: 171kB |
-> Parallel Bitmap Heap Scan on test_pictures p (cost=72.45..12791.29 rows=1731 width=60) (actual time=1.781..17.268 rows=492 loops=3) |
Recheck Cond: (title_tsv ## to_tsquery('address & shipping'::text)) |
Heap Blocks: exact=625 |
-> Bitmap Index Scan on ix_pictures_title_tsv (cost=0.00..71.41 rows=4155 width=0) (actual time=3.765..3.765 rows=1477 loops=1) |
Index Cond: (title_tsv ## to_tsquery('address & shipping'::text)) |
Planning time: 0.214 ms |
Execution time: 28.995 ms |
As you can see in the execution plan, the index ix_pictures_title_tsv was used in the second query, but not in the first one, making the query with the ## operator a whopping 172 times faster!

select *
from (
SELECT
pictures.id,
ts_rank_cd(to_tsvector('english', pictures.title),
to_tsquery('small dog')) AS score
FROM pictures
) s
WHERE score > 0
ORDER BY score DESC

If I use the version with repeated to_tsvector(...) will it call that twice, or is it smart enough to cache the results somehow?
The best way to notice these things is to do a simple explain, although those can be hard to read.
Long story short, yes, PostgreSQL is smart enough to reuse computed results.
Is there a way to do it without repeating the to_ts... function calls?
What I usually do is add a tsv column which is the text search vector. If you make this auto update using triggers it immediately gives you the vector easily accessible but it also allows you to selectively update the search index by making the trigger selective.
Is there a way to use score in the WHERE clause at all?
Yes, but not with that name.
Alternatively you could create a sub-query, but I would personally just repeat it.
If there is, would it be better to filter by score > 0 or use the ## thing?
The simplest version I can think of is this:
SELECT *
FROM pictures
WHERE 'small dog' ## text_search_vector
The text_search_vector could obviously be replaced with something like to_tsvector('english', pictures.title)

Two column, bulk, random access retrieval from sparse table using PostgreSQL

I'm storing a relatively reasonable (~3 million) number of very small rows (the entire DB is ~300MB) in PostgreSQL. The data is organized thus:
Table "public.tr_rating"
Column | Type | Modifiers
-----------+--------------------------+---------------------------------------------------------------
user_id | bigint | not null
place_id | bigint | not null
rating | smallint | not null
rated_at | timestamp with time zone | not null default now()
rating_id | bigint | not null default nextval('tr_rating_rating_id_seq'::regclass)
Indexes:
"tr_rating_rating_id_key" UNIQUE, btree (rating_id)
"tr_rating_user_idx" btree (user_id, place_id)
Now, I would like to retrieve the ratings deposited over a set of places by your friends (a set of users)
The natural query I wrote is:
SELECT * FROM tr_rating WHERE user_id=ANY(?) AND place_id=ANY(?)
The size of the user_id array is ~500, while the place_id array is ~10,000
This turns into:
Bitmap Heap Scan on tr_rating (cost=2453743.43..2492013.53 rows=3627 width=34) (actual time=10174.044..10174.234 rows=1111 loops=1)
Buffers: shared hit=27922214
-> Bitmap Index Scan on tr_rating_user_idx (cost=0.00..2453742.53 rows=3627 width=0) (actual time=10174.031..10174.031 rows=1111 loops=1)
Index Cond: ((user_id = ANY (...) ))
Buffers: shared hit=27922214
Total runtime: 10279.290 ms
The first suspicious thing I see here is that it estimates that scanning the index for 500 users will take 2.5M disk seeks
Everything else here looks reasonable, except that it takes ten full seconds to do this! The index (via \di) looks like:
public | tr_rating_user_idx | index | tr_rating | 67 MB |
at 67 MB, I would expect it could tear through the index in a trivial amount of time, even if it has to do it sequentially. As the buffers accounting from the EXPLAIN ANALYZE shows, everything is already in memory (as all values other than shared_hit are zero and thus suppressed).
I have tried various combinations of REINDEX, VACUUM, ANALYZE, and CLUSTER with no measurable improvement.
Any thoughts as to what I am doing wrong here, or how I could debug further? I'm mystified; 67MB of data is a puny amount to spend so much time searching through...
For reference, the hardware is a 8-way recent Xeon with 8 15K 300GB drives in RAID-10. Should be enough :-)
EDIT
Per btilly's suggestion, I tried out temporary tables:
=> explain analyze select * from tr_rating NATURAL JOIN user_ids NATURAL JOIN place_ids;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=49133.46..49299.51 rows=3524 width=34) (actual time=13.801..15.676 rows=1111 loops=1)
Hash Cond: (place_ids.place_id = tr_rating.place_id)
-> Seq Scan on place_ids (cost=0.00..59.66 rows=4066 width=8) (actual time=0.009..0.619 rows=4251 loops=1)
-> Hash (cost=48208.02..48208.02 rows=74035 width=34) (actual time=13.767..13.767 rows=7486 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 527kB
-> Nested Loop (cost=0.00..48208.02 rows=74035 width=34) (actual time=0.047..11.055 rows=7486 loops=1)
-> Seq Scan on user_ids (cost=0.00..31.40 rows=2140 width=8) (actual time=0.006..0.399 rows=2189 loops=1)
-> Index Scan using tr_rating_user_idx on tr_rating (cost=0.00..22.07 rows=35 width=34) (actual time=0.002..0.003 rows=3 loops=2189)
Index Cond: (tr_rating.user_id = user_ids.user_id) JOIN place_ids;
Total runtime: 15.931 ms
Why is the query plan so much better when faced with temporary tables, rather than arrays? The data is exactly the same, simply presented in a different way. Additionally, I've measured the time to create a temporary table at running in the tens to hundreds of milliseconds, which is a pretty steep overhead to pay. Can I continue to use the array approach, yet allow Postgres to use the hash join which is so much faster, instead?
EDIT 2
By creating a hash index on user_id, the runtime reduces to 250ms. Adding another hash index to place_id reduces the runtime further to 50ms. This is still twice as slow as using temporary tables, but the overhead of making the table negates any gains I see. I still do not understand how doing O(500) lookups in a btree index can take ten seconds, but the hash index is unquestionably much faster.

It looks like it is taking each row in the index, and then scanning through your user_id array, then if it finds it scanning through your place_id array. That means that for 3 million rows it has to scan through 100 user_ids, and for each match it scans through 10,000 place_ids. Those matches are individually fast, but this is a poor algorithm that could potentially result in up to 30 billion operations.
You'd be better off creating two temporary tables, giving them indexes, and doing a join. If it does a hash join, then you'd potentially have 6 million hash lookups. (3 million for user_id and 3 million for place_id.)

How to optimize my PostgreSQL DB for prefix search?

I have a table called "nodes" with roughly 1.7 million rows in my PostgreSQL db
=#\d nodes
Table "public.nodes"
Column | Type | Modifiers
--------+------------------------+-----------
id | integer | not null
title | character varying(256) |
score | double precision |
Indexes:
"nodes_pkey" PRIMARY KEY, btree (id)
I want to use information from that table for autocompletion of a search field, showing the user a list of the ten titles having the highest score fitting to his input. So I used this query (here searching for all titles starting with "s")
=# explain analyze select title,score from nodes where title ilike 's%' order by score desc;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Sort (cost=64177.92..64581.38 rows=161385 width=25) (actual time=4930.334..5047.321 rows=161264 loops=1)
Sort Key: score
Sort Method: external merge Disk: 5712kB
-> Seq Scan on nodes (cost=0.00..46630.50 rows=161385 width=25) (actual time=0.611..4464.413 rows=161264 loops=1)
Filter: ((title)::text ~~* 's%'::text)
Total runtime: 5260.791 ms
(6 rows)
This was much to slow for using it with autocomplete. With some information from Using PostgreSQL in Web 2.0 Applications I was able to improve that with a special index
=# create index title_idx on nodes using btree(lower(title) text_pattern_ops);
=# explain analyze select title,score from nodes where lower(title) like lower('s%') order by score desc limit 10;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=18122.41..18122.43 rows=10 width=25) (actual time=1324.703..1324.708 rows=10 loops=1)
-> Sort (cost=18122.41..18144.60 rows=8876 width=25) (actual time=1324.700..1324.702 rows=10 loops=1)
Sort Key: score
Sort Method: top-N heapsort Memory: 17kB
-> Bitmap Heap Scan on nodes (cost=243.53..17930.60 rows=8876 width=25) (actual time=96.124..1227.203 rows=161264 loops=1)
Filter: (lower((title)::text) ~~ 's%'::text)
-> Bitmap Index Scan on title_idx (cost=0.00..241.31 rows=8876 width=0) (actual time=90.059..90.059 rows=161264 loops=1)
Index Cond: ((lower((title)::text) ~>=~ 's'::text) AND (lower((title)::text) ~<~ 't'::text))
Total runtime: 1325.085 ms
(9 rows)
So this gave me a speedup of factor 4. But can this be further improved? What if I want to use '%s%' instead of 's%'? Do I have any chance of getting a decent performance with PostgreSQL in that case, too? Or should I better try a different solution (Lucene?, Sphinx?) for implementing my autocomplete feature?

You will need a text_pattern_ops index if you're not in the C locale.
See: index types.

Tips for further investigation :
Partition the table on the title key. This makes the lists smaller that postgres need to work with.
give postgresql more memory so the cache hit rate > 98%. This table will take about 0.5G, I think 2G should be no problem nowadays. Make sure statistics collection is enabled and read up on the pg_stats tables.
Make a second table with a reduced sustring of the title e.g. 12 characters so the complete table fits in less database blocks. An index on a substring may also work, but requires careful querying.
The long the substring, the faster the query will run. Create a separate table for small substrings, and store in the value the top ten or whatever of choices you would want to show. There are about 20000 combinations of 1,2,3 character strings.
You can use the same idea if you want to have %abc% queries, but probably switching to lucene makes sense now.

You're obviously not interested in 150000+ results, so you should limit them:
select title,score
from nodes
where title ilike 's%'
order by score desc
limit 10;
You can also consider creating functional index, and using ">=" and "<":
create index nodes_title_lower_idx on nodes (lower(title));
select title,score
from nodes
where lower(title)>='s' and lower(title)<'t'
order by score desc
limit 10;
You should also create index on score, which will help in ilike %s% case.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Avoiding external disk sort for aggregate query - sql

Related

How to optimize this "select count" SQL? (postgres array comparision)

postgresql st_contains performance

Best way to use PostgreSQL full text search ranking

Two column, bulk, random access retrieval from sparse table using PostgreSQL

How to optimize my PostgreSQL DB for prefix search?

Categories

Resources