Postgres: STABLE function called multiple times on constant

Postgres: STABLE function called multiple times on constant - sql

I'm having a Postgresql (version 9.4) performance puzzle. I have a function (prevd) declared as STABLE (see below). When I run this function on a constant in where clause, it is called multiple times - instead of once.
If I understand postgres documentation correctly, the query should be optimized to call prevd only once.
A STABLE function cannot modify the database and is guaranteed to return the same results given the same arguments for all rows within a single statement
Why it doesn't optimize calls to prevd in this case?
I'm not expecting prevd to be called once for all subsequent queries using prevd on the same argument (like it was IMMUTABLE). I'm expecting postgres to create a plan for my query with just one call to prevd('2015-12-12')
Please find the code below:
Schema
create table somedata(d date, number double precision);
create table dates(d date);
insert into dates
select generate_series::date
from generate_series('2015-01-01'::date, '2015-12-31'::date, '1 day');
insert into somedata
select '2015-01-01'::date + (random() * 365 + 1)::integer, random()
from generate_series(1, 100000);
create or replace function prevd(date_ date)
returns date
language sql
stable
as $$
select max(d) from dates where d < date_;
$$
Slow Query
select avg(number) from somedata where d=prevd('2015-12-12');
Poor query plan of the query above
Aggregate (cost=28092.74..28092.75 rows=1 width=8) (actual time=3532.638..3532.638 rows=1 loops=1)
Output: avg(number)
-> Seq Scan on public.somedata (cost=0.00..28091.43 rows=525 width=8) (actual time=10.210..3532.576 rows=282 loops=1)
Output: d, number
Filter: (somedata.d = prevd('2015-12-12'::date))
Rows Removed by Filter: 99718
Planning time: 1.144 ms
Execution time: 3532.688 ms
(8 rows)
Performance
The query above, on my machine runs around 3.5s. After changing prevd to IMMUTABLE, it's changing to 0.035s.

I started writing this as a comment, but it got a bit long, so I'm expanding it into an answer.
As discussed in this previous answer, Postgres does not promise to always optimise based on STABLE or IMMUTABLE annotations, only that it can sometimes do so. It does this by planning the query differently by taking advantage of certain assumptions. This part of the previous answer is directly analogous to your case:
This particular sort of rewriting depends upon immutability or stability. With where test_multi_calls1(30) != num query re-writing will happen for immutable but not for merely stable functions.
If you change the function to IMMUTABLE and look at the query plan, you will see that the rewriting it does is really rather radical:
Seq Scan on public.somedata (cost=0.00..1791.00 rows=272 width=12) (actual time=0.036..14.549 rows=270 loops=1)
Output: d, number
Filter: (somedata.d = '2015-12-11'::date)
Buffers: shared read=541 written=14
Total runtime: 14.589 ms
It actually runs the function while planning the query, and substitutes the value before the query is even executed. With a STABLE function, this optimisation would clearly not be appropriate - the data might change between planning and executing the query.
In a comment, it was mentioned that this query results in an optimised plan:
select avg(number) from somedata where d=(select prevd(date '2015-12-12'));
This is fast, but note that the plan doesn't look anything like what the IMMUTABLE version did:
Aggregate (cost=1791.69..1791.70 rows=1 width=8) (actual time=14.670..14.670 rows=1 loops=1)
Output: avg(number)
Buffers: shared read=541 written=21
InitPlan 1 (returns $0)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1)
Output: '2015-12-11'::date
-> Seq Scan on public.somedata (cost=0.00..1791.00 rows=273 width=8) (actual time=0.026..14.589 rows=270 loops=1)
Output: d, number
Filter: (somedata.d = $0)
Buffers: shared read=541 written=21
Total runtime: 14.707 ms
By putting it into a sub-query, you are moving the function call from the WHERE clause to the SELECT clause. More importantly, the sub-query can always be executed once and used by the rest of the query; so the function is run once in a separate node of the plan.
To confirm this, we can take the SQL out of a function altogether:
select avg(number) from somedata where d=(select max(d) from dates where d < '2015-12-12');
This gives a rather longer plan with very similar performance:
Aggregate (cost=1799.12..1799.13 rows=1 width=8) (actual time=14.174..14.174 rows=1 loops=1)
Output: avg(somedata.number)
Buffers: shared read=543 written=19
InitPlan 1 (returns $0)
-> Aggregate (cost=7.43..7.44 rows=1 width=4) (actual time=0.150..0.150 rows=1 loops=1)
Output: max(dates.d)
Buffers: shared read=2
-> Seq Scan on public.dates (cost=0.00..6.56 rows=347 width=4) (actual time=0.015..0.103 rows=345 loops=1)
Output: dates.d
Filter: (dates.d < '2015-12-12'::date)
Buffers: shared read=2
-> Seq Scan on public.somedata (cost=0.00..1791.00 rows=273 width=8) (actual time=0.190..14.098 rows=270 loops=1)
Output: somedata.d, somedata.number
Filter: (somedata.d = $0)
Buffers: shared read=543 written=19
Total runtime: 14.232 ms
The important thing to note is that the inner Aggregate (the max(d)) is executed once, on a separate node from the main Seq Scan (which is checking the where clause). In this position, even a VOLATILE function can be optimised in the same way.
In short, while you know that the query you've produced can be optimised by executing the function only once, it doesn't match any of the patterns that Postgres's query planner knows how to rewrite, so it uses a naive plan which runs the function multiple times.
[Note: all tests performed on Postgres 9.1, because it's what I happened to have to hand.]

Related

Postgres: which index to add

I have a table mainly used by this query (only 3 columns are in use here, meter, timeStampUtc and createdOnUtc, but there are other in the table), which starts to take too long:
select
rank() over (order by mr.meter, mr."timeStampUtc") as row_name
, max(mr."createdOnUtc") over (partition by mr.meter, mr."timeStampUtc") as "createdOnUtc"
from
"MeterReading" mr
where
"createdOnUtc" >= '2021-01-01'
order by row_name
;
(this is the minimal query to show my issue. It might not make too much sense on its own, or could be rewritten)
I am wondering which index (or other technique) to use to optimise this particular query.
A basic index on createdOnUtc helps already.
I am mostly wondering about those 2 windows functions. They are very similar, so I factorised them (named window with thus identical partition by and order by), it had no effect. Adding an index on meter, "timeStampUtc" had no effect either (query plan unchanged).
Is there no way to use an index on those 2 columns inside a window function?
Edit - explain analyze output: using the createdOnUtc index
Sort (cost=8.51..8.51 rows=1 width=40) (actual time=61.045..62.222 rows=26954 loops=1)
Sort Key: (rank() OVER (?))
Sort Method: quicksort Memory: 2874kB
-> WindowAgg (cost=8.46..8.50 rows=1 width=40) (actual time=18.373..57.892 rows=26954 loops=1)
-> WindowAgg (cost=8.46..8.48 rows=1 width=40) (actual time=18.363..32.444 rows=26954 loops=1)
-> Sort (cost=8.46..8.46 rows=1 width=32) (actual time=18.353..19.663 rows=26954 loops=1)
Sort Key: meter, "timeStampUtc"
Sort Method: quicksort Memory: 2874kB
-> Index Scan using "MeterReading_createdOnUtc_idx" on "MeterReading" mr (cost=0.43..8.45 rows=1 width=32) (actual time=0.068..8.059 rows=26954 loops=1)
Index Cond: ("createdOnUtc" >= '2021-01-01 00:00:00'::timestamp without time zone)
Planning Time: 0.082 ms
Execution Time: 63.698 ms

Is there no way to use an index on those 2 columns inside a window function?
That is correct; a window function cannot use an index, as the work only on what otherwise would be the final result, all data selection has already finished. From the documentation.
The rows considered by a window function are those of the “virtual
table” produced by the query's FROM clause as filtered by its WHERE,
GROUP BY, and HAVING clauses if any. For example, a row removed
because it does not meet the WHERE condition is not seen by any window
function. A query can contain multiple window functions that slice up
the data in different ways using different OVER clauses, but they all
act on the same collection of rows defined by this virtual table.
The purpose of an index is to speed up the creation of that "virtual table". Applying an index would just slow things down: the data is already in memory. Scanning it is orders of magnitude faster any any index.

Difference between ANY(ARRAY[..]) vs ANY(VALUES (), () ..) in PostgreSQL

I am trying to workout query optimisation on id. Not sure which one way should I use. Below is the query plan using explain and cost wise looks similar.
1. explain (analyze, buffers) SELECT * FROM table1 WHERE id = ANY (ARRAY['00e289b0-1ac8-451f-957f-e00bc289148e'::uuid,...]);
QUERY PLAN:
Index Scan using table1_pkey on table1 (cost=0.42..641.44 rows=76 width=835) (actual time=0.258..2.603 rows=76 loops=1)
Index Cond: (id = ANY ('{00e289b0-1ac8-451f-957f-e00bc289148e,...}'::uuid[]))
Buffers: shared hit=231 read=73
Planning Time: 0.487 ms
Execution Time: 2.715 ms)
2. explain (analyze, buffers) SELECT * FROM table1 WHERE id = ANY (VALUES ('00e289b0-1ac8-451f-957f-e00bc289148e'::uuid),...);
QUERY PLAN:
Nested Loop (cost=1.56..644.10 rows=76 width=835) (actual time=0.058..0.297 rows=76 loops=1)
Buffers: shared hit=304
-> HashAggregate (cost=1.14..1.90 rows=76 width=16) (actual time=0.049..0.060 rows=76 loops=1)
Group Key: "*VALUES*".column1
-> Values Scan on "*VALUES*" (cost=0.00..0.95 rows=76 width=16) (actual time=0.006..0.022 rows=76 loops=1)
-> Index Scan using table1_pkey on table1 (cost=0.42..8.44 rows=1 width=835) (actual time=0.002..0.003 rows=1 loops=76)
Index Cond: (id = "*VALUES*".column1)
Buffers: shared hit=304
Planning Time: 0.437 ms
Execution Time: 0.389 ms
Looks like VALUES () does some hashing and join to improve performance but not sure.
NOTE: In my practical use case, id is uuid_generate_v4() e.x. d31cddc0-1771-4de8-ad41-e6c568b39a5d but the column may not be indexed as such.
Also, I have a table of with 5-10 million records.
Which way is for the better query performance?

Both options seem reasonable. I would, however, suggest to avoid casting the column you filter on. Instead, you should cast the literal values to uuid:
SELECT *
FROM table1
WHERE id = ANY (ARRAY['00e289b0-1ac8-451f-957f-e00bc289148e'::uuid, ...]);
This should allow the database to take advantage of an index on column id.

Optimize query with multiple "between" conditions

I have a table playground with column val, column val is indexed.
I have a list of ranges [(min1, max1), (min2, max2), ... , (minN, maxN)]
and I want to select all rows with val that fit in any of those ranges.
E.g. my ranges looks like that: [(1,5), (20,25), (200,400)]
Here is the simple query that extracts corresponding rows:
select p.*
from playground p
where (val between 1 AND 5) or (val between 20 and 25) or
(val between 200 and 400);
The problem here is that this list of ranges is dynamic, my application generates it and sends it along with the query to postgres.
I tried to rewrite the query to accept dynamic list of ranges:
select p.*
from playground p,
unnest(ARRAY [(1, 5),(20, 25),(200, 400)]) as r(min_val INT, max_val INT)
where p.val between r.min_val and r.max_val;
It extracts the same rows, but I don't know is an effective query or not?
This is how the explain looks like for the first query:
Bitmap Heap Scan on playground p (cost=12.43..16.45 rows=1 width=36) (actual time=0.017..0.018 rows=4 loops=1)
Recheck Cond: (((val >= 1) AND (val <= 5)) OR ((val >= 20) AND (val <= 25)) OR ((val >= 200) AND (val <= 400)))
Heap Blocks: exact=1
-> BitmapOr (cost=12.43..12.43 rows=1 width=0) (actual time=0.012..0.012 rows=0 loops=1)
-> Bitmap Index Scan on playground_val_index (cost=0.00..4.14 rows=1 width=0) (actual time=0.010..0.010 rows=3 loops=1)
Index Cond: ((val >= 1) AND (val <= 5))
-> Bitmap Index Scan on playground_val_index (cost=0.00..4.14 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=1)
Index Cond: ((val >= 20) AND (val <= 25))
-> Bitmap Index Scan on playground_val_index (cost=0.00..4.14 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1)
Index Cond: ((val >= 200) AND (val <= 400))
Planning Time: 0.071 ms
Execution Time: 0.057 ms
And here is the explain for the second:
Nested Loop (cost=0.14..12.52 rows=2 width=36) (actual time=0.033..0.065 rows=4 loops=1)
-> Function Scan on unnest r (cost=0.00..0.03 rows=3 width=8) (actual time=0.011..0.012 rows=3 loops=1)
-> Index Scan using playground_val_index on playground p (cost=0.13..4.15 rows=1 width=36) (actual time=0.008..0.015 rows=1 loops=3)
Index Cond: ((val >= r.min_val) AND (val <= r.max_val))
Planning Time: 0.148 ms
Execution Time: 0.714 ms
NOTE: In both cases I did set enable_seqscan = false; to make the index work.
I am worried about the "Nested Loop" stage. Is it Okay? Or there are more effective ways to pass dynamic list of ranges into a query?
My postgres version is 12.1

You added more information, but much more is relevant, yet. Exact table and index definition, cardinality, data distribution, row size stats, number of ranges in predicate, purpose of the table, write patterns, ... Performance optimization needs all the input it can get.
Shot in the dark: with non-overlapping ranges, a UNION ALL query may deliver best performance:
SELECT * FROM playground WHERE val BETWEEN 1 AND 5
UNION ALL
SELECT * FROM playground WHERE val BETWEEN 20 AND 25
UNION ALL
SELECT * FROM playground WHERE val BETWEEN 200 AND 400;
We know that ranges don't overlap, but Postgres doesn't, so it has to do extra work in your attempts. This query should avoid both the BitmapOr of the first as well as the Nested Loop of the second plan. Just fetch each range and append to the output. Should result in a plan like:
Append (cost=0.13..24.50 rows=3 width=40)
-> Index Scan using playground_val_idx on playground (cost=0.13..8.15 rows=1 width=40)
Index Cond: ((val >= 1) AND (val <= 5))
-> Index Scan using playground_val_idx on playground playground_1 (cost=0.13..8.15 rows=1 width=40)
Index Cond: ((val >= 20) AND (val <= 25))
-> Index Scan using playground_val_idx on playground playground_2 (cost=0.13..8.15 rows=1 width=40)
Index Cond: ((val >= 200) AND (val <= 400))
Plus, each sub-SELECT will be based on actual statistics for the given range, not generic estimates, even for a longer list of ranges. See (recommended!):
How to use index for simple time range join?
You can generate the query in your client or write a server-side function to generate and execute dynamic SQL (applicable as the result type is known).
You might even test a server-side function using a LOOP (which is often less efficient, but this may be an exception):
CREATE OR REPLACE FUNCTION foo(_ranges int[])
RETURNS SETOF playground LANGUAGE plpgsql PARALLEL SAFE STABLE AS
$func$
DECLARE
_range int[];
BEGIN
FOREACH _range SLICE 1 IN ARRAY _ranges
LOOP
RETURN QUERY
SELECT * FROM playground WHERE val BETWEEN _range[1] AND _range[2];
END LOOP;
END
$func$;
The overhead may not pay for few ranges in the call. But very convenient to call, if nothing else:
SELECT * FROM foo('{{1,5},{20,25},{200,400}}');
Related:
Loop over array dimension in plpgsql
db<>fiddle here
Physical order of rows may help a lot. If rows are stored in sequence, (much) fewer data pages need to be processed. Depends on undisclosed details. Built-in CLUSTER or the extensions pg_repack or pg_squeeze may help with that. Related:
Optimize Postgres timestamp query range
And it's recommended to use the latest available minor release for whatever major version is in use. That would be 12.2 at the time of writing (released 2020-02-13).

Consume results from function lazily in PostgreSQL?

I have a function in my database that returns a lot of rows:
CREATE FUNCTION lots_of_rows(n integer) RETURNS SETOF integer
STABLE LANGUAGE plpgsql AS $$ BEGIN
FOR i IN 1..10000000 LOOP
RETURN NEXT i * n;
END LOOP;
END $$;
Unsurprisingly, queries that use this function are not very fast:
=# EXPLAIN ANALYZE SELECT n FROM lots_of_rows(4) as n;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Function Scan on lots_of_rows n (cost=0.25..10.25 rows=1000 width=4) (actual time=1867.135..2900.167 rows=10000000 loops=1)
Planning Time: 0.026 ms
Execution Time: 3494.365 ms
(3 rows)
That is to be expected. But what frustrates me is that I pay for the whole cost of this function even if I only use a tiny subset of the resulting rows:
=# EXPLAIN ANALYZE SELECT n FROM lots_of_rows(4) as n LIMIT 10;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.25..0.35 rows=10 width=4) (actual time=1863.679..1863.682 rows=10 loops=1)
-> Function Scan on lots_of_rows n (cost=0.25..10.25 rows=1000 width=4) (actual time=1863.675..1863.676 rows=10 loops=1)
Planning Time: 0.044 ms
Execution Time: 1872.395 ms
(4 rows)
Clearly, that is very wasteful. For comparison, if I do the same thing with a recursive view, it takes essentially zero time:
CREATE RECURSIVE VIEW lots_of_rows (n) AS
VALUES (1)
UNION ALL
SELECT n+1 FROM lots_of_rows WHERE n < 10000000;
=# EXPLAIN ANALYZE SELECT n * 4 FROM lots_of_rows LIMIT 10;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=2.95..3.28 rows=10 width=4) (actual time=0.005..0.027 rows=10 loops=1)
-> Subquery Scan on lots_of_rows (cost=2.95..3.96 rows=31 width=4) (actual time=0.005..0.023 rows=10 loops=1)
-> CTE Scan on lots_of_rows lots_of_rows_1 (cost=2.95..3.57 rows=31 width=4) (actual time=0.003..0.020 rows=10 loops=1)
CTE lots_of_rows
-> Recursive Union (cost=0.00..2.95 rows=31 width=4) (actual time=0.002..0.015 rows=10 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=1)
-> WorkTable Scan on lots_of_rows lots_of_rows_2 (cost=0.00..0.23 rows=3 width=4) (actual time=0.001..0.001 rows=1 loops=9)
Filter: (n < 10000000)
Planning Time: 0.213 ms
Execution Time: 0.089 ms
(10 rows)
But of course, my function takes an argument, n, but views cannot accept arguments, so some of the implementation details have to leak out into my individual queries.
Of course, this lots_of_rows function is very silly, and I do not actually literally use it anywhere. My real function is more complex: it accepts several different arguments and uses them to construct a SELECT query, iterates over the results using FOR, and for certain rows, returns records using RETURN NEXT. It is not nearly as simple to replace that particular function with a view.
Furthermore, it is not straightforward to move the limiting logic from my enclosing query into the function, since the enclosing queries sometimes add various WHERE conditions to the result:
SELECT r.id FROM complicated_function($1, $2, $3, $4) as r
WHERE r.is_public AND r.score > 0 LIMIT 20;
I guess I could always just add a ton of different arguments to the function for all the different conditions I need, but ideally, I’d like to be able to keep my function as it is (since it encapsulates precisely the abstraction I want), just somehow stream the results to the caller on-demand so that it acts a little bit more like a view (albeit still more or less opaque to the query planner). Is that at all possible, or must a function’s result be completely materialized in memory before it returns?

I believe you may be able to achieve what you're looking for by having the function return a cursor.
A cursor should allow the function caller to fetch rows in batches rather than all at once, allowing both for results to the caller more quickly and less in memory at once on both the client and the server.
Note: There is overhead on the server in terms of maintaining the cursor. The caller should close the cursor explicitly once done (otherwise it will close at the end of the transaction).
In particular, check out the section in the above link entitled 43.7.3.5. Returning Cursors.

Postgres has two possible implementation of table functions:
row based with persistent context - it returns only rows that are required - it is little bit more CPU expensive, but it stop early - with this implementation the function is called multiple times and returns only one row every time. For this implementation only C language can be used.
tuple store implementation - this is your case - PLpgSQL and other than C language use it. When the function is called, then special structure tuplestore is filled. All rows are generated - and all rows are returned. The reader of tuplestore (parent node), can read all rows (or not), but every time all rows are produced. Outer LIMIT clause is not push down inside function, so it has not any effect on speed.
There are not any other implementation - so if you need limit result, then you have to do explicitly (manually) (if you want higher programming language).

Postgresql LIKE ANY versus LIKE

I've tried to be thorough in this question, so if you're impatient, just jump to the end to see what the actual question is...
I'm working on adjusting how some search features in one of our databases is implemented. To this end, I'm adding some wildcard capabilities to our application's API that interfaces back to Postgresql.
The issue that I've found is that the EXPLAIN ANALYZE times do not make sense to me and I'm trying to figure out where I could be going wrong; it doesn't seem likely that 15 queries is better than just one optimized query!
The table, Words, has two relevant columns for this question: id and text. The text column has an index on it that was build with the text_pattern_ops option. Here's what I'm seeing:
First, using a LIKE ANY with a VALUES clause, which some references seem to indicate would be ideal in my case (finding multiple words):
events_prod=# explain analyze select distinct id from words where words.text LIKE ANY (values('test%'));
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=6716668.40..6727372.85 rows=1070445 width=4) (actual time=103088.381..103091.468 rows=256 loops=1)
Group Key: words.id
-> Nested Loop Semi Join (cost=0.00..6713992.29 rows=1070445 width=4) (actual time=0.670..103087.904 rows=256 loops=1)
Join Filter: ((words.text)::text ~~ "*VALUES*".column1)
Rows Removed by Join Filter: 214089311
-> Seq Scan on words (cost=0.00..3502655.91 rows=214089091 width=21) (actual time=0.017..25232.135 rows=214089567 loops=1)
-> Materialize (cost=0.00..0.02 rows=1 width=32) (actual time=0.000..0.000 rows=1 loops=214089567)
-> Values Scan on "*VALUES*" (cost=0.00..0.01 rows=1 width=32) (actual time=0.006..0.006 rows=1 loops=1)
Planning time: 0.226 ms
Execution time: 103106.296 ms
(10 rows)
As you can see, the execution time is horrendous.
A second attempt, using LIKE ANY(ARRAY[... yields:
events_prod=# explain analyze select distinct id from words where words.text LIKE ANY(ARRAY['test%']);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=3770401.08..3770615.17 rows=21409 width=4) (actual time=37399.573..37399.704 rows=256 loops=1)
Group Key: id
-> Seq Scan on words (cost=0.00..3770347.56 rows=21409 width=4) (actual time=0.224..37399.054 rows=256 loops=1)
Filter: ((text)::text ~~ ANY ('{test%}'::text[]))
Rows Removed by Filter: 214093922
Planning time: 0.611 ms
Execution time: 37399.895 ms
(7 rows)
As you can see, performance is dramatically improved, but still far from ideal... 37 seconds. with one word in the list. Moving that up to three words that returns a total of 256 rows changes the execution time to well over 100 seconds.
The last try, doing a LIKE for a single word:
events_prod=# explain analyze select distinct id from words where words.text LIKE 'test%';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=60.14..274.23 rows=21409 width=4) (actual time=1.437..1.576 rows=256 loops=1)
Group Key: id
-> Index Scan using words_special_idx on words (cost=0.57..6.62 rows=21409 width=4) (actual time=0.048..1.258 rows=256 loops=1)
Index Cond: (((text)::text ~>=~ 'test'::text) AND ((text)::text ~<~ 'tesu'::text))
Filter: ((text)::text ~~ 'test%'::text)
Planning time: 0.826 ms
Execution time: 1.858 ms
(7 rows)
As expected, this is the fastest, but the 1.85ms makes me wonder if there is something else I'm missing with the VALUES and ARRAY approach.
The Question
Is there some more efficient way to do something like this in Postgresql that I've missed in my research?
select distinct id
from words
where words.text LIKE ANY(ARRAY['word1%', 'another%', 'third%']);

This is a bit speculative. I think the key is your pattern:
where words.text LIKE 'test%'
Note that the like pattern starts with a constant string. The means that Postgres can do a range scan on the index for the words that start with 'test'.
When you then introduce multiple comparisons, the optimizer gets confused and no longer considers multiple range scans. Instead, it decides that it needs to process all the rows.
This may be a case where this re-write gives you the performance that you want:
select id
from words
where words.text LIKE 'word1%'
union
select id
from words
where words.text LIKE 'another%'
union
select id
from words
where words.text LIKE 'third%';
Notes:
The distinct is not needed because of the union.
If the pattern starts with a wildcard, then a full scan is needed anyway.
You might want to consider an n-gram or full-text index on the table.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Postgres: STABLE function called multiple times on constant - sql

Related

Postgres: which index to add

Difference between ANY(ARRAY[..]) vs ANY(VALUES (), () ..) in PostgreSQL

Optimize query with multiple "between" conditions

Consume results from function lazily in PostgreSQL?

Postgresql LIKE ANY versus LIKE

Categories

Resources