I've tried to be thorough in this question, so if you're impatient, just jump to the end to see what the actual question is...
I'm working on adjusting how some search features in one of our databases is implemented. To this end, I'm adding some wildcard capabilities to our application's API that interfaces back to Postgresql.
The issue that I've found is that the EXPLAIN ANALYZE times do not make sense to me and I'm trying to figure out where I could be going wrong; it doesn't seem likely that 15 queries is better than just one optimized query!
The table, Words, has two relevant columns for this question: id and text. The text column has an index on it that was build with the text_pattern_ops option. Here's what I'm seeing:
First, using a LIKE ANY with a VALUES clause, which some references seem to indicate would be ideal in my case (finding multiple words):
events_prod=# explain analyze select distinct id from words where words.text LIKE ANY (values('test%'));
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=6716668.40..6727372.85 rows=1070445 width=4) (actual time=103088.381..103091.468 rows=256 loops=1)
Group Key: words.id
-> Nested Loop Semi Join (cost=0.00..6713992.29 rows=1070445 width=4) (actual time=0.670..103087.904 rows=256 loops=1)
Join Filter: ((words.text)::text ~~ "*VALUES*".column1)
Rows Removed by Join Filter: 214089311
-> Seq Scan on words (cost=0.00..3502655.91 rows=214089091 width=21) (actual time=0.017..25232.135 rows=214089567 loops=1)
-> Materialize (cost=0.00..0.02 rows=1 width=32) (actual time=0.000..0.000 rows=1 loops=214089567)
-> Values Scan on "*VALUES*" (cost=0.00..0.01 rows=1 width=32) (actual time=0.006..0.006 rows=1 loops=1)
Planning time: 0.226 ms
Execution time: 103106.296 ms
(10 rows)
As you can see, the execution time is horrendous.
A second attempt, using LIKE ANY(ARRAY[... yields:
events_prod=# explain analyze select distinct id from words where words.text LIKE ANY(ARRAY['test%']);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=3770401.08..3770615.17 rows=21409 width=4) (actual time=37399.573..37399.704 rows=256 loops=1)
Group Key: id
-> Seq Scan on words (cost=0.00..3770347.56 rows=21409 width=4) (actual time=0.224..37399.054 rows=256 loops=1)
Filter: ((text)::text ~~ ANY ('{test%}'::text[]))
Rows Removed by Filter: 214093922
Planning time: 0.611 ms
Execution time: 37399.895 ms
(7 rows)
As you can see, performance is dramatically improved, but still far from ideal... 37 seconds. with one word in the list. Moving that up to three words that returns a total of 256 rows changes the execution time to well over 100 seconds.
The last try, doing a LIKE for a single word:
events_prod=# explain analyze select distinct id from words where words.text LIKE 'test%';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=60.14..274.23 rows=21409 width=4) (actual time=1.437..1.576 rows=256 loops=1)
Group Key: id
-> Index Scan using words_special_idx on words (cost=0.57..6.62 rows=21409 width=4) (actual time=0.048..1.258 rows=256 loops=1)
Index Cond: (((text)::text ~>=~ 'test'::text) AND ((text)::text ~<~ 'tesu'::text))
Filter: ((text)::text ~~ 'test%'::text)
Planning time: 0.826 ms
Execution time: 1.858 ms
(7 rows)
As expected, this is the fastest, but the 1.85ms makes me wonder if there is something else I'm missing with the VALUES and ARRAY approach.
The Question
Is there some more efficient way to do something like this in Postgresql that I've missed in my research?
select distinct id
from words
where words.text LIKE ANY(ARRAY['word1%', 'another%', 'third%']);
This is a bit speculative. I think the key is your pattern:
where words.text LIKE 'test%'
Note that the like pattern starts with a constant string. The means that Postgres can do a range scan on the index for the words that start with 'test'.
When you then introduce multiple comparisons, the optimizer gets confused and no longer considers multiple range scans. Instead, it decides that it needs to process all the rows.
This may be a case where this re-write gives you the performance that you want:
select id
from words
where words.text LIKE 'word1%'
union
select id
from words
where words.text LIKE 'another%'
union
select id
from words
where words.text LIKE 'third%';
Notes:
The distinct is not needed because of the union.
If the pattern starts with a wildcard, then a full scan is needed anyway.
You might want to consider an n-gram or full-text index on the table.
Related
I have a table mainly used by this query (only 3 columns are in use here, meter, timeStampUtc and createdOnUtc, but there are other in the table), which starts to take too long:
select
rank() over (order by mr.meter, mr."timeStampUtc") as row_name
, max(mr."createdOnUtc") over (partition by mr.meter, mr."timeStampUtc") as "createdOnUtc"
from
"MeterReading" mr
where
"createdOnUtc" >= '2021-01-01'
order by row_name
;
(this is the minimal query to show my issue. It might not make too much sense on its own, or could be rewritten)
I am wondering which index (or other technique) to use to optimise this particular query.
A basic index on createdOnUtc helps already.
I am mostly wondering about those 2 windows functions. They are very similar, so I factorised them (named window with thus identical partition by and order by), it had no effect. Adding an index on meter, "timeStampUtc" had no effect either (query plan unchanged).
Is there no way to use an index on those 2 columns inside a window function?
Edit - explain analyze output: using the createdOnUtc index
Sort (cost=8.51..8.51 rows=1 width=40) (actual time=61.045..62.222 rows=26954 loops=1)
Sort Key: (rank() OVER (?))
Sort Method: quicksort Memory: 2874kB
-> WindowAgg (cost=8.46..8.50 rows=1 width=40) (actual time=18.373..57.892 rows=26954 loops=1)
-> WindowAgg (cost=8.46..8.48 rows=1 width=40) (actual time=18.363..32.444 rows=26954 loops=1)
-> Sort (cost=8.46..8.46 rows=1 width=32) (actual time=18.353..19.663 rows=26954 loops=1)
Sort Key: meter, "timeStampUtc"
Sort Method: quicksort Memory: 2874kB
-> Index Scan using "MeterReading_createdOnUtc_idx" on "MeterReading" mr (cost=0.43..8.45 rows=1 width=32) (actual time=0.068..8.059 rows=26954 loops=1)
Index Cond: ("createdOnUtc" >= '2021-01-01 00:00:00'::timestamp without time zone)
Planning Time: 0.082 ms
Execution Time: 63.698 ms
Is there no way to use an index on those 2 columns inside a window function?
That is correct; a window function cannot use an index, as the work only on what otherwise would be the final result, all data selection has already finished. From the documentation.
The rows considered by a window function are those of the “virtual
table” produced by the query's FROM clause as filtered by its WHERE,
GROUP BY, and HAVING clauses if any. For example, a row removed
because it does not meet the WHERE condition is not seen by any window
function. A query can contain multiple window functions that slice up
the data in different ways using different OVER clauses, but they all
act on the same collection of rows defined by this virtual table.
The purpose of an index is to speed up the creation of that "virtual table". Applying an index would just slow things down: the data is already in memory. Scanning it is orders of magnitude faster any any index.
I have two tables which links to each other like this:
Table answered_questions with the following columns and indexes:
id: primary key
taken_test_id: integer (foreign key)
question_id: integer (foreign key, links to another table called questions)
indexes: (taken_test_id, question_id)
Table taken_tests
id: primary key
user_id: (foreign key, links to table Users)
indexes: user_id column
First query (with EXPLAIN ANALYZE output):
EXPLAIN ANALYZE
SELECT
"answered_questions".*
FROM
"answered_questions"
INNER JOIN "taken_tests" ON "answered_questions"."taken_test_id" = "taken_tests"."id"
WHERE
"taken_tests"."user_id" = 1;
Output:
Nested Loop (cost=0.99..116504.61 rows=1472 width=61) (actual time=0.025..2.208 rows=653 loops=1)
-> Index Scan using index_taken_tests_on_user_id on taken_tests (cost=0.43..274.18 rows=91 width=4) (actual time=0.014..0.483 rows=371 loops=1)
Index Cond: (user_id = 1)
-> Index Scan using index_answered_questions_on_taken_test_id_and_question_id on answered_questions (cost=0.56..1273.61 rows=365 width=61) (actual time=0.00
2..0.003 rows=2 loops=371)
Index Cond: (taken_test_id = taken_tests.id)
Planning time: 0.276 ms
Execution time: 2.365 ms
(7 rows)
Another query (this is generated automatically by Rails when using joins method in ActiveRecord)
EXPLAIN ANALYZE
SELECT
"answered_questions".*
FROM
"answered_questions"
INNER JOIN "taken_tests" ON "taken_tests"."id" = "answered_questions"."taken_test_id"
WHERE
"taken_tests"."user_id" = 1;
And here is the output
Nested Loop (cost=0.99..116504.61 rows=1472 width=61) (actual time=23.611..1257.807 rows=653 loops=1)
-> Index Scan using index_taken_tests_on_user_id on taken_tests (cost=0.43..274.18 rows=91 width=4) (actual time=10.451..71.474 rows=371 loops=1)
Index Cond: (user_id = 1)
-> Index Scan using index_answered_questions_on_taken_test_id_and_question_id on answered_questions (cost=0.56..1273.61 rows=365 width=61) (actual time=2.07
1..3.195 rows=2 loops=371)
Index Cond: (taken_test_id = taken_tests.id)
Planning time: 0.302 ms
Execution time: 1258.035 ms
(7 rows)
The only difference is the order of columns in the INNER JOIN condition. In the first query, it is ON "answered_questions"."taken_test_id" = "taken_tests"."id" while in the second query, it is ON "taken_tests"."id" = "answered_questions"."taken_test_id". But the query time is hugely different.
Do you have any idea why this happens? I read some articles and it says that the order of columns in JOIN condition should not affect the execution time (ex: Best practices for the order of joined columns in a sql join?)
I am using Postgres 9.6. There are more than 40 million rows in answered_questions table and more than 3 million rows in taken_tests table
Update 1:
When I ran the EXPLAIN with (analyze true, verbose true, buffers true), I got a much better result for the second query (quite similar to the first query)
EXPLAIN (ANALYZE TRUE, VERBOSE TRUE, BUFFERS TRUE)
SELECT
"answered_questions".*
FROM
"answered_questions"
INNER JOIN "taken_tests" ON "taken_tests"."id" = "answered_questions"."taken_test_id"
WHERE
"taken_tests"."user_id" = 1;
Output
Nested Loop (cost=0.99..116504.61 rows=1472 width=61) (actual time=0.030..2.192 rows=653 loops=1)
Output: answered_questions.id, answered_questions.question_id, answered_questions.answer_text, answered_questions.created_at, answered_questions.updated_at, a
nswered_questions.taken_test_id, answered_questions.correct, answered_questions.answer
Buffers: shared hit=1986
-> Index Scan using index_taken_tests_on_user_id on public.taken_tests (cost=0.43..274.18 rows=91 width=4) (actual time=0.014..0.441 rows=371 loops=1)
Output: taken_tests.id
Index Cond: (taken_tests.user_id = 1)
Buffers: shared hit=269
-> Index Scan using index_answered_questions_on_taken_test_id_and_question_id on public.answered_questions (cost=0.56..1273.61 rows=365 width=61) (actual ti
me=0.002..0.003 rows=2 loops=371)
Output: answered_questions.id, answered_questions.question_id, answered_questions.answer_text, answered_questions.created_at, answered_questions.updated
_at, answered_questions.taken_test_id, answered_questions.correct, answered_questions.answer
Index Cond: (answered_questions.taken_test_id = taken_tests.id)
Buffers: shared hit=1717
Planning time: 0.238 ms
Execution time: 2.335 ms
As you can see from the initial EXPLAIN ANALYZE statement results -- the queries are resulting in the equivalent query plan and are executed exactly the same.
The difference comes from the very same unit's execution time:
-> Index Scan using index_taken_tests_on_user_id on taken_tests (cost=0.43..274.18 rows=91 width=4) (actual time=0.014..0.483rows=371 loops=1)
and
-> Index Scan using index_taken_tests_on_user_id on taken_tests (cost=0.43..274.18 rows=91 width=4) (actual time=10.451..71.474rows=371 loops=1)
As the commenters already pointed out (see documentation links in the wuestion comments), the query plan for an inner join is expected to be the same regardless of the table order. It is ordered based on the query planner decisions. This means that you should really look at other performance-optimisation parts of the query execution. One of those would be memory used for caching (SHARED BUFFER). It looks like the query results would depend a lot on whether this data has already been loaded into memory. Just as you have noticed -- the query execution time grows after you have waited some time. This clearly indicates the cache expiry issue more than the plan problem.
Increasing the size of the shared buffers may help resolve it, but the initial execution of the query will always take longer -- this is just your disk access speed.
For more hints on memory configuration of Pg database see here: https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
Note: VACUUM or ANALYZE commands will be unlikely to help here. Both queries are using the same plan already. Keep in mind, though, that due to PostgreSQL transaction isolation mechanism (MVCC) it may have to read the underlying table rows to validate that they are still visible to the current transaction after getting the results from the index. This could be improved by updating the visibility map (see https://www.postgresql.org/docs/10/storage-vm.html), which is done during vacuuming.
I have a .exists() query in an app I am writing. I want to optimize it.
The current ORM expression yields SQL that looks like this:
SELECT DISTINCT
(1) AS "a",
"the_model"."id",
... snip every single column on the_model
FROM "the_model"
WHERE (
...snip criteria...
LIMIT 1
The explain plan looks like this:
Limit (cost=176.60..176.63 rows=1 width=223)
-> Unique (cost=176.60..177.40 rows=29 width=223)
-> Sort (cost=176.60..176.67 rows=29 width=223)
Sort Key: id, ...SNIP...
-> Index Scan using ...SNIP... on ...SNIP... (cost=0.43..175.89 rows=29 width=223)
Index Cond: (user_id = 6)
Filter: ...SNIP...
If I manually modify the above SQL and remove the individual table columns so it looks like this:
SELECT DISTINCT
(1) AS "a",
FROM "the_model"
WHERE (
...snip criteria...
LIMIT 1
the explain plan shows a couple fewer steps, which is great.
Limit (cost=0.43..175.89 rows=1 width=4)
-> Unique (cost=0.43..175.89 rows=1 width=4)
-> Index Scan using ...SNIP... on ...SNIP... (cost=0.43..175.89 rows=29 width=4)
Index Cond: (user_id = 6)
Filter: ...SNIP...
I can go further by removing the DISTINCT keyword from the query, thus yielding an even shallower execution plan, although the cost saving here is minor:
Limit (cost=0.43..6.48 rows=1 width=4)
-> Index Scan using ..SNIP... on ..SNIP... (cost=0.43..175.89 rows=29 width=4)
Index Cond: (user_id = 6)
Filter: ..SNIP...
I can modify the ORM expression using .only('id') to select only one field. However, that does not result in the result I would like. It is doing an unnecessary sort on id. Ideally, I would like to do a .only(None), since none of the columns are needed here. They only add weight.
I would also like to remove the DISTINCT keyword, if possible. I don't think it adds much execution time if the columns are removed though.
It seems like this could be done across the board, as .exists() returns a boolean. None of the returned columns are used for anything. They only complicate the query and reduce performance.
I found that during my QuerySet construction, .distinct() was being called before my .exists(). It was causing the columns to be selected unnecessarily.
EXPLAIN SELECT a.name, m.name FROM Casting c JOIN Movie m ON c.m_id = m.m_id JOIN Actor a ON a.a_id = c.a_id AND c.a_id < 50;
Output
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=26.20..18354.49 rows=1090 width=27) (actual time=0.240..5.603 rows=1011 loops=1)
-> Nested Loop (cost=25.78..12465.01 rows=1090 width=15) (actual time=0.236..4.046 rows=1011 loops=1)
-> Bitmap Heap Scan on casting c (cost=25.35..3660.19 rows=1151 width=8) (actual time=0.229..1.059 rows=1011 loops=1)
Recheck Cond: (a_id < 50)
Heap Blocks: exact=989
-> Bitmap Index Scan on casting_a_id_index (cost=0.00..25.06 rows=1151 width=0) (actual time=0.114..0.114 rows=1011 loops=1)
Index Cond: (a_id < 50)
-> Index Scan using movie_pkey on movie m (cost=0.42..7.64 rows=1 width=15) (actual time=0.003..0.003 rows=1 loops=1011)
Index Cond: (m_id = c.m_id)
-> Index Scan using actor_pkey on actor a (cost=0.42..5.39 rows=1 width=20) (actual time=0.001..0.001 rows=1 loops=1011)
Index Cond: (a_id = c.a_id)
Planning time: 0.334 ms
Execution time: 5.672 ms
(13 rows)
I am trying to understand how query planner works? I am able to understand the process it choose, but I am not getting why ?
Can someone explain query optimizer choices (choice of query processing algorithms, join order) in these queries based on parameters like query selectivity and cost models or anything that effects choice?
Also why there is use of Recheck Cond, after index scan ?
There are two reasons why there has to be a Bitmap Heap Scan:
PostgreSQL has to check whether the rows found are visible for the current transaction or not. Remember that PostgreSQL keeps old row versions in the table until VACUUM removes them. This visibility information is not stored in the index.
If work_mem is not large enough to contain a bitmap with one bit per table row, PostgreSQL uses one bit per table page, which loses some information. The PostgreSQL needs to check the lossy blocks to see which of the rows in the block really satisfy the condition.
You can see this when you use EXPLAIN (ANALYZE, BUFFERS), then PostgreSQL will show if there were lossy matches, see this example on rextester:
-> Bitmap Heap Scan on t (cost=177.14..4719.43 rows=9383 width=0)
(actual time=2.130..144.729 rows=10001 loops=1)
Recheck Cond: (val = 10)
Rows Removed by Index Recheck: 738586
Heap Blocks: exact=646 lossy=3305
Buffers: shared hit=1891 read=2090
-> Bitmap Index Scan on t_val_idx (cost=0.00..174.80 rows=9383 width=0)
(actual time=1.978..1.978 rows=10001 loops=1)
Index Cond: (val = 10)
Buffers: shared read=30
I cannot explain the whole of the PostgreSQL optimizer in this answer, but what it does is to try all possible ways to compute the result, estimate how much each one will cost and choose the cheapest plan.
To estimate how big the result set will be, it uses the object definitions and the table statistics, which contain detailed data about how the column values are distributed.
It then calculates how many disk blocks it will have to read sequentially and by random access (I/O cost), and how many tables and index rows and function calls it will have to process (CPU cost) to come up with a grand total. The weights for each of these components in the total can be configured.
Usually the best plan is one that reduces the number of result rows as quickly as possible by applying the most selective condition first. In your case this seems to be casting.a_id < 50.
Nested loop joins are often preferred if the number of rows in the outer (upper in EXPLAIN output) table is small.
I'm having a Postgresql (version 9.4) performance puzzle. I have a function (prevd) declared as STABLE (see below). When I run this function on a constant in where clause, it is called multiple times - instead of once.
If I understand postgres documentation correctly, the query should be optimized to call prevd only once.
A STABLE function cannot modify the database and is guaranteed to return the same results given the same arguments for all rows within a single statement
Why it doesn't optimize calls to prevd in this case?
I'm not expecting prevd to be called once for all subsequent queries using prevd on the same argument (like it was IMMUTABLE). I'm expecting postgres to create a plan for my query with just one call to prevd('2015-12-12')
Please find the code below:
Schema
create table somedata(d date, number double precision);
create table dates(d date);
insert into dates
select generate_series::date
from generate_series('2015-01-01'::date, '2015-12-31'::date, '1 day');
insert into somedata
select '2015-01-01'::date + (random() * 365 + 1)::integer, random()
from generate_series(1, 100000);
create or replace function prevd(date_ date)
returns date
language sql
stable
as $$
select max(d) from dates where d < date_;
$$
Slow Query
select avg(number) from somedata where d=prevd('2015-12-12');
Poor query plan of the query above
Aggregate (cost=28092.74..28092.75 rows=1 width=8) (actual time=3532.638..3532.638 rows=1 loops=1)
Output: avg(number)
-> Seq Scan on public.somedata (cost=0.00..28091.43 rows=525 width=8) (actual time=10.210..3532.576 rows=282 loops=1)
Output: d, number
Filter: (somedata.d = prevd('2015-12-12'::date))
Rows Removed by Filter: 99718
Planning time: 1.144 ms
Execution time: 3532.688 ms
(8 rows)
Performance
The query above, on my machine runs around 3.5s. After changing prevd to IMMUTABLE, it's changing to 0.035s.
I started writing this as a comment, but it got a bit long, so I'm expanding it into an answer.
As discussed in this previous answer, Postgres does not promise to always optimise based on STABLE or IMMUTABLE annotations, only that it can sometimes do so. It does this by planning the query differently by taking advantage of certain assumptions. This part of the previous answer is directly analogous to your case:
This particular sort of rewriting depends upon immutability or stability. With where test_multi_calls1(30) != num query re-writing will happen for immutable but not for merely stable functions.
If you change the function to IMMUTABLE and look at the query plan, you will see that the rewriting it does is really rather radical:
Seq Scan on public.somedata (cost=0.00..1791.00 rows=272 width=12) (actual time=0.036..14.549 rows=270 loops=1)
Output: d, number
Filter: (somedata.d = '2015-12-11'::date)
Buffers: shared read=541 written=14
Total runtime: 14.589 ms
It actually runs the function while planning the query, and substitutes the value before the query is even executed. With a STABLE function, this optimisation would clearly not be appropriate - the data might change between planning and executing the query.
In a comment, it was mentioned that this query results in an optimised plan:
select avg(number) from somedata where d=(select prevd(date '2015-12-12'));
This is fast, but note that the plan doesn't look anything like what the IMMUTABLE version did:
Aggregate (cost=1791.69..1791.70 rows=1 width=8) (actual time=14.670..14.670 rows=1 loops=1)
Output: avg(number)
Buffers: shared read=541 written=21
InitPlan 1 (returns $0)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1)
Output: '2015-12-11'::date
-> Seq Scan on public.somedata (cost=0.00..1791.00 rows=273 width=8) (actual time=0.026..14.589 rows=270 loops=1)
Output: d, number
Filter: (somedata.d = $0)
Buffers: shared read=541 written=21
Total runtime: 14.707 ms
By putting it into a sub-query, you are moving the function call from the WHERE clause to the SELECT clause. More importantly, the sub-query can always be executed once and used by the rest of the query; so the function is run once in a separate node of the plan.
To confirm this, we can take the SQL out of a function altogether:
select avg(number) from somedata where d=(select max(d) from dates where d < '2015-12-12');
This gives a rather longer plan with very similar performance:
Aggregate (cost=1799.12..1799.13 rows=1 width=8) (actual time=14.174..14.174 rows=1 loops=1)
Output: avg(somedata.number)
Buffers: shared read=543 written=19
InitPlan 1 (returns $0)
-> Aggregate (cost=7.43..7.44 rows=1 width=4) (actual time=0.150..0.150 rows=1 loops=1)
Output: max(dates.d)
Buffers: shared read=2
-> Seq Scan on public.dates (cost=0.00..6.56 rows=347 width=4) (actual time=0.015..0.103 rows=345 loops=1)
Output: dates.d
Filter: (dates.d < '2015-12-12'::date)
Buffers: shared read=2
-> Seq Scan on public.somedata (cost=0.00..1791.00 rows=273 width=8) (actual time=0.190..14.098 rows=270 loops=1)
Output: somedata.d, somedata.number
Filter: (somedata.d = $0)
Buffers: shared read=543 written=19
Total runtime: 14.232 ms
The important thing to note is that the inner Aggregate (the max(d)) is executed once, on a separate node from the main Seq Scan (which is checking the where clause). In this position, even a VOLATILE function can be optimised in the same way.
In short, while you know that the query you've produced can be optimised by executing the function only once, it doesn't match any of the patterns that Postgres's query planner knows how to rewrite, so it uses a naive plan which runs the function multiple times.
[Note: all tests performed on Postgres 9.1, because it's what I happened to have to hand.]