PostgreSQL text range scan

PostgreSQL text range scan - sql

I have written a query whose aim is to get 10 results including the current one, padding up to 9 entries on either side for an alphabetical list which can be sorted by the reciever.
This is the query I am using, my issue however is not with the result, but because neither of the queries is using an index.
(
SELECT
uid,
title
FROM
books
WHERE
lower(title) < lower('Frankenstein')
ORDER BY title desc
LIMIT 9
)
UNION
(
SELECT
uid,
title
FROM
books
WHERE
lower(title) >= lower('Frankenstein')
ORDER BY title
LIMIT 10
)
ORDER BY title;
The index I am trying to utilize is a simple btree, no text_pattern_ops etc as below:
CREATE INDEX books_title_idx ON books USING btree (lower(title));
If I run explain on the first part of the unioin, in spite of the limit and order, it performs a full table scan
explain analyze
SELECT
uid,
title
FROM
books
WHERE
lower(title) < lower('Frankenstein')
ORDER BY title desc
LIMIT 9
Limit (cost=69.04..69.06 rows=9 width=152) (actual time=6.276..6.292 rows=9 loops=1)
-> Sort (cost=69.04..69.67 rows=251 width=152) (actual time=6.273..6.277 rows=9 loops=1)
Sort Key: ((title))
Sort Method: top-N heapsort Memory: 25kB
-> Seq Scan on books (cost=0.00..63.80 rows=251 width=152) (actual time=0.056..5.227 rows=267 loops=1)
Filter: (lower((title)) < 'frankenstein'::text)
Rows Removed by Filter: 486
Total runtime: 6.359 ms
when I do an equality check on the same query - the index is used
explain analyze
SELECT
uid,
title
FROM
books
WHERE
lower(title) = lower('Frankenstein')
ORDER BY title desc
Sort (cost=17.04..17.05 rows=4 width=152) (actual time=0.054..0.054 rows=0 loops=1)
Sort Key: ((title))
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on books (cost=4.31..17.00 rows=4 width=152) (actual time=0.041..0.041 rows=0 loops=1)
Recheck Cond: (lower((title)) = 'frankenstein'::text)
-> Bitmap Index Scan on books_title_idx (cost=0.00..4.31 rows=4 width=0) (actual time=0.036..0.036 rows=0 loops=1)
Index Cond: (lower((title)) = 'frankenstein'::text)
Total runtime: 0.129 ms
and the same applies when I do a between query
explain analyze
SELECT
uid,
title
FROM
books
WHERE
lower(title) > lower('Frankenstein') AND lower(title) < lower('Gulliver''s Travels')
ORDER BY title
Sort (cost=17.08..17.09 rows=4 width=152) (actual time=0.511..0.529 rows=25 loops=1)
Sort Key: (title)
Sort Method: quicksort Memory: 27kB
-> Bitmap Heap Scan on books (cost=4.33..17.04 rows=4 width=152) (actual time=0.118..0.213 rows=25 loops=1)
Recheck Cond: ((lower(title) > 'frankenstein'::text) AND (lower(title) < 'gulliver''s travels'::text))
-> Bitmap Index Scan on books_title_idx (cost=0.00..4.33 rows=4 width=0) (actual time=0.087..0.087 rows=25 loops=1)
Index Cond: ((lower(title) > 'frankenstein'::text) AND (lower(title) < 'gulliver''s travels'::text))
Total runtime: 0.621 ms
What I am obviously looking for here is not a between search because the beginning and end are unknown.
So is this a postgresql limitation or is there something other than manipulating the cost of a table scan to something silly that I can use to convince the query planner to use the index?
I am using PostgreSQL 9.3

Use:
ORDER BY lower(title) DESC
or
ORDER BY lower(title)
to match your functional index, so it can be utilized.
ORDER BY is irrelevant for the selection of rows in the other two queries. That's why the index can be used in those cases.

Related

Multiple ORDER BY DESC will not use index in Postgres

I'm trying to create some queries in order to implement a cursor pagination (something like this: https://shopify.engineering/pagination-relative-cursors) on Postgres. In my implementation I'm trying to reach an efficient pagination even with ordering NON-unique columns.
I'm struggling to do that efficiently, in particular on the query that retrieves the previous page given a specific cursor.
The table that I'm using (>3M records) to test these query is very simple, and it has this structure:
CREATE TABLE "placemarks" (
"id" serial NOT NULL DEFAULT,
"assetId" text,
"createdAt" timestamptz,
PRIMARY KEY ("id")
);
I have an index on the id field clearly and also an index on the assetId column.
This is the query I'm using for retrieving the next page given a cursor composed by the latest ID and the latest assetId:
SELECT
*
FROM
"placemarks"
WHERE
"assetId" > 'CURSOR_ASSETID'
or("assetId" = 'CURSOR_ASSETID'
AND id > CURSOR_INT_ID)
ORDER BY
"assetId",
id
LIMIT 5;
This query is actually pretty fast, it uses the indexes and it allows to handle also duplicated values on assetId by using the unique ID field in order to avoid skipping duplicated rows with same CURSOR_ASSETID values.
-> Sort (cost=25709.62..25726.63 rows=6803 width=2324) (actual time=0.128..0.138 rows=5 loops=1)
" Sort Key: ""assetId"", id"
Sort Method: top-N heapsort Memory: 45kB
-> Bitmap Heap Scan on placemarks (cost=271.29..25596.63 rows=6803 width=2324) (actual time=0.039..0.088 rows=11 loops=1)
" Recheck Cond: (((""assetId"")::text > 'CURSOR_ASSETID'::text) OR ((""assetId"")::text = 'CURSOR_ASSETID'::text))"
" Filter: (((""assetId"")::text > 'CURSOR_ASSETID'::text) OR (((""assetId"")::text = 'CURSOR_ASSETID'::text) AND (id > CURSOR_INT_ID)))"
Rows Removed by Filter: 1
Heap Blocks: exact=10
-> BitmapOr (cost=271.29..271.29 rows=6803 width=0) (actual time=0.030..0.034 rows=0 loops=1)
" -> Bitmap Index Scan on ""placemarks_assetId_key"" (cost=0.00..263.45 rows=6802 width=0) (actual time=0.023..0.023 rows=11 loops=1)"
" Index Cond: ((""assetId"")::text > 'CURSOR_ASSETID'::text)"
" -> Bitmap Index Scan on ""placemarks_assetId_key"" (cost=0.00..4.44 rows=1 width=0) (actual time=0.005..0.005 rows=1 loops=1)"
" Index Cond: ((""assetId"")::text = 'CURSOR_ASSETID'::text)"
Planning time: 0.201 ms
Execution time: 0.194 ms
The issue is when I try to get the same page but with the query that should return me the previous page:
SELECT
*
FROM
placemarks
WHERE
"assetId" < 'CURSOR_ASSETID'
or("assetId" = 'CURSOR_ASSETID'
AND id < CURSOR_INT_ID)
ORDER BY
"assetId" desc,
id desc
LIMIT 5;
With this query no indexes are used, even if it would be much faster:
Limit (cost=933644.62..933644.63 rows=5 width=2324)
-> Sort (cost=933644.62..944647.42 rows=4401120 width=2324)
" Sort Key: ""assetId"" DESC, id DESC"
-> Seq Scan on placemarks (cost=0.00..860543.60 rows=4401120 width=2324)
" Filter: (((""assetId"")::text < 'CURSOR_ASSETID'::text) OR (((""assetId"")::text = 'CURSOR_ASSETID'::text) AND (id < CURSOR_INT_ID)))"
I've noticied that by forcing the usage of indexes with SET enable_seqscan = OFF; the query appears to be using the indexes and it performs better and faster. The query plan resulting:
Limit (cost=12.53..12.54 rows=5 width=108) (actual time=0.532..0.555 rows=5 loops=1)
-> Sort (cost=12.53..12.55 rows=6 width=108) (actual time=0.524..0.537 rows=5 loops=1)
Sort Key: assetid DESC, id DESC
Sort Method: top-N heapsort Memory: 25kB
" -> Bitmap Heap Scan on ""placemarks"" (cost=8.33..12.45 rows=6 width=108) (actual time=0.274..0.340 rows=14 loops=1)"
" Recheck Cond: ((assetid < 'CURSOR_ASSETID'::text) OR (assetid = 'CURSOR_ASSETID'::text))"
" Filter: ((assetid < 'CURSOR_ASSETID'::text) OR ((assetid = 'CURSOR_ASSETID'::text) AND (id < 14)))"
Rows Removed by Filter: 1
Heap Blocks: exact=1
-> BitmapOr (cost=8.33..8.33 rows=7 width=0) (actual time=0.152..0.159 rows=0 loops=1)
" -> Bitmap Index Scan on ""placemarks_assetid_idx"" (cost=0.00..4.18 rows=6 width=0) (actual time=0.108..0.110 rows=12 loops=1)"
" Index Cond: (assetid < 'CURSOR_ASSETID'::text)"
" -> Bitmap Index Scan on ""placemarks_assetid_idx"" (cost=0.00..4.15 rows=1 width=0) (actual time=0.036..0.036 rows=3 loops=1)"
" Index Cond: (assetid = 'CURSOR_ASSETID'::text)"
Planning time: 1.319 ms
Execution time: 0.918 ms
Any clue to optimize the second query in order to use always the indexes?
Postgres DB version: 10.20

The fast performance of your first query seems to be down to luck of where your constant 'CURSOR_ASSETID' falls in the distribution of that column. Or maybe this luck is not luck but is how it will always be?
For good performance more generally, including for reverse sorting, you need to write your query with a tuple comparator, not an OR comparator.
WHERE
("assetId",id) < ('something',500000)
If you are using a version before incremental sorting was introduced in v13, or if "assetId" can have a large number of ties, then you will need a multicolumn index on ("assetId",id) to get optimal performance.
And there is no reason to decorate the index with DESC, as PostgreSQL knows how to read the index backwards. Decorating the index is needed when the two columns have different ordering than each other, as then you would need to read the undecorated index "spirally" rather than either completely forward or completely backwards. (But that wouldn't work well here anyway, as tuple comparators can't have different orderings between the columns.)

Why is Postgres query planner affected by LIMIT?

EXPLAIN ANALYZE SELECT "alerts"."id",
"alerts"."created_at",
't1'::text AS src_table
FROM "alerts"
INNER JOIN "devices"
ON "devices"."id" = "alerts"."device_id"
INNER JOIN "sites"
ON "sites"."id" = "devices"."site_id"
WHERE "sites"."cloud_id" = 111
AND "alerts"."created_at" >= '2019-08-30'
ORDER BY "created_at" DESC limit 9;
Limit (cost=1.15..36021.60 rows=9 width=16) (actual time=30.505..29495.765 rows=9 loops=1)
-> Nested Loop (cost=1.15..232132.92 rows=58 width=16) (actual time=30.504..29495.755 rows=9 loops=1)
-> Nested Loop (cost=0.86..213766.42 rows=57231 width=24) (actual time=0.029..29086.323 rows=88858 loops=1)
-> Index Scan Backward using alerts_created_at_index on alerts (cost=0.43..85542.16 rows=57231 width=24) (actual time=0.014..88.137 rows=88858 loops=1)
Index Cond: (created_at >= '2019-08-30 00:00:00'::timestamp without time zone)
-> Index Scan using devices_pkey on devices (cost=0.43..2.23 rows=1 width=16) (actual time=0.016..0.325 rows=1 loops=88858)
Index Cond: (id = alerts.device_id)
-> Index Scan using sites_pkey on sites (cost=0.29..0.31 rows=1 width=8) (actual time=0.004..0.004 rows=0 loops=88858)
Index Cond: (id = devices.site_id)
Filter: (cloud_id = 7231)
Rows Removed by Filter: 1
Total runtime: 29495.816 ms
Now we change to LIMIT 10:
EXPLAIN ANALYZE SELECT "alerts"."id",
"alerts"."created_at",
't1'::text AS src_table
FROM "alerts"
INNER JOIN "devices"
ON "devices"."id" = "alerts"."device_id"
INNER JOIN "sites"
ON "sites"."id" = "devices"."site_id"
WHERE "sites"."cloud_id" = 111
AND "alerts"."created_at" >= '2019-08-30'
ORDER BY "created_at" DESC limit 10;
Limit (cost=39521.79..39521.81 rows=10 width=16) (actual time=1.557..1.559 rows=10 loops=1)
-> Sort (cost=39521.79..39521.93 rows=58 width=16) (actual time=1.555..1.555 rows=10 loops=1)
Sort Key: alerts.created_at
Sort Method: quicksort Memory: 25kB
-> Nested Loop (cost=5.24..39520.53 rows=58 width=16) (actual time=0.150..1.543 rows=11 loops=1)
-> Nested Loop (cost=4.81..16030.12 rows=2212 width=8) (actual time=0.137..0.643 rows=31 loops=1)
-> Index Scan using sites_cloud_id_index on sites (cost=0.29..64.53 rows=31 width=8) (actual time=0.014..0.057 rows=23 loops=1)
Index Cond: (cloud_id = 7231)
-> Bitmap Heap Scan on devices (cost=4.52..512.32 rows=270 width=16) (actual time=0.020..0.025 rows=1 loops=23)
Recheck Cond: (site_id = sites.id)
-> Bitmap Index Scan on devices_site_id_index (cost=0.00..4.46 rows=270 width=0) (actual time=0.006..0.006 rows=9 loops=23)
Index Cond: (site_id = sites.id)
-> Index Scan using alerts_device_id_index on alerts (cost=0.43..10.59 rows=3 width=24) (actual time=0.024..0.028 rows=0 loops=31)
Index Cond: (device_id = devices.id)
Filter: (created_at >= '2019-08-30 00:00:00'::timestamp without time zone)
Rows Removed by Filter: 12
Total runtime: 1.603 ms
alerts table has millions of records, other tables are counted in thousands.
I can already optimize the query by simply not using limit < 10. What I don't understand is why the LIMIT affects the performance. Perhaps there's a better way than hardcoding this magic number "10".

The number of result rows affects the PostgreSQL optimizer, because plans that return the first few rows quickly are not necessarily plans that return the whole result as fast as possible.
In your case, PostgreSQL thinks that for small values of LIMIT, it will be faster by scanning the alerts table in the order of the ORDER BY clause using an index and just join the other tables using a nested loop until it has found 9 rows.
The benefit of such a strategy is that it doesn't have to calculate the complete result of the join, then sort it and throw away all but the first few result rows.
The danger is that it takes longer than expected to find the 9 matching rows, and this is what hits you:
Index Scan Backward using alerts_created_at_index on alerts (cost=0.43..85542.16 rows=57231 width=24) (actual time=0.014..88.137 rows=88858 loops=1)
So PostgreSQL has to process 88858 rows and use a nested loop join (which is inefficient if it has to loop often) until it finds 9 result rows. This may be because it underestimates the selectivity of the conditions, or because the many matching rows all happen to have low created_at.
The number 10 just happens to be the cut-off point where PostgreSQL thinks it will no longer be more efficient to use that strategy, it is a value that will change as the data in the database change.
You can avoid using that plan altogether by using an ORDER BY clause that does not match the index:
ORDER BY (created_at + INTERVAL '0 days') DESC

Indexing columns across multiple tables in PostgreSQL

I'm trying to optimize the following join query:
A notification is a record that says whether a user has read some activity. One notification points to one activity but many users can be notified about an activity. The activity record has some columns such as the workspace the activity is in and the type of activity.
This query gets the user non-comment notifications that have been read in a specific workspace ordered by time.
explain analyze
select activity.id from activity, notification
where notification.user_id = '9a51f675-e1e2-46e5-8bcd-6bc535c7e7cb'
and notification.read = true
and notification.activity_id = activity.id
and activity.space_id = '6d702c09-8795-4185-abb3-dc6b3e8907dc'
and activity.type != 'commented'
order by activity.end_time desc
limit 20;
The problem is that this query has to run through every notification the user has every gotten.
Limit (cost=4912.35..4912.36 rows=1 width=24) (actual time=138.767..138.779 rows=20 loops=1)
-> Sort (cost=4912.35..4912.36 rows=1 width=24) (actual time=138.766..138.770 rows=20 loops=1)
Sort Key: activity.end_time DESC
Sort Method: top-N heapsort Memory: 27kB
-> Nested Loop (cost=32.57..4912.34 rows=1 width=24) (actual time=1.354..138.606 rows=447 loops=1)
-> Bitmap Heap Scan on notification (cost=32.01..3847.48 rows=124 width=16) (actual time=1.341..6.639 rows=1218 loops=1)
Recheck Cond: (user_id = '9a51f675-e1e2-46e5-8bcd-6bc535c7e7cb'::uuid)
Filter: read
Rows Removed by Filter: 4101
Heap Blocks: exact=4774
-> Bitmap Index Scan on notification_user_id_idx (cost=0.00..31.98 rows=988 width=0) (actual time=0.719..0.719 rows=5355 loops=1)
Index Cond: (user_id = '9a51f675-e1e2-46e5-8bcd-6bc535c7e7cb'::uuid)
-> Index Scan using activity_pkey on activity (cost=0.56..8.59 rows=1 width=24) (actual time=0.108..0.108 rows=0 loops=1218)
Index Cond: (id = notification.activity_id)
Filter: ((type <> 'commented'::activity_type) AND (space_id = '6d702c09-8795-4185-abb3-dc6b3e8907dc'::uuid))
Rows Removed by Filter: 1
Planning time: 0.428 ms
Execution time: 138.825 ms
Edit: Here is the performance after the cache has been warmed.
Limit (cost=4912.35..4912.36 rows=1 width=24) (actual time=13.618..13.629 rows=20 loops=1)
-> Sort (cost=4912.35..4912.36 rows=1 width=24) (actual time=13.617..13.621 rows=20 loops=1)
Sort Key: activity.end_time DESC
Sort Method: top-N heapsort Memory: 27kB
-> Nested Loop (cost=32.57..4912.34 rows=1 width=24) (actual time=1.365..13.447 rows=447 loops=1)
-> Bitmap Heap Scan on notification (cost=32.01..3847.48 rows=124 width=16) (actual time=1.352..6.606 rows=1218 loops=1)
Recheck Cond: (user_id = '9a51f675-e1e2-46e5-8bcd-6bc535c7e7cb'::uuid)
Filter: read
Rows Removed by Filter: 4101
Heap Blocks: exact=4774
-> Bitmap Index Scan on notification_user_id_idx (cost=0.00..31.98 rows=988 width=0) (actual time=0.729..0.729 rows=5355 loops=1)
Index Cond: (user_id = '9a51f675-e1e2-46e5-8bcd-6bc535c7e7cb'::uuid)
-> Index Scan using activity_pkey on activity (cost=0.56..8.59 rows=1 width=24) (actual time=0.005..0.005 rows=0 loops=1218)
Index Cond: (id = notification.activity_id)
Filter: ((type <> 'commented'::activity_type) AND (space_id = '6d702c09-8795-4185-abb3-dc6b3e8907dc'::uuid))
Rows Removed by Filter: 1
Planning time: 0.438 ms
Execution time: 13.673 ms
I could create a multi-column index on user_id and read, but that doesn't solve the issue I'm trying to solve.
I could solve this problem myself by manually denormalizing the data, adding the space_id, type, and end_time columns in the notification record, but that seems like it should be unnecessary.
I would expect Postgres to be able create an index across the two tables, but everything I read so far says this isn't possible.
So my question: what is the best way to optimize this query?
Edit: After creating the suggested indexes:
create index tmp_index_1 on activity using btree (
space_id,
id,
end_time
) where (
type != 'commented'
);
create index tmp_index_2 on notification using btree (
user_id,
activity_id
) where (
read = true
);
The query performance improved 3x.
explain analyse
select activity.id from activity
INNER JOIN notification ON notification.user_id = '9a51f675-e1e2-46e5-8bcd-6bc535c7e7cb'
and notification.read = true
and notification.activity_id = activity.id
and activity.space_id = '6d702c09-8795-4185-abb3-dc6b3e8907dc'
and activity.type != 'commented'
order by activity.end_time desc
limit 20;
Limit (cost=955.26..955.27 rows=1 width=24) (actual time=4.386..4.397 rows=20 loops=1)
-> Sort (cost=955.26..955.27 rows=1 width=24) (actual time=4.385..4.389 rows=20 loops=1)
Sort Key: activity.end_time DESC
Sort Method: top-N heapsort Memory: 27kB
-> Nested Loop (cost=1.12..955.25 rows=1 width=24) (actual time=0.035..4.244 rows=447 loops=1)
-> Index Only Scan using tmp_index_2 on notification (cost=0.56..326.71 rows=124 width=16) (actual time=0.017..1.039 rows=1218 loops=1)
Index Cond: (user_id = '9a51f675-e1e2-46e5-8bcd-6bc535c7e7cb'::uuid)
Heap Fetches: 689
-> Index Only Scan using tmp_index_1 on activity (cost=0.56..5.07 rows=1 width=24) (actual time=0.002..0.002 rows=0 loops=1218)
Index Cond: ((space_id = '6d702c09-8795-4185-abb3-dc6b3e8907dc'::uuid) AND (id = notification.activity_id))
Heap Fetches: 1
Planning time: 0.484 ms
Execution time: 4.428 ms
The one thing that still bothers me about this query is the rows=1218 and loops=1218. This query is looping through all of the read user notifications and querying against the activities table.
I would expect to be able to create a single index to read all of this in a manner that would mimic denormalizing this data. For example, if I add space_id, type, and end_time to the notification table, I could create the following index and read in fractions of a millisecond.
create index tmp_index_3 on notification using btree (
user_id,
space_id,
end_time desc
) where (
read = true
and type != 'commented'
);
Is this not currently possible within Postgres without denormalizing?

Add the index:
create index ix1_activity on activity (space_id, type, end_time, id);
create index ix2_notification on notification (activity_id, user_id, read);
These two "covering indexes" could make your query real fast.
Additionally, with a little bit of luck, it will read the activity table first (only 20 rows), and perform a Nested Loop Join (NLJ) on notification. That is, a very limited index walk.

looking to your code you should use for filter a composite index on
table notification columns : user_id, read, activity_id
table activity columns space_id, type, id
and for query and order by you could also add end_time in composite for activity
table activity columns space_id, type, id, end_time
and you should also use explict inner join sintax
select activity.id from activity
INNER JOIN notification ON notification.user_id = '9a51f675-e1e2-46e5-8bcd-6bc535c7e7cb'
and notification.read = true
and notification.activity_id = activity.id
and activity.space_id = '6d702c09-8795-4185-abb3-dc6b3e8907dc'
and activity.type != 'commented'
order by activity.end_time desc
limit 20;

Index on nested JSONB field is not used

I created an index on a nested JSONB field:
CREATE INDEX foo_idx ON some_table(cast(content->'meta'->>'version' AS int));
but the select query still does a full table scan:
select *
from some_table
where (content->'meta'->>'version')::INT <= 9000
LIMIT 1;
I also tried to express the query like:
select *
from some_table
where cast(content->'meta'->>'version' AS INT) <= 9000
LIMIT 1;
with the same result.
Query plan:
Limit (cost=0.00..1.06 rows=10 width=52)
-> Seq Scan on some_table (cost=0.00..38429.27 rows=361441 width=52)
Filter: ((((content -> 'meta'::text) ->> 'version'::text))::integer <= 9000)
What do I miss here?
Edit: It was more a coincident. I added a ORDER BY random() to the query and got the following query plan:
Limit (cost=31644.83..31644.83 rows=1 width=52) (actual time=0.017..0.017 rows=0 loops=1)
-> Sort (cost=31644.83..32548.43 rows=361441 width=52) (actual time=0.016..0.016 rows=0 loops=1)
Sort Key: (random())
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on game_object_user (cost=6769.60..29837.62 rows=361441 width=52) (actual time=0.011..0.011 rows=0 loops=1)
Recheck Cond: ((((content -> 'meta'::text) ->> 'version'::text))::integer < 9000)
-> Bitmap Index Scan on foo_idx (cost=0.00..6679.23 rows=361441 width=0) (actual time=0.009..0.009 rows=0 loops=1)
Index Cond: ((((content -> 'meta'::text) ->> 'version'::text))::integer < 90000)
Planning time: 0.074 ms
Execution time: 0.040 ms
The index was used.

Why is this DISTINCT/INNER JOIN/ORDER BY postgresql query so slow?

This query takes ~4 seconds to complete:
SELECT DISTINCT "resources_resource"."id",
"resources_resource"."heading",
"resources_resource"."name",
"resources_resource"."old_name",
"resources_resource"."clean_name",
"resources_resource"."sort_name",
"resources_resource"."see_also_id",
"resources_resource"."referenced_passages",
"resources_resource"."resource_type",
"resources_resource"."ord",
"resources_resource"."content",
"resources_resource"."thumb",
"resources_resource"."resource_origin"
FROM "resources_resource"
INNER JOIN "resources_passageresource" ON ("resources_resource"."id" = "resources_passageresource"."resource_id")
WHERE "resources_passageresource"."start_ref" >= 66001001
ORDER BY "resources_resource"."ord" ASC, "resources_resource"."sort_name" ASC LIMIT 5
By popular request, EXPLAIN ANALYZE:
Limit (cost=1125.50..1125.68 rows=5 width=803) (actual time=4434.076..4434.557 rows=5 loops=1)
-> Unique (cost=1125.50..1136.91 rows=326 width=803) (actual time=4434.076..4434.557 rows=5 loops=1)
-> Sort (cost=1125.50..1126.32 rows=326 width=803) (actual time=4434.075..4434.075 rows=6 loops=1)
Sort Key: resources_resource.ord, resources_resource.sort_name, resources_resource.id, resources_resource.heading, resources_resource.name, resources_resource.old_name, resources_resource.clean_name, resources_resource.see_also_id, resources_resource.referenced_passages, resources_resource.resource_type, resources_resource.content, resources_resource.thumb, resources_resource.resource_origin
Sort Method: quicksort Memory: 424kB
-> Hash Join (cost=697.00..1111.89 rows=326 width=803) (actual time=3.453..41.429 rows=424 loops=1)
Hash Cond: (resources_passageresource.resource_id = resources_resource.id)
-> Bitmap Heap Scan on resources_passageresource (cost=10.78..190.19 rows=326 width=4) (actual time=0.107..0.401 rows=424 loops=1)
Recheck Cond: (start_ref >= 66001001)
-> Bitmap Index Scan on resources_passageresource_start_ref (cost=0.00..10.70 rows=326 width=0) (actual time=0.086..0.086 rows=424 loops=1)
Index Cond: (start_ref >= 66001001)
-> Hash (cost=431.32..431.32 rows=2232 width=803) (actual time=3.228..3.228 rows=2232 loops=1)
Buckets: 1024 Batches: 2 Memory Usage: 947kB
-> Seq Scan on resources_resource (cost=0.00..431.32 rows=2232 width=803) (actual time=0.002..1.621 rows=2232 loops=1)
Total runtime: 4435.460 ms
This is ORM-generated SQL. I can work in SQL, but I'm definitely not proficient, and the EXPLAIN output here is mystifying to me. What about this query is dragging me down?
UPDATE: #Ybakos identified that the ORDER_BY was causing trouble. Removing the ORDER_BY clause altogether helps a bit, but the query still takes 800ms. Here's the EXPLAIN ANALYZE, sans ORDER_BY:
HashAggregate (cost=1122.49..1125.75 rows=326 width=803) (actual time=787.519..787.559 rows=104 loops=1)
-> Hash Join (cost=697.00..1111.89 rows=326 width=803) (actual time=3.381..7.312 rows=424 loops=1)
Hash Cond: (resources_passageresource.resource_id = resources_resource.id)
-> Bitmap Heap Scan on resources_passageresource (cost=10.78..190.19 rows=326 width=4) (actual time=0.095..0.686 rows=424 loops=1)
Recheck Cond: (start_ref >= 66001001)
-> Bitmap Index Scan on resources_passageresource_start_ref (cost=0.00..10.70 rows=326 width=0) (actual time=0.079..0.079 rows=424 loops=1)
Index Cond: (start_ref >= 66001001)
-> Hash (cost=431.32..431.32 rows=2232 width=803) (actual time=3.173..3.173 rows=2232 loops=1)
Buckets: 1024 Batches: 2 Memory Usage: 947kB
-> Seq Scan on resources_resource (cost=0.00..431.32 rows=2232 width=803) (actual time=0.002..1.568 rows=2232 loops=1)
Total runtime: 787.678 ms

It seems to me, DISTINCT has to be used to remove duplicates produced by the join. So my question is, why produce the duplicates in the first place? I'm not entirely sure what this query's being ORM-generated must imply, but if rewriting it is an option, you could certainly rewrite it in such a way as to prevent duplicates from appearing. For instance, using IN:
SELECT "resources_resource"."id",
"resources_resource"."heading",
"resources_resource"."name",
"resources_resource"."old_name",
"resources_resource"."clean_name",
"resources_resource"."sort_name",
"resources_resource"."see_also_id",
"resources_resource"."referenced_passages",
"resources_resource"."resource_type",
"resources_resource"."ord",
"resources_resource"."content",
"resources_resource"."thumb",
"resources_resource"."resource_origin"
FROM "resources_resource"
WHERE "resources_resource"."id" IN (
SELECT "resources_passageresource"."resource_id"
FROM "resources_passageresource"
WHERE "resources_passageresource"."start_ref" >= 66001001
)
ORDER BY "resources_resource"."ord" ASC, "resources_resource"."sort_name" ASC LIMIT 5
or using EXISTS:
SELECT "resources_resource"."id",
"resources_resource"."heading",
"resources_resource"."name",
"resources_resource"."old_name",
"resources_resource"."clean_name",
"resources_resource"."sort_name",
"resources_resource"."see_also_id",
"resources_resource"."referenced_passages",
"resources_resource"."resource_type",
"resources_resource"."ord",
"resources_resource"."content",
"resources_resource"."thumb",
"resources_resource"."resource_origin"
FROM "resources_resource"
WHERE EXISTS (
SELECT *
FROM "resources_passageresource"
WHERE "resources_passageresource"."resource_id" = "resources_resource"."id"
AND "resources_passageresource"."start_ref" >= 66001001
)
ORDER BY "resources_resource"."ord" ASC, "resources_resource"."sort_name" ASC LIMIT 5
And, of course, if it's acceptable to rewrite the query completely, I would also remove the long table names in front of column names. Consider the following, for instance (the IN query rewritten):
SELECT "id",
"heading",
"name",
"old_name",
"clean_name",
"sort_name",
"see_also_id",
"referenced_passages",
"resource_type",
"ord",
"content",
"thumb",
"resource_origin"
FROM "resources_resource"
WHERE "resources_resource"."id" IN (
SELECT "resource_id"
FROM "resources_passageresource"
WHERE "start_ref" >= 66001001
)
ORDER BY "ord" ASC, "sort_name" ASC LIMIT 5

It's the combination of ORDER BY with LIMIT.
If you don't have an index on (ord, sort_name) then I bet this is the cause of the slow performance. Or perhaps an index on (start_ref, ord, sort_name) is necessary for this particular query. Lastly, due to that join, perhaps have the left/first table be the one upon which your ORDER BY criteria applies.

That seems like a long time in the JOIN. The default memory settings in postgresql.conf are too low for any modern computer. Have you remembered to bump them up?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

PostgreSQL text range scan - sql

Use: ORDER BY lower(title) DESC or ORDER BY lower(title) to match your functional index, so it can be utilized. ORDER BY is irrelevant for the selection of rows in the other two queries. That's why the index can be used in those cases.

Related

Multiple ORDER BY DESC will not use index in Postgres

Why is Postgres query planner affected by LIMIT?

Indexing columns across multiple tables in PostgreSQL

Index on nested JSONB field is not used

Why is this DISTINCT/INNER JOIN/ORDER BY postgresql query so slow?

Categories

Resources