Slow OR statement in postgresql - sql

I currently have a postgresql query that is slow because of an OR statement. It's apparently not using an index because of it. Rewriting this query failed so far.
The query:
EXPLAIN ANALYZE SELECT a0_.id AS id0
FROM advert a0_
INNER JOIN advertcategory a1_
ON a0_.advert_category_id = a1_.id
WHERE a0_.advert_category_id IN ( 1136 )
OR a1_.parent_id IN ( 1136 )
ORDER BY a0_.created_date DESC
LIMIT 15;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..27542.49 rows=15 width=12) (actual time=1.658..50.809 rows=15 loops=1)
-> Nested Loop (cost=0.00..1691109.07 rows=921 width=12) (actual time=1.657..50.790 rows=15 loops=1)
-> Index Scan Backward using advert_created_date_idx on advert a0_ (cost=0.00..670300.17 rows=353804 width=16) (actual time=0.013..16.449 rows=12405 loops=1)
-> Index Scan using advertcategory_pkey on advertcategory a1_ (cost=0.00..2.88 rows=1 width=8) (actual time=0.002..0.002 rows=0 loops=12405)
Index Cond: (id = a0_.advert_category_id)
Filter: ((a0_.advert_category_id = 1136) OR (parent_id = 1136))
Rows Removed by Filter: 1
Total runtime: 50.860 ms
Cause of slowness: Filter: ((a0_.advert_category_id = 1136) OR (parent_id = 1136))
I've tried using INNER JOIN instead of the WHERE statement:
EXPLAIN ANALYZE SELECT a0_.id AS id0
FROM advert a0_
INNER JOIN advertcategory a1_
ON a0_.advert_category_id = a1_.id
AND ( a0_.advert_category_id IN ( 1136 )
OR a1_.parent_id IN ( 1136 ) )
ORDER BY a0_.created_date DESC
LIMIT 15;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..27542.49 rows=15 width=12) (actual time=4.667..139.955 rows=15 loops=1)
-> Nested Loop (cost=0.00..1691109.07 rows=921 width=12) (actual time=4.666..139.932 rows=15 loops=1)
-> Index Scan Backward using advert_created_date_idx on advert a0_ (cost=0.00..670300.17 rows=353804 width=16) (actual time=0.019..100.765 rows=12405 loops=1)
-> Index Scan using advertcategory_pkey on advertcategory a1_ (cost=0.00..2.88 rows=1 width=8) (actual time=0.002..0.002 rows=0 loops=12405)
Index Cond: (id = a0_.advert_category_id)
Filter: ((a0_.advert_category_id = 1136) OR (parent_id = 1136))
Rows Removed by Filter: 1
Total runtime: 140.048 ms
The query speeds up when I remove one of the OR criteria. So I've made a UNION to see the results. It's very fast! But I do not consider this as a solution:
EXPLAIN ANALYZE
(SELECT a0_.id AS id0
FROM advert a0_
INNER JOIN advertcategory a1_
ON a0_.advert_category_id = a1_.id
WHERE a0_.advert_category_id IN ( 1136 )
ORDER BY a0_.created_date DESC
LIMIT 15)
UNION
(SELECT a0_.id AS id0
FROM advert a0_
INNER JOIN advertcategory a1_
ON a0_.advert_category_id = a1_.id
WHERE a1_.parent_id IN ( 1136 )
ORDER BY a0_.created_date DESC
LIMIT 15);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=4125.70..4126.00 rows=30 width=12) (actual time=7.945..7.951 rows=15 loops=1)
-> Append (cost=1120.82..4125.63 rows=30 width=12) (actual time=6.811..7.929 rows=15 loops=1)
-> Subquery Scan on "*SELECT* 1" (cost=1120.82..1121.01 rows=15 width=12) (actual time=6.810..6.840 rows=15 loops=1)
-> Limit (cost=1120.82..1120.86 rows=15 width=12) (actual time=6.809..6.825 rows=15 loops=1)
-> Sort (cost=1120.82..1121.56 rows=295 width=12) (actual time=6.807..6.813 rows=15 loops=1)
Sort Key: a0_.created_date
Sort Method: top-N heapsort Memory: 25kB
-> Nested Loop (cost=10.59..1113.59 rows=295 width=12) (actual time=1.151..6.639 rows=220 loops=1)
-> Index Only Scan using advertcategory_pkey on advertcategory a1_ (cost=0.00..8.27 rows=1 width=4) (actual time=1.030..1.033 rows=1 loops=1)
Index Cond: (id = 1136)
Heap Fetches: 1
-> Bitmap Heap Scan on advert a0_ (cost=10.59..1102.37 rows=295 width=16) (actual time=0.099..5.287 rows=220 loops=1)
Recheck Cond: (advert_category_id = 1136)
-> Bitmap Index Scan on idx_54f1f40bd4436821 (cost=0.00..10.51 rows=295 width=0) (actual time=0.073..0.073 rows=220 loops=1)
Index Cond: (advert_category_id = 1136)
-> Subquery Scan on "*SELECT* 2" (cost=3004.43..3004.62 rows=15 width=12) (actual time=1.072..1.072 rows=0 loops=1)
-> Limit (cost=3004.43..3004.47 rows=15 width=12) (actual time=1.071..1.071 rows=0 loops=1)
-> Sort (cost=3004.43..3005.99 rows=626 width=12) (actual time=1.069..1.069 rows=0 loops=1)
Sort Key: a0_.created_date
Sort Method: quicksort Memory: 25kB
-> Nested Loop (cost=22.91..2989.07 rows=626 width=12) (actual time=1.056..1.056 rows=0 loops=1)
-> Index Scan using idx_d84ab8ea727aca70 on advertcategory a1_ (cost=0.00..8.27 rows=1 width=4) (actual time=1.054..1.054 rows=0 loops=1)
Index Cond: (parent_id = 1136)
-> Bitmap Heap Scan on advert a0_ (cost=22.91..2972.27 rows=853 width=16) (never executed)
Recheck Cond: (advert_category_id = a1_.id)
-> Bitmap Index Scan on idx_54f1f40bd4436821 (cost=0.00..22.70 rows=853 width=0) (never executed)
Index Cond: (advert_category_id = a1_.id)
Total runtime: 8.940 ms
(28 rows)
Tried reversing the IN statement:
EXPLAIN ANALYZE SELECT a0_.id AS id0
FROM advert a0_
INNER JOIN advertcategory a1_
ON a0_.advert_category_id = a1_.id
WHERE 1136 IN ( a0_.advert_category_id, a1_.parent_id )
ORDER BY a0_.created_date DESC
LIMIT 15;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..27542.49 rows=15 width=12) (actual time=1.848..62.461 rows=15 loops=1)
-> Nested Loop (cost=0.00..1691109.07 rows=921 width=12) (actual time=1.847..62.441 rows=15 loops=1)
-> Index Scan Backward using advert_created_date_idx on advert a0_ (cost=0.00..670300.17 rows=353804 width=16) (actual time=0.028..27.316 rows=12405 loops=1)
-> Index Scan using advertcategory_pkey on advertcategory a1_ (cost=0.00..2.88 rows=1 width=8) (actual time=0.002..0.002 rows=0 loops=12405)
Index Cond: (id = a0_.advert_category_id)
Filter: ((1136 = a0_.advert_category_id) OR (1136 = parent_id))
Rows Removed by Filter: 1
Total runtime: 62.506 ms
(8 rows)
Tried using EXISTS:
EXPLAIN ANALYZE SELECT a0_.id AS id0
FROM advert a0_
INNER JOIN advertcategory a1_
ON a0_.advert_category_id = a1_.id
WHERE EXISTS(SELECT test.id
FROM advert test
INNER JOIN advertcategory test_cat
ON test_cat.id = test.advert_category_id
WHERE test.id = a0_.id
AND ( test.advert_category_id IN ( 1136 )
OR test_cat.parent_id IN ( 1136 ) ))
ORDER BY a0_.created_date DESC
LIMIT 15;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=45538.18..45538.22 rows=15 width=12) (actual time=524.654..524.673 rows=15 loops=1)
-> Sort (cost=45538.18..45540.48 rows=921 width=12) (actual time=524.651..524.658 rows=15 loops=1)
Sort Key: a0_.created_date
Sort Method: top-N heapsort Memory: 25kB
-> Hash Join (cost=39803.59..45515.58 rows=921 width=12) (actual time=497.362..524.436 rows=220 loops=1)
Hash Cond: (a0_.advert_category_id = a1_.id)
-> Nested Loop (cost=39786.88..45486.21 rows=921 width=16) (actual time=496.748..523.501 rows=220 loops=1)
-> HashAggregate (cost=39786.88..39796.09 rows=921 width=4) (actual time=496.705..496.872 rows=220 loops=1)
-> Hash Join (cost=16.71..39784.58 rows=921 width=4) (actual time=1.210..496.294 rows=220 loops=1)
Hash Cond: (test.advert_category_id = test_cat.id)
Join Filter: ((test.advert_category_id = 1136) OR (test_cat.parent_id = 1136))
Rows Removed by Join Filter: 353584
-> Seq Scan on advert test (cost=0.00..33134.04 rows=353804 width=8) (actual time=0.002..177.953 rows=353804 loops=1)
-> Hash (cost=9.65..9.65 rows=565 width=8) (actual time=0.622..0.622 rows=565 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 22kB
-> Seq Scan on advertcategory test_cat (cost=0.00..9.65 rows=565 width=8) (actual time=0.005..0.327 rows=565 loops=1)
-> Index Scan using advert_pkey on advert a0_ (cost=0.00..6.17 rows=1 width=16) (actual time=0.117..0.118 rows=1 loops=220)
Index Cond: (id = test.id)
-> Hash (cost=9.65..9.65 rows=565 width=4) (actual time=0.604..0.604 rows=565 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 20kB
-> Seq Scan on advertcategory a1_ (cost=0.00..9.65 rows=565 width=4) (actual time=0.010..0.285 rows=565 loops=1)
Total runtime: 524.797 ms
Advert table (stripped down):
353804 rows
Table "public.advert"
Column | Type | Modifiers | Storage | Stats target | Description
-----------------------------+--------------------------------+-----------------------------------------------------+----------+--------------+-------------
id | integer | not null default nextval('advert_id_seq'::regclass) | plain | |
advert_category_id | integer | not null | plain | |
Indexes:
"idx_54f1f40bd4436821" btree (advert_category_id)
"advert_created_date_idx" btree (created_date)
Foreign-key constraints:
"fk_54f1f40bd4436821" FOREIGN KEY (advert_category_id) REFERENCES advertcategory(id) ON DELETE RESTRICT
Has OIDs: no
Category table (stripped down):
565 rows
Table "public.advertcategory"
Column | Type | Modifiers
-----------+---------+-------------------------------------------------------------
id | integer | not null default nextval('advertcategory_id_seq'::regclass)
parent_id | integer |
active | boolean | not null
system | boolean | not null
Indexes:
"advertcategory_pkey" PRIMARY KEY, btree (id)
"idx_d84ab8ea727aca70" btree (parent_id)
Foreign-key constraints:
"fk_d84ab8ea727aca70" FOREIGN KEY (parent_id) REFERENCES advertcategory(id) ON DELETE RESTRICT
Short server config:
version
--------------------------------------------------------------------------------------------------------------
PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-3), 64-bit
name | current_setting | source
----------------------------+--------------------+----------------------
shared_buffers | 1800MB | configuration file
work_mem | 4MB | configuration file
As you can see, none of the proper solutions give speed improvement. Only the UNION solution to split the OR statement improves performance. But I can't use that because this query is being used through my ORM framework with alot of other filter options. Plus if I can do this, why doesn't the optimizer do it? It seems a really simple optimization.
Any hints on this? A solution to this small problem would greatly appreciated!

Entirely new approach. Your where condition is on two tables, but that seems unnecessary.
The first change would be:
where a1_.id = 1136 or a1_.parent_id = 1136
I think the structure you want is a scan on the category table and then fetches from the advert table. To help, you can create an index on advert(advert_category_id, created_date).
I would be tempted to write the query by moving the where clause into a subquery. I don't know if this would effect performance:
SELECT a0_.id AS id0
FROM advert a0_ INNER JOIN
(select ac.*
from advertcategory ac
where ac.id = 1136 or ac.parent_id = 1136
) ac
ON a0_.advert_category_id = ac.id
ORDER BY a0_.created_date DESC
LIMIT 15;

"Plus if I can do this, why doesn't the optimizer do it?" -- Because there are all sorts of cases where it's not necessarily valid (due to aggregates in subqueries) or interesting (due to a better index) to do so.
The best query plan you'll possibly get is given in Gordon's answer, using union all instead of union to avoid the sort (I take it that a category is never its own parent, eliminating any possibility of dups).
Otherwise, note that your query could be rewritten like so:
SELECT a0_.id AS id0
FROM advert a0_
INNER JOIN advertcategory a1_
ON a0_.advert_category_id = a1_.id
WHERE a1_.id IN ( 1136 )
OR a1_.parent_id IN ( 1136 )
ORDER BY a0_.created_date DESC
LIMIT 15;
In other words you're filtering based on a criteria from one table, and sort/limit based on another. The way you wrote it precludes you from being able to use a good index, because the planner isn't recognizing that the filter criteria is all from the same table, so it'll nestloop over created_date with a filter like you're currently doing. It's not a bad plan, mind you... It's actually the right thing to do if e.g. 1136 isn't very selective a criteria.
By making it explicit that the second table is the one of interest, you may end up with a bitmap heap scan when the category is selective enough, if you've indexes on advertcategory (id) (which you already have if it's the primary key) and on advertcategory (parent_id) (which you probably do not have at the moment). Don't count too much on it, though -- PG doesn't collect correlated column information insofar as I'm aware.
Another possibility might be to maintain an array with the aggregate categories (using triggers) in advert directly, and to use a GIST index on it:
SELECT a0_.id AS id0
FROM advert a0_
WHERE ARRAY[1136, 1137] && a0_.category_ids -- 1136 OR 1137; use <# for AND
ORDER BY a0_.created_date DESC
LIMIT 15;
It's technically redundant, but it works well to optimize this sort of query (i.e. filters over a nested category tree that yield complicated join criterias)... When PG decides to use it, you'll end up top-n sorting applicable ads. (In older PG versions, the selectivity of && was arbitrary from lack of stats; I vaguely remember reading a changelog whereby 9.1, 9.2 or 9.3 improved things, presumably by using code similar to that used by the tsvector content statistics collector for generic array types. At any rate, be sure to use the latest PG version, and be sure to not rewrite that query using operators that the gin/gist index won't be able to use.)

Related

When ORDER BY is performed on values aggregated by COUNT, it takes time to issue the query

It tries to retrieve videos in order of the number of tags that are the same as the specific video.
The following query takes about 800ms, but the index appears to be used.
If you remove COUNT, GROUP BY, and ORDER BY from the SQL query, it runs super fast.(1-5ms)
In such a case, improving the SQL query alone will not speed up the process and
Do I need to use MATERIALIZED VIEW?
SELECT "videos_video"."id",
"videos_video"."title",
"videos_video"."thumbnail_url",
"videos_video"."preview_url",
"videos_video"."embed_url",
"videos_video"."duration",
"videos_video"."views",
"videos_video"."is_public",
"videos_video"."published_at",
"videos_video"."created_at",
"videos_video"."updated_at",
COUNT("videos_video"."id") AS "n"
FROM "videos_video"
INNER JOIN "videos_video_tags" ON ("videos_video"."id" = "videos_video_tags"."video_id")
WHERE ("videos_video_tags"."tag_id" IN
(SELECT U0."id"
FROM "videos_tag" U0
INNER JOIN "videos_video_tags" U1 ON (U0."id" = U1."tag_id")
WHERE U1."video_id" = '748b1814-f311-48da-a1f5-6bf8fe229c7f'))
GROUP BY "videos_video"."id"
ORDER BY "n" DESC
LIMIT 20;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1040.69..1040.74 rows=20 width=24) (actual time=738.648..738.654 rows=20 loops=1)
-> Sort (cost=1040.69..1044.29 rows=1441 width=24) (actual time=738.646..738.650 rows=20 loops=1)
Sort Key: (count(videos_video.id)) DESC
Sort Method: top-N heapsort Memory: 27kB
-> HashAggregate (cost=987.93..1002.34 rows=1441 width=24) (actual time=671.006..714.322 rows=188818 loops=1)
Group Key: videos_video.id
Batches: 1 Memory Usage: 28689kB
-> Nested Loop (cost=35.20..980.73 rows=1441 width=16) (actual time=0.341..559.034 rows=240293 loops=1)
-> Nested Loop (cost=34.78..340.88 rows=1441 width=16) (actual time=0.278..92.806 rows=240293 loops=1)
-> HashAggregate (cost=34.35..34.41 rows=6 width=32) (actual time=0.188..0.200 rows=4 loops=1)
Group Key: u0.id
Batches: 1 Memory Usage: 24kB
-> Nested Loop (cost=0.71..34.33 rows=6 width=32) (actual time=0.161..0.185 rows=4 loops=1)
-> Index Only Scan using videos_video_tags_video_id_tag_id_f8d6ba70_uniq on videos_video_tags u1 (cost=0.43..4.53 rows=6 width=16) (actual time=0.039..0.040 rows=4 loops=1)
Index Cond: (video_id = '748b1814-f311-48da-a1f5-6bf8fe229c7f'::uuid)
Heap Fetches: 0
-> Index Only Scan using videos_tag_pkey on videos_tag u0 (cost=0.28..4.97 rows=1 width=16) (actual time=0.035..0.035 rows=1 loops=4)
Index Cond: (id = u1.tag_id)
Heap Fetches: 0
-> Index Scan using videos_video_tags_tag_id_2673cfc8 on videos_video_tags (cost=0.43..35.90 rows=1518 width=32) (actual time=0.029..16.728 rows=60073 loops=4)
Index Cond: (tag_id = u0.id)
-> Index Only Scan using videos_video_pkey on videos_video (cost=0.42..0.44 rows=1 width=16) (actual time=0.002..0.002 rows=1 loops=240293)
Index Cond: (id = videos_video_tags.video_id)
Heap Fetches: 46
Planning Time: 1.980 ms
Execution Time: 739.446 ms
(26 rows)
Time: 742.145 ms
---------- Results of the execution plan for the query as answered by Edouard. ----------
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=30043.90..30212.53 rows=20 width=746) (actual time=239.142..239.219 rows=20 loops=1)
-> Limit (cost=30043.48..30043.53 rows=20 width=24) (actual time=239.089..239.093 rows=20 loops=1)
-> Sort (cost=30043.48..30607.15 rows=225467 width=24) (actual time=239.087..239.090 rows=20 loops=1)
Sort Key: (count(*)) DESC
Sort Method: top-N heapsort Memory: 26kB
-> HashAggregate (cost=21789.21..24043.88 rows=225467 width=24) (actual time=185.710..219.211 rows=188818 loops=1)
Group Key: vt.video_id
Batches: 1 Memory Usage: 22545kB
-> Nested Loop (cost=20.62..20187.24 rows=320395 width=16) (actual time=4.975..106.839 rows=240293 loops=1)
-> Index Only Scan using videos_video_tags_video_id_tag_id_f8d6ba70_uniq on videos_video_tags vvt (cost=0.43..4.53 rows=6 width=16) (actual time=0.033..0.043 rows=4 loops=1)
Index Cond: (video_id = '748b1814-f311-48da-a1f5-6bf8fe229c7f'::uuid)
Heap Fetches: 0
-> Bitmap Heap Scan on videos_video_tags vt (cost=20.19..3348.60 rows=1518 width=32) (actual time=4.311..20.663 rows=60073 loops=4)
Recheck Cond: (tag_id = vvt.tag_id)
Heap Blocks: exact=34757
-> Bitmap Index Scan on videos_video_tags_tag_id_2673cfc8 (cost=0.00..19.81 rows=1518 width=0) (actual time=3.017..3.017 rows=60073 loops=4)
Index Cond: (tag_id = vvt.tag_id)
-> Index Scan using videos_video_pkey on videos_video v (cost=0.42..8.44 rows=1 width=738) (actual time=0.005..0.005 rows=1 loops=20)
Index Cond: (id = vt.video_id)
Planning Time: 0.854 ms
Execution Time: 241.392 ms
(21 rows)
Time: 242.909 ms
Here below are some ideas to simplify the query. Then an EXPLAIN ANALYSE will confirm the potential impacts on the query performance.
Starting from the subquery :
SELECT U0."id"
FROM "videos_tag" U0
INNER JOIN "videos_video_tags" U1 ON (U0."id" = U1."tag_id")
WHERE U1."video_id" = '748b1814-f311-48da-a1f5-6bf8fe229c7f'
According to the JOIN clause : U0."id" = U1."tag_id" so that SELECT U0."id" can be replaced by SELECT U1."tag_id".
In this case, the table "videos_tag" U0 is not used anymore in the subquery which can be simplified as :
SELECT U1."tag_id"
FROM "videos_video_tags" U1
WHERE U1."video_id" = '748b1814-f311-48da-a1f5-6bf8fe229c7f'
And the WHERE clause of the main query becomes :
WHERE "videos_video_tags"."tag_id" IN
( SELECT U1."tag_id"
FROM "videos_video_tags" U1
WHERE U1."video_id" = '748b1814-f311-48da-a1f5-6bf8fe229c7f'
)
which can be transformed as a self join on the table "videos_video_tags" to be added in the FROM clause of the main query :
FROM "videos_video" AS v
INNER JOIN "videos_video_tags" AS vt
ON v."id" = vt."video_id"
INNER JOIN "videos_video_tags" AS vvt
ON vvt."tag_id" = vt."tag_id"
WHERE vvt."video_id" = '748b1814-f311-48da-a1f5-6bf8fe229c7f'
Finally, the GROUP BY "videos_video"."id" clause can be replaced by GROUP BY "videos_video_tags"."video_id" according to the JOIN clause between both tables, and this new GROUP BY clause associated to the ORDER BY clause and LIMIT clause can apply to a subquery involving the table "videos_video_tags" only, and before joining with the table "videos_video" :
SELECT v."id",
v."title",
v."thumbnail_url",
v."preview_url",
v."embed_url",
v."duration",
v."views",
v."is_public",
v."published_at",
v."created_at",
v."updated_at",
w."n"
FROM "videos_video" AS v
INNER JOIN
( SELECT vt."video_id"
, count(*) AS "n"
FROM "videos_video_tags" AS vt
INNER JOIN "videos_video_tags" AS vvt
ON vvt."tag_id" = vt."tag_id"
WHERE vvt."video_id" = '748b1814-f311-48da-a1f5-6bf8fe229c7f'
GROUP BY vt."video_id"
ORDER BY "n" DESC
LIMIT 20
) AS w
ON v."id" = w."video_id"

Poor performing postgres sql

Here's my sql, followed by the explanation. I need to improve the performance. Any ideas?
PostgreSQL 9.3.12 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 4.8.4-2ubuntu1~14.04.1) 4.8.4, 64-bit
explain analyze
SELECT DISTINCT "apts"."id", practices.name AS alias_0
FROM "apts"
LEFT OUTER JOIN "patients" ON "patients"."id" = "apts"."patient_id"
LEFT OUTER JOIN "practices" ON "practices"."id" = "apts"."practice_id"
LEFT OUTER JOIN "eligibility_messages" ON "eligibility_messages"."apt_id" = "apts"."id"
WHERE (apts.eligibility_status_id != 1)
AND (eligibility_messages.current = 't')
AND (practices.id = '104')
ORDER BY practices.name desc
LIMIT 25 OFFSET 0
Limit (cost=881321.34..881321.41 rows=25 width=20) (actual time=2928.225..2928.227 rows=25 loops=1)
-> Sort (cost=881321.34..881391.94 rows=28240 width=20) (actual time=2928.223..2928.224 rows=25 loops=1)
Sort Key: practices.name
Sort Method: top-N heapsort Memory: 26kB
-> HashAggregate (cost=880242.03..880524.43 rows=28240 width=20) (actual time=2927.213..2927.319 rows=520 loops=1)
-> Nested Loop (cost=286614.55..880100.83 rows=28240 width=20) (actual time=206.180..2926.791 rows=520 loops=1)
-> Seq Scan on practices (cost=0.00..6.36 rows=1 width=20) (actual time=0.018..0.031 rows=1 loops=1)
Filter: (id = 104)
Rows Removed by Filter: 108
-> Hash Join (cost=286614.55..879812.07 rows=28240 width=8) (actual time=206.159..2926.643 rows=520 loops=1)
Hash Cond: (eligibility_messages.apt_id = apts.id)
-> Seq Scan on eligibility_messages (cost=0.00..561275.63 rows=2029532 width=4) (actual time=0.691..2766.867 rows=67559 loops=1)
Filter: current
Rows Removed by Filter: 3924633
-> Hash (cost=284614.02..284614.02 rows=115082 width=12) (actual time=121.957..121.957 rows=91660 loops=1)
Buckets: 16384 Batches: 2 Memory Usage: 1974kB
-> Bitmap Heap Scan on apts (cost=8296.88..284614.02 rows=115082 width=12) (actual time=19.927..91.038 rows=91660 loops=1)
Recheck Cond: (practice_id = 104)
Filter: (eligibility_status_id <> 1)
Rows Removed by Filter: 80169
-> Bitmap Index Scan on index_apts_on_practice_id (cost=0.00..8268.11 rows=177540 width=0) (actual time=16.856..16.856 rows=179506 loops=1)
Index Cond: (practice_id = 104)
Total runtime: 2928.361 ms
First, rewrite the query to a more manageable form:
SELECT DISTINCT a."id", pr.name AS alias_0
FROM "apts" a JOIN
"practices" pr
ON pr."id" = a."practice_id" JOIN
"eligibility_messages" em
ON em."apt_id" = a."id"
WHERE (a.eligibility_status_id <> 1) AND
(em.current = 't') AND
(a.practice_id = 104)
ORDER BY pr.name desc ;
Notes:
The WHERE clause turns the outer joins into inner joins anyway, so you might as well express them correctly.
I doubt pr.id is actually a string
The patients table isn't used, so I just removed it.
Perhaps you don't even need the select distinct any more.
Switched the condition in the where to apts rather than practices.
If this isn't fast enough, you want indexes, probably on apts(practice_id, eligibility_status_id, id), practices(id), and eligibility_messages(apt_id, current).

Delete orphaned records in postgres. Delete using join. Performance

I have a case where I need to cleanup table from orphans regularly, so I'm looking for a high performance solution. I tried using 'IN' clause, but its not really fast. Columns have all the required indexes in both tables.(id - primary key, component_id - index, component_type - index)
DELETE FROM component_apportionment
WHERE id in (
SELECT a.id
FROM component_apportionment a
LEFT JOIN component_live c
ON (c.component_id = a.component_id
AND
c.component_type = a.component_type)
WHERE c.id is null);
Basically the case is to remove records from 'component_apportionment' table which do not exist in 'component_live' table.
The query plan for the query above is horrible as well:
Delete on component_apportionment_copy1 (cost=3860927.55..3860929.09 rows=1 width=18) (actual time=183479.848..183479.848 rows=0 loops=1)
-> Nested Loop (cost=3860927.55..3860929.09 rows=1 width=18) (actual time=183479.811..183479.813 rows=1 loops=1)
-> HashAggregate (cost=3860927.12..3860927.13 rows=1 width=20) (actual time=183479.793..183479.793 rows=1 loops=1)
Group Key: a.id
-> Merge Right Join (cost=3753552.72..3860927.12 rows=1 width=20) (actual time=172941.125..183479.787 rows=1 loops=1)
Merge Cond: ((c.component_id = a.component_id) AND ((c.component_type)::text = (a.component_type)::text))
Filter: (c.id IS NULL)
Rows Removed by Filter: 5968195
-> Sort (cost=3390767.32..3413658.29 rows=9156391 width=21) (actual time=169852.438..172642.897 rows=8043013 loops=1)
Sort Key: c.component_id, c.component_type
Sort Method: external merge Disk: 310232kB
-> Seq Scan on component_live c (cost=0.00..2117393.91 rows=9156391 width=21) (actual time=0.004..155656.568 rows=9333382 loops=1)
-> Materialize (cost=362785.40..375049.75 rows=2452871 width=21) (actual time=3088.653..5343.013 rows=5968195 loops=1)
-> Sort (cost=362785.40..368917.58 rows=2452871 width=21) (actual time=3088.648..3989.163 rows=2452871 loops=1)
Sort Key: a.component_id, a.component_type
Sort Method: external merge Disk: 81504kB
-> Seq Scan on component_apportionment_copy1 a (cost=0.00..44969.71 rows=2452871 width=21) (actual time=0.920..882.040 rows=2452871 loops=1)
-> Index Scan using component_apportionment_copy1_pkey on component_apportionment_copy1 (cost=0.43..1.95 rows=1 width=14) (actual time=0.012..0.012 rows=1 loops=1)
Index Cond: (id = a.id)
Planning time: 5.573 ms
Execution time: 183554.675 ms
Would appreciate any help.
Thanks
Note
Tables have approx 80mln records each in a worst case. Both tables have indexes on used columns.
UPDATE
Query plan for 'not exists'
Query:
EXPLAIN (analyze, verbose, buffers) DELETE FROM component_apportionment_copy1
WHERE not exists (select 1
from component_live c
where c.component_id = component_apportionment_copy1.component_id);
Delete on vector.component_apportionment_copy1 (cost=2276557.80..2446287.39 rows=2104532 width=12) (actual time=203643.560..203643.560 rows=0 loops=1)
Buffers: shared hit=20875 read=2025400, temp read=46067 written=45813
-> Hash Anti Join (cost=2276557.80..2446287.39 rows=2104532 width=12) (actual time=202212.975..203643.486 rows=1 loops=1)
Output: component_apportionment_copy1.ctid, c.ctid
Hash Cond: (component_apportionment_copy1.component_id = c.component_id)
Buffers: shared hit=20874 read=2025400, temp read=46067 written=45813
-> Seq Scan on vector.component_apportionment_copy1 (cost=0.00..44969.71 rows=2452871 width=10) (actual time=0.003..659.668 rows=2452871 loops=1)
Output: component_apportionment_copy1.ctid, component_apportionment_copy1.component_id
Buffers: shared hit=20441
-> Hash (cost=2117393.91..2117393.91 rows=9156391 width=10) (actual time=198536.786..198536.786 rows=9333382 loops=1)
Output: c.ctid, c.component_id
Buckets: 16384 Batches: 128 Memory Usage: 3195kB
Buffers: shared hit=430 read=2025400, temp written=36115
-> Seq Scan on vector.component_live c (cost=0.00..2117393.91 rows=9156391 width=10) (actual time=0.039..194415.641 rows=9333382 loops=1)
Output: c.ctid, c.component_id
Buffers: shared hit=430 read=2025400
Planning time: 6.639 ms
Execution time: 203643.594 ms
Its doing seq scan on both tables and more data - the slower it will be.
You have way too many joins in there:
set enable_seqscan = false; -- forcing to use indexes
DELETE FROM component_apportionment
WHERE not exists (select 1
from component_live c
where c.component_id = component_apportionment.component_id);
Will do the same thing and should be much faster, especially if you have indexes on the component_id columns.
The exists way:
delete
from component_apportionment ca
where not exists
(select 1
from component_live cl
where cl.component_id = ca.component_id
);
Or the in way:
delete
from component_apportionment
where component_id not in
(select component_id
from component_live
);
Also, create indexes on both tables to the component_id columns.
UPDATE
I made a script for testing:
-- table creating and populating (1,000,000 records each)
drop table if exists component_apportionment;
drop table if exists component_live;
create table component_live (component_id numeric primary key);
create table component_apportionment (id serial primary key, component_id numeric);
create index component_apportionment_idx on component_apportionment (component_id);
insert into component_live select g from generate_series(1,1000000) g;
insert into component_apportionment (component_id) select trunc(random()*1000000) from generate_series(1,1000000) g;
analyze verbose component_live;
analyze verbose component_apportionment;
EXPLAIN (analyze, verbose, buffers)
select component_id
from component_apportionment ca
where not exists
(select 1
from component_live cl
where cl.component_id = ca.component_id
);
Merge Anti Join (cost=0.85..61185.85 rows=1 width=6) (actual time=0.013..1060.014 rows=2 loops=1)
Output: ca.component_id
Merge Cond: (ca.component_id = cl.component_id)
Buffers: shared hit=1010548
-> Index Only Scan using component_apportionment_idx on admin.component_apportionment ca (cost=0.42..24015.42 rows=1000000 width=6) (actual time=0.006..460.318 rows=1000000 loops=1)
Output: ca.component_id
Heap Fetches: 1000000
Buffers: shared hit=1003388
-> Index Only Scan using component_live_pkey on admin.component_live cl (cost=0.42..22170.42 rows=1000000 width=6) (actual time=0.005..172.502 rows=999998 loops=1)
Output: cl.component_id
Heap Fetches: 999998
Buffers: shared hit=7160
Total runtime: 1060.035 ms

Optimise the PG query

The query is used very often in the app and is too expensive.
What are the things I can do to optimise it and bring the total time to milliseconds (rather than hundreds of ms)?
NOTES:
removing DISTINCT improves (down to ~460ms), but I need to to get rid of cartesian product :( (yeah, show better way of avoiding it)
removing OREDER BY name improves, but not significantly.
The query:
SELECT DISTINCT properties.*
FROM properties JOIN developments ON developments.id = properties.development_id
-- Development allocations
LEFT JOIN allocation_items AS dev_items ON dev_items.development_id = properties.development_id
LEFT JOIN allocations AS dev_allocs ON dev_items.allocation_id = dev_allocs.id
-- Group allocations
LEFT JOIN properties_property_groupings ppg ON ppg.property_id = properties.id
LEFT JOIN property_groupings pg ON pg.id = ppg.property_grouping_id
LEFT JOIN allocation_items prop_items ON prop_items.property_grouping_id = pg.id
LEFT JOIN allocations prop_allocs ON prop_allocs.id = prop_items.allocation_id
WHERE
(properties.status <> 'deleted') AND ((
properties.status <> 'inactive'
AND (
(dev_allocs.receiving_company_id = 175 OR prop_allocs.receiving_company_id = 175)
AND developments.status = 'active'
)
OR developments.company_id = 175
)
AND EXISTS (
SELECT 1 FROM development_participations dp
JOIN participations p ON p.id = dp.participation_id
WHERE dp.allowed
AND p.user_id = 387 AND p.company_id = 175
AND dp.development_id = properties.development_id
LIMIT 1
)
)
ORDER BY properties.name
EXPLAIN ANALYZE
Unique (cost=72336.86..72517.53 rows=1606 width=4336) (actual time=703.766..710.920 rows=219 loops=1)
-> Sort (cost=72336.86..72340.87 rows=1606 width=4336) (actual time=703.765..704.698 rows=5091 loops=1)
Sort Key: properties.name, properties.id, properties.status, properties.level, etc etc (all columns)
Sort Method: external sort Disk: 1000kB
-> Nested Loop Left Join (cost=0.00..69258.84 rows=1606 width=4336) (actual time=25.230..366.489 rows=5091 loops=1)
Filter: ((((properties.status)::text <> 'inactive'::text) AND ((dev_allocs.receiving_company_id = 175) OR (prop_allocs.receiving_company_id = 175)) AND ((developments.status)::text = 'active'::text)) OR (developments.company_id = 175))
-> Nested Loop Left Join (cost=0.00..57036.99 rows=41718 width=4355) (actual time=25.122..247.587 rows=99567 loops=1)
-> Nested Loop Left Join (cost=0.00..47616.39 rows=21766 width=4355) (actual time=25.111..163.827 rows=39774 loops=1)
-> Nested Loop Left Join (cost=0.00..41508.16 rows=21766 width=4355) (actual time=25.101..112.452 rows=39774 loops=1)
-> Nested Loop Left Join (cost=0.00..34725.22 rows=21766 width=4351) (actual time=25.087..68.104 rows=19887 loops=1)
-> Nested Loop Left Join (cost=0.00..28613.00 rows=21766 width=4351) (actual time=25.076..39.360 rows=19887 loops=1)
-> Nested Loop (cost=0.00..27478.54 rows=1147 width=4347) (actual time=25.059..29.966 rows=259 loops=1)
-> Index Scan using developments_pkey on developments (cost=0.00..25.17 rows=49 width=15) (actual time=0.048..0.127 rows=48 loops=1)
Filter: (((status)::text = 'active'::text) OR (company_id = 175))
-> Index Scan using index_properties_on_development_id on properties (cost=0.00..559.95 rows=26 width=4336) (actual time=0.534..0.618 rows=5 loops=48)
Index Cond: (development_id = developments.id)
Filter: (((status)::text <> 'deleted'::text) AND (SubPlan 1))
SubPlan 1
-> Limit (cost=0.00..10.00 rows=1 width=0) (actual time=0.011..0.011 rows=0 loops=2420)
-> Nested Loop (cost=0.00..10.00 rows=1 width=0) (actual time=0.011..0.011 rows=0 loops=2420)
Join Filter: (dp.participation_id = p.id)
-> Seq Scan on development_participations dp (cost=0.00..1.71 rows=1 width=4) (actual time=0.004..0.008 rows=1 loops=2420)
Filter: (allowed AND (development_id = properties.development_id))
-> Index Scan using index_participations_on_user_id on participations p (cost=0.00..8.27 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=3148)
Index Cond: (user_id = 387)
Filter: (company_id = 175)
-> Index Scan using index_allocation_items_on_development_id on allocation_items dev_items (cost=0.00..0.70 rows=23 width=8) (actual time=0.003..0.016 rows=77 loops=259)
Index Cond: (development_id = properties.development_id)
-> Index Scan using allocations_pkey on allocations dev_allocs (cost=0.00..0.27 rows=1 width=8) (actual time=0.001..0.001 rows=1 loops=19887)
Index Cond: (dev_items.allocation_id = id)
-> Index Scan using index_properties_property_groupings_on_property_id on properties_property_groupings ppg (cost=0.00..0.29 rows=2 width=8) (actual time=0.001..0.001 rows=2 loops=19887)
Index Cond: (property_id = properties.id)
-> Index Scan using property_groupings_pkey on property_groupings pg (cost=0.00..0.27 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=39774)
Index Cond: (id = ppg.property_grouping_id)
-> Index Scan using index_allocation_items_on_property_grouping_id on allocation_items prop_items (cost=0.00..0.36 rows=6 width=8) (actual time=0.001..0.001 rows=2 loops=39774)
Index Cond: (property_grouping_id = pg.id)
-> Index Scan using allocations_pkey on allocations prop_allocs (cost=0.00..0.27 rows=1 width=8) (actual time=0.001..0.001 rows=1 loops=99567)
Index Cond: (id = prop_items.allocation_id)
Total runtime: 716.692 ms
(39 rows)
Answering my own question.
This query has 2 big issues:
6 LEFT JOINs that produce cartesian product (resulting in billion-s of records even on small dataset).
DISTINCT that has to sort that billion records dataset.
So I had to eliminate those.
The way I did it is by replacing JOINs with 2 subqueries (won't provide it here since it should be pretty obvious).
As a result, the actual time went from ~700-800ms down to ~45ms which is more or less acceptable.
Most time is spend in the disk sort, you should use RAM by changing work_mem:
SET work_mem TO '20MB';
And check EXPLAIN ANALYZE again

Strange: Planner takes decision with lower cost, but (very) query long runtime

Facts:
PGSQL 8.4.2, Linux
I make use of table inheritance
Each Table contains 3 million rows
Indexes on joining columns are set
Table statistics (analyze, vacuum analyze) are up-to-date
Only used table is "node" with varios partitioned sub-tables
Recursive query (pg >= 8.4)
Now here is the explained query:
WITH RECURSIVE
rows AS
(
SELECT *
FROM (
SELECT r.id, r.set, r.parent, r.masterid
FROM d_storage.node_dataset r
WHERE masterid = 3533933
) q
UNION ALL
SELECT *
FROM (
SELECT c.id, c.set, c.parent, r.masterid
FROM rows r
JOIN a_storage.node c
ON c.parent = r.id
) q
)
SELECT r.masterid, r.id AS nodeid
FROM rows r
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
CTE Scan on rows r (cost=2742105.92..2862119.94 rows=6000701 width=16) (actual time=0.033..172111.204 rows=4 loops=1)
CTE rows
-> Recursive Union (cost=0.00..2742105.92 rows=6000701 width=28) (actual time=0.029..172111.183 rows=4 loops=1)
-> Index Scan using node_dataset_masterid on node_dataset r (cost=0.00..8.60 rows=1 width=28) (actual time=0.025..0.027 rows=1 loops=1)
Index Cond: (masterid = 3533933)
-> Hash Join (cost=0.33..262208.33 rows=600070 width=28) (actual time=40628.371..57370.361 rows=1 loops=3)
Hash Cond: (c.parent = r.id)
-> Append (cost=0.00..211202.04 rows=12001404 width=20) (actual time=0.011..46365.669 rows=12000004 loops=3)
-> Seq Scan on node c (cost=0.00..24.00 rows=1400 width=20) (actual time=0.002..0.002 rows=0 loops=3)
-> Seq Scan on node_dataset c (cost=0.00..55001.01 rows=3000001 width=20) (actual time=0.007..3426.593 rows=3000001 loops=3)
-> Seq Scan on node_stammdaten c (cost=0.00..52059.01 rows=3000001 width=20) (actual time=0.008..9049.189 rows=3000001 loops=3)
-> Seq Scan on node_stammdaten_adresse c (cost=0.00..52059.01 rows=3000001 width=20) (actual time=3.455..8381.725 rows=3000001 loops=3)
-> Seq Scan on node_testdaten c (cost=0.00..52059.01 rows=3000001 width=20) (actual time=1.810..5259.178 rows=3000001 loops=3)
-> Hash (cost=0.20..0.20 rows=10 width=16) (actual time=0.010..0.010 rows=1 loops=3)
-> WorkTable Scan on rows r (cost=0.00..0.20 rows=10 width=16) (actual time=0.002..0.004 rows=1 loops=3)
Total runtime: 172111.371 ms
(16 rows)
(END)
So far so bad, the planner decides to choose hash joins (good) but no indexes (bad).
Now after doing the following:
SET enable_hashjoins TO false;
The explained query looks like that:
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
CTE Scan on rows r (cost=15198247.00..15318261.02 rows=6000701 width=16) (actual time=0.038..49.221 rows=4 loops=1)
CTE rows
-> Recursive Union (cost=0.00..15198247.00 rows=6000701 width=28) (actual time=0.032..49.201 rows=4 loops=1)
-> Index Scan using node_dataset_masterid on node_dataset r (cost=0.00..8.60 rows=1 width=28) (actual time=0.028..0.031 rows=1 loops=1)
Index Cond: (masterid = 3533933)
-> Nested Loop (cost=0.00..1507822.44 rows=600070 width=28) (actual time=10.384..16.382 rows=1 loops=3)
Join Filter: (r.id = c.parent)
-> WorkTable Scan on rows r (cost=0.00..0.20 rows=10 width=16) (actual time=0.001..0.003 rows=1 loops=3)
-> Append (cost=0.00..113264.67 rows=3001404 width=20) (actual time=8.546..12.268 rows=1 loops=4)
-> Seq Scan on node c (cost=0.00..24.00 rows=1400 width=20) (actual time=0.001..0.001 rows=0 loops=4)
-> Bitmap Heap Scan on node_dataset c (cost=58213.87..113214.88 rows=3000001 width=20) (actual time=1.906..1.906 rows=0 loops=4)
Recheck Cond: (c.parent = r.id)
-> Bitmap Index Scan on node_dataset_parent (cost=0.00..57463.87 rows=3000001 width=0) (actual time=1.903..1.903 rows=0 loops=4)
Index Cond: (c.parent = r.id)
-> Index Scan using node_stammdaten_parent on node_stammdaten c (cost=0.00..8.60 rows=1 width=20) (actual time=3.272..3.273 rows=0 loops=4)
Index Cond: (c.parent = r.id)
-> Index Scan using node_stammdaten_adresse_parent on node_stammdaten_adresse c (cost=0.00..8.60 rows=1 width=20) (actual time=4.333..4.333 rows=0 loops=4)
Index Cond: (c.parent = r.id)
-> Index Scan using node_testdaten_parent on node_testdaten c (cost=0.00..8.60 rows=1 width=20) (actual time=2.745..2.746 rows=0 loops=4)
Index Cond: (c.parent = r.id)
Total runtime: 49.349 ms
(21 rows)
(END)
-> incredibly faster, because indexes were used.
Notice: Cost of the second query ist somewhat higher than for the first query.
So the main question is: Why does the planner make the first decision, instead of the second?
Also interesing:
Via
SET enable_seqscan TO false;
i temp. disabled seq scans. Than the planner used indexes and hash joins, and the query still was slow. So the problem seems to be the hash join.
Maybe someone can help in this confusing situation?
thx, R.
If your explain differs significantly from reality (the cost is lower but the time is higher) it is likely that your statistics are out of date, or are on a non-representative sample.
Try again with fresh statistics. If this does not help, increase the sample size and build stats again.
Try this:
set seq_page_cost = '4.0';
set random_page_cost = '1.0;
EXPLAIN ANALYZE ....
Do this in your session to see if it makes a difference.
The hash join was expecting 600070 resulting rows, but only got 4 (in 3 loops, averaging 1 row per loop). If 600070 had been accurate, the hash join would presumably have been appropriate.