I already making index on both table. But its still pretty slow. And its so fast without the where clause.
EXPLAIN ANALYZE SELECT date, a , b, c FROM t1 JOIN t2 using (date, a) where date = current_date
;
Nest2d Loop (cost=0.71..12.75 rows=1 width=22) (actual time=0.343..50925.262 rows=87956 loops=1)
Join Filt2r: (t1.a = t2.a)
Rows Removed by Join Filt2r: 262988440
-> Index Scan using t1_date_idx1 on t1 t1 (cost=0.42..8.44 rows=1 width=15) (actual time=0.022..20.240 rows=87956 loops=1)
Index Cond: (date = CURRENT_DATE)
-> Index Scan using t2_date_idx on t2 t2 (cost=0.29..4.30 rows=1 width=15) (actual time=0.005..0.353 rows=2991 loops=87956)
Index Cond: (date = CURRENT_DATE)
Planning time: 0.151 ms
Execution time: 50930.327 ms
Without where clause:
EXPLAIN ANALYZE SELECT date, a , b, c FROM t1 JOIN t2 using (date, a)
;
Hash Join (cost=349.55..11993.24 rows=182123 width=22) (actual time=4.741..61.881 rows=182123 loops=1)
Hash Cond: ((t1.date = t2.date) AND (t1.a = t2.a))
-> Seq Scan on t1 t1 (cost=0.00..8001.23 rows=182123 width=15) (actual time=2.921..17.651 rows=182123 loops=1)
-> Hash (cost=259.82..259.82 rows=5982 width=15) (actual time=1.765..1.765 rows=5982 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 350kB
-> Seq Scan on t2 t2 (cost=0.00..259.82 rows=5982 width=15) (actual time=0.115..0.908 rows=5982 loops=1)
Planning time: 0.280 ms
Execution time: 66.400 ms
Thanks rickyen , After try "ANALYZE t1; and ANALYZE t2; ". It is much better.
Hash Join (cost=233.44..7052.01 rows=87936 width=15) (actual time=0.936..31.488 rows=87956 loops=1)
Hash Cond: (t1.a = t2.a)
-> Index Scan using t1_date_idx1 on t1 (cost=0.42..5499.81 rows=87965 width=15) (actual time=0.026..12.625 rows=87956 loops=1)
Index Cond: (date = CURRENT_DATE)
-> Hash (cost=195.63..195.63 rows=2991 width=8) (actual time=0.900..0.900 rows=2991 loops=1)
Buckets: 4096 Batches: 1 Memory Usage: 152kB
-> Index Scan using t2_date_idx on t2 (cost=0.29..195.63 rows=2991 width=8) (actual time=0.015..0.556 rows=2991 loops=1)
Index Cond: (date = CURRENT_DATE)
Planning time: 0.171 ms
Execution time: 33.799 ms
Related
It tries to retrieve videos in order of the number of tags that are the same as the specific video.
The following query takes about 800ms, but the index appears to be used.
If you remove COUNT, GROUP BY, and ORDER BY from the SQL query, it runs super fast.(1-5ms)
In such a case, improving the SQL query alone will not speed up the process and
Do I need to use MATERIALIZED VIEW?
SELECT "videos_video"."id",
"videos_video"."title",
"videos_video"."thumbnail_url",
"videos_video"."preview_url",
"videos_video"."embed_url",
"videos_video"."duration",
"videos_video"."views",
"videos_video"."is_public",
"videos_video"."published_at",
"videos_video"."created_at",
"videos_video"."updated_at",
COUNT("videos_video"."id") AS "n"
FROM "videos_video"
INNER JOIN "videos_video_tags" ON ("videos_video"."id" = "videos_video_tags"."video_id")
WHERE ("videos_video_tags"."tag_id" IN
(SELECT U0."id"
FROM "videos_tag" U0
INNER JOIN "videos_video_tags" U1 ON (U0."id" = U1."tag_id")
WHERE U1."video_id" = '748b1814-f311-48da-a1f5-6bf8fe229c7f'))
GROUP BY "videos_video"."id"
ORDER BY "n" DESC
LIMIT 20;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1040.69..1040.74 rows=20 width=24) (actual time=738.648..738.654 rows=20 loops=1)
-> Sort (cost=1040.69..1044.29 rows=1441 width=24) (actual time=738.646..738.650 rows=20 loops=1)
Sort Key: (count(videos_video.id)) DESC
Sort Method: top-N heapsort Memory: 27kB
-> HashAggregate (cost=987.93..1002.34 rows=1441 width=24) (actual time=671.006..714.322 rows=188818 loops=1)
Group Key: videos_video.id
Batches: 1 Memory Usage: 28689kB
-> Nested Loop (cost=35.20..980.73 rows=1441 width=16) (actual time=0.341..559.034 rows=240293 loops=1)
-> Nested Loop (cost=34.78..340.88 rows=1441 width=16) (actual time=0.278..92.806 rows=240293 loops=1)
-> HashAggregate (cost=34.35..34.41 rows=6 width=32) (actual time=0.188..0.200 rows=4 loops=1)
Group Key: u0.id
Batches: 1 Memory Usage: 24kB
-> Nested Loop (cost=0.71..34.33 rows=6 width=32) (actual time=0.161..0.185 rows=4 loops=1)
-> Index Only Scan using videos_video_tags_video_id_tag_id_f8d6ba70_uniq on videos_video_tags u1 (cost=0.43..4.53 rows=6 width=16) (actual time=0.039..0.040 rows=4 loops=1)
Index Cond: (video_id = '748b1814-f311-48da-a1f5-6bf8fe229c7f'::uuid)
Heap Fetches: 0
-> Index Only Scan using videos_tag_pkey on videos_tag u0 (cost=0.28..4.97 rows=1 width=16) (actual time=0.035..0.035 rows=1 loops=4)
Index Cond: (id = u1.tag_id)
Heap Fetches: 0
-> Index Scan using videos_video_tags_tag_id_2673cfc8 on videos_video_tags (cost=0.43..35.90 rows=1518 width=32) (actual time=0.029..16.728 rows=60073 loops=4)
Index Cond: (tag_id = u0.id)
-> Index Only Scan using videos_video_pkey on videos_video (cost=0.42..0.44 rows=1 width=16) (actual time=0.002..0.002 rows=1 loops=240293)
Index Cond: (id = videos_video_tags.video_id)
Heap Fetches: 46
Planning Time: 1.980 ms
Execution Time: 739.446 ms
(26 rows)
Time: 742.145 ms
---------- Results of the execution plan for the query as answered by Edouard. ----------
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=30043.90..30212.53 rows=20 width=746) (actual time=239.142..239.219 rows=20 loops=1)
-> Limit (cost=30043.48..30043.53 rows=20 width=24) (actual time=239.089..239.093 rows=20 loops=1)
-> Sort (cost=30043.48..30607.15 rows=225467 width=24) (actual time=239.087..239.090 rows=20 loops=1)
Sort Key: (count(*)) DESC
Sort Method: top-N heapsort Memory: 26kB
-> HashAggregate (cost=21789.21..24043.88 rows=225467 width=24) (actual time=185.710..219.211 rows=188818 loops=1)
Group Key: vt.video_id
Batches: 1 Memory Usage: 22545kB
-> Nested Loop (cost=20.62..20187.24 rows=320395 width=16) (actual time=4.975..106.839 rows=240293 loops=1)
-> Index Only Scan using videos_video_tags_video_id_tag_id_f8d6ba70_uniq on videos_video_tags vvt (cost=0.43..4.53 rows=6 width=16) (actual time=0.033..0.043 rows=4 loops=1)
Index Cond: (video_id = '748b1814-f311-48da-a1f5-6bf8fe229c7f'::uuid)
Heap Fetches: 0
-> Bitmap Heap Scan on videos_video_tags vt (cost=20.19..3348.60 rows=1518 width=32) (actual time=4.311..20.663 rows=60073 loops=4)
Recheck Cond: (tag_id = vvt.tag_id)
Heap Blocks: exact=34757
-> Bitmap Index Scan on videos_video_tags_tag_id_2673cfc8 (cost=0.00..19.81 rows=1518 width=0) (actual time=3.017..3.017 rows=60073 loops=4)
Index Cond: (tag_id = vvt.tag_id)
-> Index Scan using videos_video_pkey on videos_video v (cost=0.42..8.44 rows=1 width=738) (actual time=0.005..0.005 rows=1 loops=20)
Index Cond: (id = vt.video_id)
Planning Time: 0.854 ms
Execution Time: 241.392 ms
(21 rows)
Time: 242.909 ms
Here below are some ideas to simplify the query. Then an EXPLAIN ANALYSE will confirm the potential impacts on the query performance.
Starting from the subquery :
SELECT U0."id"
FROM "videos_tag" U0
INNER JOIN "videos_video_tags" U1 ON (U0."id" = U1."tag_id")
WHERE U1."video_id" = '748b1814-f311-48da-a1f5-6bf8fe229c7f'
According to the JOIN clause : U0."id" = U1."tag_id" so that SELECT U0."id" can be replaced by SELECT U1."tag_id".
In this case, the table "videos_tag" U0 is not used anymore in the subquery which can be simplified as :
SELECT U1."tag_id"
FROM "videos_video_tags" U1
WHERE U1."video_id" = '748b1814-f311-48da-a1f5-6bf8fe229c7f'
And the WHERE clause of the main query becomes :
WHERE "videos_video_tags"."tag_id" IN
( SELECT U1."tag_id"
FROM "videos_video_tags" U1
WHERE U1."video_id" = '748b1814-f311-48da-a1f5-6bf8fe229c7f'
)
which can be transformed as a self join on the table "videos_video_tags" to be added in the FROM clause of the main query :
FROM "videos_video" AS v
INNER JOIN "videos_video_tags" AS vt
ON v."id" = vt."video_id"
INNER JOIN "videos_video_tags" AS vvt
ON vvt."tag_id" = vt."tag_id"
WHERE vvt."video_id" = '748b1814-f311-48da-a1f5-6bf8fe229c7f'
Finally, the GROUP BY "videos_video"."id" clause can be replaced by GROUP BY "videos_video_tags"."video_id" according to the JOIN clause between both tables, and this new GROUP BY clause associated to the ORDER BY clause and LIMIT clause can apply to a subquery involving the table "videos_video_tags" only, and before joining with the table "videos_video" :
SELECT v."id",
v."title",
v."thumbnail_url",
v."preview_url",
v."embed_url",
v."duration",
v."views",
v."is_public",
v."published_at",
v."created_at",
v."updated_at",
w."n"
FROM "videos_video" AS v
INNER JOIN
( SELECT vt."video_id"
, count(*) AS "n"
FROM "videos_video_tags" AS vt
INNER JOIN "videos_video_tags" AS vvt
ON vvt."tag_id" = vt."tag_id"
WHERE vvt."video_id" = '748b1814-f311-48da-a1f5-6bf8fe229c7f'
GROUP BY vt."video_id"
ORDER BY "n" DESC
LIMIT 20
) AS w
ON v."id" = w."video_id"
We run a join query between 2 tables.
The query has an OR statement that compares one column from the left table and one column from the right table. The query performance is very low, and we fixed it by changing the OR to UNION.
Why is this happening? I'm looking for a detailed explanation or a reference to the documentation that might shed a light on the issue.
Query with Or Statment:
db1=# explain analyze select count(*)
from conversations
join agents on conversations.agent_id=agents.id
where conversations.id=1 or agents.id = '123';
**Query plan**
----------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=**11017.95..11017.96** rows=1 width=8) (actual time=54.088..54.088 rows=1 loops=1)
-> Gather (cost=11017.73..11017.94 rows=2 width=8) (actual time=53.945..57.181 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=10017.73..10017.74 rows=1 width=8) (actual time=48.303..48.303 rows=1 loops=3)
-> Hash Join (cost=219.26..10016.69 rows=415 width=0) (actual time=5.292..48.287 rows=130 loops=3)
Hash Cond: (conversations.agent_id = agents.id)
Join Filter: ((conversations.id = 1) OR ((agents.id)::text = '123'::text))
Rows Removed by Join Filter: 80035
-> Parallel Seq Scan on conversations (cost=0.00..9366.95 rows=163995 width=8) (actual time=0.017..14.972 rows=131196 loops=3)
-> Hash (cost=143.56..143.56 rows=6056 width=16) (actual time=2.686..2.686 rows=6057 loops=3)
Buckets: 8192 Batches: 1 Memory Usage: 353kB
-> Seq Scan on agents (cost=0.00..143.56 rows=6056 width=16) (actual time=0.011..1.305 rows=6057 loops=3)
Planning time: 0.710 ms
Execution time: 57.276 ms
(15 rows)
Changing the OR to UNION:
db1=# explain analyze select count(*) from (
select *
from conversations
join agents on conversations.agent_id=agents.id
where conversations.installation_id=1
union
select *
from conversations
join agents on conversations.agent_id=agents.id
where agents.source_id = '123') as subquery;
**Query plan:**
----------------------------------------------------------------------------------------------------------------------------------
Aggregate (**cost=1114.31..1114.32** rows=1 width=8) (actual time=8.038..8.038 rows=1 loops=1)
-> HashAggregate (cost=1091.90..1101.86 rows=996 width=1437) (actual time=7.783..8.009 rows=390 loops=1)
Group Key: conversations.id, conversations.created, conversations.modified, conversations.source_created, conversations.source_id, conversations.installation_id, bra
in_conversation.resolution_reason, conversations.solve_time, conversations.agent_id, conversations.submission_reason, conversations.is_marked_as_duplicate, conversations.n
um_back_and_forths, conversations.is_closed, conversations.is_solved, conversations.conversation_type, conversations.related_ticket_source_id, conversations.channel, brain_convers
ation.last_updated_from_platform, conversations.csat, agents.id, agents.created, agents.modified, agents.name, agents.source_id, organizati
on_agent.installation_id, agents.settings
-> Append (cost=219.68..1027.16 rows=996 width=1437) (actual time=5.517..6.307 rows=390 loops=1)
-> Hash Join (cost=219.68..649.69 rows=931 width=224) (actual time=5.516..6.063 rows=390 loops=1)
Hash Cond: (conversations.agent_id = agents.id)
-> Index Scan using conversations_installation_id_b3ff5c00 on conversations (cost=0.42..427.98 rows=931 width=154) (actual time=0.039..0.344 rows=879 loops=1)
Index Cond: (installation_id = 1)
-> Hash (cost=143.56..143.56 rows=6056 width=70) (actual time=5.394..5.394 rows=6057 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 710kB
-> Seq Scan on agents (cost=0.00..143.56 rows=6056 width=70) (actual time=0.014..1.938 rows=6057 loops=1)
-> Nested Loop (cost=0.70..367.52 rows=65 width=224) (actual time=0.210..0.211 rows=0 loops=1)
-> Index Scan using agents_source_id_106c8103_like on agents agents_1 (cost=0.28..8.30 rows=1 width=70) (actual time=0.210..0.210 rows=0 loops=1)
Index Cond: ((source_id)::text = '123'::text)
-> Index Scan using conversations_agent_id_de76554b on conversations conversations_1 (cost=0.42..358.12 rows=110 width=154) (never executed)
Index Cond: (agent_id = agents_1.id)
Planning time: 2.024 ms
Execution time: 9.367 ms
(18 rows)
Yes. or has a way of killing the performance of queries. For this query:
select count(*)
from conversations c join
agents a
on c.agent_id = a.id
where c.id = 1 or a.id = 123;
Note I removed the quotes around 123. It looks like a number so I assume it is. For this query, you want an index on conversations(agent_id).
Probably the most effective way to write the query is:
select count(*)
from ((select 1
from conversations c join
agents a
on c.agent_id = a.id
where c.id = 1
) union all
(select 1
from conversations c join
agents a
on c.agent_id = a.id
where a.id = 123 and c.id <> 1
)
) ac;
Note the use of union all rather than union. The additional where condition eliminates duplicates.
This can take advantage of the following indexes:
conversations(id, agent_id)
agents(id)
conversations(agent_id, id)
Here's my sql, followed by the explanation. I need to improve the performance. Any ideas?
PostgreSQL 9.3.12 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 4.8.4-2ubuntu1~14.04.1) 4.8.4, 64-bit
explain analyze
SELECT DISTINCT "apts"."id", practices.name AS alias_0
FROM "apts"
LEFT OUTER JOIN "patients" ON "patients"."id" = "apts"."patient_id"
LEFT OUTER JOIN "practices" ON "practices"."id" = "apts"."practice_id"
LEFT OUTER JOIN "eligibility_messages" ON "eligibility_messages"."apt_id" = "apts"."id"
WHERE (apts.eligibility_status_id != 1)
AND (eligibility_messages.current = 't')
AND (practices.id = '104')
ORDER BY practices.name desc
LIMIT 25 OFFSET 0
Limit (cost=881321.34..881321.41 rows=25 width=20) (actual time=2928.225..2928.227 rows=25 loops=1)
-> Sort (cost=881321.34..881391.94 rows=28240 width=20) (actual time=2928.223..2928.224 rows=25 loops=1)
Sort Key: practices.name
Sort Method: top-N heapsort Memory: 26kB
-> HashAggregate (cost=880242.03..880524.43 rows=28240 width=20) (actual time=2927.213..2927.319 rows=520 loops=1)
-> Nested Loop (cost=286614.55..880100.83 rows=28240 width=20) (actual time=206.180..2926.791 rows=520 loops=1)
-> Seq Scan on practices (cost=0.00..6.36 rows=1 width=20) (actual time=0.018..0.031 rows=1 loops=1)
Filter: (id = 104)
Rows Removed by Filter: 108
-> Hash Join (cost=286614.55..879812.07 rows=28240 width=8) (actual time=206.159..2926.643 rows=520 loops=1)
Hash Cond: (eligibility_messages.apt_id = apts.id)
-> Seq Scan on eligibility_messages (cost=0.00..561275.63 rows=2029532 width=4) (actual time=0.691..2766.867 rows=67559 loops=1)
Filter: current
Rows Removed by Filter: 3924633
-> Hash (cost=284614.02..284614.02 rows=115082 width=12) (actual time=121.957..121.957 rows=91660 loops=1)
Buckets: 16384 Batches: 2 Memory Usage: 1974kB
-> Bitmap Heap Scan on apts (cost=8296.88..284614.02 rows=115082 width=12) (actual time=19.927..91.038 rows=91660 loops=1)
Recheck Cond: (practice_id = 104)
Filter: (eligibility_status_id <> 1)
Rows Removed by Filter: 80169
-> Bitmap Index Scan on index_apts_on_practice_id (cost=0.00..8268.11 rows=177540 width=0) (actual time=16.856..16.856 rows=179506 loops=1)
Index Cond: (practice_id = 104)
Total runtime: 2928.361 ms
First, rewrite the query to a more manageable form:
SELECT DISTINCT a."id", pr.name AS alias_0
FROM "apts" a JOIN
"practices" pr
ON pr."id" = a."practice_id" JOIN
"eligibility_messages" em
ON em."apt_id" = a."id"
WHERE (a.eligibility_status_id <> 1) AND
(em.current = 't') AND
(a.practice_id = 104)
ORDER BY pr.name desc ;
Notes:
The WHERE clause turns the outer joins into inner joins anyway, so you might as well express them correctly.
I doubt pr.id is actually a string
The patients table isn't used, so I just removed it.
Perhaps you don't even need the select distinct any more.
Switched the condition in the where to apts rather than practices.
If this isn't fast enough, you want indexes, probably on apts(practice_id, eligibility_status_id, id), practices(id), and eligibility_messages(apt_id, current).
I have a case where I need to cleanup table from orphans regularly, so I'm looking for a high performance solution. I tried using 'IN' clause, but its not really fast. Columns have all the required indexes in both tables.(id - primary key, component_id - index, component_type - index)
DELETE FROM component_apportionment
WHERE id in (
SELECT a.id
FROM component_apportionment a
LEFT JOIN component_live c
ON (c.component_id = a.component_id
AND
c.component_type = a.component_type)
WHERE c.id is null);
Basically the case is to remove records from 'component_apportionment' table which do not exist in 'component_live' table.
The query plan for the query above is horrible as well:
Delete on component_apportionment_copy1 (cost=3860927.55..3860929.09 rows=1 width=18) (actual time=183479.848..183479.848 rows=0 loops=1)
-> Nested Loop (cost=3860927.55..3860929.09 rows=1 width=18) (actual time=183479.811..183479.813 rows=1 loops=1)
-> HashAggregate (cost=3860927.12..3860927.13 rows=1 width=20) (actual time=183479.793..183479.793 rows=1 loops=1)
Group Key: a.id
-> Merge Right Join (cost=3753552.72..3860927.12 rows=1 width=20) (actual time=172941.125..183479.787 rows=1 loops=1)
Merge Cond: ((c.component_id = a.component_id) AND ((c.component_type)::text = (a.component_type)::text))
Filter: (c.id IS NULL)
Rows Removed by Filter: 5968195
-> Sort (cost=3390767.32..3413658.29 rows=9156391 width=21) (actual time=169852.438..172642.897 rows=8043013 loops=1)
Sort Key: c.component_id, c.component_type
Sort Method: external merge Disk: 310232kB
-> Seq Scan on component_live c (cost=0.00..2117393.91 rows=9156391 width=21) (actual time=0.004..155656.568 rows=9333382 loops=1)
-> Materialize (cost=362785.40..375049.75 rows=2452871 width=21) (actual time=3088.653..5343.013 rows=5968195 loops=1)
-> Sort (cost=362785.40..368917.58 rows=2452871 width=21) (actual time=3088.648..3989.163 rows=2452871 loops=1)
Sort Key: a.component_id, a.component_type
Sort Method: external merge Disk: 81504kB
-> Seq Scan on component_apportionment_copy1 a (cost=0.00..44969.71 rows=2452871 width=21) (actual time=0.920..882.040 rows=2452871 loops=1)
-> Index Scan using component_apportionment_copy1_pkey on component_apportionment_copy1 (cost=0.43..1.95 rows=1 width=14) (actual time=0.012..0.012 rows=1 loops=1)
Index Cond: (id = a.id)
Planning time: 5.573 ms
Execution time: 183554.675 ms
Would appreciate any help.
Thanks
Note
Tables have approx 80mln records each in a worst case. Both tables have indexes on used columns.
UPDATE
Query plan for 'not exists'
Query:
EXPLAIN (analyze, verbose, buffers) DELETE FROM component_apportionment_copy1
WHERE not exists (select 1
from component_live c
where c.component_id = component_apportionment_copy1.component_id);
Delete on vector.component_apportionment_copy1 (cost=2276557.80..2446287.39 rows=2104532 width=12) (actual time=203643.560..203643.560 rows=0 loops=1)
Buffers: shared hit=20875 read=2025400, temp read=46067 written=45813
-> Hash Anti Join (cost=2276557.80..2446287.39 rows=2104532 width=12) (actual time=202212.975..203643.486 rows=1 loops=1)
Output: component_apportionment_copy1.ctid, c.ctid
Hash Cond: (component_apportionment_copy1.component_id = c.component_id)
Buffers: shared hit=20874 read=2025400, temp read=46067 written=45813
-> Seq Scan on vector.component_apportionment_copy1 (cost=0.00..44969.71 rows=2452871 width=10) (actual time=0.003..659.668 rows=2452871 loops=1)
Output: component_apportionment_copy1.ctid, component_apportionment_copy1.component_id
Buffers: shared hit=20441
-> Hash (cost=2117393.91..2117393.91 rows=9156391 width=10) (actual time=198536.786..198536.786 rows=9333382 loops=1)
Output: c.ctid, c.component_id
Buckets: 16384 Batches: 128 Memory Usage: 3195kB
Buffers: shared hit=430 read=2025400, temp written=36115
-> Seq Scan on vector.component_live c (cost=0.00..2117393.91 rows=9156391 width=10) (actual time=0.039..194415.641 rows=9333382 loops=1)
Output: c.ctid, c.component_id
Buffers: shared hit=430 read=2025400
Planning time: 6.639 ms
Execution time: 203643.594 ms
Its doing seq scan on both tables and more data - the slower it will be.
You have way too many joins in there:
set enable_seqscan = false; -- forcing to use indexes
DELETE FROM component_apportionment
WHERE not exists (select 1
from component_live c
where c.component_id = component_apportionment.component_id);
Will do the same thing and should be much faster, especially if you have indexes on the component_id columns.
The exists way:
delete
from component_apportionment ca
where not exists
(select 1
from component_live cl
where cl.component_id = ca.component_id
);
Or the in way:
delete
from component_apportionment
where component_id not in
(select component_id
from component_live
);
Also, create indexes on both tables to the component_id columns.
UPDATE
I made a script for testing:
-- table creating and populating (1,000,000 records each)
drop table if exists component_apportionment;
drop table if exists component_live;
create table component_live (component_id numeric primary key);
create table component_apportionment (id serial primary key, component_id numeric);
create index component_apportionment_idx on component_apportionment (component_id);
insert into component_live select g from generate_series(1,1000000) g;
insert into component_apportionment (component_id) select trunc(random()*1000000) from generate_series(1,1000000) g;
analyze verbose component_live;
analyze verbose component_apportionment;
EXPLAIN (analyze, verbose, buffers)
select component_id
from component_apportionment ca
where not exists
(select 1
from component_live cl
where cl.component_id = ca.component_id
);
Merge Anti Join (cost=0.85..61185.85 rows=1 width=6) (actual time=0.013..1060.014 rows=2 loops=1)
Output: ca.component_id
Merge Cond: (ca.component_id = cl.component_id)
Buffers: shared hit=1010548
-> Index Only Scan using component_apportionment_idx on admin.component_apportionment ca (cost=0.42..24015.42 rows=1000000 width=6) (actual time=0.006..460.318 rows=1000000 loops=1)
Output: ca.component_id
Heap Fetches: 1000000
Buffers: shared hit=1003388
-> Index Only Scan using component_live_pkey on admin.component_live cl (cost=0.42..22170.42 rows=1000000 width=6) (actual time=0.005..172.502 rows=999998 loops=1)
Output: cl.component_id
Heap Fetches: 999998
Buffers: shared hit=7160
Total runtime: 1060.035 ms
Consider two PostgreSQL tables :
Table #1
id INT
secret_id INT
title VARCHAR
Table #2
id INT
secret_id INT
I need to select all records from Table #1, but exclude Table #2 crossing secret_id values.
The following query is very slow with 1 000 000 records in Table #1 and 500 000 in Table #2 :
select * from table_1 where secret_id not in (select secret_id from table_2);
What is the best way to achieve this ?
FWIW, I tested Daniel Lyons and Craig Ringer's suggestions made in comments above. Here are the results on my particular case (~500k rows per table), order by efficiency (the most efficient first).
ANTI-JOIN :
> EXPLAIN ANALYZE SELECT * FROM table1 t1 LEFT JOIN table2 t2 ON t1.secret_id=t2.secret_id WHERE t2.secret_id IS NULL;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Hash Anti Join (cost=19720.19..56129.91 rows=21 width=28) (actual time=139.868..459.991 rows=142993 loops=1)
Hash Cond: (t1.secret_id = t2.secret_id)
-> Seq Scan ON table1 t1 (cost=0.00..13049.06 rows=622606 width=14) (actual time=0.005..61.913 rows=622338 loops=1)
-> Hash (cost=10849.75..10849.75 rows=510275 width=14) (actual time=138.176..138.176 rows=510275 loops=1)
Buckets: 4096 Batches: 32 Memory Usage: 777kB
-> Seq Scan ON table2 t2 (cost=0.00..10849.75 rows=510275 width=14) (actual time=0.018..47.005 rows=510275 loops=1)
Total runtime: 466.748 ms
(7 lignes)
NOT EXISTS :
> EXPLAIN ANALYZE SELECT * FROM table1 t1 WHERE NOT EXISTS (SELECT secret_id FROM table2 t2 WHERE t2.secret_id=t1.secret_id);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Hash Anti Join (cost=19222.19..55133.91 rows=21 width=14) (actual time=181.881..517.632 rows=142993 loops=1)
Hash Cond: (t1.secret_id = t2.secret_id)
-> Seq Scan ON table1 t1 (cost=0.00..13049.06 rows=622606 width=14) (actual time=0.005..70.478 rows=622338 loops=1)
-> Hash (cost=10849.75..10849.75 rows=510275 width=4) (actual time=179.665..179.665 rows=510275 loops=1)
Buckets: 4096 Batches: 32 Memory Usage: 592kB
-> Seq Scan ON table2 t2 (cost=0.00..10849.75 rows=510275 width=4) (actual time=0.019..78.074 rows=510275 loops=1)
Total runtime: 524.300 ms
(7 lignes)
EXCEPT :
> EXPLAIN ANALYZE SELECT * FROM table1 EXCEPT (SELECT t1.* FROM table1 t1 join table2 t2 ON t1.secret_id=t2.secret_id);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
SetOp Except (cost=1524985.53..1619119.03 rows=62261 width=14) (actual time=16926.056..19850.915 rows=142993 loops=1)
-> Sort (cost=1524985.53..1543812.23 rows=7530680 width=14) (actual time=16925.010..18596.860 rows=6491408 loops=1)
Sort Key: "*SELECT* 1".secret_id, "*SELECT* 1".jeu, "*SELECT* 1".combinaison, "*SELECT* 1".gains
Sort Method: external merge Disk: 185232kB
-> Append (cost=0.00..278722.63 rows=7530680 width=14) (actual time=0.007..2951.920 rows=6491408 loops=1)
-> Subquery Scan ON "*SELECT* 1" (cost=0.00..19275.12 rows=622606 width=14) (actual time=0.007..176.892 rows=622338 loops=1)
-> Seq Scan ON table1 (cost=0.00..13049.06 rows=622606 width=14) (actual time=0.005..69.842 rows=622338 loops=1)
-> Subquery Scan ON "*SELECT* 2" (cost=19222.19..259447.51 rows=6908074 width=14) (actual time=168.529..2228.335 rows=5869070 loops=1)
-> Hash Join (cost=19222.19..190366.77 rows=6908074 width=14) (actual time=168.528..1450.663 rows=5869070 loops=1)
Hash Cond: (t1.secret_id = t2.secret_id)
-> Seq Scan ON table1 t1 (cost=0.00..13049.06 rows=622606 width=14) (actual time=0.002..64.554 rows=622338 loops=1)
-> Hash (cost=10849.75..10849.75 rows=510275 width=4) (actual time=168.329..168.329 rows=510275 loops=1)
Buckets: 4096 Batches: 32 Memory Usage: 592kB
-> Seq Scan ON table2 t2 (cost=0.00..10849.75 rows=510275 width=4) (actual time=0.017..72.702 rows=510275 loops=1)
Total runtime: 19896.445 ms
(15 lignes)
NOT IN :
> EXPLAIN SELECT * FROM table1 WHERE secret_id NOT IN (SELECT secret_id FROM table2);
QUERY PLAN
-----------------------------------------------------------------------------------------
Seq Scan ON table1 (cost=0.00..5189688549.26 rows=311303 width=14)
Filter: (NOT (SubPlan 1))
SubPlan 1
-> Materialize (cost=0.00..15395.12 rows=510275 width=4)
-> Seq Scan ON table2 (cost=0.00..10849.75 rows=510275 width=4)
(5 lignes)
I did not analyze the latter because it took ages.