The queries are performed on a large table with 11 million rows. I have already performed an ANALYZE on the table prior to the query executions.
Query 1:
SELECT *
FROM accounts t1
LEFT OUTER JOIN accounts t2
ON (t1.account_no = t2.account_no
AND t1.effective_date < t2.effective_date)
WHERE t2.account_no IS NULL;
Explain Analyze:
Hash Anti Join (cost=480795.57..1201111.40 rows=7369854 width=292) (actual time=29619.499..115662.111 rows=1977871 loops=1)
Hash Cond: ((t1.account_no)::text = (t2.account_no)::text)
Join Filter: ((t1.effective_date)::text < (t2.effective_date)::text)
-> Seq Scan on accounts t1 (cost=0.00..342610.81 rows=11054781 width=146) (actual time=0.025..25693.921 rows=11034070 loops=1)
-> Hash (cost=342610.81..342610.81 rows=11054781 width=146) (actual time=29612.925..29612.925 rows=11034070 loops=1)
Buckets: 2097152 Batches: 1 Memory Usage: 1834187kB
-> Seq Scan on accounts t2 (cost=0.00..342610.81 rows=11054781 width=146) (actual time=0.006..22929.635 rows=11034070 loops=1)
Total runtime: 115870.788 ms
The estimated cost is ~1.2 million and the actual time taken is ~1.9 minutes.
Query 2:
SELECT t1.*
FROM accounts t1
LEFT OUTER JOIN accounts t2
ON (t1.account_no = t2.account_no
AND t1.effective_date < t2.effective_date)
WHERE t2.account_no IS NULL;
Explain Analyze:
Hash Anti Join (cost=480795.57..1201111.40 rows=7369854 width=146) (actual time=13365.808..65519.402 rows=1977871 loops=1)
Hash Cond: ((t1.account_no)::text = (t2.account_no)::text)
Join Filter: ((t1.effective_date)::text < (t2.effective_date)::text)
-> Seq Scan on accounts t1 (cost=0.00..342610.81 rows=11054781 width=146) (actual time=0.007..5032.778 rows=11034070 loops=1)
-> Hash (cost=342610.81..342610.81 rows=11054781 width=18) (actual time=13354.219..13354.219 rows=11034070 loops=1)
Buckets: 2097152 Batches: 1 Memory Usage: 545369kB
-> Seq Scan on accounts t2 (cost=0.00..342610.81 rows=11054781 width=18) (actual time=0.011..8964.571 rows=11034070 loops=1)
Total runtime: 65705.707 ms
The estimated cost is ~1.2 million (again) but the actual time taken is <1.1 minutes.
Query 3:
SELECT *
FROM accounts
WHERE (account_no,
effective_date) IN
(SELECT account_no,
max(effective_date)
FROM accounts
GROUP BY account_no);
Explain Analyze:
Nested Loop (cost=406416.19..502216.84 rows=2763695 width=146) (actual time=31779.457..917543.228 rows=1977871 loops=1)
-> HashAggregate (cost=406416.19..406757.45 rows=34126 width=43) (actual time=31774.877..33378.968 rows=1977425 loops=1)
-> Subquery Scan on "ANY_subquery" (cost=397884.72..404709.90 rows=341259 width=43) (actual time=27979.226..29841.217 rows=1977425 loops=1)
-> HashAggregate (cost=397884.72..401297.31 rows=341259 width=18) (actual time=27979.224..29315.346 rows=1977425 loops=1)
-> Seq Scan on accounts (cost=0.00..342610.81 rows=11054781 width=18) (actual time=0.851..16092.755 rows=11034070 loops=1)
-> Index Scan using accounts_idx2 on accounts (cost=0.00..2.78 rows=1 width=146) (actual time=0.443..0.445 rows=1 loops=1977425)
Index Cond: (((account_no)::text = ("ANY_subquery".account_no)::text) AND ((effective_date)::text = "ANY_subquery".max))
Total runtime: 918039.614 ms
The estimated cost is ~502,000 but the actual time taken is ~15.3 minutes!
How reliable is the EXPLAIN output?
Do we always have to EXPLAIN ANALYZE to see how our query is going to perform on real data, and not place trust on how much the query planner thinks it will cost?
They are reliable, except for when they are not. You can't really generalize.
It looks like it is dramatically underestimating the number of different account_no that it will find (thinks it will find 34126 actually found 1977425). Your default_statistics_target might not be high enough to get a good estimate for this column.
Related
We run a join query between 2 tables.
The query has an OR statement that compares one column from the left table and one column from the right table. The query performance is very low, and we fixed it by changing the OR to UNION.
Why is this happening? I'm looking for a detailed explanation or a reference to the documentation that might shed a light on the issue.
Query with Or Statment:
db1=# explain analyze select count(*)
from conversations
join agents on conversations.agent_id=agents.id
where conversations.id=1 or agents.id = '123';
**Query plan**
----------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=**11017.95..11017.96** rows=1 width=8) (actual time=54.088..54.088 rows=1 loops=1)
-> Gather (cost=11017.73..11017.94 rows=2 width=8) (actual time=53.945..57.181 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=10017.73..10017.74 rows=1 width=8) (actual time=48.303..48.303 rows=1 loops=3)
-> Hash Join (cost=219.26..10016.69 rows=415 width=0) (actual time=5.292..48.287 rows=130 loops=3)
Hash Cond: (conversations.agent_id = agents.id)
Join Filter: ((conversations.id = 1) OR ((agents.id)::text = '123'::text))
Rows Removed by Join Filter: 80035
-> Parallel Seq Scan on conversations (cost=0.00..9366.95 rows=163995 width=8) (actual time=0.017..14.972 rows=131196 loops=3)
-> Hash (cost=143.56..143.56 rows=6056 width=16) (actual time=2.686..2.686 rows=6057 loops=3)
Buckets: 8192 Batches: 1 Memory Usage: 353kB
-> Seq Scan on agents (cost=0.00..143.56 rows=6056 width=16) (actual time=0.011..1.305 rows=6057 loops=3)
Planning time: 0.710 ms
Execution time: 57.276 ms
(15 rows)
Changing the OR to UNION:
db1=# explain analyze select count(*) from (
select *
from conversations
join agents on conversations.agent_id=agents.id
where conversations.installation_id=1
union
select *
from conversations
join agents on conversations.agent_id=agents.id
where agents.source_id = '123') as subquery;
**Query plan:**
----------------------------------------------------------------------------------------------------------------------------------
Aggregate (**cost=1114.31..1114.32** rows=1 width=8) (actual time=8.038..8.038 rows=1 loops=1)
-> HashAggregate (cost=1091.90..1101.86 rows=996 width=1437) (actual time=7.783..8.009 rows=390 loops=1)
Group Key: conversations.id, conversations.created, conversations.modified, conversations.source_created, conversations.source_id, conversations.installation_id, bra
in_conversation.resolution_reason, conversations.solve_time, conversations.agent_id, conversations.submission_reason, conversations.is_marked_as_duplicate, conversations.n
um_back_and_forths, conversations.is_closed, conversations.is_solved, conversations.conversation_type, conversations.related_ticket_source_id, conversations.channel, brain_convers
ation.last_updated_from_platform, conversations.csat, agents.id, agents.created, agents.modified, agents.name, agents.source_id, organizati
on_agent.installation_id, agents.settings
-> Append (cost=219.68..1027.16 rows=996 width=1437) (actual time=5.517..6.307 rows=390 loops=1)
-> Hash Join (cost=219.68..649.69 rows=931 width=224) (actual time=5.516..6.063 rows=390 loops=1)
Hash Cond: (conversations.agent_id = agents.id)
-> Index Scan using conversations_installation_id_b3ff5c00 on conversations (cost=0.42..427.98 rows=931 width=154) (actual time=0.039..0.344 rows=879 loops=1)
Index Cond: (installation_id = 1)
-> Hash (cost=143.56..143.56 rows=6056 width=70) (actual time=5.394..5.394 rows=6057 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 710kB
-> Seq Scan on agents (cost=0.00..143.56 rows=6056 width=70) (actual time=0.014..1.938 rows=6057 loops=1)
-> Nested Loop (cost=0.70..367.52 rows=65 width=224) (actual time=0.210..0.211 rows=0 loops=1)
-> Index Scan using agents_source_id_106c8103_like on agents agents_1 (cost=0.28..8.30 rows=1 width=70) (actual time=0.210..0.210 rows=0 loops=1)
Index Cond: ((source_id)::text = '123'::text)
-> Index Scan using conversations_agent_id_de76554b on conversations conversations_1 (cost=0.42..358.12 rows=110 width=154) (never executed)
Index Cond: (agent_id = agents_1.id)
Planning time: 2.024 ms
Execution time: 9.367 ms
(18 rows)
Yes. or has a way of killing the performance of queries. For this query:
select count(*)
from conversations c join
agents a
on c.agent_id = a.id
where c.id = 1 or a.id = 123;
Note I removed the quotes around 123. It looks like a number so I assume it is. For this query, you want an index on conversations(agent_id).
Probably the most effective way to write the query is:
select count(*)
from ((select 1
from conversations c join
agents a
on c.agent_id = a.id
where c.id = 1
) union all
(select 1
from conversations c join
agents a
on c.agent_id = a.id
where a.id = 123 and c.id <> 1
)
) ac;
Note the use of union all rather than union. The additional where condition eliminates duplicates.
This can take advantage of the following indexes:
conversations(id, agent_id)
agents(id)
conversations(agent_id, id)
I want to check if the order of JOINS is important in a SQL Querie for runtime and efficiency.
I'm using PostgreSQL and for my check I used the example world db from MYSQL (https://downloads.mysql.com/docs/world.sql.zip) and wrote these two statements:
Querie 1:
EXPLAIN ANALYSE SELECT * FROM countrylanguage
JOIN city ON city.countrycode = countrylanguage.countrycode
JOIN country c ON city.countrycode = c.code
Querie 2:
EXPLAIN ANALYSE SELECT * FROM city
JOIN country c ON c.code = city.countrycode
JOIN countrylanguage c2 on c.code = c2.countrycode
Query Plan 1:
Hash Join (cost=41.14..484.78 rows=29946 width=161) (actual time=1.472..17.602 rows=30670 loops=1)
Hash Cond: (city.countrycode = countrylanguage.countrycode)
-> Seq Scan on city (cost=0.00..72.79 rows=4079 width=31) (actual time=0.062..1.220 rows=4079 loops=1)
-> Hash (cost=28.84..28.84 rows=984 width=130) (actual time=1.378..1.378 rows=984 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 172kB
-> Hash Join (cost=10.38..28.84 rows=984 width=130) (actual time=0.267..0.823 rows=984 loops=1)
Hash Cond: (countrylanguage.countrycode = c.code)
-> Seq Scan on countrylanguage (cost=0.00..15.84 rows=984 width=17) (actual time=0.029..0.158 rows=984 loops=1)
-> Hash (cost=7.39..7.39 rows=239 width=113) (actual time=0.220..0.220 rows=239 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 44kB
-> Seq Scan on country c (cost=0.00..7.39 rows=239 width=113) (actual time=0.013..0.137 rows=239 loops=1)
Planning Time: 3.818 ms
Execution Time: 18.801 ms
Query Plan 2:
Hash Join (cost=41.14..312.47 rows=16794 width=161) (actual time=2.415..18.628 rows=30670 loops=1)
Hash Cond: (city.countrycode = c.code)
-> Seq Scan on city (cost=0.00..72.79 rows=4079 width=31) (actual time=0.032..0.574 rows=4079 loops=1)
-> Hash (cost=28.84..28.84 rows=984 width=130) (actual time=2.364..2.364 rows=984 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 171kB
-> Hash Join (cost=10.38..28.84 rows=984 width=130) (actual time=0.207..1.307 rows=984 loops=1)
Hash Cond: (c2.countrycode = c.code)
-> Seq Scan on countrylanguage c2 (cost=0.00..15.84 rows=984 width=17) (actual time=0.027..0.204 rows=984 loops=1)
-> Hash (cost=7.39..7.39 rows=239 width=113) (actual time=0.163..0.163 rows=239 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 44kB
-> Seq Scan on country c (cost=0.00..7.39 rows=239 width=113) (actual time=0.015..0.049 rows=239 loops=1)
Planning Time: 1.901 ms
Execution Time: 19.694 ms
The estimated costs and rows are different and the last hash condition is different. Does this mean, that the Query Planner hasn't done the same thing for both queries, or am I on the wrong track?
Thanks for your help!
The issue is not the order of the joins but that the join conditions are different -- referring to different tables.
In the first query, you are joining to countrylanguage using the country code from city. In the second, you are using the country code from country.
With inner joins this should not make a difference to the final result. However, it clearly does affect how the optimizer considers different paths.
(as said) the queries are not identical
though not identical, the plans are comparable
both queries execute in 18ms, comparing them is close to useless
queries on tables with insufficient structure(keys,indexes,statistics), but with a small enough footprint (work_mem) will always result in hash-joins.
I have following query:
SELECT
Sum(fact_individual_re.quality_hours) AS C0,
dim_gender.name AS C1,
dim_date.year AS C2
FROM
fact_individual_re
INNER JOIN dim_date ON fact_individual_re.dim_date_id = dim_date.id
INNER JOIN dim_gender ON fact_individual_re.dim_gender_id = dim_gender.id
GROUP BY dim_date.year,dim_gender.name
ORDER BY dim_date.year ASC,dim_gender.name ASC,Sum(fact_individual_re.quality_hours) ASC
When explaining it's plan, HASH JOIN is taking most time. Is there any way to minimize the time for HASH JOIN:
Sort (cost=190370.50..190370.55 rows=20 width=18) (actual time=4005.152..4005.154 rows=20 loops=1)
Sort Key: dim_date.year, dim_gender.name, (sum(fact_individual_re.quality_hours))
Sort Method: quicksort Memory: 26kB
-> Finalize GroupAggregate (cost=190369.07..190370.07 rows=20 width=18) (actual time=4005.106..4005.135 rows=20 loops=1)
Group Key: dim_date.year, dim_gender.name
-> Sort (cost=190369.07..190369.27 rows=80 width=18) (actual time=4005.100..4005.103 rows=100 loops=1)
Sort Key: dim_date.year, dim_gender.name
Sort Method: quicksort Memory: 32kB
-> Gather (cost=190358.34..190366.54 rows=80 width=18) (actual time=4004.966..4005.020 rows=100 loops=1)
Workers Planned: 4
Workers Launched: 4
-> Partial HashAggregate (cost=189358.34..189358.54 rows=20 width=18) (actual time=3885.254..3885.259 rows=20 loops=5)
Group Key: dim_date.year, dim_gender.name
-> Hash Join (cost=125.17..170608.34 rows=2500000 width=14) (actual time=2.279..2865.808 rows=2000000 loops=5)
Hash Cond: (fact_individual_re.dim_gender_id = dim_gender.id)
-> Hash Join (cost=124.13..150138.54 rows=2500000 width=12) (actual time=2.060..2115.234 rows=2000000 loops=5)
Hash Cond: (fact_individual_re.dim_date_id = dim_date.id)
-> Parallel Seq Scan on fact_individual_re (cost=0.00..118458.00 rows=2500000 width=12) (actual time=0.204..982.810 rows=2000000 loops=5)
-> Hash (cost=78.50..78.50 rows=3650 width=8) (actual time=1.824..1.824 rows=3650 loops=5)
Buckets: 4096 Batches: 1 Memory Usage: 175kB
-> Seq Scan on dim_date (cost=0.00..78.50 rows=3650 width=8) (actual time=0.143..1.030 rows=3650 loops=5)
-> Hash (cost=1.02..1.02 rows=2 width=10) (actual time=0.193..0.193 rows=2 loops=5)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on dim_gender (cost=0.00..1.02 rows=2 width=10) (actual time=0.181..0.182 rows=2 loops=5)
Planning time: 0.609 ms
Execution time: 4020.423 ms
(26 rows)
I am using postgresql v10.
I'd recommend to partially group the rows before the join:
select
sum(quality_hours_sum) AS C0,
dim_gender.name AS C1,
dim_date.year AS C2
from
(
select
sum(quality_hours) as quality_hours_sum,
dim_date_id,
dim_gender_id
from fact_individual_re
group by dim_date_id, dim_gender_id
) as fact_individual_re_sum
join dim_date on dim_date_id = dim_date.id
join dim_gender on dim_gender_id = dim_gender.id
group by dim_date.year, dim_gender.name
order by dim_date.year, dim_gender.name, 0;
This way you will be joining only 1460 rows (count(distinct dim_date_id)*count(distint dim_gender_id)) instead of all 2M rows. Although it would still need to read and group all 2M rows - to avoid that you'd need something like summary table maintained with a trigger.
There is no predicate shown on the fact table, so we can assume prior to the filtering via the joins 100% of that table is required.
The indexes exist on the lookup tables, but are not covering indexes from what you say. Given 100% of the fact table being scanned, combined with the index not being covering, I would expect it to hash join.
As an experiment, you could apply a covering index (index the dim_date.date_id and dim_date.year in a single index) to see if it swaps off a hash join against dim_date.
With the overall lack of predicates though - outside of a covering index, a hash join is not necessarily the wrong query plan.
I have the following two queries.Query 1 is fast since it uses indexes(uses nested loop join) and Query 2 uses hash join and it is slower.
Query 1 does order by on table 1 column and Query 2 does order by using table 2 column.
Query 1
learning=# explain analyze
select *
from users left join
access_logs
on users.userid = access_logs.userid
order by users.userid
limit 10 offset 90;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=14.00..15.46 rows=10 width=104) (actual time=1.330..1.504 rows=10 loops=1)
-> Merge Left Join (cost=0.85..291532.97 rows=1995958 width=104) (actual time=0.037..1.482 rows=100 loops=1)
Merge Cond: (users.userid = access_logs.userid)
-> Index Scan using users_pkey on users (cost=0.43..151132.75 rows=1995958 width=76) (actual time=0.018..1.135 rows=100 loops=1)
-> Index Scan using access_logs_userid_idx on access_logs (cost=0.43..110471.45 rows=1995958 width=28) (actual time=0.012..0.198 rows=100 loops=1)
Planning time: 0.469 ms
Execution time: 1.569 ms
Query 2
learning=# explain analyze
select *
from users left join
access_logs
on users.userid = access_logs.userid
order by access_logs.userid
limit 10 offset 90;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=293584.20..293584.23 rows=10 width=104) (actual time=3821.432..3821.439 rows=10 loops=1)
-> Sort (cost=293583.98..298573.87 rows=1995958 width=104) (actual time=3821.391..3821.415 rows=100 loops=1)
Sort Key: access_logs.userid
Sort Method: top-N heapsort Memory: 51kB
-> Hash Left Join (cost=73231.06..217299.90 rows=1995958 width=104) (actual time=539.859..3168.754 rows=1995958 loops=1)
Hash Cond: (users.userid = access_logs.userid)
-> Seq Scan on users (cost=0.00..44814.58 rows=1995958 width=76) (actual time=0.009..443.260 rows=1995958 loops=1)
-> Hash (cost=34636.58..34636.58 rows=1995958 width=28) (actual time=539.112..539.112 rows=1995958 loops=1)
Buckets: 262144 Batches: 2 Memory Usage: 58532kB
-> Seq Scan on access_logs (cost=0.00..34636.58 rows=1995958 width=28) (actual time=0.006..170.061 rows=1995958 loops=1)
Planning time: 0.480 ms
Execution time: 3832.245 ms
Questions
The second query is slow since the sorting is done before the join as in the plan.
Why does the sort in the second table not use the index? There is a plan below with just the sort.
Query - explain analyze select * from access_logs order by userid limit 10 offset 90;
Plan
Limit (cost=5.41..5.96 rows=10 width=28) (actual time=0.199..0.218 rows=10 loops=1)
-> Index Scan using access_logs_userid_idx on access_logs (cost=0.43..110471.45 rows=1995958 width=28) (actual time=0.029..0.201 rows=100 loops=1)
Planning time: 0.120 ms
Execution time: 0.252 ms
Edit 1:
My goal is not to compare both queries, in fact I want the result as in query 2,I just provided query 1 so that I can understand in comparison.
The order by is not limited to the join column, the user can also do order by another column in table 2, plans are below.
learning=# explain analyze select * from users left join access_logs on users.userid=access_logs.userid order by access_logs.last_login limit 10;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=260431.83..260431.86 rows=10 width=104) (actual time=3846.625..3846.627 rows=10 loops=1)
-> Sort (cost=260431.83..265421.73 rows=1995958 width=104) (actual time=3846.623..3846.623 rows=10 loops=1)
Sort Key: access_logs.last_login
Sort Method: top-N heapsort Memory: 27kB
-> Hash Left Join (cost=73231.06..217299.90 rows=1995958 width=104) (actual time=567.104..3174.818 rows=1995958 loops=1)
Hash Cond: (users.userid = access_logs.userid)
-> Seq Scan on users (cost=0.00..44814.58 rows=1995958 width=76) (actual time=0.007..443.364 rows=1995958 loops=1)
-> Hash (cost=34636.58..34636.58 rows=1995958 width=28) (actual time=566.814..566.814 rows=1995958 loops=1)
Buckets: 262144 Batches: 2 Memory Usage: 58532kB
-> Seq Scan on access_logs (cost=0.00..34636.58 rows=1995958 width=28) (actual time=0.004..169.137 rows=1995958 loops=1)
Planning time: 0.490 ms
Execution time: 3857.171 ms
Sort in the second query would not use index because the index is not guaranteed to have all the values being sorted. If there are some records in users not matched by access_logs then Left Join would generate null values referenced in query as access_logs.userid but not actually present in access_logs and thus not covered by the index.
The workaround can be to create a default initial record in access_log for each user and use Inner Join.
Consider two PostgreSQL tables :
Table #1
id INT
secret_id INT
title VARCHAR
Table #2
id INT
secret_id INT
I need to select all records from Table #1, but exclude Table #2 crossing secret_id values.
The following query is very slow with 1 000 000 records in Table #1 and 500 000 in Table #2 :
select * from table_1 where secret_id not in (select secret_id from table_2);
What is the best way to achieve this ?
FWIW, I tested Daniel Lyons and Craig Ringer's suggestions made in comments above. Here are the results on my particular case (~500k rows per table), order by efficiency (the most efficient first).
ANTI-JOIN :
> EXPLAIN ANALYZE SELECT * FROM table1 t1 LEFT JOIN table2 t2 ON t1.secret_id=t2.secret_id WHERE t2.secret_id IS NULL;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Hash Anti Join (cost=19720.19..56129.91 rows=21 width=28) (actual time=139.868..459.991 rows=142993 loops=1)
Hash Cond: (t1.secret_id = t2.secret_id)
-> Seq Scan ON table1 t1 (cost=0.00..13049.06 rows=622606 width=14) (actual time=0.005..61.913 rows=622338 loops=1)
-> Hash (cost=10849.75..10849.75 rows=510275 width=14) (actual time=138.176..138.176 rows=510275 loops=1)
Buckets: 4096 Batches: 32 Memory Usage: 777kB
-> Seq Scan ON table2 t2 (cost=0.00..10849.75 rows=510275 width=14) (actual time=0.018..47.005 rows=510275 loops=1)
Total runtime: 466.748 ms
(7 lignes)
NOT EXISTS :
> EXPLAIN ANALYZE SELECT * FROM table1 t1 WHERE NOT EXISTS (SELECT secret_id FROM table2 t2 WHERE t2.secret_id=t1.secret_id);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Hash Anti Join (cost=19222.19..55133.91 rows=21 width=14) (actual time=181.881..517.632 rows=142993 loops=1)
Hash Cond: (t1.secret_id = t2.secret_id)
-> Seq Scan ON table1 t1 (cost=0.00..13049.06 rows=622606 width=14) (actual time=0.005..70.478 rows=622338 loops=1)
-> Hash (cost=10849.75..10849.75 rows=510275 width=4) (actual time=179.665..179.665 rows=510275 loops=1)
Buckets: 4096 Batches: 32 Memory Usage: 592kB
-> Seq Scan ON table2 t2 (cost=0.00..10849.75 rows=510275 width=4) (actual time=0.019..78.074 rows=510275 loops=1)
Total runtime: 524.300 ms
(7 lignes)
EXCEPT :
> EXPLAIN ANALYZE SELECT * FROM table1 EXCEPT (SELECT t1.* FROM table1 t1 join table2 t2 ON t1.secret_id=t2.secret_id);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
SetOp Except (cost=1524985.53..1619119.03 rows=62261 width=14) (actual time=16926.056..19850.915 rows=142993 loops=1)
-> Sort (cost=1524985.53..1543812.23 rows=7530680 width=14) (actual time=16925.010..18596.860 rows=6491408 loops=1)
Sort Key: "*SELECT* 1".secret_id, "*SELECT* 1".jeu, "*SELECT* 1".combinaison, "*SELECT* 1".gains
Sort Method: external merge Disk: 185232kB
-> Append (cost=0.00..278722.63 rows=7530680 width=14) (actual time=0.007..2951.920 rows=6491408 loops=1)
-> Subquery Scan ON "*SELECT* 1" (cost=0.00..19275.12 rows=622606 width=14) (actual time=0.007..176.892 rows=622338 loops=1)
-> Seq Scan ON table1 (cost=0.00..13049.06 rows=622606 width=14) (actual time=0.005..69.842 rows=622338 loops=1)
-> Subquery Scan ON "*SELECT* 2" (cost=19222.19..259447.51 rows=6908074 width=14) (actual time=168.529..2228.335 rows=5869070 loops=1)
-> Hash Join (cost=19222.19..190366.77 rows=6908074 width=14) (actual time=168.528..1450.663 rows=5869070 loops=1)
Hash Cond: (t1.secret_id = t2.secret_id)
-> Seq Scan ON table1 t1 (cost=0.00..13049.06 rows=622606 width=14) (actual time=0.002..64.554 rows=622338 loops=1)
-> Hash (cost=10849.75..10849.75 rows=510275 width=4) (actual time=168.329..168.329 rows=510275 loops=1)
Buckets: 4096 Batches: 32 Memory Usage: 592kB
-> Seq Scan ON table2 t2 (cost=0.00..10849.75 rows=510275 width=4) (actual time=0.017..72.702 rows=510275 loops=1)
Total runtime: 19896.445 ms
(15 lignes)
NOT IN :
> EXPLAIN SELECT * FROM table1 WHERE secret_id NOT IN (SELECT secret_id FROM table2);
QUERY PLAN
-----------------------------------------------------------------------------------------
Seq Scan ON table1 (cost=0.00..5189688549.26 rows=311303 width=14)
Filter: (NOT (SubPlan 1))
SubPlan 1
-> Materialize (cost=0.00..15395.12 rows=510275 width=4)
-> Seq Scan ON table2 (cost=0.00..10849.75 rows=510275 width=4)
(5 lignes)
I did not analyze the latter because it took ages.