SQL: ON vs. WHERE in sub JOIN - sql

What is the difference between using ON and WHERE in a sub join when using an outer reference?
Consider these two SQL statements as an example (looking for 10 persons with not closed tasks, using person_task with a many-to-many relationship):
select p.name
from person p
where exists (
select 1
from person_task pt
join task t on pt.task_id = t.id
and t.state <> 'closed'
and pt.person_id = p.id -- ON
)
limit 10
select p.name
from person p
where exists (
select 1
from person_task pt
join task t on pt.task_id = t.id and t.state <> 'closed'
where pt.person_id = p.id -- WHERE
)
limit 10
They produce the same result but the statement with ON is considerably faster.
Here the corresponding EXPLAIN (ANALYZE) statements:
-- USING ON
Limit (cost=0.00..270.98 rows=10 width=8) (actual time=10.412..60.876 rows=10 loops=1)
-> Seq Scan on person p (cost=0.00..28947484.16 rows=1068266 width=8) (actual time=10.411..60.869 rows=10 loops=1)
Filter: (SubPlan 1)
Rows Removed by Filter: 68
SubPlan 1
-> Nested Loop (cost=1.00..20257.91 rows=1632 width=0) (actual time=0.778..0.778 rows=0 loops=78)
-> Index Scan using person_taskx1 on person_task pt (cost=0.56..6551.27 rows=1632 width=8) (actual time=0.633..0.633 rows=0 loops=78)
Index Cond: (id = p.id)
-> Index Scan using taskxpk on task t (cost=0.44..8.40 rows=1 width=8) (actual time=1.121..1.121 rows=1 loops=10)
Index Cond: (id = pt.task_id)
Filter: (state <> 'open')
Planning Time: 0.466 ms
Execution Time: 60.920 ms
-- USING WHERE
Limit (cost=2818814.57..2841563.37 rows=10 width=8) (actual time=29.075..6884.259 rows=10 loops=1)
-> Merge Semi Join (cost=2818814.57..59308642.64 rows=24832 width=8) (actual time=29.075..6884.251 rows=10 loops=1)
Merge Cond: (p.id = pt.person_id)
-> Index Scan using personxpk on person p (cost=0.43..1440340.27 rows=2136533 width=16) (actual time=0.003..0.168 rows=18 loops=1)
-> Gather Merge (cost=1001.03..57357094.42 rows=40517669 width=8) (actual time=9.441..6881.180 rows=23747 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Nested Loop (cost=1.00..52679350.05 rows=16882362 width=8) (actual time=1.862..4207.577 rows=7938 loops=3)
-> Parallel Index Scan using person_taskx1 on person_task pt (cost=0.56..25848782.35 rows=16882362 width=16) (actual time=1.344..1807.664 rows=7938 loops=3)
-> Index Scan using taskxpk on task t (cost=0.44..1.59 rows=1 width=8) (actual time=0.301..0.301 rows=1 loops=23814)
Index Cond: (id = pt.task_id)
Filter: (state <> 'open')
Planning Time: 0.430 ms
Execution Time: 6884.349 ms
Should therefore always the ON statement be used for filtering values in a sub JOIN? Or what is going on?
I have used Postgres for this example.

The condition and pt.person_id = p.id doesn't refer to any column of the joined table t. In an inner join this doesn't make much sense semantically and we can move this condition from ON to WHERE to get the query more readable.
You are right, hence, that the two queries are equivalent and should result in the same execution plan. As this is not the case, PostgreSQL seems to have a problem here with their optimizer.
In an outer join such a condition in ON can make sense and would be different from WHERE. I assume that this is the reason for the optimizer finding a different plan for ON in general. Once it detects the condition in ON it goes another route, oblivious of the join type (so my assumption). I am surprised though, that this leads to a better plan; I'd rather expect a worse plan.
This may indicate that the table's statistics are not up-to-date. Please analyze the tables to make sure. Or it may be a sore spot in the optimizer code PostgreSQL developers might want to work on.

Related

How to improve the performance of queries when joining a lot of huge tables?

This is my SQL script, I have to join 7 tables
SELECT concat_ws('-', it.item_id, it.model_id) AS product_id,
concat_ws('-', aip.partner_item_id, aip.partner_model_id) AS product_reseller_id,
i.name as item_name,
im.name AS model_name,
p.partner_code,
sum(it.quantity) AS transfer_total,
sum(isb.remaining_item) as remaining_stock,
sum(isb.sold_item) as partner_sold
FROM transfer t
INNER JOIN partner p ON p.reseller_store_id = t.reseller_store_id
INNER JOIN item_transfer it ON t.id = it.transfer_id
INNER JOIN item i ON i.id = it.item_id
INNER JOIN item_model im ON it.model_id = im.id
INNER JOIN affiliate_item_mapping aip on it.item_id = aip.seller_item_id and it.model_id = aip.seller_model_id
and t.reseller_store_id = aip.reseller_store_id
LEFT JOIN inventory_summary_branch isb on isb.inventory_summary_id = concat_ws('-', aip.partner_item_id, aip.partner_model_id)
WHERE p.store_id = 9805
GROUP BY it.item_id, it.model_id, p.partner_code, i.id, im.id, aip.id, isb.inventory_summary_id
This is the result of SQL EXPLAIN:
GroupAggregate (cost=13861.57..13861.62 rows=1 width=885) (actual time=1890.392..1890.525 rows=15 loops=1)
Group Key: it.item_id, it.model_id, p.partner_code, i.id, im.id, aip.id, isb.inventory_summary_id
Buffers: shared hit=118610
-> Sort (cost=13861.57..13861.58 rows=1 width=765) (actual time=1890.310..1890.338 rows=21 loops=1)
Sort Key: it.item_id, it.model_id, p.partner_code, aip.id, isb.inventory_summary_id
Sort Method: quicksort Memory: 28kB
Buffers: shared hit=118610
-> Nested Loop (cost=1.27..13861.56 rows=1 width=765) (actual time=73.156..1890.057 rows=21 loops=1)
Buffers: shared hit=118610
-> Nested Loop (cost=0.85..13853.14 rows=1 width=753) (actual time=73.134..1889.495 rows=21 loops=1)
Buffers: shared hit=118526
-> Nested Loop (cost=0.43..13845.32 rows=1 width=609) (actual time=73.099..1888.733 rows=21 loops=1)
Join Filter: ((p.reseller_store_id = t.reseller_store_id) AND (it.transfer_id = t.id))
Rows Removed by Join Filter: 2142
Buffers: shared hit=118442
-> Nested Loop (cost=0.43..13840.24 rows=1 width=633) (actual time=72.793..1879.961 rows=21 loops=1)
Join Filter: ((aip.seller_item_id = it.item_id) AND (aip.seller_model_id = it.model_id))
Rows Removed by Join Filter: 6003
Buffers: shared hit=118379
-> Nested Loop Left Join (cost=0.43..13831.47 rows=1 width=601) (actual time=72.093..1861.415 rows=24 loops=1)
Buffers: shared hit=118307
-> Nested Loop (cost=0.00..11.44 rows=1 width=572) (actual time=0.042..0.696 rows=24 loops=1)
Join Filter: (p.reseller_store_id = aip.reseller_store_id)
Rows Removed by Join Filter: 150
Buffers: shared hit=7
-> Seq Scan on partner p (cost=0.00..10.38 rows=1 width=524) (actual time=0.026..0.039 rows=6 loops=1)
Filter: (store_id = 9805)
Buffers: shared hit=1
-> Seq Scan on affiliate_item_mapping aip (cost=0.00..1.03 rows=3 width=48) (actual time=0.006..0.043 rows=29 loops=6)
Buffers: shared hit=6
-> Index Scan using branch_id_inventory_summary_id_inventory_summary_branch on inventory_summary_branch isb (cost=0.43..13820.01 rows=1 width=29) (actual time=77.498..77.498 rows=0 loops=24)
Index Cond: ((inventory_summary_id)::text = concat_ws('-'::text, aip.partner_item_id, aip.partner_model_id))
Buffers: shared hit=118300
-> Seq Scan on item_transfer it (cost=0.00..5.31 rows=231 width=32) (actual time=0.024..0.391 rows=251 loops=24)
Buffers: shared hit=72
-> Seq Scan on transfer t (cost=0.00..3.83 rows=83 width=16) (actual time=0.011..0.256 rows=103 loops=21)
Buffers: shared hit=63
-> Index Scan using pk_item on item i (cost=0.42..7.81 rows=1 width=152) (actual time=0.022..0.023 rows=1 loops=21)
Index Cond: (id = it.item_id)
Buffers: shared hit=84
-> Index Scan using pk_item_model on item_model im (cost=0.43..8.41 rows=1 width=20) (actual time=0.016..0.018 rows=1 loops=21)
Index Cond: (id = it.model_id)
Buffers: shared hit=84
Planning time: 10.051 ms
Execution time: 1890.943 ms
Of course, this statement works fine, but it's slow. Is there a better way to write this code?
How can I improve the performance? Join or sub-query is better in this case? Anyone, please give me a hand
2 things can help you
do VACCUME ANALYZE for all the tables involved.
create indexe on item_transfer.item_id & model_id
Essentially all of your time (77.498*24) is spend on the index scan of branch_id_inventory_summary_id_inventory_summary_branch.
About the only explanation I can see for this is that the index isn't suited to the query, and it is being full-index scanned (in lieu of full scanning the table), rather than being efficiently scanned. This probably means the index includes the column inventory_summary_id, but it is not the leading column. (It would be nice if EXPLAIN were to make this inefficient type of usage clearer than it currently does).
You would probably benefit from an index such as on inventory_summary_branch (inventory_summary_id) which has a better chance of being used efficiently.
I don't know why it wouldn't just do a hash join of that table. Maybe your work_mem is too low.
Inner joins will always be slower, especially with so many tables.
You could change from an inner join on the whole table to just the columns you need and see if that improves it at all:
From:
INNER JOIN partner p ON p.reseller_store_id = t.reseller_store_id
To:
inner join (select id, partner_code from partner) as p ON p.reseller_store_id = t.reseller_store_id
See if that speeds things up at all.
If not I would recommend indexes on the keys

Postgres: Slow query when using OR statement in a join query

We run a join query between 2 tables.
The query has an OR statement that compares one column from the left table and one column from the right table. The query performance is very low, and we fixed it by changing the OR to UNION.
Why is this happening? I'm looking for a detailed explanation or a reference to the documentation that might shed a light on the issue.
Query with Or Statment:
db1=# explain analyze select count(*)
from conversations
join agents on conversations.agent_id=agents.id
where conversations.id=1 or agents.id = '123';
**Query plan**
----------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=**11017.95..11017.96** rows=1 width=8) (actual time=54.088..54.088 rows=1 loops=1)
-> Gather (cost=11017.73..11017.94 rows=2 width=8) (actual time=53.945..57.181 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=10017.73..10017.74 rows=1 width=8) (actual time=48.303..48.303 rows=1 loops=3)
-> Hash Join (cost=219.26..10016.69 rows=415 width=0) (actual time=5.292..48.287 rows=130 loops=3)
Hash Cond: (conversations.agent_id = agents.id)
Join Filter: ((conversations.id = 1) OR ((agents.id)::text = '123'::text))
Rows Removed by Join Filter: 80035
-> Parallel Seq Scan on conversations (cost=0.00..9366.95 rows=163995 width=8) (actual time=0.017..14.972 rows=131196 loops=3)
-> Hash (cost=143.56..143.56 rows=6056 width=16) (actual time=2.686..2.686 rows=6057 loops=3)
Buckets: 8192 Batches: 1 Memory Usage: 353kB
-> Seq Scan on agents (cost=0.00..143.56 rows=6056 width=16) (actual time=0.011..1.305 rows=6057 loops=3)
Planning time: 0.710 ms
Execution time: 57.276 ms
(15 rows)
Changing the OR to UNION:
db1=# explain analyze select count(*) from (
select *
from conversations
join agents on conversations.agent_id=agents.id
where conversations.installation_id=1
union
select *
from conversations
join agents on conversations.agent_id=agents.id
where agents.source_id = '123') as subquery;
**Query plan:**
----------------------------------------------------------------------------------------------------------------------------------
Aggregate (**cost=1114.31..1114.32** rows=1 width=8) (actual time=8.038..8.038 rows=1 loops=1)
-> HashAggregate (cost=1091.90..1101.86 rows=996 width=1437) (actual time=7.783..8.009 rows=390 loops=1)
Group Key: conversations.id, conversations.created, conversations.modified, conversations.source_created, conversations.source_id, conversations.installation_id, bra
in_conversation.resolution_reason, conversations.solve_time, conversations.agent_id, conversations.submission_reason, conversations.is_marked_as_duplicate, conversations.n
um_back_and_forths, conversations.is_closed, conversations.is_solved, conversations.conversation_type, conversations.related_ticket_source_id, conversations.channel, brain_convers
ation.last_updated_from_platform, conversations.csat, agents.id, agents.created, agents.modified, agents.name, agents.source_id, organizati
on_agent.installation_id, agents.settings
-> Append (cost=219.68..1027.16 rows=996 width=1437) (actual time=5.517..6.307 rows=390 loops=1)
-> Hash Join (cost=219.68..649.69 rows=931 width=224) (actual time=5.516..6.063 rows=390 loops=1)
Hash Cond: (conversations.agent_id = agents.id)
-> Index Scan using conversations_installation_id_b3ff5c00 on conversations (cost=0.42..427.98 rows=931 width=154) (actual time=0.039..0.344 rows=879 loops=1)
Index Cond: (installation_id = 1)
-> Hash (cost=143.56..143.56 rows=6056 width=70) (actual time=5.394..5.394 rows=6057 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 710kB
-> Seq Scan on agents (cost=0.00..143.56 rows=6056 width=70) (actual time=0.014..1.938 rows=6057 loops=1)
-> Nested Loop (cost=0.70..367.52 rows=65 width=224) (actual time=0.210..0.211 rows=0 loops=1)
-> Index Scan using agents_source_id_106c8103_like on agents agents_1 (cost=0.28..8.30 rows=1 width=70) (actual time=0.210..0.210 rows=0 loops=1)
Index Cond: ((source_id)::text = '123'::text)
-> Index Scan using conversations_agent_id_de76554b on conversations conversations_1 (cost=0.42..358.12 rows=110 width=154) (never executed)
Index Cond: (agent_id = agents_1.id)
Planning time: 2.024 ms
Execution time: 9.367 ms
(18 rows)
Yes. or has a way of killing the performance of queries. For this query:
select count(*)
from conversations c join
agents a
on c.agent_id = a.id
where c.id = 1 or a.id = 123;
Note I removed the quotes around 123. It looks like a number so I assume it is. For this query, you want an index on conversations(agent_id).
Probably the most effective way to write the query is:
select count(*)
from ((select 1
from conversations c join
agents a
on c.agent_id = a.id
where c.id = 1
) union all
(select 1
from conversations c join
agents a
on c.agent_id = a.id
where a.id = 123 and c.id <> 1
)
) ac;
Note the use of union all rather than union. The additional where condition eliminates duplicates.
This can take advantage of the following indexes:
conversations(id, agent_id)
agents(id)
conversations(agent_id, id)

Why does a lateral joins with LIMIT increase execution time?

when I run a query with a lateral join and a LIMIT inside, it uses nested a loop join. But when I remove the LIMIT it uses a Hash Right Join. Why?
EXPLAIN ANALYSE
SELECT proxy.*
FROM jobs
LEFT OUTER JOIN LATERAL (
SELECT proxy.*
FROM proxy
WHERE jobs.id = proxy.job_id
) proxy ON true
Hash Right Join (cost=2075.47..3029.05 rows=34688 width=12) (actual time=9.951..24.758 rows=35212 loops=1)
Hash Cond: (proxy.job_id = jobs.id)
-> Seq Scan on proxy (cost=0.00..524.15 rows=34015 width=12) (actual time=0.011..2.502 rows=34028 loops=1)
-> Hash (cost=1641.87..1641.87 rows=34688 width=4) (actual time=9.842..9.842 rows=34689 loops=1)
Buckets: 65536 Batches: 1 Memory Usage: 1732kB
-> Index Only Scan using jobs_pkey on jobs (cost=0.29..1641.87 rows=34688 width=4) (actual time=0.010..4.904 rows=34689 loops=1)
Heap Fetches: 921
But when I add limits to the query, the actual time jumps from 24 to 150:
EXPLAIN ANALYSE
SELECT proxy.*
FROM jobs
LEFT OUTER JOIN LATERAL (
SELECT proxy.*
FROM proxy
WHERE jobs.id = proxy.job_id
limit 1
) proxy ON true
Nested Loop Left Join (cost=0.58..290506.19 rows=34688 width=12) (actual time=0.024..155.753 rows=34689 loops=1)
-> Index Only Scan using jobs_pkey on jobs (cost=0.29..1641.87 rows=34688 width=4) (actual time=0.014..3.984 rows=34689 loops=1)
Heap Fetches: 921
-> Limit (cost=0.29..8.31 rows=1 width=12) (actual time=0.001..0.001 rows=1 loops=34689)
-> Index Scan using index_job_proxy_on_job_id on loc_job_source_materials (cost=0.29..8.31 rows=1 width=12) (actual time=0.001..0.001 rows=1 loops=34689)
Index Cond: (jobs.id = job_id)
The optimizer is smart enough to rewrite your first query to
SELECT proxy.*
FROM proxy
RIGHT OUTER JOIN jobs
ON jobs.id = proxy.job_id;
But this optimization cannot be made with the LIMIT clause, so only a nested loop join is possible.
Following on from #LaurenzAlbe's answer, I think we can help more if you show the complete query, so we know why you need a LATERAL join. For the (simplified) requirements you have mentioned so far, I think an equivalent is
SELECT DISTINCT ON(proxy.id) proxy.*
FROM proxy
RIGHT OUTER JOIN jobs
ON jobs.id = proxy.job_id;
Also, since you are only outputting columns from proxy, you are effectively doing only an INNER JOIN, but with more computing effort.

Improve performance on SQL query with Nested Loop - PostgreSQL

I am using PostgreSQL and I have a weird problem with my SQL query. Depending on wich date paramter I'm using. My request doesn't do the same operation.
This is my working query :
SELECT DISTINCT app.id_application
FROM stat sj
LEFT OUTER JOIN groupe gp ON gp.id_groupe = sj.id_groupe
LEFT OUTER JOIN application app ON app.id_application = gp.id_application
WHERE date_stat >= '2016/3/01'
AND date_stat <= '2016/3/31'
AND ( date_stat = date_gen-1 or (date_gen = '2016/04/01' AND date_stat = '2016/3/31'))
AND app.id_application IS NOT NULL
This query takes around 2 secondes (which is OKAY for me because I have a lots of rows). When I run EXPLAIN ANALYSE for this query I have this:
HashAggregate (cost=375486.95..375493.62 rows=667 width=4) (actual time=2320.541..2320.656 rows=442 loops=1)
-> Hash Join (cost=254.02..375478.99 rows=3186 width=4) (actual time=6.144..2271.984 rows=263274 loops=1)
Hash Cond: (gp.id_application = app.id_application)
-> Hash Join (cost=234.01..375415.17 rows=3186 width=4) (actual time=5.926..2200.671 rows=263274 loops=1)
Hash Cond: (sj.id_groupe = gp.id_groupe)
-> Seq Scan on stat sj (cost=0.00..375109.47 rows=3186 width=8) (actual time=3.196..2068.357 rows=263274 loops=1)
Filter: ((date_stat >= '2016-03-01'::date) AND (date_stat <= '2016-03-31'::date) AND ((date_stat = (date_gen - 1)) OR ((date_gen = '2016-04-01'::date) AND (date_stat = '2016-03-31'::date))))
Rows Removed by Filter: 7199514
-> Hash (cost=133.45..133.45 rows=8045 width=12) (actual time=2.677..2.677 rows=8019 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 345kB
-> Seq Scan on groupe gp (cost=0.00..133.45 rows=8045 width=12) (actual time=0.007..1.284 rows=8019 loops=1)
-> Hash (cost=11.67..11.67 rows=667 width=4) (actual time=0.206..0.206 rows=692 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 25kB
-> Seq Scan on application app (cost=0.00..11.67 rows=667 width=4) (actual time=0.007..0.101 rows=692 loops=1)
Filter: (id_application IS NOT NULL)
Total runtime: 2320.855 ms
Now, When I'm trying the same query for the current month (we are the 6th of April, so I'm trying to get all the application_id of April) with the same query
SELECT DISTINCT app.id_application
FROM stat sj
LEFT OUTER JOIN groupe gp ON gp.id_groupe = sj.id_groupe
LEFT OUTER JOIN application app ON app.id_application = gp.id_application
WHERE date_stat >= '2016/04/01'
AND date_stat <= '2016/04/30'
AND ( date_stat = date_gen-1 or ( date_gen = '2016/05/01' AND date_job = '2016/04/30'))
AND app.id_application IS NOT NULL
This query takes now 120 seconds. So I also ran EXPLAIN ANALYZE on this query and now it doesn't have the same operations:
HashAggregate (cost=375363.50..375363.51 rows=1 width=4) (actual time=186716.468..186716.532 rows=490 loops=1)
-> Nested Loop (cost=0.00..375363.49 rows=1 width=4) (actual time=1.945..186619.404 rows=118990 loops=1)
Join Filter: (gp.id_application = app.id_application)
Rows Removed by Join Filter: 82222090
-> Nested Loop (cost=0.00..375343.49 rows=1 width=4) (actual time=1.821..171458.237 rows=118990 loops=1)
Join Filter: (sj.id_groupe = gp.id_groupe)
Rows Removed by Join Filter: 954061820
-> Seq Scan on stat sj (cost=0.00..375109.47 rows=1 width=8) (actual time=0.235..1964.423 rows=118990 loops=1)
Filter: ((date_stat >= '2016-04-01'::date) AND (date_stat <= '2016-04-30'::date) AND ((date_stat = (date_gen - 1)) OR ((date_gen = '2016-05-01'::date) AND (date_stat = '2016-04-30'::date))))
Rows Removed by Filter: 7343798
-> Seq Scan on groupe gp (cost=0.00..133.45 rows=8045 width=12) (actual time=0.002..0.736 rows=8019 loops=118990)
-> Seq Scan on application app (cost=0.00..11.67 rows=667 width=4) (actual time=0.003..0.073 rows=692 loops=118990)
Filter: (id_application IS NOT NULL)
Total runtime: 186716.635 ms
So I decided to search where the problem came from by reducing the number of conditions from my query until the performances is acceptable again.
So with only this parameter
WHERE date_stat >= '2016/04/01'
It takes only 1.9secondes (like the first working query)
and it's also working with 2 parameters :
WHERE date_stat >= '2016/04/01'
AND app.id_application IS NOT NULL
BUT when I try to add one of those line I have the Nested loop in the Explain
AND date_stat <= '2016/04/30'
AND ( date_stat = date_gen-1 or ( date_gen = '2016/05/01' AND date_stat = '2016/04/30'))
Does someone have any idea where it could come from?
Ok, it looks like there's problem with optimizer estimations. He thiks that for april there will be only 1 row so he choose NESTED LOOP which is very inefficient for big number of rows (118,990 in that case).
Perform VACUUM ANALYZE for every table. This will clean up dead tuples and refresh statistics.
consider adding index based on dates like CREATE INDEX date_stat_idx ON <table with date_stat> USING btree (date_stat);
Rerun the query,

Optimizing postgres query

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=32164.87..32164.89 rows=1 width=44) (actual time=221552.831..221552.831 rows=0 loops=1)
-> Sort (cost=32164.87..32164.87 rows=1 width=44) (actual time=221552.827..221552.827 rows=0 loops=1)
Sort Key: t.date_effective, t.acct_account_transaction_id, p.method, t.amount, c.business_name, t.amount
-> Nested Loop (cost=22871.67..32164.86 rows=1 width=44) (actual time=221552.808..221552.808 rows=0 loops=1)
-> Nested Loop (cost=22871.67..32160.37 rows=1 width=52) (actual time=221431.071..221546.619 rows=670 loops=1)
-> Nested Loop (cost=22871.67..32157.33 rows=1 width=43) (actual time=221421.218..221525.056 rows=2571 loops=1)
-> Hash Join (cost=22871.67..32152.80 rows=1 width=16) (actual time=221307.382..221491.019 rows=2593 loops=1)
Hash Cond: ("outer".acct_account_id = "inner".acct_account_fk)
-> Seq Scan on acct_account a (cost=0.00..7456.08 rows=365008 width=8) (actual time=0.032..118.369 rows=61295 loops=1)
-> Hash (cost=22871.67..22871.67 rows=1 width=16) (actual time=221286.733..221286.733 rows=2593 loops=1)
-> Nested Loop Left Join (cost=0.00..22871.67 rows=1 width=16) (actual time=1025.396..221266.357 rows=2593 loops=1)
Join Filter: ("inner".orig_acct_payment_fk = "outer".acct_account_transaction_id)
Filter: ("inner".link_type IS NULL)
-> Seq Scan on acct_account_transaction t (cost=0.00..18222.98 rows=1 width=16) (actual time=949.081..976.432 rows=2596 loops=1)
Filter: ((("type")::text = 'debit'::text) AND ((transaction_status)::text = 'active'::text) AND (date_effective >= '2012-03-01'::date) AND (date_effective < '2012-04-01 00:00:00'::timestamp without time zone))
-> Seq Scan on acct_payment_link l (cost=0.00..4648.68 rows=1 width=15) (actual time=1.073..84.610 rows=169 loops=2596)
Filter: ((link_type)::text ~~ 'return_%'::text)
-> Index Scan using contact_pk on contact c (cost=0.00..4.52 rows=1 width=27) (actual time=0.007..0.008 rows=1 loops=2593)
Index Cond: (c.contact_id = "outer".contact_fk)
-> Index Scan using acct_payment_transaction_fk on acct_payment p (cost=0.00..3.02 rows=1 width=13) (actual time=0.005..0.005 rows=0 loops=2571)
Index Cond: (p.acct_account_transaction_fk = "outer".acct_account_transaction_id)
Filter: ((method)::text <> 'trade'::text)
-> Index Scan using contact_role_pk on contact_role (cost=0.00..4.48 rows=1 width=4) (actual time=0.007..0.007 rows=0 loops=670)
Index Cond: ("outer".contact_id = contact_role.contact_fk)
Filter: (exchange_fk = 74)
Total runtime: 221553.019 ms
Your problem is here:
-> Nested Loop Left Join (cost=0.00..22871.67 rows=1 width=16) (actual time=1025.396..221266.357 rows=2593 loops=1)
Join Filter: ("inner".orig_acct_payment_fk = "outer".acct_account_transaction_id)
Filter: ("inner".link_type IS NULL)
-> Seq Scan on acct_account_transaction t (cost=0.00..18222.98 rows=1 width=16) (actual time=949.081..976.432 rows=2596 loops=1)
Filter: ((("type")::text = 'debit'::text) AND ((transaction_status)::text = 'active'::text) AND (date_effective >= '2012-03-01'::date) AND (date_effective
Seq Scan on acct_payment_link l (cost=0.00..4648.68 rows=1 width=15) (actual time=1.073..84.610 rows=169 loops=2596)
Filter: ((link_type)::text ~~ 'return_%'::text)
It expects to find 1 row in acct_account_transaction, while it finds 2596, and similarly for the other table.
You did not mention Your postgres version (could You?), but this should do the trick:
SELECT DISTINCT
t.date_effective,
t.acct_account_transaction_id,
p.method,
t.amount,
c.business_name,
t.amount
FROM
contact c inner join contact_role on (c.contact_id=contact_role.contact_fk and contact_role.exchange_fk=74),
acct_account a, acct_payment p,
acct_account_transaction t
WHERE
p.acct_account_transaction_fk=t.acct_account_transaction_id
and t.type = 'debit'
and transaction_status = 'active'
and p.method != 'trade'
and t.date_effective >= '2012-03-01'
and t.date_effective < (date '2012-03-01' + interval '1 month')
and c.contact_id=a.contact_fk and a.acct_account_id = t.acct_account_fk
and not exists(
select * from acct_payment_link l
where orig_acct_payment_fk == acct_account_transaction_id
and link_type like 'return_%'
)
ORDER BY
t.date_effective DESC
Also, try setting appropriate statistics target for relevant columns. Link to the friendly manual: http://www.postgresql.org/docs/current/static/sql-altertable.html
What are your indexes, and have you analysed lately? It's doing a table scan on acct_account_transaction even though there are several criteria on that table:
type
date_effective
If there are no indexes on those columns, then a compound one one (type, date_effective) could help (assuming there are lots of rows that don't meet the criteria on those columns).
I remove my first suggestion, as it changes the nature of the query.
I see that there's too much time spent in the LEFT JOIN.
First thing is to try to make only a single scan of the acct_payment_link table. Could you try rewriting your query to:
... LEFT JOIN (SELECT * FROM acct_payment_link
WHERE link_type LIKE 'return_%') AS l ...
You should check your statistics, as there's difference between planned and returned numbers of rows.
You haven't included tables' and indexes' definitions, it'd be good to take a look on those.
You might also want to use contrib/pg_tgrm extension to build index on the acct_payment_link.link_type, but I would make this a last option to try out.
BTW, what is the PostgreSQL version you're using?
Your statement rewritten and formatted:
SELECT DISTINCT
t.date_effective,
t.acct_account_transaction_id,
p.method,
t.amount,
c.business_name,
t.amount
FROM contact c
JOIN contact_role cr ON cr.contact_fk = c.contact_id
JOIN acct_account a ON a.contact_fk = c.contact_id
JOIN acct_account_transaction t ON t.acct_account_fk = a.acct_account_id
JOIN acct_payment p ON p.acct_account_transaction_fk
= t.acct_account_transaction_id
LEFT JOIN acct_payment_link l ON orig_acct_payment_fk
= acct_account_transaction_id
-- missing table-qualification!
AND link_type like 'return_%'
-- missing table-qualification!
WHERE transaction_status = 'active' -- missing table-qualification!
AND cr.exchange_fk = 74
AND t.type = 'debit'
AND t.date_effective >= '2012-03-01'
AND t.date_effective < (date '2012-03-01' + interval '1 month')
AND p.method != 'trade'
AND l.link_type IS NULL
ORDER BY t.date_effective DESC;
Explicit JOIN statements are preferable. I reordered your tables according to your JOIN logic.
Why (date '2012-03-01' + interval '1 month') instead of date '2012-04-01'?
Some table qualifications are missing. In a complex statement like this that's bad style. May be hiding a mistake.
The key to performance are indexes where appropriate, proper configuration of PostgreSQL and accurate statistics.
General advice on performance tuning in the PostgreSQL wiki.