Improve performance on SQL query with Nested Loop - PostgreSQL - sql

I am using PostgreSQL and I have a weird problem with my SQL query. Depending on wich date paramter I'm using. My request doesn't do the same operation.
This is my working query :
SELECT DISTINCT app.id_application
FROM stat sj
LEFT OUTER JOIN groupe gp ON gp.id_groupe = sj.id_groupe
LEFT OUTER JOIN application app ON app.id_application = gp.id_application
WHERE date_stat >= '2016/3/01'
AND date_stat <= '2016/3/31'
AND ( date_stat = date_gen-1 or (date_gen = '2016/04/01' AND date_stat = '2016/3/31'))
AND app.id_application IS NOT NULL
This query takes around 2 secondes (which is OKAY for me because I have a lots of rows). When I run EXPLAIN ANALYSE for this query I have this:
HashAggregate (cost=375486.95..375493.62 rows=667 width=4) (actual time=2320.541..2320.656 rows=442 loops=1)
-> Hash Join (cost=254.02..375478.99 rows=3186 width=4) (actual time=6.144..2271.984 rows=263274 loops=1)
Hash Cond: (gp.id_application = app.id_application)
-> Hash Join (cost=234.01..375415.17 rows=3186 width=4) (actual time=5.926..2200.671 rows=263274 loops=1)
Hash Cond: (sj.id_groupe = gp.id_groupe)
-> Seq Scan on stat sj (cost=0.00..375109.47 rows=3186 width=8) (actual time=3.196..2068.357 rows=263274 loops=1)
Filter: ((date_stat >= '2016-03-01'::date) AND (date_stat <= '2016-03-31'::date) AND ((date_stat = (date_gen - 1)) OR ((date_gen = '2016-04-01'::date) AND (date_stat = '2016-03-31'::date))))
Rows Removed by Filter: 7199514
-> Hash (cost=133.45..133.45 rows=8045 width=12) (actual time=2.677..2.677 rows=8019 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 345kB
-> Seq Scan on groupe gp (cost=0.00..133.45 rows=8045 width=12) (actual time=0.007..1.284 rows=8019 loops=1)
-> Hash (cost=11.67..11.67 rows=667 width=4) (actual time=0.206..0.206 rows=692 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 25kB
-> Seq Scan on application app (cost=0.00..11.67 rows=667 width=4) (actual time=0.007..0.101 rows=692 loops=1)
Filter: (id_application IS NOT NULL)
Total runtime: 2320.855 ms
Now, When I'm trying the same query for the current month (we are the 6th of April, so I'm trying to get all the application_id of April) with the same query
SELECT DISTINCT app.id_application
FROM stat sj
LEFT OUTER JOIN groupe gp ON gp.id_groupe = sj.id_groupe
LEFT OUTER JOIN application app ON app.id_application = gp.id_application
WHERE date_stat >= '2016/04/01'
AND date_stat <= '2016/04/30'
AND ( date_stat = date_gen-1 or ( date_gen = '2016/05/01' AND date_job = '2016/04/30'))
AND app.id_application IS NOT NULL
This query takes now 120 seconds. So I also ran EXPLAIN ANALYZE on this query and now it doesn't have the same operations:
HashAggregate (cost=375363.50..375363.51 rows=1 width=4) (actual time=186716.468..186716.532 rows=490 loops=1)
-> Nested Loop (cost=0.00..375363.49 rows=1 width=4) (actual time=1.945..186619.404 rows=118990 loops=1)
Join Filter: (gp.id_application = app.id_application)
Rows Removed by Join Filter: 82222090
-> Nested Loop (cost=0.00..375343.49 rows=1 width=4) (actual time=1.821..171458.237 rows=118990 loops=1)
Join Filter: (sj.id_groupe = gp.id_groupe)
Rows Removed by Join Filter: 954061820
-> Seq Scan on stat sj (cost=0.00..375109.47 rows=1 width=8) (actual time=0.235..1964.423 rows=118990 loops=1)
Filter: ((date_stat >= '2016-04-01'::date) AND (date_stat <= '2016-04-30'::date) AND ((date_stat = (date_gen - 1)) OR ((date_gen = '2016-05-01'::date) AND (date_stat = '2016-04-30'::date))))
Rows Removed by Filter: 7343798
-> Seq Scan on groupe gp (cost=0.00..133.45 rows=8045 width=12) (actual time=0.002..0.736 rows=8019 loops=118990)
-> Seq Scan on application app (cost=0.00..11.67 rows=667 width=4) (actual time=0.003..0.073 rows=692 loops=118990)
Filter: (id_application IS NOT NULL)
Total runtime: 186716.635 ms
So I decided to search where the problem came from by reducing the number of conditions from my query until the performances is acceptable again.
So with only this parameter
WHERE date_stat >= '2016/04/01'
It takes only 1.9secondes (like the first working query)
and it's also working with 2 parameters :
WHERE date_stat >= '2016/04/01'
AND app.id_application IS NOT NULL
BUT when I try to add one of those line I have the Nested loop in the Explain
AND date_stat <= '2016/04/30'
AND ( date_stat = date_gen-1 or ( date_gen = '2016/05/01' AND date_stat = '2016/04/30'))
Does someone have any idea where it could come from?

Ok, it looks like there's problem with optimizer estimations. He thiks that for april there will be only 1 row so he choose NESTED LOOP which is very inefficient for big number of rows (118,990 in that case).
Perform VACUUM ANALYZE for every table. This will clean up dead tuples and refresh statistics.
consider adding index based on dates like CREATE INDEX date_stat_idx ON <table with date_stat> USING btree (date_stat);
Rerun the query,

Related

Postgres: Slow query when using OR statement in a join query

We run a join query between 2 tables.
The query has an OR statement that compares one column from the left table and one column from the right table. The query performance is very low, and we fixed it by changing the OR to UNION.
Why is this happening? I'm looking for a detailed explanation or a reference to the documentation that might shed a light on the issue.
Query with Or Statment:
db1=# explain analyze select count(*)
from conversations
join agents on conversations.agent_id=agents.id
where conversations.id=1 or agents.id = '123';
**Query plan**
----------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=**11017.95..11017.96** rows=1 width=8) (actual time=54.088..54.088 rows=1 loops=1)
-> Gather (cost=11017.73..11017.94 rows=2 width=8) (actual time=53.945..57.181 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=10017.73..10017.74 rows=1 width=8) (actual time=48.303..48.303 rows=1 loops=3)
-> Hash Join (cost=219.26..10016.69 rows=415 width=0) (actual time=5.292..48.287 rows=130 loops=3)
Hash Cond: (conversations.agent_id = agents.id)
Join Filter: ((conversations.id = 1) OR ((agents.id)::text = '123'::text))
Rows Removed by Join Filter: 80035
-> Parallel Seq Scan on conversations (cost=0.00..9366.95 rows=163995 width=8) (actual time=0.017..14.972 rows=131196 loops=3)
-> Hash (cost=143.56..143.56 rows=6056 width=16) (actual time=2.686..2.686 rows=6057 loops=3)
Buckets: 8192 Batches: 1 Memory Usage: 353kB
-> Seq Scan on agents (cost=0.00..143.56 rows=6056 width=16) (actual time=0.011..1.305 rows=6057 loops=3)
Planning time: 0.710 ms
Execution time: 57.276 ms
(15 rows)
Changing the OR to UNION:
db1=# explain analyze select count(*) from (
select *
from conversations
join agents on conversations.agent_id=agents.id
where conversations.installation_id=1
union
select *
from conversations
join agents on conversations.agent_id=agents.id
where agents.source_id = '123') as subquery;
**Query plan:**
----------------------------------------------------------------------------------------------------------------------------------
Aggregate (**cost=1114.31..1114.32** rows=1 width=8) (actual time=8.038..8.038 rows=1 loops=1)
-> HashAggregate (cost=1091.90..1101.86 rows=996 width=1437) (actual time=7.783..8.009 rows=390 loops=1)
Group Key: conversations.id, conversations.created, conversations.modified, conversations.source_created, conversations.source_id, conversations.installation_id, bra
in_conversation.resolution_reason, conversations.solve_time, conversations.agent_id, conversations.submission_reason, conversations.is_marked_as_duplicate, conversations.n
um_back_and_forths, conversations.is_closed, conversations.is_solved, conversations.conversation_type, conversations.related_ticket_source_id, conversations.channel, brain_convers
ation.last_updated_from_platform, conversations.csat, agents.id, agents.created, agents.modified, agents.name, agents.source_id, organizati
on_agent.installation_id, agents.settings
-> Append (cost=219.68..1027.16 rows=996 width=1437) (actual time=5.517..6.307 rows=390 loops=1)
-> Hash Join (cost=219.68..649.69 rows=931 width=224) (actual time=5.516..6.063 rows=390 loops=1)
Hash Cond: (conversations.agent_id = agents.id)
-> Index Scan using conversations_installation_id_b3ff5c00 on conversations (cost=0.42..427.98 rows=931 width=154) (actual time=0.039..0.344 rows=879 loops=1)
Index Cond: (installation_id = 1)
-> Hash (cost=143.56..143.56 rows=6056 width=70) (actual time=5.394..5.394 rows=6057 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 710kB
-> Seq Scan on agents (cost=0.00..143.56 rows=6056 width=70) (actual time=0.014..1.938 rows=6057 loops=1)
-> Nested Loop (cost=0.70..367.52 rows=65 width=224) (actual time=0.210..0.211 rows=0 loops=1)
-> Index Scan using agents_source_id_106c8103_like on agents agents_1 (cost=0.28..8.30 rows=1 width=70) (actual time=0.210..0.210 rows=0 loops=1)
Index Cond: ((source_id)::text = '123'::text)
-> Index Scan using conversations_agent_id_de76554b on conversations conversations_1 (cost=0.42..358.12 rows=110 width=154) (never executed)
Index Cond: (agent_id = agents_1.id)
Planning time: 2.024 ms
Execution time: 9.367 ms
(18 rows)
Yes. or has a way of killing the performance of queries. For this query:
select count(*)
from conversations c join
agents a
on c.agent_id = a.id
where c.id = 1 or a.id = 123;
Note I removed the quotes around 123. It looks like a number so I assume it is. For this query, you want an index on conversations(agent_id).
Probably the most effective way to write the query is:
select count(*)
from ((select 1
from conversations c join
agents a
on c.agent_id = a.id
where c.id = 1
) union all
(select 1
from conversations c join
agents a
on c.agent_id = a.id
where a.id = 123 and c.id <> 1
)
) ac;
Note the use of union all rather than union. The additional where condition eliminates duplicates.
This can take advantage of the following indexes:
conversations(id, agent_id)
agents(id)
conversations(agent_id, id)

Poor performing postgres sql

Here's my sql, followed by the explanation. I need to improve the performance. Any ideas?
PostgreSQL 9.3.12 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 4.8.4-2ubuntu1~14.04.1) 4.8.4, 64-bit
explain analyze
SELECT DISTINCT "apts"."id", practices.name AS alias_0
FROM "apts"
LEFT OUTER JOIN "patients" ON "patients"."id" = "apts"."patient_id"
LEFT OUTER JOIN "practices" ON "practices"."id" = "apts"."practice_id"
LEFT OUTER JOIN "eligibility_messages" ON "eligibility_messages"."apt_id" = "apts"."id"
WHERE (apts.eligibility_status_id != 1)
AND (eligibility_messages.current = 't')
AND (practices.id = '104')
ORDER BY practices.name desc
LIMIT 25 OFFSET 0
Limit (cost=881321.34..881321.41 rows=25 width=20) (actual time=2928.225..2928.227 rows=25 loops=1)
-> Sort (cost=881321.34..881391.94 rows=28240 width=20) (actual time=2928.223..2928.224 rows=25 loops=1)
Sort Key: practices.name
Sort Method: top-N heapsort Memory: 26kB
-> HashAggregate (cost=880242.03..880524.43 rows=28240 width=20) (actual time=2927.213..2927.319 rows=520 loops=1)
-> Nested Loop (cost=286614.55..880100.83 rows=28240 width=20) (actual time=206.180..2926.791 rows=520 loops=1)
-> Seq Scan on practices (cost=0.00..6.36 rows=1 width=20) (actual time=0.018..0.031 rows=1 loops=1)
Filter: (id = 104)
Rows Removed by Filter: 108
-> Hash Join (cost=286614.55..879812.07 rows=28240 width=8) (actual time=206.159..2926.643 rows=520 loops=1)
Hash Cond: (eligibility_messages.apt_id = apts.id)
-> Seq Scan on eligibility_messages (cost=0.00..561275.63 rows=2029532 width=4) (actual time=0.691..2766.867 rows=67559 loops=1)
Filter: current
Rows Removed by Filter: 3924633
-> Hash (cost=284614.02..284614.02 rows=115082 width=12) (actual time=121.957..121.957 rows=91660 loops=1)
Buckets: 16384 Batches: 2 Memory Usage: 1974kB
-> Bitmap Heap Scan on apts (cost=8296.88..284614.02 rows=115082 width=12) (actual time=19.927..91.038 rows=91660 loops=1)
Recheck Cond: (practice_id = 104)
Filter: (eligibility_status_id <> 1)
Rows Removed by Filter: 80169
-> Bitmap Index Scan on index_apts_on_practice_id (cost=0.00..8268.11 rows=177540 width=0) (actual time=16.856..16.856 rows=179506 loops=1)
Index Cond: (practice_id = 104)
Total runtime: 2928.361 ms
First, rewrite the query to a more manageable form:
SELECT DISTINCT a."id", pr.name AS alias_0
FROM "apts" a JOIN
"practices" pr
ON pr."id" = a."practice_id" JOIN
"eligibility_messages" em
ON em."apt_id" = a."id"
WHERE (a.eligibility_status_id <> 1) AND
(em.current = 't') AND
(a.practice_id = 104)
ORDER BY pr.name desc ;
Notes:
The WHERE clause turns the outer joins into inner joins anyway, so you might as well express them correctly.
I doubt pr.id is actually a string
The patients table isn't used, so I just removed it.
Perhaps you don't even need the select distinct any more.
Switched the condition in the where to apts rather than practices.
If this isn't fast enough, you want indexes, probably on apts(practice_id, eligibility_status_id, id), practices(id), and eligibility_messages(apt_id, current).

DISTINCT INNER JOIN slow

I've written the following PostgreSQL query which works as it should. However, it seems to be awfully slow, sometimes taking up to 10 seconds to return a result. I'm sure there is something in my statement that is causing this to be slow.
Can anyone help determine why this query is slow?
SELECT DISTINCT ON (school_classes.class_id,attendance_calendar.school_date)
school_classes.class_id, school_classes.class_name, school_classes.grade_id
, school_gradelevels.linked_calendar, attendance_calendars.calendar_id
, attendance_calendar.school_date, attendance_calendar.minutes
, teacher_join_classes_subjects.staff_id, staff.first_name, staff.last_name
FROM school_classes
INNER JOIN school_gradelevels ON school_gradelevels.id=school_classes.grade_id
INNER JOIN teacher_join_classes_subjects ON teacher_join_classes_subjects.class_id=school_classes.class_id
INNER JOIN staff ON staff.staff_id=teacher_join_classes_subjects.staff_id
INNER JOIN attendance_calendars ON attendance_calendars.title=school_gradelevels.linked_calendar
INNER JOIN attendance_calendar ON attendance_calendar.calendar_id=attendance_calendars.calendar_id
WHERE teacher_join_classes_subjects.syear='2013'
AND staff.syear='2013'
AND attendance_calendars.syear='2013'
AND teacher_join_classes_subjects.does_attendance='Y'
AND teacher_join_classes_subjects.subject_id IS NULL
AND attendance_calendar.school_date<CURRENT_DATE
AND attendance_calendar.school_date NOT IN (
SELECT com.school_date FROM attendance_completed com
WHERE com.class_id=school_classes.class_id
AND (com.period_id='101' AND attendance_calendar.minutes>='151' OR
com.period_id='95' AND attendance_calendar.minutes='150') )
I replaced the NOT IN with the following:
AND NOT EXISTS (
SELECT com.school_date
FROM attendance_completed com
WHERE com.class_id=school_classes.class_id
AND com.school_date=attendance_calendar.school_date
AND (com.period_id='101' AND attendance_calendar.minutes>='151' OR
com.period_id='95' AND attendance_calendar.minutes='150') )
Result of EXPLAIN ANALYZE:
Unique (cost=2998.39..2998.41 rows=3 width=85) (actual time=10751.111..10751.118 rows=1 loops=1)
-> Sort (cost=2998.39..2998.40 rows=3 width=85) (actual time=10751.110..10751.110 rows=2 loops=1)
Sort Key: school_classes.class_id, attendance_calendar.school_date
Sort Method: quicksort Memory: 25kB
-> Hash Join (cost=2.03..2998.37 rows=3 width=85) (actual time=6409.471..10751.045 rows=2 loops=1)
Hash Cond: ((teacher_join_classes_subjects.class_id = school_classes.class_id) AND (school_gradelevels.id = school_classes.grade_id))
Join Filter: (NOT (SubPlan 1))
-> Nested Loop (cost=0.00..120.69 rows=94 width=81) (actual time=2.468..1187.397 rows=26460 loops=1)
Join Filter: (attendance_calendars.calendar_id = attendance_calendar.calendar_id)
-> Nested Loop (cost=0.00..42.13 rows=1 width=70) (actual time=0.087..3.247 rows=735 loops=1)
Join Filter: ((attendance_calendars.title)::text = (school_gradelevels.linked_calendar)::text)
-> Nested Loop (cost=0.00..40.80 rows=1 width=277) (actual time=0.077..1.005 rows=245 loops=1)
-> Nested Loop (cost=0.00..39.61 rows=1 width=27) (actual time=0.064..0.572 rows=49 loops=1)
-> Seq Scan on teacher_join_classes_subjects (cost=0.00..10.48 rows=4 width=14) (actual time=0.022..0.143 rows=49 loops=1)
Filter: ((subject_id IS NULL) AND (syear = 2013::numeric) AND ((does_attendance)::text = 'Y'::text))
-> Index Scan using staff_pkey on staff (cost=0.00..7.27 rows=1 width=20) (actual time=0.006..0.007 rows=1 loops=49)
Index Cond: (staff.staff_id = teacher_join_classes_subjects.staff_id)
Filter: (staff.syear = 2013::numeric)
-> Seq Scan on attendance_calendars (cost=0.00..1.18 rows=1 width=250) (actual time=0.003..0.006 rows=5 loops=49)
Filter: (attendance_calendars.syear = 2013::numeric)
-> Seq Scan on school_gradelevels (cost=0.00..1.15 rows=15 width=11) (actual time=0.001..0.005 rows=15 loops=245)
-> Seq Scan on attendance_calendar (cost=0.00..55.26 rows=1864 width=18) (actual time=0.003..1.129 rows=1824 loops=735)
Filter: (attendance_calendar.school_date Hash (cost=1.41..1.41 rows=41 width=18) (actual time=0.040..0.040 rows=41 loops=1)
-> Seq Scan on school_classes (cost=0.00..1.41 rows=41 width=18) (actual time=0.006..0.015 rows=41 loops=1)
SubPlan 1
-> Seq Scan on attendance_completed com (cost=0.00..958.28 rows=5 width=4) (actual time=0.228..5.411 rows=17 loops=1764)
Filter: ((class_id = $0) AND (((period_id = 101::numeric) AND ($1 >= 151::numeric)) OR ((period_id = 95::numeric) AND ($1 = 150::numeric))))
NOT EXISTS is an excellent choice. Almost always better than NOT IN. More details here.
I simplified your query a bit (which looks fine, generally):
SELECT DISTINCT ON (c.class_id, a.school_date)
c.class_id, c.class_name, c.grade_id
,g.linked_calendar, aa.calendar_id
,a.school_date, a.minutes
,t.staff_id, s.first_name, s.last_name
FROM school_classes c
JOIN teacher_join_classes_subjects t USING (class_id)
JOIN staff s USING (staff_id)
JOIN school_gradelevels g ON g.id = c.grade_id
JOIN attendance_calendars aa ON aa.title = g.linked_calendar
JOIN attendance_calendar a ON a.calendar_id = aa.calendar_id
WHERE t.syear = 2013
AND s.syear = 2013
AND aa.syear = 2013
AND t.does_attendance = 'Y' -- looks like it should be boolean!
AND t.subject_id IS NULL
AND a.school_date < CURRENT_DATE
AND NOT EXISTS (
SELECT 1
FROM attendance_completed x
WHERE x.class_id = c.class_id
AND x.school_date = a.school_date
AND (x.period_id = 101 AND a.minutes >= 151 OR -- actually numbers?
x.period_id = 95 AND a.minutes = 150)
)
ORDER BY c.class_id, a.school_date, ???
What seems to be missing is ORDER BY which should accompany your DISTINCT ON. Add more ORDER BY items in place of ???. If there are duplicates to pick from, you probably want to define which to pick.
Numeric literals don't need single quotes and boolean values should be coded as such.
You may want to revisit the chapter about data types.

Optimise the PG query

The query is used very often in the app and is too expensive.
What are the things I can do to optimise it and bring the total time to milliseconds (rather than hundreds of ms)?
NOTES:
removing DISTINCT improves (down to ~460ms), but I need to to get rid of cartesian product :( (yeah, show better way of avoiding it)
removing OREDER BY name improves, but not significantly.
The query:
SELECT DISTINCT properties.*
FROM properties JOIN developments ON developments.id = properties.development_id
-- Development allocations
LEFT JOIN allocation_items AS dev_items ON dev_items.development_id = properties.development_id
LEFT JOIN allocations AS dev_allocs ON dev_items.allocation_id = dev_allocs.id
-- Group allocations
LEFT JOIN properties_property_groupings ppg ON ppg.property_id = properties.id
LEFT JOIN property_groupings pg ON pg.id = ppg.property_grouping_id
LEFT JOIN allocation_items prop_items ON prop_items.property_grouping_id = pg.id
LEFT JOIN allocations prop_allocs ON prop_allocs.id = prop_items.allocation_id
WHERE
(properties.status <> 'deleted') AND ((
properties.status <> 'inactive'
AND (
(dev_allocs.receiving_company_id = 175 OR prop_allocs.receiving_company_id = 175)
AND developments.status = 'active'
)
OR developments.company_id = 175
)
AND EXISTS (
SELECT 1 FROM development_participations dp
JOIN participations p ON p.id = dp.participation_id
WHERE dp.allowed
AND p.user_id = 387 AND p.company_id = 175
AND dp.development_id = properties.development_id
LIMIT 1
)
)
ORDER BY properties.name
EXPLAIN ANALYZE
Unique (cost=72336.86..72517.53 rows=1606 width=4336) (actual time=703.766..710.920 rows=219 loops=1)
-> Sort (cost=72336.86..72340.87 rows=1606 width=4336) (actual time=703.765..704.698 rows=5091 loops=1)
Sort Key: properties.name, properties.id, properties.status, properties.level, etc etc (all columns)
Sort Method: external sort Disk: 1000kB
-> Nested Loop Left Join (cost=0.00..69258.84 rows=1606 width=4336) (actual time=25.230..366.489 rows=5091 loops=1)
Filter: ((((properties.status)::text <> 'inactive'::text) AND ((dev_allocs.receiving_company_id = 175) OR (prop_allocs.receiving_company_id = 175)) AND ((developments.status)::text = 'active'::text)) OR (developments.company_id = 175))
-> Nested Loop Left Join (cost=0.00..57036.99 rows=41718 width=4355) (actual time=25.122..247.587 rows=99567 loops=1)
-> Nested Loop Left Join (cost=0.00..47616.39 rows=21766 width=4355) (actual time=25.111..163.827 rows=39774 loops=1)
-> Nested Loop Left Join (cost=0.00..41508.16 rows=21766 width=4355) (actual time=25.101..112.452 rows=39774 loops=1)
-> Nested Loop Left Join (cost=0.00..34725.22 rows=21766 width=4351) (actual time=25.087..68.104 rows=19887 loops=1)
-> Nested Loop Left Join (cost=0.00..28613.00 rows=21766 width=4351) (actual time=25.076..39.360 rows=19887 loops=1)
-> Nested Loop (cost=0.00..27478.54 rows=1147 width=4347) (actual time=25.059..29.966 rows=259 loops=1)
-> Index Scan using developments_pkey on developments (cost=0.00..25.17 rows=49 width=15) (actual time=0.048..0.127 rows=48 loops=1)
Filter: (((status)::text = 'active'::text) OR (company_id = 175))
-> Index Scan using index_properties_on_development_id on properties (cost=0.00..559.95 rows=26 width=4336) (actual time=0.534..0.618 rows=5 loops=48)
Index Cond: (development_id = developments.id)
Filter: (((status)::text <> 'deleted'::text) AND (SubPlan 1))
SubPlan 1
-> Limit (cost=0.00..10.00 rows=1 width=0) (actual time=0.011..0.011 rows=0 loops=2420)
-> Nested Loop (cost=0.00..10.00 rows=1 width=0) (actual time=0.011..0.011 rows=0 loops=2420)
Join Filter: (dp.participation_id = p.id)
-> Seq Scan on development_participations dp (cost=0.00..1.71 rows=1 width=4) (actual time=0.004..0.008 rows=1 loops=2420)
Filter: (allowed AND (development_id = properties.development_id))
-> Index Scan using index_participations_on_user_id on participations p (cost=0.00..8.27 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=3148)
Index Cond: (user_id = 387)
Filter: (company_id = 175)
-> Index Scan using index_allocation_items_on_development_id on allocation_items dev_items (cost=0.00..0.70 rows=23 width=8) (actual time=0.003..0.016 rows=77 loops=259)
Index Cond: (development_id = properties.development_id)
-> Index Scan using allocations_pkey on allocations dev_allocs (cost=0.00..0.27 rows=1 width=8) (actual time=0.001..0.001 rows=1 loops=19887)
Index Cond: (dev_items.allocation_id = id)
-> Index Scan using index_properties_property_groupings_on_property_id on properties_property_groupings ppg (cost=0.00..0.29 rows=2 width=8) (actual time=0.001..0.001 rows=2 loops=19887)
Index Cond: (property_id = properties.id)
-> Index Scan using property_groupings_pkey on property_groupings pg (cost=0.00..0.27 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=39774)
Index Cond: (id = ppg.property_grouping_id)
-> Index Scan using index_allocation_items_on_property_grouping_id on allocation_items prop_items (cost=0.00..0.36 rows=6 width=8) (actual time=0.001..0.001 rows=2 loops=39774)
Index Cond: (property_grouping_id = pg.id)
-> Index Scan using allocations_pkey on allocations prop_allocs (cost=0.00..0.27 rows=1 width=8) (actual time=0.001..0.001 rows=1 loops=99567)
Index Cond: (id = prop_items.allocation_id)
Total runtime: 716.692 ms
(39 rows)
Answering my own question.
This query has 2 big issues:
6 LEFT JOINs that produce cartesian product (resulting in billion-s of records even on small dataset).
DISTINCT that has to sort that billion records dataset.
So I had to eliminate those.
The way I did it is by replacing JOINs with 2 subqueries (won't provide it here since it should be pretty obvious).
As a result, the actual time went from ~700-800ms down to ~45ms which is more or less acceptable.
Most time is spend in the disk sort, you should use RAM by changing work_mem:
SET work_mem TO '20MB';
And check EXPLAIN ANALYZE again

Optimizing postgres query

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=32164.87..32164.89 rows=1 width=44) (actual time=221552.831..221552.831 rows=0 loops=1)
-> Sort (cost=32164.87..32164.87 rows=1 width=44) (actual time=221552.827..221552.827 rows=0 loops=1)
Sort Key: t.date_effective, t.acct_account_transaction_id, p.method, t.amount, c.business_name, t.amount
-> Nested Loop (cost=22871.67..32164.86 rows=1 width=44) (actual time=221552.808..221552.808 rows=0 loops=1)
-> Nested Loop (cost=22871.67..32160.37 rows=1 width=52) (actual time=221431.071..221546.619 rows=670 loops=1)
-> Nested Loop (cost=22871.67..32157.33 rows=1 width=43) (actual time=221421.218..221525.056 rows=2571 loops=1)
-> Hash Join (cost=22871.67..32152.80 rows=1 width=16) (actual time=221307.382..221491.019 rows=2593 loops=1)
Hash Cond: ("outer".acct_account_id = "inner".acct_account_fk)
-> Seq Scan on acct_account a (cost=0.00..7456.08 rows=365008 width=8) (actual time=0.032..118.369 rows=61295 loops=1)
-> Hash (cost=22871.67..22871.67 rows=1 width=16) (actual time=221286.733..221286.733 rows=2593 loops=1)
-> Nested Loop Left Join (cost=0.00..22871.67 rows=1 width=16) (actual time=1025.396..221266.357 rows=2593 loops=1)
Join Filter: ("inner".orig_acct_payment_fk = "outer".acct_account_transaction_id)
Filter: ("inner".link_type IS NULL)
-> Seq Scan on acct_account_transaction t (cost=0.00..18222.98 rows=1 width=16) (actual time=949.081..976.432 rows=2596 loops=1)
Filter: ((("type")::text = 'debit'::text) AND ((transaction_status)::text = 'active'::text) AND (date_effective >= '2012-03-01'::date) AND (date_effective < '2012-04-01 00:00:00'::timestamp without time zone))
-> Seq Scan on acct_payment_link l (cost=0.00..4648.68 rows=1 width=15) (actual time=1.073..84.610 rows=169 loops=2596)
Filter: ((link_type)::text ~~ 'return_%'::text)
-> Index Scan using contact_pk on contact c (cost=0.00..4.52 rows=1 width=27) (actual time=0.007..0.008 rows=1 loops=2593)
Index Cond: (c.contact_id = "outer".contact_fk)
-> Index Scan using acct_payment_transaction_fk on acct_payment p (cost=0.00..3.02 rows=1 width=13) (actual time=0.005..0.005 rows=0 loops=2571)
Index Cond: (p.acct_account_transaction_fk = "outer".acct_account_transaction_id)
Filter: ((method)::text <> 'trade'::text)
-> Index Scan using contact_role_pk on contact_role (cost=0.00..4.48 rows=1 width=4) (actual time=0.007..0.007 rows=0 loops=670)
Index Cond: ("outer".contact_id = contact_role.contact_fk)
Filter: (exchange_fk = 74)
Total runtime: 221553.019 ms
Your problem is here:
-> Nested Loop Left Join (cost=0.00..22871.67 rows=1 width=16) (actual time=1025.396..221266.357 rows=2593 loops=1)
Join Filter: ("inner".orig_acct_payment_fk = "outer".acct_account_transaction_id)
Filter: ("inner".link_type IS NULL)
-> Seq Scan on acct_account_transaction t (cost=0.00..18222.98 rows=1 width=16) (actual time=949.081..976.432 rows=2596 loops=1)
Filter: ((("type")::text = 'debit'::text) AND ((transaction_status)::text = 'active'::text) AND (date_effective >= '2012-03-01'::date) AND (date_effective
Seq Scan on acct_payment_link l (cost=0.00..4648.68 rows=1 width=15) (actual time=1.073..84.610 rows=169 loops=2596)
Filter: ((link_type)::text ~~ 'return_%'::text)
It expects to find 1 row in acct_account_transaction, while it finds 2596, and similarly for the other table.
You did not mention Your postgres version (could You?), but this should do the trick:
SELECT DISTINCT
t.date_effective,
t.acct_account_transaction_id,
p.method,
t.amount,
c.business_name,
t.amount
FROM
contact c inner join contact_role on (c.contact_id=contact_role.contact_fk and contact_role.exchange_fk=74),
acct_account a, acct_payment p,
acct_account_transaction t
WHERE
p.acct_account_transaction_fk=t.acct_account_transaction_id
and t.type = 'debit'
and transaction_status = 'active'
and p.method != 'trade'
and t.date_effective >= '2012-03-01'
and t.date_effective < (date '2012-03-01' + interval '1 month')
and c.contact_id=a.contact_fk and a.acct_account_id = t.acct_account_fk
and not exists(
select * from acct_payment_link l
where orig_acct_payment_fk == acct_account_transaction_id
and link_type like 'return_%'
)
ORDER BY
t.date_effective DESC
Also, try setting appropriate statistics target for relevant columns. Link to the friendly manual: http://www.postgresql.org/docs/current/static/sql-altertable.html
What are your indexes, and have you analysed lately? It's doing a table scan on acct_account_transaction even though there are several criteria on that table:
type
date_effective
If there are no indexes on those columns, then a compound one one (type, date_effective) could help (assuming there are lots of rows that don't meet the criteria on those columns).
I remove my first suggestion, as it changes the nature of the query.
I see that there's too much time spent in the LEFT JOIN.
First thing is to try to make only a single scan of the acct_payment_link table. Could you try rewriting your query to:
... LEFT JOIN (SELECT * FROM acct_payment_link
WHERE link_type LIKE 'return_%') AS l ...
You should check your statistics, as there's difference between planned and returned numbers of rows.
You haven't included tables' and indexes' definitions, it'd be good to take a look on those.
You might also want to use contrib/pg_tgrm extension to build index on the acct_payment_link.link_type, but I would make this a last option to try out.
BTW, what is the PostgreSQL version you're using?
Your statement rewritten and formatted:
SELECT DISTINCT
t.date_effective,
t.acct_account_transaction_id,
p.method,
t.amount,
c.business_name,
t.amount
FROM contact c
JOIN contact_role cr ON cr.contact_fk = c.contact_id
JOIN acct_account a ON a.contact_fk = c.contact_id
JOIN acct_account_transaction t ON t.acct_account_fk = a.acct_account_id
JOIN acct_payment p ON p.acct_account_transaction_fk
= t.acct_account_transaction_id
LEFT JOIN acct_payment_link l ON orig_acct_payment_fk
= acct_account_transaction_id
-- missing table-qualification!
AND link_type like 'return_%'
-- missing table-qualification!
WHERE transaction_status = 'active' -- missing table-qualification!
AND cr.exchange_fk = 74
AND t.type = 'debit'
AND t.date_effective >= '2012-03-01'
AND t.date_effective < (date '2012-03-01' + interval '1 month')
AND p.method != 'trade'
AND l.link_type IS NULL
ORDER BY t.date_effective DESC;
Explicit JOIN statements are preferable. I reordered your tables according to your JOIN logic.
Why (date '2012-03-01' + interval '1 month') instead of date '2012-04-01'?
Some table qualifications are missing. In a complex statement like this that's bad style. May be hiding a mistake.
The key to performance are indexes where appropriate, proper configuration of PostgreSQL and accurate statistics.
General advice on performance tuning in the PostgreSQL wiki.