I am trying to get a row with highest popularity. Ordering by descending popularity is slowing down the query significantly.
Is there a better way to optimize this query ?
Postgresql - 9.5
```explain analyse SELECT v.cosmo_id,
v.resource_id, k.gid, k.popularity,v.cropinfo_id
FROM rmsg.verifications V INNER JOIN rmip.resourceinfo R ON
(R.id=V.resource_id AND R.source_id=54) INNER JOIN rmpp.kgidinfo K ON
(K.cosmo_id=V.cosmo_id) WHERE V.status=1 AND
v.crop_Status=1 AND V.locked_time isnull ORDER BY k.popularity
desc, (v.cosmo_id,
v.resource_id, v.cropinfo_id) LIMIT 1;```
QUERY PLAN
Limit (cost=470399.99..470399.99 rows=1 width=31) (actual time=19655.552..19655.553 rows=1 loops=1)
Sort (cost=470399.99..470434.80 rows=13923 width=31) (actual time=19655.549..19655.549 rows=1 loops=1)
Sort Key: k.popularity DESC, (ROW(v.cosmo_id, v.resource_id, v.cropinfo_id))
Sort Method: top-N heapsort Memory: 25kB
-> Nested Loop (cost=19053.91..470330.37 rows=13923 width=31) (actual time=58.365..19627.405 rows=23006 loops=1)
-> Hash Join (cost=19053.48..459008.74 rows=13188 width=16) (actual time=58.275..19268.339 rows=19165 loops=1)
Hash Cond: (v.resource_id = r.id)
-> Seq Scan on verifications v (cost=0.00..409876.92 rows=7985725 width=16) (actual time=0.035..11097.163 rows=9908140 loops=1)
Filter: ((locked_time IS NULL) AND (status = 1) AND (crop_status = 1))
Rows Removed by Filter: 1126121
-> Hash (cost=18984.23..18984.23 rows=5540 width=4) (actual time=57.101..57.101 rows=5186 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 247kB
-> Bitmap Heap Scan on resourceinfo r (cost=175.37..18984.23 rows=5540 width=4) (actual time=2.827..51.318 rows=5186 loops=1)
Recheck Cond: (source_id = 54)
Heap Blocks: exact=5907
-> Bitmap Index Scan on resourceinfo_source_id_key (cost=0.00..173.98 rows=5540 width=0) (actual time=1.742..1.742 rows=6483 loops=1)
Index Cond: (source_id = 54)
Index Scan using kgidinfo_cosmo_id_idx on kgidinfo k (cost=0.43..0.85 rows=1 width=23) (actual time=0.013..0.014 rows=1 loops=19165)
Index Cond: (cosmo_id = v.cosmo_id)
Planning time: 1.083 ms
Execution time: 19655.638 ms
(21 rows)```
This is your query, simplified by removing parentheses:
SELECT v.cosmo_id, v.resource_id, k.gid, k.popularity, v.cropinfo_id
FROM rmsg.verifications V INNER JOIN
rmip.resourceinfo R
ON R.id = V.resource_id AND R.source_id = 54 INNER JOIN
rmpp.kgidinfo K
ON K.cosmo_id = V.cosmo_id
WHERE V.status = 1 AND v.crop_Status = 1 AND
V.locked_time is null
ORDER BY k.popularity desc, v.cosmo_id, v.resource_id, v.cropinfo_id
LIMIT 1;
For this query, I would think in terms of indexes on verifications(status, crop_status, locked_time, resource_id, cosmo_id, crop_info_id), resourceinfo(id, source_id), and kgidinfo(cosmo_id). I don't see an easy way to remove the ORDER BY.
In looking at the query, I wonder if you might have a Cartesian product problem between the two tables.
Related
It tries to retrieve videos in order of the number of tags that are the same as the specific video.
The following query takes about 800ms, but the index appears to be used.
If you remove COUNT, GROUP BY, and ORDER BY from the SQL query, it runs super fast.(1-5ms)
In such a case, improving the SQL query alone will not speed up the process and
Do I need to use MATERIALIZED VIEW?
SELECT "videos_video"."id",
"videos_video"."title",
"videos_video"."thumbnail_url",
"videos_video"."preview_url",
"videos_video"."embed_url",
"videos_video"."duration",
"videos_video"."views",
"videos_video"."is_public",
"videos_video"."published_at",
"videos_video"."created_at",
"videos_video"."updated_at",
COUNT("videos_video"."id") AS "n"
FROM "videos_video"
INNER JOIN "videos_video_tags" ON ("videos_video"."id" = "videos_video_tags"."video_id")
WHERE ("videos_video_tags"."tag_id" IN
(SELECT U0."id"
FROM "videos_tag" U0
INNER JOIN "videos_video_tags" U1 ON (U0."id" = U1."tag_id")
WHERE U1."video_id" = '748b1814-f311-48da-a1f5-6bf8fe229c7f'))
GROUP BY "videos_video"."id"
ORDER BY "n" DESC
LIMIT 20;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1040.69..1040.74 rows=20 width=24) (actual time=738.648..738.654 rows=20 loops=1)
-> Sort (cost=1040.69..1044.29 rows=1441 width=24) (actual time=738.646..738.650 rows=20 loops=1)
Sort Key: (count(videos_video.id)) DESC
Sort Method: top-N heapsort Memory: 27kB
-> HashAggregate (cost=987.93..1002.34 rows=1441 width=24) (actual time=671.006..714.322 rows=188818 loops=1)
Group Key: videos_video.id
Batches: 1 Memory Usage: 28689kB
-> Nested Loop (cost=35.20..980.73 rows=1441 width=16) (actual time=0.341..559.034 rows=240293 loops=1)
-> Nested Loop (cost=34.78..340.88 rows=1441 width=16) (actual time=0.278..92.806 rows=240293 loops=1)
-> HashAggregate (cost=34.35..34.41 rows=6 width=32) (actual time=0.188..0.200 rows=4 loops=1)
Group Key: u0.id
Batches: 1 Memory Usage: 24kB
-> Nested Loop (cost=0.71..34.33 rows=6 width=32) (actual time=0.161..0.185 rows=4 loops=1)
-> Index Only Scan using videos_video_tags_video_id_tag_id_f8d6ba70_uniq on videos_video_tags u1 (cost=0.43..4.53 rows=6 width=16) (actual time=0.039..0.040 rows=4 loops=1)
Index Cond: (video_id = '748b1814-f311-48da-a1f5-6bf8fe229c7f'::uuid)
Heap Fetches: 0
-> Index Only Scan using videos_tag_pkey on videos_tag u0 (cost=0.28..4.97 rows=1 width=16) (actual time=0.035..0.035 rows=1 loops=4)
Index Cond: (id = u1.tag_id)
Heap Fetches: 0
-> Index Scan using videos_video_tags_tag_id_2673cfc8 on videos_video_tags (cost=0.43..35.90 rows=1518 width=32) (actual time=0.029..16.728 rows=60073 loops=4)
Index Cond: (tag_id = u0.id)
-> Index Only Scan using videos_video_pkey on videos_video (cost=0.42..0.44 rows=1 width=16) (actual time=0.002..0.002 rows=1 loops=240293)
Index Cond: (id = videos_video_tags.video_id)
Heap Fetches: 46
Planning Time: 1.980 ms
Execution Time: 739.446 ms
(26 rows)
Time: 742.145 ms
---------- Results of the execution plan for the query as answered by Edouard. ----------
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=30043.90..30212.53 rows=20 width=746) (actual time=239.142..239.219 rows=20 loops=1)
-> Limit (cost=30043.48..30043.53 rows=20 width=24) (actual time=239.089..239.093 rows=20 loops=1)
-> Sort (cost=30043.48..30607.15 rows=225467 width=24) (actual time=239.087..239.090 rows=20 loops=1)
Sort Key: (count(*)) DESC
Sort Method: top-N heapsort Memory: 26kB
-> HashAggregate (cost=21789.21..24043.88 rows=225467 width=24) (actual time=185.710..219.211 rows=188818 loops=1)
Group Key: vt.video_id
Batches: 1 Memory Usage: 22545kB
-> Nested Loop (cost=20.62..20187.24 rows=320395 width=16) (actual time=4.975..106.839 rows=240293 loops=1)
-> Index Only Scan using videos_video_tags_video_id_tag_id_f8d6ba70_uniq on videos_video_tags vvt (cost=0.43..4.53 rows=6 width=16) (actual time=0.033..0.043 rows=4 loops=1)
Index Cond: (video_id = '748b1814-f311-48da-a1f5-6bf8fe229c7f'::uuid)
Heap Fetches: 0
-> Bitmap Heap Scan on videos_video_tags vt (cost=20.19..3348.60 rows=1518 width=32) (actual time=4.311..20.663 rows=60073 loops=4)
Recheck Cond: (tag_id = vvt.tag_id)
Heap Blocks: exact=34757
-> Bitmap Index Scan on videos_video_tags_tag_id_2673cfc8 (cost=0.00..19.81 rows=1518 width=0) (actual time=3.017..3.017 rows=60073 loops=4)
Index Cond: (tag_id = vvt.tag_id)
-> Index Scan using videos_video_pkey on videos_video v (cost=0.42..8.44 rows=1 width=738) (actual time=0.005..0.005 rows=1 loops=20)
Index Cond: (id = vt.video_id)
Planning Time: 0.854 ms
Execution Time: 241.392 ms
(21 rows)
Time: 242.909 ms
Here below are some ideas to simplify the query. Then an EXPLAIN ANALYSE will confirm the potential impacts on the query performance.
Starting from the subquery :
SELECT U0."id"
FROM "videos_tag" U0
INNER JOIN "videos_video_tags" U1 ON (U0."id" = U1."tag_id")
WHERE U1."video_id" = '748b1814-f311-48da-a1f5-6bf8fe229c7f'
According to the JOIN clause : U0."id" = U1."tag_id" so that SELECT U0."id" can be replaced by SELECT U1."tag_id".
In this case, the table "videos_tag" U0 is not used anymore in the subquery which can be simplified as :
SELECT U1."tag_id"
FROM "videos_video_tags" U1
WHERE U1."video_id" = '748b1814-f311-48da-a1f5-6bf8fe229c7f'
And the WHERE clause of the main query becomes :
WHERE "videos_video_tags"."tag_id" IN
( SELECT U1."tag_id"
FROM "videos_video_tags" U1
WHERE U1."video_id" = '748b1814-f311-48da-a1f5-6bf8fe229c7f'
)
which can be transformed as a self join on the table "videos_video_tags" to be added in the FROM clause of the main query :
FROM "videos_video" AS v
INNER JOIN "videos_video_tags" AS vt
ON v."id" = vt."video_id"
INNER JOIN "videos_video_tags" AS vvt
ON vvt."tag_id" = vt."tag_id"
WHERE vvt."video_id" = '748b1814-f311-48da-a1f5-6bf8fe229c7f'
Finally, the GROUP BY "videos_video"."id" clause can be replaced by GROUP BY "videos_video_tags"."video_id" according to the JOIN clause between both tables, and this new GROUP BY clause associated to the ORDER BY clause and LIMIT clause can apply to a subquery involving the table "videos_video_tags" only, and before joining with the table "videos_video" :
SELECT v."id",
v."title",
v."thumbnail_url",
v."preview_url",
v."embed_url",
v."duration",
v."views",
v."is_public",
v."published_at",
v."created_at",
v."updated_at",
w."n"
FROM "videos_video" AS v
INNER JOIN
( SELECT vt."video_id"
, count(*) AS "n"
FROM "videos_video_tags" AS vt
INNER JOIN "videos_video_tags" AS vvt
ON vvt."tag_id" = vt."tag_id"
WHERE vvt."video_id" = '748b1814-f311-48da-a1f5-6bf8fe229c7f'
GROUP BY vt."video_id"
ORDER BY "n" DESC
LIMIT 20
) AS w
ON v."id" = w."video_id"
We run a join query between 2 tables.
The query has an OR statement that compares one column from the left table and one column from the right table. The query performance is very low, and we fixed it by changing the OR to UNION.
Why is this happening? I'm looking for a detailed explanation or a reference to the documentation that might shed a light on the issue.
Query with Or Statment:
db1=# explain analyze select count(*)
from conversations
join agents on conversations.agent_id=agents.id
where conversations.id=1 or agents.id = '123';
**Query plan**
----------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=**11017.95..11017.96** rows=1 width=8) (actual time=54.088..54.088 rows=1 loops=1)
-> Gather (cost=11017.73..11017.94 rows=2 width=8) (actual time=53.945..57.181 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=10017.73..10017.74 rows=1 width=8) (actual time=48.303..48.303 rows=1 loops=3)
-> Hash Join (cost=219.26..10016.69 rows=415 width=0) (actual time=5.292..48.287 rows=130 loops=3)
Hash Cond: (conversations.agent_id = agents.id)
Join Filter: ((conversations.id = 1) OR ((agents.id)::text = '123'::text))
Rows Removed by Join Filter: 80035
-> Parallel Seq Scan on conversations (cost=0.00..9366.95 rows=163995 width=8) (actual time=0.017..14.972 rows=131196 loops=3)
-> Hash (cost=143.56..143.56 rows=6056 width=16) (actual time=2.686..2.686 rows=6057 loops=3)
Buckets: 8192 Batches: 1 Memory Usage: 353kB
-> Seq Scan on agents (cost=0.00..143.56 rows=6056 width=16) (actual time=0.011..1.305 rows=6057 loops=3)
Planning time: 0.710 ms
Execution time: 57.276 ms
(15 rows)
Changing the OR to UNION:
db1=# explain analyze select count(*) from (
select *
from conversations
join agents on conversations.agent_id=agents.id
where conversations.installation_id=1
union
select *
from conversations
join agents on conversations.agent_id=agents.id
where agents.source_id = '123') as subquery;
**Query plan:**
----------------------------------------------------------------------------------------------------------------------------------
Aggregate (**cost=1114.31..1114.32** rows=1 width=8) (actual time=8.038..8.038 rows=1 loops=1)
-> HashAggregate (cost=1091.90..1101.86 rows=996 width=1437) (actual time=7.783..8.009 rows=390 loops=1)
Group Key: conversations.id, conversations.created, conversations.modified, conversations.source_created, conversations.source_id, conversations.installation_id, bra
in_conversation.resolution_reason, conversations.solve_time, conversations.agent_id, conversations.submission_reason, conversations.is_marked_as_duplicate, conversations.n
um_back_and_forths, conversations.is_closed, conversations.is_solved, conversations.conversation_type, conversations.related_ticket_source_id, conversations.channel, brain_convers
ation.last_updated_from_platform, conversations.csat, agents.id, agents.created, agents.modified, agents.name, agents.source_id, organizati
on_agent.installation_id, agents.settings
-> Append (cost=219.68..1027.16 rows=996 width=1437) (actual time=5.517..6.307 rows=390 loops=1)
-> Hash Join (cost=219.68..649.69 rows=931 width=224) (actual time=5.516..6.063 rows=390 loops=1)
Hash Cond: (conversations.agent_id = agents.id)
-> Index Scan using conversations_installation_id_b3ff5c00 on conversations (cost=0.42..427.98 rows=931 width=154) (actual time=0.039..0.344 rows=879 loops=1)
Index Cond: (installation_id = 1)
-> Hash (cost=143.56..143.56 rows=6056 width=70) (actual time=5.394..5.394 rows=6057 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 710kB
-> Seq Scan on agents (cost=0.00..143.56 rows=6056 width=70) (actual time=0.014..1.938 rows=6057 loops=1)
-> Nested Loop (cost=0.70..367.52 rows=65 width=224) (actual time=0.210..0.211 rows=0 loops=1)
-> Index Scan using agents_source_id_106c8103_like on agents agents_1 (cost=0.28..8.30 rows=1 width=70) (actual time=0.210..0.210 rows=0 loops=1)
Index Cond: ((source_id)::text = '123'::text)
-> Index Scan using conversations_agent_id_de76554b on conversations conversations_1 (cost=0.42..358.12 rows=110 width=154) (never executed)
Index Cond: (agent_id = agents_1.id)
Planning time: 2.024 ms
Execution time: 9.367 ms
(18 rows)
Yes. or has a way of killing the performance of queries. For this query:
select count(*)
from conversations c join
agents a
on c.agent_id = a.id
where c.id = 1 or a.id = 123;
Note I removed the quotes around 123. It looks like a number so I assume it is. For this query, you want an index on conversations(agent_id).
Probably the most effective way to write the query is:
select count(*)
from ((select 1
from conversations c join
agents a
on c.agent_id = a.id
where c.id = 1
) union all
(select 1
from conversations c join
agents a
on c.agent_id = a.id
where a.id = 123 and c.id <> 1
)
) ac;
Note the use of union all rather than union. The additional where condition eliminates duplicates.
This can take advantage of the following indexes:
conversations(id, agent_id)
agents(id)
conversations(agent_id, id)
Here's my sql, followed by the explanation. I need to improve the performance. Any ideas?
PostgreSQL 9.3.12 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 4.8.4-2ubuntu1~14.04.1) 4.8.4, 64-bit
explain analyze
SELECT DISTINCT "apts"."id", practices.name AS alias_0
FROM "apts"
LEFT OUTER JOIN "patients" ON "patients"."id" = "apts"."patient_id"
LEFT OUTER JOIN "practices" ON "practices"."id" = "apts"."practice_id"
LEFT OUTER JOIN "eligibility_messages" ON "eligibility_messages"."apt_id" = "apts"."id"
WHERE (apts.eligibility_status_id != 1)
AND (eligibility_messages.current = 't')
AND (practices.id = '104')
ORDER BY practices.name desc
LIMIT 25 OFFSET 0
Limit (cost=881321.34..881321.41 rows=25 width=20) (actual time=2928.225..2928.227 rows=25 loops=1)
-> Sort (cost=881321.34..881391.94 rows=28240 width=20) (actual time=2928.223..2928.224 rows=25 loops=1)
Sort Key: practices.name
Sort Method: top-N heapsort Memory: 26kB
-> HashAggregate (cost=880242.03..880524.43 rows=28240 width=20) (actual time=2927.213..2927.319 rows=520 loops=1)
-> Nested Loop (cost=286614.55..880100.83 rows=28240 width=20) (actual time=206.180..2926.791 rows=520 loops=1)
-> Seq Scan on practices (cost=0.00..6.36 rows=1 width=20) (actual time=0.018..0.031 rows=1 loops=1)
Filter: (id = 104)
Rows Removed by Filter: 108
-> Hash Join (cost=286614.55..879812.07 rows=28240 width=8) (actual time=206.159..2926.643 rows=520 loops=1)
Hash Cond: (eligibility_messages.apt_id = apts.id)
-> Seq Scan on eligibility_messages (cost=0.00..561275.63 rows=2029532 width=4) (actual time=0.691..2766.867 rows=67559 loops=1)
Filter: current
Rows Removed by Filter: 3924633
-> Hash (cost=284614.02..284614.02 rows=115082 width=12) (actual time=121.957..121.957 rows=91660 loops=1)
Buckets: 16384 Batches: 2 Memory Usage: 1974kB
-> Bitmap Heap Scan on apts (cost=8296.88..284614.02 rows=115082 width=12) (actual time=19.927..91.038 rows=91660 loops=1)
Recheck Cond: (practice_id = 104)
Filter: (eligibility_status_id <> 1)
Rows Removed by Filter: 80169
-> Bitmap Index Scan on index_apts_on_practice_id (cost=0.00..8268.11 rows=177540 width=0) (actual time=16.856..16.856 rows=179506 loops=1)
Index Cond: (practice_id = 104)
Total runtime: 2928.361 ms
First, rewrite the query to a more manageable form:
SELECT DISTINCT a."id", pr.name AS alias_0
FROM "apts" a JOIN
"practices" pr
ON pr."id" = a."practice_id" JOIN
"eligibility_messages" em
ON em."apt_id" = a."id"
WHERE (a.eligibility_status_id <> 1) AND
(em.current = 't') AND
(a.practice_id = 104)
ORDER BY pr.name desc ;
Notes:
The WHERE clause turns the outer joins into inner joins anyway, so you might as well express them correctly.
I doubt pr.id is actually a string
The patients table isn't used, so I just removed it.
Perhaps you don't even need the select distinct any more.
Switched the condition in the where to apts rather than practices.
If this isn't fast enough, you want indexes, probably on apts(practice_id, eligibility_status_id, id), practices(id), and eligibility_messages(apt_id, current).
I have the following two queries.Query 1 is fast since it uses indexes(uses nested loop join) and Query 2 uses hash join and it is slower.
Query 1 does order by on table 1 column and Query 2 does order by using table 2 column.
Query 1
learning=# explain analyze
select *
from users left join
access_logs
on users.userid = access_logs.userid
order by users.userid
limit 10 offset 90;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=14.00..15.46 rows=10 width=104) (actual time=1.330..1.504 rows=10 loops=1)
-> Merge Left Join (cost=0.85..291532.97 rows=1995958 width=104) (actual time=0.037..1.482 rows=100 loops=1)
Merge Cond: (users.userid = access_logs.userid)
-> Index Scan using users_pkey on users (cost=0.43..151132.75 rows=1995958 width=76) (actual time=0.018..1.135 rows=100 loops=1)
-> Index Scan using access_logs_userid_idx on access_logs (cost=0.43..110471.45 rows=1995958 width=28) (actual time=0.012..0.198 rows=100 loops=1)
Planning time: 0.469 ms
Execution time: 1.569 ms
Query 2
learning=# explain analyze
select *
from users left join
access_logs
on users.userid = access_logs.userid
order by access_logs.userid
limit 10 offset 90;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=293584.20..293584.23 rows=10 width=104) (actual time=3821.432..3821.439 rows=10 loops=1)
-> Sort (cost=293583.98..298573.87 rows=1995958 width=104) (actual time=3821.391..3821.415 rows=100 loops=1)
Sort Key: access_logs.userid
Sort Method: top-N heapsort Memory: 51kB
-> Hash Left Join (cost=73231.06..217299.90 rows=1995958 width=104) (actual time=539.859..3168.754 rows=1995958 loops=1)
Hash Cond: (users.userid = access_logs.userid)
-> Seq Scan on users (cost=0.00..44814.58 rows=1995958 width=76) (actual time=0.009..443.260 rows=1995958 loops=1)
-> Hash (cost=34636.58..34636.58 rows=1995958 width=28) (actual time=539.112..539.112 rows=1995958 loops=1)
Buckets: 262144 Batches: 2 Memory Usage: 58532kB
-> Seq Scan on access_logs (cost=0.00..34636.58 rows=1995958 width=28) (actual time=0.006..170.061 rows=1995958 loops=1)
Planning time: 0.480 ms
Execution time: 3832.245 ms
Questions
The second query is slow since the sorting is done before the join as in the plan.
Why does the sort in the second table not use the index? There is a plan below with just the sort.
Query - explain analyze select * from access_logs order by userid limit 10 offset 90;
Plan
Limit (cost=5.41..5.96 rows=10 width=28) (actual time=0.199..0.218 rows=10 loops=1)
-> Index Scan using access_logs_userid_idx on access_logs (cost=0.43..110471.45 rows=1995958 width=28) (actual time=0.029..0.201 rows=100 loops=1)
Planning time: 0.120 ms
Execution time: 0.252 ms
Edit 1:
My goal is not to compare both queries, in fact I want the result as in query 2,I just provided query 1 so that I can understand in comparison.
The order by is not limited to the join column, the user can also do order by another column in table 2, plans are below.
learning=# explain analyze select * from users left join access_logs on users.userid=access_logs.userid order by access_logs.last_login limit 10;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=260431.83..260431.86 rows=10 width=104) (actual time=3846.625..3846.627 rows=10 loops=1)
-> Sort (cost=260431.83..265421.73 rows=1995958 width=104) (actual time=3846.623..3846.623 rows=10 loops=1)
Sort Key: access_logs.last_login
Sort Method: top-N heapsort Memory: 27kB
-> Hash Left Join (cost=73231.06..217299.90 rows=1995958 width=104) (actual time=567.104..3174.818 rows=1995958 loops=1)
Hash Cond: (users.userid = access_logs.userid)
-> Seq Scan on users (cost=0.00..44814.58 rows=1995958 width=76) (actual time=0.007..443.364 rows=1995958 loops=1)
-> Hash (cost=34636.58..34636.58 rows=1995958 width=28) (actual time=566.814..566.814 rows=1995958 loops=1)
Buckets: 262144 Batches: 2 Memory Usage: 58532kB
-> Seq Scan on access_logs (cost=0.00..34636.58 rows=1995958 width=28) (actual time=0.004..169.137 rows=1995958 loops=1)
Planning time: 0.490 ms
Execution time: 3857.171 ms
Sort in the second query would not use index because the index is not guaranteed to have all the values being sorted. If there are some records in users not matched by access_logs then Left Join would generate null values referenced in query as access_logs.userid but not actually present in access_logs and thus not covered by the index.
The workaround can be to create a default initial record in access_log for each user and use Inner Join.
The query is used very often in the app and is too expensive.
What are the things I can do to optimise it and bring the total time to milliseconds (rather than hundreds of ms)?
NOTES:
removing DISTINCT improves (down to ~460ms), but I need to to get rid of cartesian product :( (yeah, show better way of avoiding it)
removing OREDER BY name improves, but not significantly.
The query:
SELECT DISTINCT properties.*
FROM properties JOIN developments ON developments.id = properties.development_id
-- Development allocations
LEFT JOIN allocation_items AS dev_items ON dev_items.development_id = properties.development_id
LEFT JOIN allocations AS dev_allocs ON dev_items.allocation_id = dev_allocs.id
-- Group allocations
LEFT JOIN properties_property_groupings ppg ON ppg.property_id = properties.id
LEFT JOIN property_groupings pg ON pg.id = ppg.property_grouping_id
LEFT JOIN allocation_items prop_items ON prop_items.property_grouping_id = pg.id
LEFT JOIN allocations prop_allocs ON prop_allocs.id = prop_items.allocation_id
WHERE
(properties.status <> 'deleted') AND ((
properties.status <> 'inactive'
AND (
(dev_allocs.receiving_company_id = 175 OR prop_allocs.receiving_company_id = 175)
AND developments.status = 'active'
)
OR developments.company_id = 175
)
AND EXISTS (
SELECT 1 FROM development_participations dp
JOIN participations p ON p.id = dp.participation_id
WHERE dp.allowed
AND p.user_id = 387 AND p.company_id = 175
AND dp.development_id = properties.development_id
LIMIT 1
)
)
ORDER BY properties.name
EXPLAIN ANALYZE
Unique (cost=72336.86..72517.53 rows=1606 width=4336) (actual time=703.766..710.920 rows=219 loops=1)
-> Sort (cost=72336.86..72340.87 rows=1606 width=4336) (actual time=703.765..704.698 rows=5091 loops=1)
Sort Key: properties.name, properties.id, properties.status, properties.level, etc etc (all columns)
Sort Method: external sort Disk: 1000kB
-> Nested Loop Left Join (cost=0.00..69258.84 rows=1606 width=4336) (actual time=25.230..366.489 rows=5091 loops=1)
Filter: ((((properties.status)::text <> 'inactive'::text) AND ((dev_allocs.receiving_company_id = 175) OR (prop_allocs.receiving_company_id = 175)) AND ((developments.status)::text = 'active'::text)) OR (developments.company_id = 175))
-> Nested Loop Left Join (cost=0.00..57036.99 rows=41718 width=4355) (actual time=25.122..247.587 rows=99567 loops=1)
-> Nested Loop Left Join (cost=0.00..47616.39 rows=21766 width=4355) (actual time=25.111..163.827 rows=39774 loops=1)
-> Nested Loop Left Join (cost=0.00..41508.16 rows=21766 width=4355) (actual time=25.101..112.452 rows=39774 loops=1)
-> Nested Loop Left Join (cost=0.00..34725.22 rows=21766 width=4351) (actual time=25.087..68.104 rows=19887 loops=1)
-> Nested Loop Left Join (cost=0.00..28613.00 rows=21766 width=4351) (actual time=25.076..39.360 rows=19887 loops=1)
-> Nested Loop (cost=0.00..27478.54 rows=1147 width=4347) (actual time=25.059..29.966 rows=259 loops=1)
-> Index Scan using developments_pkey on developments (cost=0.00..25.17 rows=49 width=15) (actual time=0.048..0.127 rows=48 loops=1)
Filter: (((status)::text = 'active'::text) OR (company_id = 175))
-> Index Scan using index_properties_on_development_id on properties (cost=0.00..559.95 rows=26 width=4336) (actual time=0.534..0.618 rows=5 loops=48)
Index Cond: (development_id = developments.id)
Filter: (((status)::text <> 'deleted'::text) AND (SubPlan 1))
SubPlan 1
-> Limit (cost=0.00..10.00 rows=1 width=0) (actual time=0.011..0.011 rows=0 loops=2420)
-> Nested Loop (cost=0.00..10.00 rows=1 width=0) (actual time=0.011..0.011 rows=0 loops=2420)
Join Filter: (dp.participation_id = p.id)
-> Seq Scan on development_participations dp (cost=0.00..1.71 rows=1 width=4) (actual time=0.004..0.008 rows=1 loops=2420)
Filter: (allowed AND (development_id = properties.development_id))
-> Index Scan using index_participations_on_user_id on participations p (cost=0.00..8.27 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=3148)
Index Cond: (user_id = 387)
Filter: (company_id = 175)
-> Index Scan using index_allocation_items_on_development_id on allocation_items dev_items (cost=0.00..0.70 rows=23 width=8) (actual time=0.003..0.016 rows=77 loops=259)
Index Cond: (development_id = properties.development_id)
-> Index Scan using allocations_pkey on allocations dev_allocs (cost=0.00..0.27 rows=1 width=8) (actual time=0.001..0.001 rows=1 loops=19887)
Index Cond: (dev_items.allocation_id = id)
-> Index Scan using index_properties_property_groupings_on_property_id on properties_property_groupings ppg (cost=0.00..0.29 rows=2 width=8) (actual time=0.001..0.001 rows=2 loops=19887)
Index Cond: (property_id = properties.id)
-> Index Scan using property_groupings_pkey on property_groupings pg (cost=0.00..0.27 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=39774)
Index Cond: (id = ppg.property_grouping_id)
-> Index Scan using index_allocation_items_on_property_grouping_id on allocation_items prop_items (cost=0.00..0.36 rows=6 width=8) (actual time=0.001..0.001 rows=2 loops=39774)
Index Cond: (property_grouping_id = pg.id)
-> Index Scan using allocations_pkey on allocations prop_allocs (cost=0.00..0.27 rows=1 width=8) (actual time=0.001..0.001 rows=1 loops=99567)
Index Cond: (id = prop_items.allocation_id)
Total runtime: 716.692 ms
(39 rows)
Answering my own question.
This query has 2 big issues:
6 LEFT JOINs that produce cartesian product (resulting in billion-s of records even on small dataset).
DISTINCT that has to sort that billion records dataset.
So I had to eliminate those.
The way I did it is by replacing JOINs with 2 subqueries (won't provide it here since it should be pretty obvious).
As a result, the actual time went from ~700-800ms down to ~45ms which is more or less acceptable.
Most time is spend in the disk sort, you should use RAM by changing work_mem:
SET work_mem TO '20MB';
And check EXPLAIN ANALYZE again