I have the following table structure:
AdPerformance
id
ad_id
impressions
Targeting
value
AdActions
app_starts
Ad
id
name
parent_id
AdTargeting
id
targeting_
ad_id
Targeting
id
name
value
AdProduct
id
ad_id
name
I need to aggregate the data by targeting with restriction to product name , so I wrote the following query:
SELECT ad_performance.ad_id, targeting.value AS targeting_value,
sum(impressions) AS impressions,
sum(app_starts) AS app_starts
FROM ad_performance
LEFT JOIN ad on ad.id = ad_performance.ad_id
LEFT JOIN ad_actions ON ad_performance.id = ad_actions.ad_performance_id
RIGHT JOIN (
SELECT ad_id, value from targeting, ad_targeting
WHERE targeting.id = ad_targeting.id and targeting.name = 'gender'
) targeting ON targeting.ad_id = ad.parent_id
WHERE ad_performance.ad_id IN
(SELECT ad_id FROM ad_product WHERE product = 'iphone')
GROUP BY ad_performance.ad_id, targeting_value
However the above query in ANALYZE command takes about 5s for ~1M records.
Is there a way to improve it?
I do have indexes on foreign keys
UPDATED
See output of ANALYZE
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=5787.28..5789.87 rows=259 width=254) (actual time=3283.763..3286.015 rows=5998 loops=1)
Group Key: adobject_performance.ad_id, targeting.value
Buffers: shared hit=3400223
-> Nested Loop Left Join (cost=241.63..5603.63 rows=8162 width=254) (actual time=10.438..2774.664 rows=839720 loops=1)
Buffers: shared hit=3400223
-> Nested Loop (cost=241.21..1590.52 rows=8162 width=250) (actual time=10.412..703.818 rows=839720 loops=1)
Join Filter: (adobject.id = adobject_performance.ad_id)
Buffers: shared hit=36755
-> Hash Join (cost=240.78..323.35 rows=9 width=226) (actual time=10.380..20.332 rows=5998 loops=1)
Hash Cond: (ad_product.ad_id = ad.id)
Buffers: shared hit=190
-> HashAggregate (cost=128.98..188.96 rows=5998 width=4) (actual time=3.788..6.821 rows=5998 loops=1)
Group Key: ad_product.ad_id
Buffers: shared hit=39
-> Seq Scan on ad_product (cost=0.00..113.99 rows=5998 width=4) (actual time=0.011..1.726 rows=5998 loops=1)
Filter: ((product)::text = 'ft2_iPhone'::text)
Rows Removed by Filter: 1
Buffers: shared hit=39
-> Hash (cost=111.69..111.69 rows=9 width=222) (actual time=6.578..6.578 rows=5998 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 241kB
Buffers: shared hit=151
-> Hash Join (cost=30.26..111.69 rows=9 width=222) (actual time=0.154..4.660 rows=5998 loops=1)
Hash Cond: (adobject.parent_id = adobject_targeting.ad_id)
Buffers: shared hit=151
-> Seq Scan on adobject (cost=0.00..77.97 rows=897 width=8) (actual time=0.009..1.449 rows=6001 loops=1)
Buffers: shared hit=69
-> Hash (cost=30.24..30.24 rows=2 width=222) (actual time=0.132..0.132 rows=2 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
Buffers: shared hit=82
-> Nested Loop (cost=0.15..30.24 rows=2 width=222) (actual time=0.101..0.129 rows=2 loops=1)
Buffers: shared hit=82
-> Seq Scan on targeting (cost=0.00..13.88 rows=2 width=222) (actual time=0.015..0.042 rows=79 loops=1)
Filter: (name = 'age group'::targeting_name)
Rows Removed by Filter: 82
Buffers: shared hit=1
-> Index Scan using advertising_targeting_pkey on adobject_targeting (cost=0.15..8.17 rows=1 width=8) (actual time=0.001..0.001 rows=0 loops=79)
Index Cond: (id = targeting.id)
Buffers: shared hit=81
-> Index Scan using "fki_advertising_peformance_advertising_entity_id -> advertising" on adobject_performance (cost=0.42..89.78 rows=4081 width=32) (actual time=0.007..0.046 rows=140 loops=5998)
Index Cond: (ad_id = ad_product.ad_id)
Buffers: shared hit=36565
-> Index Scan using facebook_advertising_actions_pkey on facebook_adobject_actions (cost=0.42..0.48 rows=1 width=12) (actual time=0.001..0.002 rows=1 loops=839720)
Index Cond: (ad_performance.id = ad_performance_id)
Buffers: shared hit=3363468
Planning time: 1.525 ms
Execution time: 3287.324 ms
(46 rows)
Blindly shooting here, as we have not been provided with the result of the EXPLAIN, but still, Postgres should treat this query better if you take out your targeting table in a CTE:
WITH targeting AS
(
SELECT ad_id, value from targeting, ad_targeting
WHERE targeting.id = ad_targeting.id and targeting.name = 'gender'
)
SELECT ad_performance.ad_id, targeting.value AS targeting_value,
sum(impressions) AS impressions,
sum(app_starts) AS app_starts
FROM ad_performance
LEFT JOIN ad on ad.id = ad_performance.ad_id
LEFT JOIN ad_actions ON ad_performance.id = ad_actions.ad_performance_id
RIGHT JOIN targeting ON targeting.ad_id = ad.parent_id
WHERE ad_performance.ad_id IN
(SELECT ad_id FROM ad_product WHERE product = 'iphone')
GROUP BY ad_performance.ad_id, targeting_value
Taken from the Documentation:
A useful property of WITH queries is that they are evaluated only once
per execution of the parent query, even if they are referred to more
than once by the parent query or sibling WITH queries. Thus, expensive
calculations that are needed in multiple places can be placed within a
WITH query to avoid redundant work. Another possible application is to
prevent unwanted multiple evaluations of functions with side-effects.
The execution plan does not seem to match the query any more (maybe you can update the query).
However, the problem now is here:
-> Hash Join (cost=30.26..111.69 rows=9 width=222)
(actual time=0.154..4.660 rows=5998 loops=1)
Hash Cond: (adobject.parent_id = adobject_targeting.ad_id)
Buffers: shared hit=151
-> Seq Scan on adobject (cost=0.00..77.97 rows=897 width=8)
(actual time=0.009..1.449 rows=6001 loops=1)
Buffers: shared hit=69
-> Hash (cost=30.24..30.24 rows=2 width=222)
(actual time=0.132..0.132 rows=2 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
Buffers: shared hit=82
-> Nested Loop (cost=0.15..30.24 rows=2 width=222)
(actual time=0.101..0.129 rows=2 loops=1)
Buffers: shared hit=82
-> Seq Scan on targeting (cost=0.00..13.88 rows=2 width=222)
(actual time=0.015..0.042 rows=79 loops=1)
Filter: (name = 'age group'::targeting_name)
Rows Removed by Filter: 82
Buffers: shared hit=1
-> Index Scan using advertising_targeting_pkey on adobject_targeting
(cost=0.15..8.17 rows=1 width=8)
(actual time=0.001..0.001 rows=0 loops=79)
Index Cond: (id = targeting.id)
Buffers: shared hit=81
This is a join between adobject and the result of
targeting JOIN adobject_targeting
USING (id)
WHERE targeting.name = 'age group'
The latter subquery is correctly estimated to 2 rows, but PostgreSQL fails to notice that almost all rows found in adobject will match one of those two rows, so that the result of the join will be 6000 rather than the 9 it estimates.
This causes the optimizer to wrongly choose a nested loop join later on, where more than half of the query time is spent.
Unfortunately, since PostgreSQL doesn't have cross-table statistics, there is no way for PostgreSQL to know better.
One coarse measure is to SET enable_nestloop=off, but that will deteriorate the performance of the other (correctly chosen) nested loop join, so I don't know if it will be a net win.
If that helps, you could consider changing the parameter only for the duration of the query (with a transaction and SET LOCAL).
Maybe there is a way to rewrite the query so that a better plan can be found, but that is hard to say without knowing the exact query.
I dont know if this query will solve your problem, but try it:
SELECT ad_performance.ad_id, targeting.value AS targeting_value,
sum(impressions) AS impressions,
sum(app_starts) AS app_starts
FROM ad_performance
LEFT JOIN ad on ad.id = ad_performance.ad_id
LEFT JOIN ad_actions ON ad_performance.id = ad_actions.ad_performance_id
RIGHT JOIN ad_targeting on ad_targeting.ad_id = ad.parent_id
INNER JOIN targeting on targeting.id = ad_targeting.id and targeting.name = 'gender'
INNER JOIN ad_product on ad_product.ad_id = ad_performance.ad_id
WHERE ad_product.product = 'iphone'
GROUP BY ad_performance.ad_id, targeting_value
perhaps you would create index on all columns that you are putting in ON or WHERE conditions
Related
given this query:
SELECT count(u.*)
FROM res_users u
WHERE active=true AND
share=false AND
NOT exists(SELECT 1 FROM res_users_log WHERE create_uid=u.id);
It currently takes 10 seconds.
I tried to make it faster with these 2 index commands, but it didn't help.
CREATE INDEX CONCURRENTLY id_active_share_index ON res_users (id,active,share);
CREATE INDEX CONCURRENTLY create_uid_index ON res_users_log (create_uid);
I guess it's because of the "NOT exists" line, but I have no idea how to include it into an index.
EXPLAIN (ANALYZE, BUFFERS) gives me this output:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=2815437.14..2815437.15 rows=1 width=8) (actual time=39174.365..39174.367 rows=1 loops=1)
Buffers: shared hit=124 read=112875 dirtied=70, temp read=98788 written=99211
-> Merge Anti Join (cost=2678572.70..2815437.09 rows=20 width=1064) (actual time=39174.360..39174.361 rows=0 loops=1)
Merge Cond: (u.id = res_users_log.create_uid)
Buffers: shared hit=124 read=112875 dirtied=70, temp read=98788 written=99211
-> Sort (cost=11.92..11.97 rows=20 width=1068) (actual time=5.577..5.602 rows=16 loops=1)
Sort Key: u.id
Sort Method: quicksort Memory: 79kB
Buffers: shared hit=53 read=5
-> Seq Scan on res_users u (cost=0.00..11.49 rows=20 width=1068) (actual time=0.050..5.519 rows=16 loops=1)
Filter: (active AND (NOT share))
Rows Removed by Filter: 33
Buffers: shared hit=49 read=5
-> Sort (cost=2678560.78..2716236.90 rows=15070449 width=4) (actual time=36258.796..38013.471 rows=15069209 loops=1)
Sort Key: res_users_log.create_uid
Sort Method: external merge Disk: 206464kB
Buffers: shared hit=71 read=112870 dirtied=70, temp read=98788 written=99211
-> Seq Scan on res_users_log (cost=0.00..263645.49 rows=15070449 width=4) (actual time=1.755..29961.086 rows=15069319 loops=1)
Buffers: shared hit=71 read=112870 dirtied=70
Planning Time: 0.889 ms
Execution Time: 39202.694 ms
(21 rows)
For this query:
SELECT count(*)
FROM res_users u
WHERE active = true AND
share = false AND
NOT exists (SELECT 1 FROM res_users_log rul WHERE rul.create_uid = u.id);
You want indexes on:
res_users(active, share, id)
res_users_log(create_uid)
Note that the ordering of the columns matters.
This index will make the query fast as lightning:
CREATE INDEX ON res_users_log (create_uid);
From pg_stat_statements I have this query that's taking 900 ms on average. What is the recommended way going forward in optimizing this query? I do have indexes but not sure where the bottleneck could be. Here's the EXPLAIN ANALYZE.
EXPLAIN ANALYZE
SELECT "listing_variants".*
FROM "listing_variants"
INNER JOIN "links" ON "links"."listing_variant_id" = "listing_variants"."id"
INNER JOIN "product_variants" ON "product_variants"."id" = "links"."product_variant_id"
INNER JOIN "listings" ON "listing_variants"."listing_id" = "listings"."id"
WHERE "listings"."sales_channel_id" = 31
AND "listing_variants"."is_linked" = 'f'
AND (listing_variants.available_quantity != product_variants.available_quantity);
gives
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=5283.71..6960.01 rows=524 width=232) (actual time=54.138..54.138 rows=0 loops=1)
Join Filter: (listing_variants.available_quantity <> product_variants.available_quantity)
-> Hash Join (cost=5283.42..6648.69 rows=720 width=236) (actual time=54.137..54.137 rows=0 loops=1)
Hash Cond: (links.listing_variant_id = listing_variants.id)
-> Index Only Scan using index_on_product_listing_variant_id on links (cost=0.29..1205.14 rows=30643 width=8) (actual time=0.026..6.112 rows=30863 loops=1)
Heap Fetches: 6799
-> Hash (cost=5261.53..5261.53 rows=1728 width=232) (actual time=45.407..45.407 rows=368 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 65kB
-> Hash Join (cost=1671.82..5261.53 rows=1728 width=232) (actual time=11.147..45.075 rows=368 loops=1)
Hash Cond: (listing_variants.listing_id = listings.id)
-> Seq Scan on listing_variants (cost=0.00..3412.77 rows=42577 width=232) (actual time=0.018..29.882 rows=42713 loops=1)
Filter: (NOT is_linked)
Rows Removed by Filter: 30863
-> Hash (cost=1661.68..1661.68 rows=811 width=4) (actual time=10.585..10.585 rows=811 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 29kB
-> Bitmap Heap Scan on listings (cost=30.57..1661.68 rows=811 width=4) (actual time=0.362..10.224 rows=811 loops=1)
Recheck Cond: (sales_channel_id = 31)
Heap Blocks: exact=668
-> Bitmap Index Scan on index_listings_on_sales_channel_ext_svc_updated (cost=0.00..30.37 rows=811 width=0) (actual time=0.242..0.242 rows=821 loops=1)
Index Cond: (sales_channel_id = 31)
-> Index Scan using product_variants_pkey on product_variants (cost=0.29..0.42 rows=1 width=12) (never executed)
Index Cond: (id = links.product_variant_id)
Planning time: 1.437 ms
Execution time: 54.366 ms
Thanks!
Use JOIN Over Exists only when you need to select data from multiple tables, which you are not doing here. That's a first step of optimization. In your case, with join it is polluting your resultset by returning multitude of same data rows depending on multiple of data available in joined secondary tables.
SELECT "listing_variants".*
FROM "listing_variants"
WHERE "listing_variants"."is_linked" = 'f'
AND EXISTS(SELECT 1 FROM "links" ON "links"."listing_variant_id" = "listing_variants"."id"
JOIN "product_variants" ON "product_variants"."id" = "links"."product_variant_id"
AND "listing_variants"."available_quantity" != "product_variants"."available_quantity"
JOIN "listings" ON "listing_variants"."listing_id" = "listings"."id"
AND "listings"."sales_channel_id" = 31);
Other than that your query is pretty straightforward, and well indexing & data partitioning can only contribute to further optimization.
SUMMARY
Adding certain criteria in a multi-table join with many rows ends up with orders of magnitude slower queries. I've tried a lot of things to make this faster, including every type of table join, reordering the joins, reordering the WHERE clause, doing subqueries, using CASE statements in the WHERE clause, etc.
The SQL specifics are below.
QUESTIONS
Why does the addition of this simple condition cause the planner to drastically change its execution plan?
Is it possible to tell the planner how to analyze a specific condition first without drastically changing the query or doing subqueries (using WITH for example)
Note: I am attempting to write a generic SQL builder for an API, allowing callers to specify arbitrary conditions at any point in the graph. The issue is that some of these calls are blazing fast and others are not due to the way Postgres plans executions. Optimizations crafted specifically for this query will not help me satisfy the larger goal of a generic SQL builder.
DETAILS
I have a schema that stores vertices and edges (a simplistic graph database) in Postgres:
CREATE TABLE IF NOT EXISTS vertex (type text, id serial, name text, data jsonb, UNIQUE (id))
CREATE INDEX vertex_data_idx ON vertex USING gin (data jsonb_path_ops)
CREATE INDEX vertex_type_idx ON vertex (type)
CREATE INDEX vertex_name_idx ON vertex (name)
CREATE TABLE IF NOT EXISTS edge (src integer REFERENCES vertex (id), dst integer REFERENCES vertex (id))
CREATE INDEX edge_src_idx ON edge (src)
CREATE INDEX edge_dst_idx ON edge (dst)
The schema stores graphs, one of which is like this: PLANET --> CONTINENT --> COUNTRY --> REGION
There are 447554 total vertices and 3155047 total edges in my sample database, but the data that is relevant is here:
5 PLANETs (each relates to 5 CONTINENTs)
25 CONTINENTs (each relates to 2500 COUNTRYs)
62500 COUNTRYs (25% of which relate to 100 REGIONs each, the rest have no REGION relationships)
250000 REGIONs
This query looking for planets that have spanish speakers in any given region is fast:
SELECT DISTINCT
v1.name as name, v1.id as id
FROM vertex v1
LEFT JOIN edge e1 ON (v1.id = e1.src)
LEFT JOIN vertex v2 ON (v2.id = e1.dst)
LEFT JOIN edge e2 ON (v2.id = e2.src)
LEFT JOIN vertex v3 ON (v3.id = e2.dst)
LEFT JOIN edge e3 ON (v3.id = e3.src)
LEFT JOIN vertex v4 ON (v4.id = e3.dst)
WHERE
v4.type = 'REGION' AND
v4.data #> '{"languages":["spanish"]}'::jsonb
Planning time: 6.289 ms
Execution time: 0.744 ms
When I add a condition on an indexed column in the first table in the graph (v1) that has no effect on the outcome, the query is 12,657 times slower:
SELECT DISTINCT
v1.name as name, v1.id as id
FROM vertex v1
LEFT JOIN edge e1 ON (v1.id = e1.src)
LEFT JOIN vertex v2 ON (v2.id = e1.dst)
LEFT JOIN edge e2 ON (v2.id = e2.src)
LEFT JOIN vertex v3 ON (v3.id = e2.dst)
LEFT JOIN edge e3 ON (v3.id = e3.src)
LEFT JOIN vertex v4 ON (v4.id = e3.dst)
WHERE
v1.type = 'PLANET' AND
v4.type = 'REGION' AND
v4.data #> '{"languages":["spanish"]}'::jsonb
Planning time: 7.664 ms
Execution time: 89010.096 ms
This is the EXPLAIN (ANALYZE, BUFFERS) on the first, fast call:
Unique (cost=154592.03..155453.96 rows=114925 width=28) (actual time=0.585..0.616 rows=4 loops=1)
Buffers: shared hit=92
-> Sort (cost=154592.03..154879.34 rows=114925 width=28) (actual time=0.579..0.588 rows=4 loops=1)
Sort Key: v1.name, v1.id
Sort Method: quicksort Memory: 17kB
Buffers: shared hit=92
-> Nested Loop (cost=37.96..142377.39 rows=114925 width=28) (actual time=0.155..0.549 rows=4 loops=1)
Buffers: shared hit=92
-> Nested Loop (cost=37.53..80131.76 rows=114925 width=4) (actual time=0.141..0.468 rows=4 loops=1)
Join Filter: (v2.id = e1.dst)
Buffers: shared hit=76
-> Nested Loop (cost=37.10..49179.08 rows=14270 width=8) (actual time=0.126..0.386 rows=4 loops=1)
Buffers: shared hit=60
-> Nested Loop (cost=36.68..41450.17 rows=14270 width=4) (actual time=0.112..0.304 rows=4 loops=1)
Join Filter: (v3.id = e2.dst)
Buffers: shared hit=44
-> Nested Loop (cost=36.25..37606.57 rows=1772 width=8) (actual time=0.092..0.209 rows=4 loops=1)
Buffers: shared hit=28
-> Nested Loop (cost=35.83..36646.82 rows=1772 width=4) (actual time=0.074..0.116 rows=4 loops=1)
Buffers: shared hit=12
-> Bitmap Heap Scan on vertex v4 (cost=30.99..1514.00 rows=220 width=4) (actual time=0.039..0.042 rows=1 loops=1)
Recheck Cond: (data #> '{"languages":["spanish"]}'::jsonb)
Filter: (type = 'REGION'::text)
Heap Blocks: exact=1
Buffers: shared hit=5
-> Bitmap Index Scan on vertex_data_idx (cost=0.00..30.94 rows=392 width=0) (actual time=0.020..0.020 rows=1 loops=1)
Index Cond: (data #> '{"languages":["spanish"]}'::jsonb)
Buffers: shared hit=4
-> Bitmap Heap Scan on edge e3 (cost=4.84..159.12 rows=57 width=8) (actual time=0.023..0.037 rows=4 loops=1)
Recheck Cond: (dst = v4.id)
Heap Blocks: exact=4
Buffers: shared hit=7
-> Bitmap Index Scan on edge_dst_idx (cost=0.00..4.82 rows=57 width=0) (actual time=0.013..0.013 rows=4 loops=1)
Index Cond: (dst = v4.id)
Buffers: shared hit=3
-> Index Only Scan using vertex_id_key on vertex v3 (cost=0.42..0.53 rows=1 width=4) (actual time=0.008..0.011 rows=1 loops=4)
Index Cond: (id = e3.src)
Heap Fetches: 4
Buffers: shared hit=16
-> Index Scan using edge_dst_idx on edge e2 (cost=0.43..1.46 rows=57 width=8) (actual time=0.008..0.011 rows=1 loops=4)
Index Cond: (dst = e3.src)
Buffers: shared hit=16
-> Index Only Scan using vertex_id_key on vertex v2 (cost=0.42..0.53 rows=1 width=4) (actual time=0.006..0.009 rows=1 loops=4)
Index Cond: (id = e2.src)
Heap Fetches: 4
Buffers: shared hit=16
-> Index Scan using edge_dst_idx on edge e1 (cost=0.43..1.46 rows=57 width=8) (actual time=0.005..0.008 rows=1 loops=4)
Index Cond: (dst = e2.src)
Buffers: shared hit=16
-> Index Scan using vertex_id_key on vertex v1 (cost=0.42..0.53 rows=1 width=28) (actual time=0.006..0.009 rows=1 loops=4)
Index Cond: (id = e1.src)
Buffers: shared hit=16
Planning time: 6.940 ms
Execution time: 0.714 ms
And on the second, slow call:
HashAggregate (cost=592.23..592.24 rows=1 width=28) (actual time=89009.873..89009.885 rows=4 loops=1)
Group Key: v1.name, v1.id
Buffers: shared hit=11644657 read=1240045
-> Nested Loop (cost=2.98..592.22 rows=1 width=28) (actual time=9098.961..89009.833 rows=4 loops=1)
Buffers: shared hit=11644657 read=1240045
-> Nested Loop (cost=2.56..306.89 rows=522 width=32) (actual time=0.424..30066.007 rows=3092522 loops=1)
Buffers: shared hit=454795 read=46267
-> Nested Loop (cost=2.13..86.31 rows=65 width=36) (actual time=0.306..2120.293 rows=62500 loops=1)
Buffers: shared hit=239162 read=12162
-> Nested Loop (cost=1.70..51.10 rows=65 width=32) (actual time=0.261..574.490 rows=62500 loops=1)
Buffers: shared hit=488 read=562
actual time=0.205..1.206 rows=25 loops=1)p (cost=1.27..23.95 rows=8 width=36) (--More--
Buffers: shared hit=109 read=17
-> Nested Loop (cost=0.85..19.62 rows=8 width=32) (actual time=0.173..0.547 rows=25 loops=1)
Buffers: shared hit=12 read=14
-> Index Scan using vertex_type_idx on vertex v1 (cost=0.42..8.44 rows=1 width=28) (actual time=0.123..0.153 rows=5 loops=1)
Index Cond: (type = 'PLANET'::text)
Buffers: shared hit=2 read=4
-> Index Scan using edge_src_idx on edge e1 (cost=0.43..10.18 rows=100 width=8) (actual time=0.021..0.039 rows=5 loops=5)
Index Cond: (src = v1.id)
Buffers: shared hit=10 read=10
-> Index Only Scan using vertex_id_key on vertex v2 (cost=0.42..0.53 rows=1 width=4) (actual time=0.009..0.013 rows=1 loops=25)
Index Cond: (id = e1.dst)
Heap Fetches: 25
Buffers: shared hit=97 read=3
43..2.39 rows=100 width=8) (actual time=0.031..8.504 rows=2500 loops=25)(cost=0.--More--
Index Cond: (src = v2.id)
Buffers: shared hit=379 read=545
-> Index Only Scan using vertex_id_key on vertex v3 (cost=0.42..0.53 rows=1 width=4) (actual time=0.010..0.013 rows=1 loops=62500)
Index Cond: (id = e2.dst)
Heap Fetches: 62500
Buffers: shared hit=238674 read=11600
-> Index Scan using edge_src_idx on edge e3 (cost=0.43..2.39 rows=100 width=8) (actual time=0.013..0.163 rows=49 loops=62500)
Index Cond: (src = v3.id)
Buffers: shared hit=215633 read=34105
-> Index Scan using vertex_id_key on vertex v4 (cost=0.42..0.54 rows=1 width=4) (actual time=0.013..0.013 rows=0 loops=3092522)
Index Cond: (id = e3.dst)
Filter: ((data #> '{"languages":["spanish"]}'::jsonb) AND (type = 'REGION'::text))
Rows Removed by Filter: 1
Buffers: shared hit=11189862 read=1193778
Planning time: 7.664 ms
Execution time: 89010.096 ms
[posted as an answer, because I need the formatting]
The edge table desparately needs a primary key (this implies NOT NULL for {src,dst} which is good):
CREATE TABLE IF NOT EXISTS edge
( src integer NOT NULL REFERENCES vertex (id)
, dst integer NOT NULL REFERENCES vertex (id)
, PRIMARY KEY (src,dst)
);
CREATE UNIQUE INDEX edge_dst_src_idx ON edge (dst, src);
-- the estimates in the question seem to be off, statistics may be absent.
VACUUM ANALYZE edge; -- refresh the statistics
VACUUM ANALYZE vertex;
And I'd combine the {type,name} indexes, too (type appears to have a very low cardinality). Maybe even make it UNIQUE and NOT NULL, but I don't know your data.
CREATE INDEX vertex_type_name_idx ON vertex (type, name);
I think using a sub-query will make the postgresql to not be able to use index. So try following query to test the performance improvement by not using the index:
select * from (
SELECT DISTINCT
v1.name as name, v1.id as id, v1.type as v1_type
FROM vertex v1
LEFT JOIN edge e1 ON (v1.id = e1.src)
LEFT JOIN vertex v2 ON (v2.id = e1.dst)
LEFT JOIN edge e2 ON (v2.id = e2.src)
LEFT JOIN vertex v3 ON (v3.id = e2.dst)
LEFT JOIN edge e3 ON (v3.id = e3.src)
LEFT JOIN vertex v4 ON (v4.id = e3.dst)
WHERE
v4.type = 'REGION' AND
v4.data #> '{"languages":["spanish"]}'::jsonb
) t1
where v1_type = 'PLANET'
I have Rails application with the ability to filter records by state_code. I noticed that when i pass 'CA' as search term i get my results almost instantly. If i will pass 'AZ' for example it will take more than a minute though.
I don't have any ideas why so?
Below is query explains from psql:
Fast one:
EXPLAIN ANALYZE SELECT
accounts.id
FROM "accounts"
LEFT OUTER JOIN "addresses"
ON "addresses"."addressable_id" = "accounts"."id"
AND "addresses"."address_type" = 'mailing'
AND "addresses"."addressable_type" = 'Account'
WHERE "accounts"."organization_id" = 16
AND (addresses.state_code IN ('CA'))
ORDER BY accounts.name DESC;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=4941.94..4941.94 rows=1 width=18) (actual time=74.810..74.969 rows=821 loops=1)
Sort Key: accounts.name
Sort Method: quicksort Memory: 75kB
-> Hash Join (cost=4.46..4941.93 rows=1 width=18) (actual time=70.044..73.148 rows=821 loops=1)
Hash Cond: (addresses.addressable_id = accounts.id)
-> Seq Scan on addresses (cost=0.00..4911.93 rows=6806 width=4) (actual time=0.027..65.547 rows=15244 loops=1)
Filter: (((address_type)::text = 'mailing'::text) AND ((addressable_type)::text = 'Account'::text) AND ((state_code)::text = 'CA'::text))
Rows Removed by Filter: 129688
-> Hash (cost=4.45..4.45 rows=1 width=18) (actual time=2.037..2.037 rows=1775 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 87kB
-> Index Scan using organization_id_index on accounts (cost=0.29..4.45 rows=1 width=18) (actual time=0.018..1.318 rows=1775 loops=1)
Index Cond: (organization_id = 16)
Planning time: 0.565 ms
Execution time: 75.224 ms
(14 rows)
Slow one:
EXPLAIN ANALYZE SELECT
accounts.id
FROM "accounts"
LEFT OUTER JOIN "addresses"
ON "addresses"."addressable_id" = "accounts"."id"
AND "addresses"."address_type" = 'mailing'
AND "addresses"."addressable_type" = 'Account'
WHERE "accounts"."organization_id" = 16
AND (addresses.state_code IN ('NV'))
ORDER BY accounts.name DESC;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=4917.27..4917.27 rows=1 width=18) (actual time=97091.270..97091.277 rows=25 loops=1)
Sort Key: accounts.name
Sort Method: quicksort Memory: 26kB
-> Nested Loop (cost=0.29..4917.26 rows=1 width=18) (actual time=844.250..97091.083 rows=25 loops=1)
Join Filter: (accounts.id = addresses.addressable_id)
Rows Removed by Join Filter: 915875
-> Index Scan using organization_id_index on accounts (cost=0.29..4.45 rows=1 width=18) (actual time=0.017..10.315 rows=1775 loops=1)
Index Cond: (organization_id = 16)
-> Seq Scan on addresses (cost=0.00..4911.93 rows=70 width=4) (actual time=0.110..54.521 rows=516 loops=1775)
Filter: (((address_type)::text = 'mailing'::text) AND ((addressable_type)::text = 'Account'::text) AND ((state_code)::text = 'NV'::text))
Rows Removed by Filter: 144416
Planning time: 0.308 ms
Execution time: 97091.325 ms
(13 rows)
Slow one result is 25 rows, fast one is 821 rows, which is even more confusing.
I solved it by using VACUUM ANALYZE command from psql command line.
I've got 2 tables:
event - list of events which should be processed (new files and dirs). Size: ~2M rows
dir_current - list of directories currently visible on filesystem. Size: ~1M but up to 100M in the future.
I use stored procedure to process events and turn them into dir_current rows. First step of processing events is to find all rows that do not have parent in dir_current table. Unfortunately this get a little more complicated as parent might be present in event table so we don't want to include them in result. I came up with this query:
SELECT DISTINCT event.parent_path, event.depth FROM sf.event as event
LEFT OUTER JOIN sf.dir_current as dir ON
event.parent_path = dir.path
AND dir.volume_id = 1
LEFT OUTER JOIN sf.event as event2 ON
event.parent_path = event2.path
AND event2.volume_id = 1
AND event2.type = 'DIR'
AND event2.id <= MAX_ID_VARIABLE
WHERE
event.volume_id = 1
AND event.id <= MAX_ID_VARIABLE
AND dir.volume_id IS NULL
AND event2.id IS NULL
ORDER BY depth, parent_path;
MAX_ID_VARIABLE is variable limiting number of events processed at once.
Below is explain analyze result (explain.depesz.com):
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=395165.81..395165.82 rows=1 width=83) (actual time=32009.439..32049.675 rows=2462 loops=1)
-> Sort (cost=395165.81..395165.81 rows=1 width=83) (actual time=32009.432..32021.733 rows=184975 loops=1)
Sort Key: event.depth, event.parent_path
Sort Method: quicksort Memory: 38705kB
-> Nested Loop Anti Join (cost=133385.93..395165.80 rows=1 width=83) (actual time=235.581..30916.912 rows=184975 loops=1)
-> Hash Anti Join (cost=133385.38..395165.14 rows=1 width=83) (actual time=83.073..1703.618 rows=768278 loops=1)
Hash Cond: (event.parent_path = event2.path)
-> Seq Scan on event (cost=0.00..252872.92 rows=2375157 width=83) (actual time=0.014..756.014 rows=2000000 loops=1)
Filter: ((id <= 13000000) AND (volume_id = 1))
-> Hash (cost=132700.54..132700.54 rows=54787 width=103) (actual time=82.754..82.754 rows=48029 loops=1)
Buckets: 65536 Batches: 1 Memory Usage: 6696kB
-> Bitmap Heap Scan on event event2 (cost=6196.07..132700.54 rows=54787 width=103) (actual time=12.979..63.803 rows=48029 loops=1)
Recheck Cond: (type = '16384'::text)
Filter: ((id <= 13000000) AND (volume_id = 1))
Heap Blocks: exact=16465
-> Bitmap Index Scan on event_dir_depth_idx (cost=0.00..6182.38 rows=54792 width=0) (actual time=8.759..8.759 rows=48029 loops=1)
-> Index Only Scan using dircurrent_volumeid_path_unq on dir_current dir (cost=0.55..0.65 rows=1 width=115) (actual time=0.038..0.038 rows=1 loops=768278)
Index Cond: ((volume_id = 1) AND (path = event.parent_path))
Heap Fetches: 583027
Planning time: 2.114 ms
Execution time: 32054.498 ms
The slowest part is Index Only Scan on dir_current table (took 29 sec from 32 sec total).
I wonder why Postgres is using index scan instead of sequential scan which would take 2-3 seconds.
After setting:
SET enable_indexscan TO false;
SET enable_bitmapscan TO false;
I received query that runs in 3 sec explain.depesz.com:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=569654.93..569654.94 rows=1 width=83) (actual time=3943.487..3979.613 rows=2462 loops=1)
-> Sort (cost=569654.93..569654.93 rows=1 width=83) (actual time=3943.481..3954.169 rows=184975 loops=1)
Sort Key: event.depth, event.parent_path
Sort Method: quicksort Memory: 38705kB
-> Hash Anti Join (cost=307875.14..569654.92 rows=1 width=83) (actual time=1393.185..2970.626 rows=184975 loops=1)
Hash Cond: ((event.parent_path = dir.path) AND ((event.depth - 1) = dir.depth))
-> Hash Anti Join (cost=259496.25..521276.01 rows=1 width=83) (actual time=786.617..2111.297 rows=768278 loops=1)
Hash Cond: (event.parent_path = event2.path)
-> Seq Scan on event (cost=0.00..252872.92 rows=2375157 width=83) (actual time=0.016..616.598 rows=2000000 loops=1)
Filter: ((id <= 13000000) AND (volume_id = 1))
-> Hash (cost=258811.41..258811.41 rows=54787 width=103) (actual time=786.214..786.214 rows=48029 loops=1)
Buckets: 65536 Batches: 1 Memory Usage: 6696kB
-> Seq Scan on event event2 (cost=0.00..258811.41 rows=54787 width=103) (actual time=0.068..766.563 rows=48029 loops=1)
Filter: ((id <= 13000000) AND (volume_id = 1) AND (type = '16384'::text))
Rows Removed by Filter: 1951971
-> Hash (cost=36960.95..36960.95 rows=761196 width=119) (actual time=582.430..582.430 rows=761196 loops=1)
Buckets: 1048576 Batches: 1 Memory Usage: 121605kB
-> Seq Scan on dir_current dir (cost=0.00..36960.95 rows=761196 width=119) (actual time=0.010..267.484 rows=761196 loops=1)
Filter: (volume_id = 1)
Planning time: 2.242 ms
Execution time: 3999.213 ms
Both tables were analyzed before running queries.
Any idea why is Postgres using far from optimal query plan?
Is there a better way to improve query performance then disabling index/bitmap scan? Maybe different query with same result?
I am using Postgres 9.5.2
I would be grateful for any help.
You are only fetching columns from one table. I would recommend rewriting the query as:
SELECT e.parent_path, e.depth
FROM sf.event e
WHERE e.volume_id = 1 AND e.id <= MAX_ID_VARIABLE AND
NOT EXISTS (SELECT 1
FROM dir_current dc
WHERE e.parent_path = dc.path AND dc.volume_id = 1
) AND
NOT EXISTS (SELECT 1
FROM sf.event e2 ON
e.parent_path = e2.path AND
e2.volume_id = 1 AND
e2.type = 'DIR' AND
e2.id <= MAX_ID_VARIABLE
)
ORDER BY e.depth, e.parent_path;
Then the following indexes:
event(volume_id, id)
dir_current(path, volume_id)
event(path, volume_id, type, id)
I'm not sure why there is a comparison to MAX_ID_VARIABLE. Without this comparison, the first index can include the sort keys: event(volume_id, depth, parent_path).