We have a simple statement in PostgreSQL 11.9/11.10 or 12.5 where we can write the join with a WHERE-CLAUSE or with a ON-CLAUSE. The meaning is exactly the same and therefore the number of returned rows too - But we receive a different explain plan. With more data inside the tables one execution plan is getting really bad and we want to understand why PostgreSQL chooses different explain plans for this situation. Any ideas?
Let's create some sample data:
CREATE TABLE t1 (
t1_nr int8 NOT NULL,
name varchar(60),
CONSTRAINT t1_pk PRIMARY KEY (t1_nr)
);
INSERT INTO t1 (t1_nr, name) SELECT s, left(md5(random()::text), 10) FROM generate_series(1, 1000000) s; -- 1 million records
CREATE TABLE t2 (
t2_nr int8 NOT NULL,
CONSTRAINT t2_pk PRIMARY KEY (t2_nr)
);
INSERT INTO t2 (t2_nr) SELECT s FROM generate_series(1, 10000000) s; -- 10 million records
CREATE TABLE t3 (
t1_nr int8 NOT NULL,
t2_nr int8 NOT NULL,
CONSTRAINT t3_pk PRIMARY KEY (t2_nr, t1_nr)
);
INSERT INTO t3 (t1_nr, t2_nr) SELECT (s-1)/10+1, s FROM generate_series(1, 10000000) s; -- 10 t2 records per t1 records --> 10 million records
Our Statement with fully analyzed statistics:
EXPLAIN (BUFFERS, ANALYZE)
SELECT t1.*
FROM t1 t1
WHERE EXISTS (
SELECT 1
FROM t3 t3
JOIN t2 t2 ON t2.t2_nr = t3.t2_nr
--AND t3.t1_nr = t1.t1_nr /* GOOD (using ON-CLAUSE) */
WHERE t3.t1_nr = t1.t1_nr /* BAD (using WHERE-CLAUSE) */
)
LIMIT 1000
The explain plan with the "GOOD" row (ON-CLAUSE):
QUERY PLAN |
--------------------------------------------------------------------------------------------------------------------------------------|
Limit (cost=0.00..22896.86 rows=1000 width=19) (actual time=0.028..4.801 rows=1000 loops=1) |
Buffers: shared hit=8015 |
-> Seq Scan on t1 (cost=0.00..11448428.92 rows=500000 width=19) (actual time=0.027..4.725 rows=1000 loops=1) |
Filter: (SubPlan 1) |
Buffers: shared hit=8015 |
SubPlan 1 |
-> Nested Loop (cost=0.87..180.43 rows=17 width=0) (actual time=0.004..0.004 rows=1 loops=1000) |
Buffers: shared hit=8008 |
-> Index Only Scan using t3_pk on t3 (cost=0.43..36.73 rows=17 width=8) (actual time=0.002..0.002 rows=1 loops=1000)|
Index Cond: (t1_nr = t1.t1_nr) |
Heap Fetches: 1000 |
Buffers: shared hit=4003 |
-> Index Only Scan using t2_pk on t2 (cost=0.43..8.45 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=1000) |
Index Cond: (t2_nr = t3.t2_nr) |
Heap Fetches: 1000 |
Buffers: shared hit=4005 |
Planning Time: 0.267 ms |
Execution Time: 4.880 ms |
The explain plan with the "BAD" row (WHERE-CLAUSE):
QUERY PLAN |
-------------------------------------------------------------------------------------------------------------------------------------------------------------|
Limit (cost=1166.26..7343.42 rows=1000 width=19) (actual time=16.888..75.809 rows=1000 loops=1) |
Buffers: shared hit=51883 read=11 dirtied=2 |
-> Merge Semi Join (cost=1166.26..3690609.61 rows=597272 width=19) (actual time=16.887..75.703 rows=1000 loops=1) |
Merge Cond: (t1.t1_nr = t3.t1_nr) |
Buffers: shared hit=51883 read=11 dirtied=2 |
-> Index Scan using t1_pk on t1 (cost=0.42..32353.42 rows=1000000 width=19) (actual time=0.010..0.271 rows=1000 loops=1) |
Buffers: shared hit=12 |
-> Gather Merge (cost=1000.89..3530760.13 rows=9999860 width=8) (actual time=16.873..74.064 rows=9991 loops=1) |
Workers Planned: 2 |
Workers Launched: 2 |
Buffers: shared hit=51871 read=11 dirtied=2 |
-> Nested Loop (cost=0.87..2375528.14 rows=4166608 width=8) (actual time=0.054..14.275 rows=4309 loops=3) |
Buffers: shared hit=51871 read=11 dirtied=2 |
-> Parallel Index Only Scan using t3_pk on t3 (cost=0.43..370689.69 rows=4166608 width=16) (actual time=0.028..1.495 rows=4309 loops=3)|
Heap Fetches: 12927 |
Buffers: shared hit=131 read=6 |
-> Index Only Scan using t2_pk on t2 (cost=0.43..0.48 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=12927) |
Index Cond: (t2_nr = t3.t2_nr) |
Heap Fetches: 12927 |
Buffers: shared hit=51740 read=5 dirtied=2 |
Planning Time: 0.475 ms |
Execution Time: 75.947 ms |
Thanks for your ideas, if we add an index like
CREATE INDEX t3_t1_nr ON t3(t1_nr);
the "BAD"-Statement will improve a little bit.
But the final solution for us was to increase the statistics gathered for this tables:
ALTER TABLE t1 ALTER COLUMN t1_nr SET STATISTICS 10000;
ALTER TABLE t2 ALTER COLUMN t2_nr SET STATISTICS 10000;
ALTER TABLE t3 ALTER COLUMN t1_nr SET STATISTICS 10000;
ANALYZE t1;
ANALYZE t2;
ANALYZE t3;
After this change both SELECTs has more about the same exection time.
More information can be found here: https://www.postgresql.org/docs/12/planner-stats.html
Using postgres 9.6.11, I have a schema like:
owner:
id: BIGINT (PK)
dog_id: BIGINT NOT NULL (FK)
cat_id: BIGINT NULL (FK)
index DOG_ID_IDX (dog_id)
index CAT_ID_IDX (cat_id)
animal:
id: BIGINT (PK)
name: VARCHAR(50) NOT NULL
index NAME_IDX (name)
In some example data:
owner table:
| id | dog_id | cat_id |
| -- | ------ | ------ |
| 1 | 100 | 200 |
| 2 | 101 | NULL |
animal table:
| id | name |
| --- | -------- |
| 100 | "fluffy" |
| 101 | "rex" |
| 200 | "tom" |
A common query I need to perform is to find owners by their pets name, which I thought to accomplish with a query like:
select *
from owner o
join animal dog on o.dog_id = dog.id
left join animal cat on o.cat_id = cat.id
where dog.name = "fluffy" or cat.name = "fluffy";
But the plan I get back from this I don't understand:
Hash Join (cost=30304.51..77508.31 rows=3 width=899)
Hash Cond: (dog.id = owner.dog_id)
Join Filter: (((dog.name)::text = 'fluffy'::text) OR ((cat.name)::text = 'fluffy'::text))
-> Seq Scan on animal dog (cost=0.00..17961.23 rows=116623 width=899)
-> Hash (cost=28208.65..28208.65 rows=114149 width=19)
-> Hash Left Join (cost=20103.02..28208.65 rows=114149 width=19)
Hash Cond: (owner.cat_id = cat.id)
-> Seq Scan on owner o (cost=0.00..5849.49 rows=114149 width=16)
-> Hash (cost=17961.23..17961.23 rows=116623 width=19)
-> Seq Scan on animal cat (cost=0.00..17961.23 rows=116623 width=19)
I don't understand why the query plan is doing a sequential scan.
I thought that the optimizer would be smart enough to scan the animal table once, or even twice using the name index, and join back to the owner table based on this result, but instead I wind up with a very unexpected query plan.
I took a simpler case where we only want to look up dog names and the query behaves as I'd expect:
select *
from owner o
join animal dog on o.dog_id = dog.id
where dog.name = "fluffy";
This query produces an a plan I understand, using the index on animal.name:
Nested Loop (cost=0.83..16.88 rows=1 width=1346)
-> Index Scan using DOG_ID_IDX on animal dog (cost=0.42..8.44 rows=1 width=899)
Index Cond: ((name)::text = 'fluffy'::text)
-> Index Scan using dog_id on owner o (cost=0.42..8.44 rows=1 width=447)
Index Cond: (dog_id = b.id)
Even doing the query with two inner joins produces a query plan I would expect:
select *
from owner o
join animal dog on o.dog_id = dog.id
join animal cat on o.cat_id = cat.id
where dog.name = 'fluffy' or cat.name = 'fluffy';
Merge Join (cost=35726.09..56215.53 rows=3 width=2245)
Merge Cond: (owner.cat_id = cat.id)
Join Filter: (((dog.name)::text = 'fluffy'::text) OR ((cat.name)::text = 'fluffy'::text))
-> Nested Loop (cost=0.83..132348.38 rows=114149 width=1346)
-> Index Scan using CAT_ID_IDX on owner o (cost=0.42..11616.07 rows=114149 width=447)
-> Index Scan using animal_pkey on animal dog (cost=0.42..1.05 rows=1 width=899)
Index Cond: (id = owner.dog_id)
-> Index Scan using animal_pkey on animal cat (cost=0.42..52636.91 rows=116623 width=899)
So it looks like the left join to animal is causing the optimizer to ignore the index.
Why does doing the additional left join to animal seem to cause the optimizer to ignore the index?
EDIT:
EXPLAIN (analyse, buffers) yields:
Hash Left Join (cost=32631.95..150357.57 rows=3 width=2245) (actual time=6696.935..6696.936 rows=0 loops=1)
Hash Cond: (o.cat_id = cat.id)
Filter: (((dog.name)::text = 'fluffy'::text) OR ((cat.name)::text = 'fluffy'::text))
Rows Removed by Filter: 114219
Buffers: shared hit=170464 read=18028 dirtied=28, temp read=13210 written=13148
-> Merge Join (cost=0.94..65696.37 rows=114149 width=1346) (actual time=1.821..860.643 rows=114219 loops=1)
Merge Cond: (o.dog_id = dog.id)
Buffers: shared hit=170286 read=1408 dirtied=28
-> Index Scan using DOG_ID_IDX on owner o (cost=0.42..11402.48 rows=114149 width=447) (actual time=1.806..334.431 rows=114219 loops=1)
Buffers: shared hit=84787 read=783 dirtied=13
-> Index Scan using animal_pkey on animal dog (cost=0.42..52636.91 rows=116623 width=899) (actual time=0.006..300.507 rows=116977 loops=1)
Buffers: shared hit=85499 read=625 dirtied=15
-> Hash (cost=17961.23..17961.23 rows=116623 width=899) (actual time=5626.780..5626.780 rows=116977 loops=1)
Buckets: 8192 Batches: 32 Memory Usage: 3442kB
Buffers: shared hit=175 read=16620, temp written=12701
-> Seq Scan on animal cat (cost=0.00..17961.23 rows=116623 width=899) (actual time=2.519..5242.106 rows=116977 loops=1)
Buffers: shared hit=175 read=16620
Planning time: 1.245 ms
Execution time: 6697.357 ms
The left join needs to keep all rows in the first table. Hence, it is going to generally scan that table, even where conditions filter other tables on those conditions.
The query plan produced by Postgres is not surprising.
I have a problem with one sql query - it became run too long.
That problem had place one month ago, before everything was OK.
EXPLAIN tell me that Postgresql using Sec scan, well, problem is clear... But why it using sec scan? DB has indexes for all needed keys.
I have spend a lot of time to fix it, but all attemps have failed :(
Please, help!
There is SQL:
EXPLAIN ANALYZE SELECT
t.id,t.category_id,t.category_root_id,t.title,t.deadtime,
t.active,t.initial_cost,t.remote,t.created,t.work_type,
a.city_id,city.city_name_ru AS city_label, curr.short_name AS currency,
l1.val AS root_category_title, l2.val AS category_title, t.service_tasktotop,
t.service_taskcolor,t.service_tasktime, m.name AS metro,
count(tb.id) AS offers_cnt, t.contact_phone_cnt
FROM tasks AS t
LEFT JOIN tasks_address AS a ON ( a.task_id=t.id )
LEFT JOIN geo.cities AS city ON ( a.city_id=city.id_city )
LEFT JOIN catalog_categories AS r ON ( r.id=t.category_root_id )
LEFT JOIN catalog_categories AS c ON ( c.id=t.category_id)
LEFT JOIN localizations AS l1 ON (l1.lang='ru' AND l1.component='catalog' AND l1.subcomponent='categories' AND l1.var=r.name)
LEFT JOIN localizations AS l2 ON (l2.lang='ru' AND l2.component='catalog' AND l2.subcomponent='categories' AND l2.var=c.name)
LEFT JOIN tasks_bets AS tb ON ( tb.task_id=t.id )
LEFT JOIN paym.currencies AS curr ON ( t.currency_id=curr.id )
LEFT JOIN geo.metro AS m ON ( a.metro_id=m.id )
WHERE t.trust_level > 0
AND (a.region_id IN (1, 0) OR a.region_id IS NULL)
AND (a.country_id IN (1, 0) OR a.country_id IS NULL)
AND t.task_type=1
GROUP BY t.id,t.category_id,t.category_root_id,t.title,t.deadtime,t.active,t.initial_cost,t.remote,t.created,t.work_type, a.city_id, city.city_name_ru, curr.short_name, l1.val, l2.val, t.contact_phone_cnt, t.service_tasktotop,t.service_taskcolor,t.service_tasktime, m.name
ORDER BY
CASE
WHEN t.active=1 THEN
CASE
WHEN t.service_tasktotop > 1392025702 THEN 100
ELSE 150
END
WHEN t.active=2 THEN
CASE
WHEN t.service_tasktotop > 1392025702 THEN 200
ELSE 250
END
WHEN t.active=3 THEN 300
WHEN t.active=4 THEN 400
WHEN t.active=5 THEN 500
WHEN t.active=-1 THEN 600
WHEN t.active=-2 THEN 700
WHEN t.active=-3 THEN 800
WHEN t.active=-4 THEN 900
ELSE 1000
END,
CASE
WHEN t.service_tasktotop>1392025702 THEN t.service_tasktotop
ELSE t.created
END
DESC
LIMIT 30 OFFSET 0
There is EXPLAIN dump:
Limit (cost=17101.17..17101.24 rows=30 width=701) (actual time=248.486..248.497 rows=30 loops=1)
-> Sort (cost=17101.17..17156.12 rows=21982 width=701) (actual time=248.484..248.487 rows=30 loops=1)
Sort Key: (CASE WHEN (t.active = 1) THEN CASE WHEN (t.service_tasktotop > 1392025702) THEN 100 ELSE 150 END WHEN (t.active = 2) THEN CASE WHEN (t.service_tasktotop > 1392025702) THEN 200 ELSE 250 END WHEN (t.active = 3) THEN 300 WHEN (t.active = 4) THEN 400 WHEN (t.active = 5) THEN 500 WHEN (t.active = (-1)) THEN 600 WHEN (t.active = (-2)) THEN 700 WHEN (t.active = (-3)) THEN 800 WHEN (t.active = (-4)) THEN 900 ELSE 1000 END), (CASE WHEN (t.service_tasktotop > 1392025702) THEN t.service_tasktotop ELSE t.created END)
Sort Method: top-N heapsort Memory: 35kB
-> GroupAggregate (cost=14363.65..16451.94 rows=21982 width=701) (actual time=212.801..233.808 rows=6398 loops=1)
-> Sort (cost=14363.65..14418.61 rows=21982 width=701) (actual time=212.777..216.111 rows=18347 loops=1)
Sort Key: t.id, t.category_id, t.category_root_id, t.title, t.deadtime, t.active, t.initial_cost, t.remote, t.created, t.work_type, a.city_id, city.city_name_ru, curr.short_name, l1.val, l2.val, t.contact_phone_cnt, t.service_tasktotop, t.service_taskcolor, t.service_tasktime, m.name
Sort Method: quicksort Memory: 6388kB
-> Hash Left Join (cost=2392.33..5939.31 rows=21982 width=701) (actual time=18.986..64.598 rows=18347 loops=1)
Hash Cond: (a.metro_id = m.id)
-> Hash Left Join (cost=2384.20..5628.92 rows=21982 width=681) (actual time=18.866..57.534 rows=18347 loops=1)
Hash Cond: (t.currency_id = curr.id)
-> Hash Left Join (cost=2383.09..5325.56 rows=21982 width=678) (actual time=18.846..50.126 rows=18347 loops=1)
Hash Cond: (t.id = tb.task_id)
-> Hash Left Join (cost=809.08..2760.18 rows=5935 width=674) (actual time=10.987..32.460 rows=6398 loops=1)
Hash Cond: (a.city_id = city.id_city)
-> Hash Left Join (cost=219.25..2029.39 rows=5935 width=158) (actual time=2.883..20.952 rows=6398 loops=1)
Hash Cond: (t.category_root_id = r.id)
-> Hash Left Join (cost=203.42..1969.65 rows=5935 width=125) (actual time=2.719..18.048 rows=6398 loops=1)
Hash Cond: (t.category_id = c.id)
-> Hash Left Join (cost=187.60..1909.91 rows=5935 width=92) (actual time=2.522..15.021 rows=6398 loops=1)
Hash Cond: (t.id = a.task_id)
Filter: (((a.region_id = ANY ('{1,0}'::integer[])) OR (a.region_id IS NULL)) AND ((a.country_id = ANY ('{1,0}'::integer[])) OR (a.country_id IS NULL)))
-> Seq Scan on tasks t (cost=0.00..1548.06 rows=7337 width=84) (actual time=0.008..6.337 rows=7337 loops=1)
Filter: ((trust_level > 0) AND (task_type = 1))
-> Hash (cost=124.49..124.49 rows=5049 width=18) (actual time=2.505..2.505 rows=5040 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 256kB
-> Seq Scan on tasks_address a (cost=0.00..124.49 rows=5049 width=18) (actual time=0.002..1.201 rows=5040 loops=1)
-> Hash (cost=14.91..14.91 rows=73 width=37) (actual time=0.193..0.193 rows=74 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 5kB
-> Hash Left Join (cost=6.46..14.91 rows=73 width=37) (actual time=0.113..0.168 rows=74 loops=1)
Hash Cond: ((c.name)::text = (l2.var)::text)
-> Seq Scan on catalog_categories c (cost=0.00..7.73 rows=73 width=17) (actual time=0.001..0.017 rows=74 loops=1)
-> Hash (cost=5.42..5.42 rows=84 width=46) (actual time=0.105..0.105 rows=104 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 8kB
-> Seq Scan on localizations l2 (cost=0.00..5.42 rows=84 width=46) (actual time=0.005..0.056 rows=104 loops=1)
Filter: (((lang)::text = 'ru'::text) AND ((component)::text = 'catalog'::text) AND ((subcomponent)::text = 'categories'::text))
-> Hash (cost=14.91..14.91 rows=73 width=37) (actual time=0.155..0.155 rows=74 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 5kB
-> Hash Left Join (cost=6.46..14.91 rows=73 width=37) (actual time=0.085..0.130 rows=74 loops=1)
Hash Cond: ((r.name)::text = (l1.var)::text)
-> Seq Scan on catalog_categories r (cost=0.00..7.73 rows=73 width=17) (actual time=0.002..0.016 rows=74 loops=1)
-> Hash (cost=5.42..5.42 rows=84 width=46) (actual time=0.080..0.080 rows=104 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 8kB
-> Seq Scan on localizations l1 (cost=0.00..5.42 rows=84 width=46) (actual time=0.004..0.046 rows=104 loops=1)
Filter: (((lang)::text = 'ru'::text) AND ((component)::text = 'catalog'::text) AND ((subcomponent)::text = 'categories'::text))
-> Hash (cost=363.26..363.26 rows=18126 width=520) (actual time=8.093..8.093 rows=18126 loops=1)
Buckets: 2048 Batches: 1 Memory Usage: 882kB
-> Seq Scan on cities city (cost=0.00..363.26 rows=18126 width=520) (actual time=0.002..3.748 rows=18126 loops=1)
-> Hash (cost=1364.56..1364.56 rows=16756 width=8) (actual time=7.844..7.844 rows=16757 loops=1)
Buckets: 2048 Batches: 1 Memory Usage: 655kB
-> Seq Scan on tasks_bets tb (cost=0.00..1364.56 rows=16756 width=8) (actual time=0.005..4.180 rows=16757 loops=1)
-> Hash (cost=1.05..1.05 rows=5 width=9) (actual time=0.008..0.008 rows=5 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Seq Scan on currencies curr (cost=0.00..1.05 rows=5 width=9) (actual time=0.003..0.005 rows=5 loops=1)
-> Hash (cost=5.28..5.28 rows=228 width=28) (actual time=0.112..0.112 rows=228 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 14kB
-> Seq Scan on metro m (cost=0.00..5.28 rows=228 width=28) (actual time=0.004..0.050 rows=228 loops=1)
Total runtime: 248.990 ms
Table:
id serial NOT NULL,
author_id integer DEFAULT 0,
category_id integer DEFAULT 0,
category_root_id integer DEFAULT 0,
title character varying,
description text,
deadtime integer DEFAULT 0,
helper_date integer DEFAULT 0,
active integer DEFAULT 1,
initial_cost integer DEFAULT 0,
conditional integer DEFAULT 0,
remote integer DEFAULT 0,
tariff_id integer DEFAULT 0,
created integer DEFAULT 0,
views integer DEFAULT 0,
accepted_helper_id integer DEFAULT 0,
accept_date integer DEFAULT 0,
auction_bet_id integer DEFAULT 0,
token character varying,
repute2helper text,
repute2author text,
active_dated integer DEFAULT 0,
bot integer,
repute_level integer,
seo_body text,
seo_body_active smallint DEFAULT 0,
service_tasktotop integer DEFAULT 0,
service_taskcolor integer DEFAULT 0,
service_tasktime integer DEFAULT 0,
type_id smallint DEFAULT 1,
partner_id integer NOT NULL DEFAULT 0,
trust_level smallint NOT NULL DEFAULT 1,
trust_date integer NOT NULL DEFAULT 0,
active_cause character varying(1500),
admin_notes text,
currency_id smallint NOT NULL DEFAULT 0,
work_type smallint NOT NULL DEFAULT 0,
helpers_gender smallint NOT NULL DEFAULT 0,
helpers_langs integer[],
service_notifyhelpers integer NOT NULL DEFAULT 0,
contact_phone character varying(50),
contact_email character varying(50),
fastclose smallint NOT NULL DEFAULT 0,
access_code character varying(250),
contact_phone_cnt integer NOT NULL DEFAULT 0,
author_in_task integer NOT NULL DEFAULT 0,
task_type smallint NOT NULL DEFAULT 1
Indexes:
CREATE INDEX tasks_author_in_task_idx
ON tasks
USING btree
(author_in_task );
CREATE INDEX tasks_deadtime_bot_created_active_dated_currency_id_idx
ON tasks
USING btree
(deadtime , bot , created , active_dated , currency_id );
CREATE INDEX tasks_idxs
ON tasks
USING btree
(id , active , category_id , category_root_id , remote , type_id , partner_id , trust_level );
CREATE INDEX tasks_service_tasktotop_service_taskcolor_service_tasktime_idx
ON tasks
USING btree
(service_tasktotop , service_taskcolor , service_tasktime );
CREATE INDEX tasks_task_type_idx
ON tasks
USING btree
(task_type );
i recommend to use explain.depesz.com, with this service it's much easier to see where is the problem.
Here http://explain.depesz.com/s/vHT is your explain, as you can see on stats tab seq scans are not a problem. Only 3.1 % of total runtime duration. On other hand sorting operations take a lot time (67%). Do you really need to sort by so many columns?
Sort Key: t.id, t.category_id, t.category_root_id, t.title, t.deadtime, t.active, t.initial_cost, t.remote, t.created, t.work_type, a.city_id, city.city_name_ru, curr.short_name, l1.val, l2.val, t.contact_phone_cnt, t.service_tasktotop, t.service_taskcolor, t.service_tasktime, m.name
last thing, do you have index on every column which is used with JOINS? look here for my simple example, i'm doing simple left join on table with table itself. First with indexed column, than without index. Look at plans (merge joins vs hash joins) and times. Remember that both table columns used in join should be in some index.
P.S. and always do ANALYZE on table, to be sure if planner has actual statistics!
Table "public.people"
Column | Type | Modifiers
------------+---------+-----------------------------------------------------
id | integer | not null default nextval('people_id_seq'::regclass)
username | text |
department | text |
salary | integer |
deleted | boolean | not null default false
Indexes:
"people_pkey" PRIMARY KEY, btree (id)
"people_department_idx" btree (department)
"people_department_salary_idx" btree (department, salary)
"people_salary_idx" btree (salary)
sebpa=# explain analyze select * from people a left join people b on a.id = b.id where a.salary < 30000;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Merge Left Join (cost=0.57..2540.51 rows=19995 width=82) (actual time=0.022..19.710 rows=19995 loops=1)
Merge Cond: (a.id = b.id)
-> Index Scan using people_pkey on people a (cost=0.29..1145.29 rows=19995 width=41) (actual time=0.011..6.645 rows=19995 loops=1)
Filter: (salary < 30000)
Rows Removed by Filter: 10005
-> Index Scan using people_pkey on people b (cost=0.29..1070.29 rows=30000 width=41) (actual time=0.008..3.769 rows=19996 loops=1)
Total runtime: 20.969 ms
(7 rows)
sebpa=# alter table people drop constraint people_pkey;
ALTER TABLE
sebpa=# vacuum analyze people;
VACUUM
sebpa=# explain analyze select * from people a left join people b on a.id = b.id where a.salary < 30000;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Hash Right Join (cost=1081.94..2829.39 rows=19995 width=82) (actual time=10.767..47.147 rows=19995 loops=1)
Hash Cond: (b.id = a.id)
-> Seq Scan on people b (cost=0.00..581.00 rows=30000 width=41) (actual time=0.001..2.989 rows=30000 loops=1)
-> Hash (cost=656.00..656.00 rows=19995 width=41) (actual time=10.753..10.753 rows=19995 loops=1)
Buckets: 2048 Batches: 2 Memory Usage: 733kB
-> Seq Scan on people a (cost=0.00..656.00 rows=19995 width=41) (actual time=0.007..5.827 rows=19995 loops=1)
Filter: (salary < 30000)
Rows Removed by Filter: 10005
Total runtime: 48.884 ms
OK - so you've run an explain on a complicated query and seen a seq-scan then jumped to conclusions.
Explain output can be tricky to read on a small screen, but there's a nice chap who's built a tool for us. Let's post it to explain.depesz.com
http://explain.depesz.com/s/DTCz
This shows you nicely coloured output. Those sequential scans? Take only milliseconds.
The big time consumer seems to be that sort (151ms by itself). It's sorting 18,000 rows by a lot of fields and uses about 6.4MB of memory to do so.
There's nothing worth concentrating on apart from this sort. There are only three plausible options:
Make sure your work_mem is > 6.4MB for this query (set work_mem=...)
Add an index that matches the fields you want to sort by (might work/might not, but it will be a big index that's expensive to update).
Rewrite the query - use a subquery to filter + group your tasks then join to the other tables. Difficult to say if/how much it will help.
Start with #1 - that'll only take a few minutes to test and is a likely candidate if the query used to be quicker.
I have tables like that in a Postgres 9.3 database:
A <1---n B n---1> C
Table A contains ~10^7 rows, table B is rather big with ~10^9 rows and C contains ~100 rows.
I use the following query to find all As (distinct) that match some criteria in B and C (the real query is more complex, joins more tables and checks more attributes within the subquery):
Query 1:
explain analyze
select A.SNr from A
where exists (select 1 from B, C
where B.AId = A.Id and
B.CId = C.Id and
B.Timestamp >= '2013-01-01' and
B.Timestamp <= '2013-01-12' and
C.Name = '00000015')
limit 200;
That query takes about 500ms (Note that C.Name = '00000015' exists in the table):
Limit (cost=119656.37..120234.06 rows=200 width=9) (actual time=427.799..465.485 rows=200 loops=1)
-> Hash Semi Join (cost=119656.37..483518.78 rows=125971 width=9) (actual time=427.797..465.460 rows=200 loops=1)
Hash Cond: (a.id = b.aid)
-> Seq Scan on a (cost=0.00..196761.34 rows=12020034 width=13) (actual time=0.010..15.058 rows=133470 loops=1)
-> Hash (cost=117588.73..117588.73 rows=125971 width=4) (actual time=427.233..427.233 rows=190920 loops=1)
Buckets: 4096 Batches: 8 Memory Usage: 838kB
-> Nested Loop (cost=0.57..117588.73 rows=125971 width=4) (actual time=0.176..400.326 rows=190920 loops=1)
-> Seq Scan on c (cost=0.00..2.88 rows=1 width=4) (actual time=0.015..0.030 rows=1 loops=1)
Filter: (name = '00000015'::text)
Rows Removed by Filter: 149
-> Index Only Scan using cid_aid on b (cost=0.57..116291.64 rows=129422 width=8) (actual time=0.157..382.896 rows=190920 loops=1)
Index Cond: ((cid = c.id) AND ("timestamp" >= '2013-01-01 00:00:00'::timestamp without time zone) AND ("timestamp" <= '2013-01-12 00:00:00'::timestamp without time zone))
Heap Fetches: 0
Total runtime: 476.173 ms
Query 2: Changing C.Name to something that doesn't exist (C.Name = 'foo') takes 0.1ms:
explain analyze
select A.SNr from A
where exists (select 1 from B, C
where B.AId = A.Id and
B.CId = C.Id and
B.Timestamp >= '2013-01-01' and
B.Timestamp <= '2013-01-12' and
C.Name = 'foo')
limit 200;
Limit (cost=119656.37..120234.06 rows=200 width=9) (actual time=0.063..0.063 rows=0 loops=1)
-> Hash Semi Join (cost=119656.37..483518.78 rows=125971 width=9) (actual time=0.062..0.062 rows=0 loops=1)
Hash Cond: (a.id = b.aid)
-> Seq Scan on a (cost=0.00..196761.34 rows=12020034 width=13) (actual time=0.010..0.010 rows=1 loops=1)
-> Hash (cost=117588.73..117588.73 rows=125971 width=4) (actual time=0.038..0.038 rows=0 loops=1)
Buckets: 4096 Batches: 8 Memory Usage: 0kB
-> Nested Loop (cost=0.57..117588.73 rows=125971 width=4) (actual time=0.038..0.038 rows=0 loops=1)
-> Seq Scan on c (cost=0.00..2.88 rows=1 width=4) (actual time=0.037..0.037 rows=0 loops=1)
Filter: (name = 'foo'::text)
Rows Removed by Filter: 150
-> Index Only Scan using cid_aid on b (cost=0.57..116291.64 rows=129422 width=8) (never executed)
Index Cond: ((cid = c.id) AND ("timestamp" >= '2013-01-01 00:00:00'::timestamp without time zone) AND ("timestamp" <= '2013-01-12 00:00:00'::timestamp without time zone))
Heap Fetches: 0
Total runtime: 0.120 ms
Query 3: Resetting the C.Name to something that exists (like in the first query) and increasing the timestamp by 3 days uses another query plan than before, but is still fast (200ms):
explain analyze
select A.SNr from A
where exists (select 1 from B, C
where B.AId = A.Id and
B.CId = C.Id and
B.Timestamp >= '2013-01-01' and
B.Timestamp <= '2013-01-15' and
C.Name = '00000015')
limit 200;
Limit (cost=0.57..112656.93 rows=200 width=9) (actual time=4.404..227.569 rows=200 loops=1)
-> Nested Loop Semi Join (cost=0.57..90347016.34 rows=160394 width=9) (actual time=4.403..227.544 rows=200 loops=1)
-> Seq Scan on a (cost=0.00..196761.34 rows=12020034 width=13) (actual time=0.008..1.046 rows=12250 loops=1)
-> Nested Loop (cost=0.57..7.49 rows=1 width=4) (actual time=0.017..0.017 rows=0 loops=12250)
-> Seq Scan on c (cost=0.00..2.88 rows=1 width=4) (actual time=0.005..0.015 rows=1 loops=12250)
Filter: (name = '00000015'::text)
Rows Removed by Filter: 147
-> Index Only Scan using cid_aid on b (cost=0.57..4.60 rows=1 width=8) (actual time=0.002..0.002 rows=0 loops=12250)
Index Cond: ((cid = c.id) AND (aid = a.id) AND ("timestamp" >= '2013-01-01 00:00:00'::timestamp without time zone) AND ("timestamp" <= '2013-01-15 00:00:00'::timestamp without time zone))
Heap Fetches: 0
Total runtime: 227.632 ms
Query 4: But that new query plan utterly fails when searching for a C.Name that doesn't exist::
explain analyze
select A.SNr from A
where exists (select 1 from B, C
where B.AId = A.Id and
B.CId = C.Id and
B.Timestamp >= '2013-01-01' and
B.Timestamp <= '2013-01-15' and
C.Name = 'foo')
limit 200;
Now it takes 170 seconds (vs. 0.1ms before!) to return the same 0 rows:
Limit (cost=0.57..112656.93 rows=200 width=9) (actual time=170184.979..170184.979 rows=0 loops=1)
-> Nested Loop Semi Join (cost=0.57..90347016.34 rows=160394 width=9) (actual time=170184.977..170184.977 rows=0 loops=1)
-> Seq Scan on a (cost=0.00..196761.34 rows=12020034 width=13) (actual time=0.008..794.626 rows=12020034 loops=1)
-> Nested Loop (cost=0.57..7.49 rows=1 width=4) (actual time=0.013..0.013 rows=0 loops=12020034)
-> Seq Scan on c (cost=0.00..2.88 rows=1 width=4) (actual time=0.013..0.013 rows=0 loops=12020034)
Filter: (name = 'foo'::text)
Rows Removed by Filter: 150
-> Index Only Scan using cid_aid on b (cost=0.57..4.60 rows=1 width=8) (never executed)
Index Cond: ((cid = c.id) AND (aid = a.id) AND ("timestamp" >= '2013-01-01 00:00:00'::timestamp without time zone) AND ("timestamp" <= '2013-01-15 00:00:00'::timestamp without time zone))
Heap Fetches: 0
Total runtime: 170185.033 ms
All queries were run after "alter table set statistics" with a value of 10000 on all columns and after running analyze on the whole db.
Right now it looks like the slightest change of a parameter (not even of the SQL) can make Postgres choose a bad plan (0.1ms vs. 170s in this case!). I always try to check query plans when changing things, but it's hard to ever be sure that something will work when such small changes on parameters can make such huge differences. I have similar problems with other queries too.
What can I do to get more predictable results?
(I have tried modifying certain query planning parameters (set enable_... = on/off) and some different SQL statements - joining+distinct/group by instead of "exists" - but nothing seems to make postgres choose "stable" query plans while still providing acceptable performance).
Edit #1: Table + index definitions
test=# \d a
Tabelle äpublic.aô
Spalte | Typ | Attribute
--------+---------+----------------------------------------------------
id | integer | not null Vorgabewert nextval('a_id_seq'::regclass)
anr | integer |
snr | text |
Indexe:
"a_pkey" PRIMARY KEY, btree (id)
"anr_snr_index" UNIQUE, btree (anr, snr)
"anr_index" btree (anr)
Fremdschlnssel-Constraints:
"anr_fkey" FOREIGN KEY (anr) REFERENCES pt(id)
Fremdschlnsselverweise von:
TABLE "b" CONSTRAINT "aid_fkey" FOREIGN KEY (aid) REFERENCES a(id)
test=# \d b
Tabelle äpublic.bô
Spalte | Typ | Attribute
-----------+-----------------------------+-----------
id | uuid | not null
timestamp | timestamp without time zone |
cid | integer |
aid | integer |
prop1 | text |
propn | integer |
Indexe:
"b_pkey" PRIMARY KEY, btree (id)
"aid_cid" btree (aid, cid)
"cid_aid" btree (cid, aid, "timestamp")
"timestamp_index" btree ("timestamp")
Fremdschlnssel-Constraints:
"aid_fkey" FOREIGN KEY (aid) REFERENCES a(id)
"cid_fkey" FOREIGN KEY (cid) REFERENCES c(id)
test=# \d c
Tabelle äpublic.cô
Spalte | Typ | Attribute
--------+---------+----------------------------------------------------
id | integer | not null Vorgabewert nextval('c_id_seq'::regclass)
name | text |
Indexe:
"c_pkey" PRIMARY KEY, btree (id)
"c_name_index" UNIQUE, btree (name)
Fremdschlnsselverweise von:
TABLE "b" CONSTRAINT "cid_fkey" FOREIGN KEY (cid) REFERENCES c(id)
Your problem is that the query needs to evaluate the correlated sub query for the entire table a. When Postgres quickly finds 200 random rows that fit (which seems to occasionally be the case when c.name exists), it yields them accordingly, and reasonably fast if there are plenty to choose from. But when no such rows exists, it evaluates the entire hogwash in the exists() statement as many times as table a has rows, hence the performance issue you're seeing.
Adding an uncorrelated where clause will most certainly fix a number of edge cases:
and exists(select 1 from c where name = ?)
It might also work when you join the latter with b and write it as a cte:
with bc as (
select aid
from b join c on b.cid = c.bid
and b.timestamp between ? and ?
and c.name = ?
)
select a.id
from a
where exists (select 1 from bc)
and exists (select 1 from bc where a.id = bc.aid)
limit 200
If not, just toss in the bc query verbatim instead of using the cte. The point here is to force Postgres to consider the bc lookup as independent, and bail early if the resulting set yields no rows at all.
I assume your query is more complex in the end, but note that the above could be rewritten as:
with bc as (...)
select aid
from bc
limit 200
Or:
with bc as (...)
select a.id
from a
where a.id in (select aid from bc)
limit 200
Both should yield better plans in edge cases.
(Side note: it's usually unadvisable to limit without ordering.)
Maybe try to rewrite query with CTE?
with BC as (
select distinct B.AId from B where
B.Timestamp >= '2013-01-01' and
B.Timestamp <= '2013-01-12' and
B.CId in (select C.Id from C where C.Name = '00000015')
limit 200
)
select A.SNr from A where A.Id in (select AId from BC)
If I understand correctly - limit could be easily put inside BC query to avoid scan on table A.
I have these two tables in my database
Student Table Student Semester Table
| Column : Type | | Column : Type |
|------------|----------| |------------|----------|
| student_id : integer | | student_id : integer |
| satquan : smallint | | semester : integer |
| actcomp : smallint | | enrolled : boolean |
| entryyear : smallint | | major : text |
|-----------------------| | college : text |
|-----------------------|
Where student_id is a unique key in the student table, and a foreign key in the student semester table. The semester integer is just a 1 for the first semester, 2 for the second, and so on.
I'm doing queries where I want to get the students by their entryyear (and sometimes by their sat and/or act scores), then get all of those students associated data from the student semester table.
Currently, my queries look something like this:
SELECT * FROM student_semester
WHERE student_id IN(
SELECT student_id FROM student_semester
WHERE student_id IN(
SELECT student_id FROM student WHERE entryyear = 2006
) AND college = 'AS' AND ...
)
ORDER BY student_id, semester;
But, this results in relatively long running queries (400ms) when I am selecting ~1k students. According to execution plan, most of the time is spent doing a hash join. To ameliorate this, I have added satquan, actpcomp, and entryyear columns to the student_semester table. This reduces the time to run the query by ~90%, but results in a lot of redundant data. Is there a better way to do this?
These are the indexes that I currently have (Along with the implicit indexes on student_id):
CREATE INDEX act_sat_entryyear ON student USING btree (entryyear, actcomp, sattotal)
CREATE INDEX student_id_major_college ON student_semester USING btree (student_id, major, college)
Query Plan
QUERY PLAN
Hash Join (cost=17311.74..35895.38 rows=81896 width=65) (actual time=121.097..326.934 rows=25680 loops=1)
Hash Cond: (public.student_semester.student_id = public.student_semester.student_id)
-> Seq Scan on student_semester (cost=0.00..14307.20 rows=698820 width=65) (actual time=0.015..154.582 rows=698820 loops=1)
-> Hash (cost=17284.89..17284.89 rows=2148 width=8) (actual time=121.062..121.062 rows=1284 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 51kB
-> HashAggregate (cost=17263.41..17284.89 rows=2148 width=8) (actual time=120.708..120.871 rows=1284 loops=1)
-> Hash Semi Join (cost=1026.68..17254.10 rows=3724 width=8) (actual time=4.828..119.619 rows=6184 loops=1)
Hash Cond: (public.student_semester.student_id = student.student_id)
-> Seq Scan on student_semester (cost=0.00..16054.25 rows=42908 width=4) (actual time=0.013..109.873 rows=42331 loops=1)
Filter: ((college)::text = 'AS'::text)
-> Hash (cost=988.73..988.73 rows=3036 width=4) (actual time=4.801..4.801 rows=3026 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 107kB
-> Bitmap Heap Scan on student (cost=71.78..988.73 rows=3036 width=4) (actual time=0.406..3.223 rows=3026 loops=1)
Recheck Cond: (entryyear = 2006)
-> Bitmap Index Scan on student_act_sat_entryyear_index (cost=0.00..71.03 rows=3036 width=0) (actual time=0.377..0.377 rows=3026 loops=1)
Index Cond: (entryyear = 2006)
Total runtime: 327.708 ms
I was mistaken about there not being a Seq Scan in the query. I think the Seq Scan is being done due to the number of rows that match the college condition; when I change it to one that has less students an index is used. Source: https://stackoverflow.com/a/5203827/880928
Query with entryyear column included student semester table
SELECT * FROM student_semester
WHERE student_id IN(
SELECT student_id FROM student_semester
WHERE entryyear = 2006 AND collgs = 'AS'
) ORDER BY student_id, semester;
Query Plan
Sort (cost=18597.13..18800.49 rows=81343 width=65) (actual time=72.946..74.003 rows=25680 loops=1)
Sort Key: public.student_semester.student_id, public.student_semester.semester
Sort Method: quicksort Memory: 3546kB
-> Nested Loop (cost=9843.87..11962.91 rows=81343 width=65) (actual time=24.617..40.751 rows=25680 loops=1)
-> HashAggregate (cost=9843.87..9845.73 rows=186 width=4) (actual time=24.590..24.836 rows=1284 loops=1)
-> Bitmap Heap Scan on student_semester (cost=1612.75..9834.63 rows=3696 width=4) (actual time=10.401..23.637 rows=6184 loops=1)
Recheck Cond: (entryyear = 2006)
Filter: ((collgs)::text = 'AS'::text)
-> Bitmap Index Scan on entryyear_act_sat_semester_enrolled_cumdeg_index (cost=0.00..1611.82 rows=60192 width=0) (actual time=10.259..10.259 rows=60520 loops=1)
Index Cond: (entryyear = 2006)
-> Index Scan using student_id_index on student_semester (cost=0.00..11.13 rows=20 width=65) (actual time=0.003..0.010 rows=20 loops=1284)
Index Cond: (student_id = public.student_semester.student_id)
Total runtime: 74.938 ms
An alternative approach to doing the query is to use window functions.
select t.* -- Has the extra NumMatches column. To eliminate it, list the columns you want
from (select ss.*,
sum(case when ss.college = 'AS' and s.entry_year = 206 then 1 else 0 end) over
(partition by student_id) as NumMatches
from student_semester ss join
student s
on ss.student_id = s.student_id
) t
where NumMatches > 0;
Window functions are usually faster than joining in an aggregation, so I suspect that this might perform well.
The clean version of you query is
select ss.*
from
student s
inner join
student_semester ss using(student_id)
where
s.entryyear = 2006
and exists (
select 1
from student_semester
where
college = 'AS'
and student_id = s.student_id
)
order by ss.student_id, semester
You want, it appears, students who entered in 2006 and who have ever been in AS college.
Version One.
SELECT sem.*
FROM student s JOIN student_semester sem USING (student_id)
WHERE s.entry_year=2006
AND student_id IN (SELECT student_id
FROM student_semester s2 WHERE s2.college='AS')
AND /* other criteria */
ORDER BY sem.student_id, semester;
Version Two
SELECT sem.*
FROM student s JOIN student_semester sem USING (student_id)
WHERE s.entry_year=2006
AND EXISTS
(SELECT 1 FROM student_semester s2
WHERE s2.student_id = s.student_id AND s2.college='AS')
-- CREATE INDEX foo on student_semester(student_id, college);
AND /* other criteria */
ORDER BY sem.student_id, semester;
I expect both to be fast, but whether they one performs better than the other (or exact same plan) is a PG mystery.
[EDIT] Here's a version with no semi-joins. I wouldn't expect it to work well because it will give multiple hits for each time a student was in AS.
SELECT DISTINCT ON ( /* PK of sem */ )
FROM student s
JOIN student_semester sem USING (student_id)
JOIN student_semester s2 USING (student_id)
WHERE s.entry_year=2006
AND s2.college='AS'
ORDER BY sem.student_id, semester;