Improving Subquery performance in Postgres - sql

I have these two tables in my database
Student Table Student Semester Table
| Column : Type | | Column : Type |
|------------|----------| |------------|----------|
| student_id : integer | | student_id : integer |
| satquan : smallint | | semester : integer |
| actcomp : smallint | | enrolled : boolean |
| entryyear : smallint | | major : text |
|-----------------------| | college : text |
|-----------------------|
Where student_id is a unique key in the student table, and a foreign key in the student semester table. The semester integer is just a 1 for the first semester, 2 for the second, and so on.
I'm doing queries where I want to get the students by their entryyear (and sometimes by their sat and/or act scores), then get all of those students associated data from the student semester table.
Currently, my queries look something like this:
SELECT * FROM student_semester
WHERE student_id IN(
SELECT student_id FROM student_semester
WHERE student_id IN(
SELECT student_id FROM student WHERE entryyear = 2006
) AND college = 'AS' AND ...
)
ORDER BY student_id, semester;
But, this results in relatively long running queries (400ms) when I am selecting ~1k students. According to execution plan, most of the time is spent doing a hash join. To ameliorate this, I have added satquan, actpcomp, and entryyear columns to the student_semester table. This reduces the time to run the query by ~90%, but results in a lot of redundant data. Is there a better way to do this?
These are the indexes that I currently have (Along with the implicit indexes on student_id):
CREATE INDEX act_sat_entryyear ON student USING btree (entryyear, actcomp, sattotal)
CREATE INDEX student_id_major_college ON student_semester USING btree (student_id, major, college)
Query Plan
QUERY PLAN
Hash Join (cost=17311.74..35895.38 rows=81896 width=65) (actual time=121.097..326.934 rows=25680 loops=1)
Hash Cond: (public.student_semester.student_id = public.student_semester.student_id)
-> Seq Scan on student_semester (cost=0.00..14307.20 rows=698820 width=65) (actual time=0.015..154.582 rows=698820 loops=1)
-> Hash (cost=17284.89..17284.89 rows=2148 width=8) (actual time=121.062..121.062 rows=1284 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 51kB
-> HashAggregate (cost=17263.41..17284.89 rows=2148 width=8) (actual time=120.708..120.871 rows=1284 loops=1)
-> Hash Semi Join (cost=1026.68..17254.10 rows=3724 width=8) (actual time=4.828..119.619 rows=6184 loops=1)
Hash Cond: (public.student_semester.student_id = student.student_id)
-> Seq Scan on student_semester (cost=0.00..16054.25 rows=42908 width=4) (actual time=0.013..109.873 rows=42331 loops=1)
Filter: ((college)::text = 'AS'::text)
-> Hash (cost=988.73..988.73 rows=3036 width=4) (actual time=4.801..4.801 rows=3026 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 107kB
-> Bitmap Heap Scan on student (cost=71.78..988.73 rows=3036 width=4) (actual time=0.406..3.223 rows=3026 loops=1)
Recheck Cond: (entryyear = 2006)
-> Bitmap Index Scan on student_act_sat_entryyear_index (cost=0.00..71.03 rows=3036 width=0) (actual time=0.377..0.377 rows=3026 loops=1)
Index Cond: (entryyear = 2006)
Total runtime: 327.708 ms
I was mistaken about there not being a Seq Scan in the query. I think the Seq Scan is being done due to the number of rows that match the college condition; when I change it to one that has less students an index is used. Source: https://stackoverflow.com/a/5203827/880928
Query with entryyear column included student semester table
SELECT * FROM student_semester
WHERE student_id IN(
SELECT student_id FROM student_semester
WHERE entryyear = 2006 AND collgs = 'AS'
) ORDER BY student_id, semester;
Query Plan
Sort (cost=18597.13..18800.49 rows=81343 width=65) (actual time=72.946..74.003 rows=25680 loops=1)
Sort Key: public.student_semester.student_id, public.student_semester.semester
Sort Method: quicksort Memory: 3546kB
-> Nested Loop (cost=9843.87..11962.91 rows=81343 width=65) (actual time=24.617..40.751 rows=25680 loops=1)
-> HashAggregate (cost=9843.87..9845.73 rows=186 width=4) (actual time=24.590..24.836 rows=1284 loops=1)
-> Bitmap Heap Scan on student_semester (cost=1612.75..9834.63 rows=3696 width=4) (actual time=10.401..23.637 rows=6184 loops=1)
Recheck Cond: (entryyear = 2006)
Filter: ((collgs)::text = 'AS'::text)
-> Bitmap Index Scan on entryyear_act_sat_semester_enrolled_cumdeg_index (cost=0.00..1611.82 rows=60192 width=0) (actual time=10.259..10.259 rows=60520 loops=1)
Index Cond: (entryyear = 2006)
-> Index Scan using student_id_index on student_semester (cost=0.00..11.13 rows=20 width=65) (actual time=0.003..0.010 rows=20 loops=1284)
Index Cond: (student_id = public.student_semester.student_id)
Total runtime: 74.938 ms

An alternative approach to doing the query is to use window functions.
select t.* -- Has the extra NumMatches column. To eliminate it, list the columns you want
from (select ss.*,
sum(case when ss.college = 'AS' and s.entry_year = 206 then 1 else 0 end) over
(partition by student_id) as NumMatches
from student_semester ss join
student s
on ss.student_id = s.student_id
) t
where NumMatches > 0;
Window functions are usually faster than joining in an aggregation, so I suspect that this might perform well.

The clean version of you query is
select ss.*
from
student s
inner join
student_semester ss using(student_id)
where
s.entryyear = 2006
and exists (
select 1
from student_semester
where
college = 'AS'
and student_id = s.student_id
)
order by ss.student_id, semester

You want, it appears, students who entered in 2006 and who have ever been in AS college.
Version One.
SELECT sem.*
FROM student s JOIN student_semester sem USING (student_id)
WHERE s.entry_year=2006
AND student_id IN (SELECT student_id
FROM student_semester s2 WHERE s2.college='AS')
AND /* other criteria */
ORDER BY sem.student_id, semester;
Version Two
SELECT sem.*
FROM student s JOIN student_semester sem USING (student_id)
WHERE s.entry_year=2006
AND EXISTS
(SELECT 1 FROM student_semester s2
WHERE s2.student_id = s.student_id AND s2.college='AS')
-- CREATE INDEX foo on student_semester(student_id, college);
AND /* other criteria */
ORDER BY sem.student_id, semester;
I expect both to be fast, but whether they one performs better than the other (or exact same plan) is a PG mystery.
[EDIT] Here's a version with no semi-joins. I wouldn't expect it to work well because it will give multiple hits for each time a student was in AS.
SELECT DISTINCT ON ( /* PK of sem */ )
FROM student s
JOIN student_semester sem USING (student_id)
JOIN student_semester s2 USING (student_id)
WHERE s.entry_year=2006
AND s2.college='AS'
ORDER BY sem.student_id, semester;

Related

Why does this GROUP BY cause a full table scan?

I have a table of projects and a table of tasks, with each task referencing a single project. I want to get a list of projects sorted by their due dates along with the number of tasks in each project. I can pretty easily write this query two ways. First, using JOIN and GROUP BY:
SELECT p.name, p.due_date, COUNT(t.id) as num_tasks
FROM projects p
LEFT OUTER JOIN tasks t ON t.project_id = p.id
GROUP BY p.id
ORDER BY p.due_date ASC LIMIT 20;
Second, using a subquery:
SELECT p.name, p.due_date, (SELECT
COUNT(*) FROM tasks t WHERE t.project_id = p.id) as num_tasks
FROM projects p
ORDER BY p.due_date ASC LIMIT 20;
I'm using PostgreSQL 10, and I've got indices on projects.id, projects.due_date and tasks.project_id. Why does the first query using the GROUP BY clause do a full table scan while the second query makes proper use of the indices? It seems like these should compile down to the same thing.
Note that if I remove the GROUP BY and the COUNT(t.id) from the first query it will run quickly, just with lots of duplicate rows. So the problem is with the GROUP BY clause, not the JOIN. This seems like it's about the simplest GROUP BY one could do, so I'd like to understand if/how to make it more efficient before moving on to more complicated queries.
Edit — here's the result of EXPLAIN ANALYZE. First query:
Limit (cost=41919.58..41919.63 rows=20 width=53) (actual time=1046.762..1046.771 rows=20 loops=1)
-> Sort (cost=41919.58..42169.58 rows=100000 width=53) (actual time=1046.760..1046.765 rows=20 loops=1)
Sort Key: p.due_date
Sort Method: top-N heapsort Memory: 29kB
-> GroupAggregate (cost=0.71..39258.62 rows=100000 width=53) (actual time=0.109..1002.890 rows=100000 loops=1)
Group Key: p.id
-> Merge Left Join (cost=0.71..35758.62 rows=500000 width=49) (actual time=0.072..807.603 rows=500702 loops=1)
Merge Cond: (p.id = t.project_id)
-> Index Scan using projects_pkey on projects p (cost=0.29..3542.29 rows=100000 width=45) (actual time=0.025..38.363 rows=100000 loops=1)
-> Index Scan using project_id_idx on tasks t (cost=0.42..25716.33 rows=500000 width=8) (actual time=0.038..531.097 rows=500000 loops=1)
Planning Time: 0.573 ms
Execution Time: 1046.934 ms
Second query:
Limit (cost=0.29..92.61 rows=20 width=49) (actual time=0.079..0.443 rows=20 loops=1)
-> Index Scan using project_date_idx on projects p (cost=0.29..461594.09 rows=100000 width=49) (actual time=0.076..0.432 rows=20 loops=1)
SubPlan 1
-> Aggregate (cost=4.54..4.55 rows=1 width=8) (actual time=0.015..0.016 rows=1 loops=20)
-> Index Only Scan using project_id_idx on tasks t (cost=0.42..4.53 rows=6 width=0) (actual time=0.009..0.011 rows=5 loops=20)
Index Cond: (project_id = p.id)
Heap Fetches: 0
Planning Time: 0.284 ms
Execution Time: 0.551 ms
And if anyone wants to try to exactly reproduce this, here's my setup:
CREATE TABLE projects (
id serial NOT NULL PRIMARY KEY,
name varchar(100) NOT NULL,
due_date timestamp NOT NULL
);
CREATE TABLE tasks (
id serial NOT NULL PRIMARY KEY,
project_id integer NOT NULL,
data real NOT NULL
);
INSERT INTO projects (name, due_date) SELECT
md5(random()::text),
timestamp '2020-01-01 00:00:00' +
random() * (timestamp '2030-01-01 20:00:00' - timestamp '2020-01-01 10:00:00')
FROM generate_series(1, 100000);
INSERT INTO tasks (project_id, data)
SELECT CAST(1 + random()*99999 AS integer), random()
FROM generate_series(1, 500000);
CREATE INDEX project_date_idx ON projects ("due_date");
CREATE INDEX project_id_idx ON tasks ("project_id");
ALTER TABLE tasks ADD CONSTRAINT task_foreignkey FOREIGN KEY ("project_id") REFERENCES "projects" ("id") DEFERRABLE INITIALLY DEFERRED;

SQL JOIN in PostgreSQL - Different execution plan in WHERE clause than in ON clause

We have a simple statement in PostgreSQL 11.9/11.10 or 12.5 where we can write the join with a WHERE-CLAUSE or with a ON-CLAUSE. The meaning is exactly the same and therefore the number of returned rows too - But we receive a different explain plan. With more data inside the tables one execution plan is getting really bad and we want to understand why PostgreSQL chooses different explain plans for this situation. Any ideas?
Let's create some sample data:
CREATE TABLE t1 (
t1_nr int8 NOT NULL,
name varchar(60),
CONSTRAINT t1_pk PRIMARY KEY (t1_nr)
);
INSERT INTO t1 (t1_nr, name) SELECT s, left(md5(random()::text), 10) FROM generate_series(1, 1000000) s; -- 1 million records
CREATE TABLE t2 (
t2_nr int8 NOT NULL,
CONSTRAINT t2_pk PRIMARY KEY (t2_nr)
);
INSERT INTO t2 (t2_nr) SELECT s FROM generate_series(1, 10000000) s; -- 10 million records
CREATE TABLE t3 (
t1_nr int8 NOT NULL,
t2_nr int8 NOT NULL,
CONSTRAINT t3_pk PRIMARY KEY (t2_nr, t1_nr)
);
INSERT INTO t3 (t1_nr, t2_nr) SELECT (s-1)/10+1, s FROM generate_series(1, 10000000) s; -- 10 t2 records per t1 records --> 10 million records
Our Statement with fully analyzed statistics:
EXPLAIN (BUFFERS, ANALYZE)
SELECT t1.*
FROM t1 t1
WHERE EXISTS (
SELECT 1
FROM t3 t3
JOIN t2 t2 ON t2.t2_nr = t3.t2_nr
--AND t3.t1_nr = t1.t1_nr /* GOOD (using ON-CLAUSE) */
WHERE t3.t1_nr = t1.t1_nr /* BAD (using WHERE-CLAUSE) */
)
LIMIT 1000
The explain plan with the "GOOD" row (ON-CLAUSE):
QUERY PLAN |
--------------------------------------------------------------------------------------------------------------------------------------|
Limit (cost=0.00..22896.86 rows=1000 width=19) (actual time=0.028..4.801 rows=1000 loops=1) |
Buffers: shared hit=8015 |
-> Seq Scan on t1 (cost=0.00..11448428.92 rows=500000 width=19) (actual time=0.027..4.725 rows=1000 loops=1) |
Filter: (SubPlan 1) |
Buffers: shared hit=8015 |
SubPlan 1 |
-> Nested Loop (cost=0.87..180.43 rows=17 width=0) (actual time=0.004..0.004 rows=1 loops=1000) |
Buffers: shared hit=8008 |
-> Index Only Scan using t3_pk on t3 (cost=0.43..36.73 rows=17 width=8) (actual time=0.002..0.002 rows=1 loops=1000)|
Index Cond: (t1_nr = t1.t1_nr) |
Heap Fetches: 1000 |
Buffers: shared hit=4003 |
-> Index Only Scan using t2_pk on t2 (cost=0.43..8.45 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=1000) |
Index Cond: (t2_nr = t3.t2_nr) |
Heap Fetches: 1000 |
Buffers: shared hit=4005 |
Planning Time: 0.267 ms |
Execution Time: 4.880 ms |
The explain plan with the "BAD" row (WHERE-CLAUSE):
QUERY PLAN |
-------------------------------------------------------------------------------------------------------------------------------------------------------------|
Limit (cost=1166.26..7343.42 rows=1000 width=19) (actual time=16.888..75.809 rows=1000 loops=1) |
Buffers: shared hit=51883 read=11 dirtied=2 |
-> Merge Semi Join (cost=1166.26..3690609.61 rows=597272 width=19) (actual time=16.887..75.703 rows=1000 loops=1) |
Merge Cond: (t1.t1_nr = t3.t1_nr) |
Buffers: shared hit=51883 read=11 dirtied=2 |
-> Index Scan using t1_pk on t1 (cost=0.42..32353.42 rows=1000000 width=19) (actual time=0.010..0.271 rows=1000 loops=1) |
Buffers: shared hit=12 |
-> Gather Merge (cost=1000.89..3530760.13 rows=9999860 width=8) (actual time=16.873..74.064 rows=9991 loops=1) |
Workers Planned: 2 |
Workers Launched: 2 |
Buffers: shared hit=51871 read=11 dirtied=2 |
-> Nested Loop (cost=0.87..2375528.14 rows=4166608 width=8) (actual time=0.054..14.275 rows=4309 loops=3) |
Buffers: shared hit=51871 read=11 dirtied=2 |
-> Parallel Index Only Scan using t3_pk on t3 (cost=0.43..370689.69 rows=4166608 width=16) (actual time=0.028..1.495 rows=4309 loops=3)|
Heap Fetches: 12927 |
Buffers: shared hit=131 read=6 |
-> Index Only Scan using t2_pk on t2 (cost=0.43..0.48 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=12927) |
Index Cond: (t2_nr = t3.t2_nr) |
Heap Fetches: 12927 |
Buffers: shared hit=51740 read=5 dirtied=2 |
Planning Time: 0.475 ms |
Execution Time: 75.947 ms |
Thanks for your ideas, if we add an index like
CREATE INDEX t3_t1_nr ON t3(t1_nr);
the "BAD"-Statement will improve a little bit.
But the final solution for us was to increase the statistics gathered for this tables:
ALTER TABLE t1 ALTER COLUMN t1_nr SET STATISTICS 10000;
ALTER TABLE t2 ALTER COLUMN t2_nr SET STATISTICS 10000;
ALTER TABLE t3 ALTER COLUMN t1_nr SET STATISTICS 10000;
ANALYZE t1;
ANALYZE t2;
ANALYZE t3;
After this change both SELECTs has more about the same exection time.
More information can be found here: https://www.postgresql.org/docs/12/planner-stats.html

Postgres multiple predicates on multiple columns

Edited:
I thought I'd explain what I am trying to do so someone might have a better idea how to write the query better than what I was asking.
I've got a table that has about 500 million rows and another with about 50M rows.
The table definitions are like the following
CREATE TABLE NGRAM_CONTENT
(
id BIGINT NOT NULL PRIMARY KEY,
ref TEXT NOT NULL,
data TEXT
);
CREATE INDEX idx_reference_ngram_content ON NGRAM_CONTENT (ref);
CREATE INDEX idx_id_ngram_content ON NGRAM_CONTENT (id);
CREATE TABLE NGRAMS
(
id BIGINT NOT NULL,
ngram TEXT NOT NULL,
ref TEXT NOT NULL,
name_length INT NOT NULL
);
CREATE INDEX combined_index ON NGRAMS (name_length, ngram, ref, id);
CREATE INDEX namelength_idx ON NGRAMS (name_length);
CREATE INDEX id_idx ON NGRAMS (id);
CREATE INDEX ref_idx ON NGRAMS (ref);
CREATE INDEX ngram_idx ON NGRAMS (ngram);
In order for fast insertion using bulk, upstream events that have been marked as deleted, are inserted with null for the data column of the NGRAM_CONTENT table and no foreign constraints have been setup, however
both id and ref from ngrams table are foreign keys to NGRAM_CONTENT table.
Some sample data
Ngram_Content:
|id | ref | data |
| 1 | 'P1' | some_json |
| 2 | 'P1' | some_new_json | # P1 comes again as an update
| 3 | 'P2' | P3 |
| 4 | 'P1' | null |
Ngrams:
name_length | ngram | ref | id |
12 | CH | 'P1' | 1 |
12 | AN | 'P1' | 1 |
14 | NEW | 'P1' | 2 |
20 | CH | 'P2' | 3 |
20 | CHAI | 'P2' | 3 |
...
For the above data if I search for ngrams of 'CH' or 'AN' with id <= 1 then
it will return P1 with content some_json however if I search with id <= 2, then it won't match, as the latest at the id=2 has been updated to NEW and if I search for NEW with id <= 5, then it'd return nothing too as the latest P1 has been deleted.
All the searches should be done within a distance of name_length from and to.
In other words, only find the latest ngram content for a given ref that has not been deleted to a certain id within a limit of name_length
There are the 2 conditions that I need to support
1. With an event id (for historical runs)
2. Without event id the use the latest
So I came up with 2 variations like this
With event_id:
select w.* From NGRAM_CONTENT w
inner join (
select max(w.id) as w_max_event_id, w.ref from NGRAMS w
inner join (
select max(id) as max_event_id, ref from NGRAMS where
name_length between a_number and b_number AND ngram in ('YU', 'CA', 'SAN', 'LT', 'TO', etc) AND id < an_event_id group by ref having count(ref) >= a_threshold) i
on w.ref = i.ref where w.id >= i.max_event_id AND w.id < an_event_id group by w.ref) wi
on w.ref = wi.ref and w.event_id = wi.w_max_event_id where w.data is not null;
Without event_id:
select w.* From NGRAM_CONTENT w
inner join (
select max(w.id) as w_max_event_id, w.ref from NGRAMS w
inner join (
select max(id) as max_event_id, ref from NGRAMS where
name_length between a_number and b_number AND ngram in ('YU', 'CA', 'SAN', 'LT', 'TO', etc) group by ref having count(ref) >= a_threshold) i
on w.ref = i.ref where w.id >= i.max_event_id group by w.ref) wi
on w.ref = wi.ref and w.event_id = wi.w_max_event_id where w.data is not null;
Both queries are taking a long time to run, and when running a query explanation, Postgres is showing as a full scan.
SEQ_SCAN (Seq Scan) table: NGAMS; 121494200 3358896.0 0.0 Node Type = Seq Scan;
Parent Relationship = Outer;
Parallel Aware = true;
Relation Name = NGRAMS;
Alias = w_1;
Startup Cost = 0.0;
Total Cost = 3358896.0;
Plan Rows = 121494200;
Plan Width = 16;
A detailed execution plan with execute (analyze, buffers) query
Nested Loop (cost=5032852.92..6943974.42 rows=1 width=381) (actual time=50787.356..52095.938 rows=9437 loops=1)
Buffers: shared hit=149882 read=769965, temp read=732 written=736
-> Finalize GroupAggregate (cost=5032852.35..5125447.71 rows=265783 width=16) (actual time=50785.079..50808.811 rows=9437 loops=1)
Group Key: w_1.ref
Buffers: shared hit=114072 read=758535, temp read=732 written=736
-> Gather Merge (cost=5032852.35..5120132.05 rows=531566 width=16) (actual time=50785.072..50801.624 rows=10261 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=343724 read=2276169, temp read=2196 written=2208
-> Partial GroupAggregate (cost=5031852.33..5057776.12 rows=265783 width=16) (actual time=50766.172..50777.757 rows=3420 loops=3)
Group Key: w_1.ref
Buffers: shared hit=343724 read=2276169, temp read=2196 written=2208
-> Sort (cost=5031852.33..5039607.65 rows=3102128 width=16) (actual time=50766.163..50769.734 rows=41777 loops=3)
Sort Key: w_1.ref
Sort Method: quicksort Memory: 3251kB
Worker 0: Sort Method: quicksort Memory: 3326kB
Worker 1: Sort Method: quicksort Memory: 3396kB
Buffers: shared hit=343724 read=2276169, temp read=2196 written=2208
-> Hash Join (cost=787482.50..4591332.06 rows=3102128 width=16) (actual time=14787.585..50749.022 rows=41777 loops=3)
Hash Cond: (w_1.ref = i.ref)
Join Filter: (w_1.id >= i.max_event_id)
Buffers: shared hit=343708 read=2276169, temp read=2196 written=2208
-> Parallel Seq Scan on NGRAMS w_1 (cost=0.00..3662631.50 rows=53797008 width=16) (actual time=0.147..30898.313 rows=38518899 loops=3)
Filter: (id < 45000000)
Rows Removed by Filter: 58676466
Buffers: shared hit=15819 read=2128135
-> Hash (cost=786907.78..786907.78 rows=45978 width=16) (actual time=14767.179..14767.180 rows=9437 loops=3)
Buckets: 65536 Batches: 1 Memory Usage: 955kB
Buffers: shared hit=327861 read=148034, temp read=2196 written=2208
-> Subquery Scan on i (cost=782779.42..786907.78 rows=45978 width=16) (actual time=14669.187..14764.701 rows=9437 loops=3)
Buffers: shared hit=327861 read=148034, temp read=2196 written=2208
-> GroupAggregate (cost=782779.42..786448.00 rows=45978 width=16) (actual time=14669.186..14763.369 rows=9437 loops=3)
Group Key: NGRAMS.ref
Filter: (count(NGRAMS.ref) >= 2)
Rows Removed by Filter: 210038
Buffers: shared hit=327861 read=148034, temp read=2196 written=2208
-> Sort (cost=782779.42..783265.52 rows=194442 width=16) (actual time=14669.164..14708.948 rows=229489 loops=3)
Sort Key: NGRAMS.ref
Sort Method: external merge Disk: 5856kB
Worker 0: Sort Method: external merge Disk: 5856kB
Worker 1: Sort Method: external merge Disk: 5856kB
Buffers: shared hit=327861 read=148034, temp read=2196 written=2208
-> Index Only Scan using combined_index on NGRAMS (cost=0.57..762373.68 rows=194442 width=16) (actual time=0.336..14507.098 rows=229489 loops=3)
Index Cond: ((indexed = ANY ('{YU,CA,SAN,LT,TO}'::text[])) AND (name_length >= 15) AND (name_length <= 20) AND (event_id < 45000000))
Heap Fetches: 688467
Buffers: shared hit=327861 read=148034
-> Index Scan using idx_id_ngram_content on NGRAM_CONTENT w (cost=0.56..6.82 rows=1 width=381) (actual time=0.135..0.136 rows=1 loops=9437)
Index Cond: (id = (max(w_1.id)))
Filter: ((data IS NOT NULL) AND (w_1.ref = ref))
Buffers: shared hit=35810 read=11430
Planning Time: 12.075 ms
Execution Time: 52100.064 ms
Is there a way to make these queries faster?
I tried to break the query into smaller chunks and analyze them, and found out the full scan happens from this join
select max(w.id) as w_max_event_id, w.ref from NGRAMS w
inner join (
select max(event_id) as max_event_id, ref from NGRAMS where
name_length between a_number and b_number AND ngram in ('YU', 'CA', 'SAN', 'LT', 'TO', etc) AND id < an_event_id group by ref having count(ref) >= a_threshold) i
on w.ref = i.ref where w.id >= i.max_event_id AND w.id < an_event_id group by w.ref
but I don't know why and not sure what indexes are missing.
Preferably the answer is for Postgres, but worst case please provide answer for Oracle too.
I know it's lengthy but please do try to help if you can. Thanks
With queries as varied as that, your best bet is to create three indexes:
CREATE INDEX ON ngrams (id);
CREATE INDEX ON ngrams (name_length);
CREATE INDEX ON ngrams (ngram);
and hope that PostgreSQL can use a Bitmap And if one of the conditions is not selective enough.

Why would a left join cause an optimizer to ignore an index?

Using postgres 9.6.11, I have a schema like:
owner:
id: BIGINT (PK)
dog_id: BIGINT NOT NULL (FK)
cat_id: BIGINT NULL (FK)
index DOG_ID_IDX (dog_id)
index CAT_ID_IDX (cat_id)
animal:
id: BIGINT (PK)
name: VARCHAR(50) NOT NULL
index NAME_IDX (name)
In some example data:
owner table:
| id | dog_id | cat_id |
| -- | ------ | ------ |
| 1 | 100 | 200 |
| 2 | 101 | NULL |
animal table:
| id | name |
| --- | -------- |
| 100 | "fluffy" |
| 101 | "rex" |
| 200 | "tom" |
A common query I need to perform is to find owners by their pets name, which I thought to accomplish with a query like:
select *
from owner o
join animal dog on o.dog_id = dog.id
left join animal cat on o.cat_id = cat.id
where dog.name = "fluffy" or cat.name = "fluffy";
But the plan I get back from this I don't understand:
Hash Join (cost=30304.51..77508.31 rows=3 width=899)
Hash Cond: (dog.id = owner.dog_id)
Join Filter: (((dog.name)::text = 'fluffy'::text) OR ((cat.name)::text = 'fluffy'::text))
-> Seq Scan on animal dog (cost=0.00..17961.23 rows=116623 width=899)
-> Hash (cost=28208.65..28208.65 rows=114149 width=19)
-> Hash Left Join (cost=20103.02..28208.65 rows=114149 width=19)
Hash Cond: (owner.cat_id = cat.id)
-> Seq Scan on owner o (cost=0.00..5849.49 rows=114149 width=16)
-> Hash (cost=17961.23..17961.23 rows=116623 width=19)
-> Seq Scan on animal cat (cost=0.00..17961.23 rows=116623 width=19)
I don't understand why the query plan is doing a sequential scan.
I thought that the optimizer would be smart enough to scan the animal table once, or even twice using the name index, and join back to the owner table based on this result, but instead I wind up with a very unexpected query plan.
I took a simpler case where we only want to look up dog names and the query behaves as I'd expect:
select *
from owner o
join animal dog on o.dog_id = dog.id
where dog.name = "fluffy";
This query produces an a plan I understand, using the index on animal.name:
Nested Loop (cost=0.83..16.88 rows=1 width=1346)
-> Index Scan using DOG_ID_IDX on animal dog (cost=0.42..8.44 rows=1 width=899)
Index Cond: ((name)::text = 'fluffy'::text)
-> Index Scan using dog_id on owner o (cost=0.42..8.44 rows=1 width=447)
Index Cond: (dog_id = b.id)
Even doing the query with two inner joins produces a query plan I would expect:
select *
from owner o
join animal dog on o.dog_id = dog.id
join animal cat on o.cat_id = cat.id
where dog.name = 'fluffy' or cat.name = 'fluffy';
Merge Join (cost=35726.09..56215.53 rows=3 width=2245)
Merge Cond: (owner.cat_id = cat.id)
Join Filter: (((dog.name)::text = 'fluffy'::text) OR ((cat.name)::text = 'fluffy'::text))
-> Nested Loop (cost=0.83..132348.38 rows=114149 width=1346)
-> Index Scan using CAT_ID_IDX on owner o (cost=0.42..11616.07 rows=114149 width=447)
-> Index Scan using animal_pkey on animal dog (cost=0.42..1.05 rows=1 width=899)
Index Cond: (id = owner.dog_id)
-> Index Scan using animal_pkey on animal cat (cost=0.42..52636.91 rows=116623 width=899)
So it looks like the left join to animal is causing the optimizer to ignore the index.
Why does doing the additional left join to animal seem to cause the optimizer to ignore the index?
EDIT:
EXPLAIN (analyse, buffers) yields:
Hash Left Join (cost=32631.95..150357.57 rows=3 width=2245) (actual time=6696.935..6696.936 rows=0 loops=1)
Hash Cond: (o.cat_id = cat.id)
Filter: (((dog.name)::text = 'fluffy'::text) OR ((cat.name)::text = 'fluffy'::text))
Rows Removed by Filter: 114219
Buffers: shared hit=170464 read=18028 dirtied=28, temp read=13210 written=13148
-> Merge Join (cost=0.94..65696.37 rows=114149 width=1346) (actual time=1.821..860.643 rows=114219 loops=1)
Merge Cond: (o.dog_id = dog.id)
Buffers: shared hit=170286 read=1408 dirtied=28
-> Index Scan using DOG_ID_IDX on owner o (cost=0.42..11402.48 rows=114149 width=447) (actual time=1.806..334.431 rows=114219 loops=1)
Buffers: shared hit=84787 read=783 dirtied=13
-> Index Scan using animal_pkey on animal dog (cost=0.42..52636.91 rows=116623 width=899) (actual time=0.006..300.507 rows=116977 loops=1)
Buffers: shared hit=85499 read=625 dirtied=15
-> Hash (cost=17961.23..17961.23 rows=116623 width=899) (actual time=5626.780..5626.780 rows=116977 loops=1)
Buckets: 8192 Batches: 32 Memory Usage: 3442kB
Buffers: shared hit=175 read=16620, temp written=12701
-> Seq Scan on animal cat (cost=0.00..17961.23 rows=116623 width=899) (actual time=2.519..5242.106 rows=116977 loops=1)
Buffers: shared hit=175 read=16620
Planning time: 1.245 ms
Execution time: 6697.357 ms
The left join needs to keep all rows in the first table. Hence, it is going to generally scan that table, even where conditions filter other tables on those conditions.
The query plan produced by Postgres is not surprising.

Unpredictable query performance in Postgresql

I have tables like that in a Postgres 9.3 database:
A <1---n B n---1> C
Table A contains ~10^7 rows, table B is rather big with ~10^9 rows and C contains ~100 rows.
I use the following query to find all As (distinct) that match some criteria in B and C (the real query is more complex, joins more tables and checks more attributes within the subquery):
Query 1:
explain analyze
select A.SNr from A
where exists (select 1 from B, C
where B.AId = A.Id and
B.CId = C.Id and
B.Timestamp >= '2013-01-01' and
B.Timestamp <= '2013-01-12' and
C.Name = '00000015')
limit 200;
That query takes about 500ms (Note that C.Name = '00000015' exists in the table):
Limit (cost=119656.37..120234.06 rows=200 width=9) (actual time=427.799..465.485 rows=200 loops=1)
-> Hash Semi Join (cost=119656.37..483518.78 rows=125971 width=9) (actual time=427.797..465.460 rows=200 loops=1)
Hash Cond: (a.id = b.aid)
-> Seq Scan on a (cost=0.00..196761.34 rows=12020034 width=13) (actual time=0.010..15.058 rows=133470 loops=1)
-> Hash (cost=117588.73..117588.73 rows=125971 width=4) (actual time=427.233..427.233 rows=190920 loops=1)
Buckets: 4096 Batches: 8 Memory Usage: 838kB
-> Nested Loop (cost=0.57..117588.73 rows=125971 width=4) (actual time=0.176..400.326 rows=190920 loops=1)
-> Seq Scan on c (cost=0.00..2.88 rows=1 width=4) (actual time=0.015..0.030 rows=1 loops=1)
Filter: (name = '00000015'::text)
Rows Removed by Filter: 149
-> Index Only Scan using cid_aid on b (cost=0.57..116291.64 rows=129422 width=8) (actual time=0.157..382.896 rows=190920 loops=1)
Index Cond: ((cid = c.id) AND ("timestamp" >= '2013-01-01 00:00:00'::timestamp without time zone) AND ("timestamp" <= '2013-01-12 00:00:00'::timestamp without time zone))
Heap Fetches: 0
Total runtime: 476.173 ms
Query 2: Changing C.Name to something that doesn't exist (C.Name = 'foo') takes 0.1ms:
explain analyze
select A.SNr from A
where exists (select 1 from B, C
where B.AId = A.Id and
B.CId = C.Id and
B.Timestamp >= '2013-01-01' and
B.Timestamp <= '2013-01-12' and
C.Name = 'foo')
limit 200;
Limit (cost=119656.37..120234.06 rows=200 width=9) (actual time=0.063..0.063 rows=0 loops=1)
-> Hash Semi Join (cost=119656.37..483518.78 rows=125971 width=9) (actual time=0.062..0.062 rows=0 loops=1)
Hash Cond: (a.id = b.aid)
-> Seq Scan on a (cost=0.00..196761.34 rows=12020034 width=13) (actual time=0.010..0.010 rows=1 loops=1)
-> Hash (cost=117588.73..117588.73 rows=125971 width=4) (actual time=0.038..0.038 rows=0 loops=1)
Buckets: 4096 Batches: 8 Memory Usage: 0kB
-> Nested Loop (cost=0.57..117588.73 rows=125971 width=4) (actual time=0.038..0.038 rows=0 loops=1)
-> Seq Scan on c (cost=0.00..2.88 rows=1 width=4) (actual time=0.037..0.037 rows=0 loops=1)
Filter: (name = 'foo'::text)
Rows Removed by Filter: 150
-> Index Only Scan using cid_aid on b (cost=0.57..116291.64 rows=129422 width=8) (never executed)
Index Cond: ((cid = c.id) AND ("timestamp" >= '2013-01-01 00:00:00'::timestamp without time zone) AND ("timestamp" <= '2013-01-12 00:00:00'::timestamp without time zone))
Heap Fetches: 0
Total runtime: 0.120 ms
Query 3: Resetting the C.Name to something that exists (like in the first query) and increasing the timestamp by 3 days uses another query plan than before, but is still fast (200ms):
explain analyze
select A.SNr from A
where exists (select 1 from B, C
where B.AId = A.Id and
B.CId = C.Id and
B.Timestamp >= '2013-01-01' and
B.Timestamp <= '2013-01-15' and
C.Name = '00000015')
limit 200;
Limit (cost=0.57..112656.93 rows=200 width=9) (actual time=4.404..227.569 rows=200 loops=1)
-> Nested Loop Semi Join (cost=0.57..90347016.34 rows=160394 width=9) (actual time=4.403..227.544 rows=200 loops=1)
-> Seq Scan on a (cost=0.00..196761.34 rows=12020034 width=13) (actual time=0.008..1.046 rows=12250 loops=1)
-> Nested Loop (cost=0.57..7.49 rows=1 width=4) (actual time=0.017..0.017 rows=0 loops=12250)
-> Seq Scan on c (cost=0.00..2.88 rows=1 width=4) (actual time=0.005..0.015 rows=1 loops=12250)
Filter: (name = '00000015'::text)
Rows Removed by Filter: 147
-> Index Only Scan using cid_aid on b (cost=0.57..4.60 rows=1 width=8) (actual time=0.002..0.002 rows=0 loops=12250)
Index Cond: ((cid = c.id) AND (aid = a.id) AND ("timestamp" >= '2013-01-01 00:00:00'::timestamp without time zone) AND ("timestamp" <= '2013-01-15 00:00:00'::timestamp without time zone))
Heap Fetches: 0
Total runtime: 227.632 ms
Query 4: But that new query plan utterly fails when searching for a C.Name that doesn't exist::
explain analyze
select A.SNr from A
where exists (select 1 from B, C
where B.AId = A.Id and
B.CId = C.Id and
B.Timestamp >= '2013-01-01' and
B.Timestamp <= '2013-01-15' and
C.Name = 'foo')
limit 200;
Now it takes 170 seconds (vs. 0.1ms before!) to return the same 0 rows:
Limit (cost=0.57..112656.93 rows=200 width=9) (actual time=170184.979..170184.979 rows=0 loops=1)
-> Nested Loop Semi Join (cost=0.57..90347016.34 rows=160394 width=9) (actual time=170184.977..170184.977 rows=0 loops=1)
-> Seq Scan on a (cost=0.00..196761.34 rows=12020034 width=13) (actual time=0.008..794.626 rows=12020034 loops=1)
-> Nested Loop (cost=0.57..7.49 rows=1 width=4) (actual time=0.013..0.013 rows=0 loops=12020034)
-> Seq Scan on c (cost=0.00..2.88 rows=1 width=4) (actual time=0.013..0.013 rows=0 loops=12020034)
Filter: (name = 'foo'::text)
Rows Removed by Filter: 150
-> Index Only Scan using cid_aid on b (cost=0.57..4.60 rows=1 width=8) (never executed)
Index Cond: ((cid = c.id) AND (aid = a.id) AND ("timestamp" >= '2013-01-01 00:00:00'::timestamp without time zone) AND ("timestamp" <= '2013-01-15 00:00:00'::timestamp without time zone))
Heap Fetches: 0
Total runtime: 170185.033 ms
All queries were run after "alter table set statistics" with a value of 10000 on all columns and after running analyze on the whole db.
Right now it looks like the slightest change of a parameter (not even of the SQL) can make Postgres choose a bad plan (0.1ms vs. 170s in this case!). I always try to check query plans when changing things, but it's hard to ever be sure that something will work when such small changes on parameters can make such huge differences. I have similar problems with other queries too.
What can I do to get more predictable results?
(I have tried modifying certain query planning parameters (set enable_... = on/off) and some different SQL statements - joining+distinct/group by instead of "exists" - but nothing seems to make postgres choose "stable" query plans while still providing acceptable performance).
Edit #1: Table + index definitions
test=# \d a
Tabelle äpublic.aô
Spalte | Typ | Attribute
--------+---------+----------------------------------------------------
id | integer | not null Vorgabewert nextval('a_id_seq'::regclass)
anr | integer |
snr | text |
Indexe:
"a_pkey" PRIMARY KEY, btree (id)
"anr_snr_index" UNIQUE, btree (anr, snr)
"anr_index" btree (anr)
Fremdschlnssel-Constraints:
"anr_fkey" FOREIGN KEY (anr) REFERENCES pt(id)
Fremdschlnsselverweise von:
TABLE "b" CONSTRAINT "aid_fkey" FOREIGN KEY (aid) REFERENCES a(id)
test=# \d b
Tabelle äpublic.bô
Spalte | Typ | Attribute
-----------+-----------------------------+-----------
id | uuid | not null
timestamp | timestamp without time zone |
cid | integer |
aid | integer |
prop1 | text |
propn | integer |
Indexe:
"b_pkey" PRIMARY KEY, btree (id)
"aid_cid" btree (aid, cid)
"cid_aid" btree (cid, aid, "timestamp")
"timestamp_index" btree ("timestamp")
Fremdschlnssel-Constraints:
"aid_fkey" FOREIGN KEY (aid) REFERENCES a(id)
"cid_fkey" FOREIGN KEY (cid) REFERENCES c(id)
test=# \d c
Tabelle äpublic.cô
Spalte | Typ | Attribute
--------+---------+----------------------------------------------------
id | integer | not null Vorgabewert nextval('c_id_seq'::regclass)
name | text |
Indexe:
"c_pkey" PRIMARY KEY, btree (id)
"c_name_index" UNIQUE, btree (name)
Fremdschlnsselverweise von:
TABLE "b" CONSTRAINT "cid_fkey" FOREIGN KEY (cid) REFERENCES c(id)
Your problem is that the query needs to evaluate the correlated sub query for the entire table a. When Postgres quickly finds 200 random rows that fit (which seems to occasionally be the case when c.name exists), it yields them accordingly, and reasonably fast if there are plenty to choose from. But when no such rows exists, it evaluates the entire hogwash in the exists() statement as many times as table a has rows, hence the performance issue you're seeing.
Adding an uncorrelated where clause will most certainly fix a number of edge cases:
and exists(select 1 from c where name = ?)
It might also work when you join the latter with b and write it as a cte:
with bc as (
select aid
from b join c on b.cid = c.bid
and b.timestamp between ? and ?
and c.name = ?
)
select a.id
from a
where exists (select 1 from bc)
and exists (select 1 from bc where a.id = bc.aid)
limit 200
If not, just toss in the bc query verbatim instead of using the cte. The point here is to force Postgres to consider the bc lookup as independent, and bail early if the resulting set yields no rows at all.
I assume your query is more complex in the end, but note that the above could be rewritten as:
with bc as (...)
select aid
from bc
limit 200
Or:
with bc as (...)
select a.id
from a
where a.id in (select aid from bc)
limit 200
Both should yield better plans in edge cases.
(Side note: it's usually unadvisable to limit without ordering.)
Maybe try to rewrite query with CTE?
with BC as (
select distinct B.AId from B where
B.Timestamp >= '2013-01-01' and
B.Timestamp <= '2013-01-12' and
B.CId in (select C.Id from C where C.Name = '00000015')
limit 200
)
select A.SNr from A where A.Id in (select AId from BC)
If I understand correctly - limit could be easily put inside BC query to avoid scan on table A.