I have two tables, one for profiles and one for the profile's employment status. The two tables have a one-to-one relationship. One profile might might not have an employment status. The table schemas are as below (irrelevant columns removed for clarity):
create type employment_status as enum ('claimed', 'approved', 'denied');
create table if not exists profiles
(
id bigserial not null
constraint profiles_pkey
primary key
);
create table if not exists employments
(
id bigserial not null
constraint employments_pkey
primary key,
status employment_status not null,
profile_id bigint not null
constraint fk_rails_d95865cd58
references profiles
on delete cascade
);
create unique index if not exists index_employments_on_profile_id
on employments (profile_id);
With these tables, I was asked to list all unemployed profiles. An unemployed profile is defined as a profile not having an employment record or having an employment with a status other than "approved".
My first tentative was the following query:
SELECT * FROM "profiles"
LEFT JOIN employments ON employments.profile_id = profiles.id
WHERE employments.status != 'approved'
The assumption here was that all profiles will be listed with their respective employments, then I could filter them with the where condition. Any profile without an employment record would have an employment status of null and therefore be filtered by the condition. However, this query did not return profiles without an employment.
After some research I found this post, explaining why it doesn't work and transformed my query:
SELECT *
FROM profiles
LEFT JOIN employments ON profiles.id = employments.profile_id and employments.status != 'approved';
Which actually did work. But, my ORM produced a slightly different query, which didn't work.
SELECT profiles.* FROM "profiles"
LEFT JOIN employments ON employments.profile_id = profiles.id AND employments.status != 'approved'
The only difference being the select clause. I tried to understand why this slight difference produced such a difference an ran an explain analyze all three queries:
EXPLAIN ANALYZE SELECT * FROM "profiles"
LEFT JOIN employments ON employments.profile_id = profiles.id
WHERE employments.status != 'approved'
Hash Join (cost=14.28..37.13 rows=846 width=452) (actual time=0.025..0.027 rows=2 loops=1)
Hash Cond: (e.profile_id = profiles.id)
-> Seq Scan on employments e (cost=0.00..20.62 rows=846 width=68) (actual time=0.008..0.009 rows=2 loops=1)
Filter: (status <> ''approved''::employment_status)
Rows Removed by Filter: 1
-> Hash (cost=11.90..11.90 rows=190 width=384) (actual time=0.007..0.007 rows=8 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 12kB
-> Seq Scan on profiles (cost=0.00..11.90 rows=190 width=384) (actual time=0.003..0.004 rows=8 loops=1)
Planning Time: 0.111 ms
Execution Time: 0.053 ms
EXPLAIN ANALYZE SELECT *
FROM profiles
LEFT JOIN employments ON profiles.id = employments.profile_id and employments.status != 'approved';
Hash Right Join (cost=14.28..37.13 rows=846 width=452) (actual time=0.036..0.042 rows=8 loops=1)
Hash Cond: (employments.profile_id = profiles.id)
-> Seq Scan on employments (cost=0.00..20.62 rows=846 width=68) (actual time=0.005..0.005 rows=2 loops=1)
Filter: (status <> ''approved''::employment_status)
Rows Removed by Filter: 1
-> Hash (cost=11.90..11.90 rows=190 width=384) (actual time=0.015..0.015 rows=8 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 12kB
-> Seq Scan on profiles (cost=0.00..11.90 rows=190 width=384) (actual time=0.010..0.011 rows=8 loops=1)
Planning Time: 0.106 ms
Execution Time: 0.108 ms
EXPLAIN ANALYZE SELECT profiles.* FROM "profiles"
LEFT JOIN employments ON employments.profile_id = profiles.id AND employments.status != 'approved'
Seq Scan on profiles (cost=0.00..11.90 rows=190 width=384) (actual time=0.006..0.007 rows=8 loops=1)
Planning Time: 0.063 ms
Execution Time: 0.016 ms
The first and second query plans are almost the same expect for the hash join for one and right hash join for the other, while the last query doesn't even do join or the where condition.
I came up with a forth query that did work:
EXPLAIN ANALYZE SELECT profiles.* FROM profiles
LEFT JOIN employments ON employments.profile_id = profiles.id
WHERE (employments.id IS NULL OR employments.status != 'approved')
Hash Right Join (cost=14.28..35.02 rows=846 width=384) (actual time=0.021..0.026 rows=7 loops=1)
Hash Cond: (employments.profile_id = profiles.id)
Filter: ((employments.id IS NULL) OR (employments.status <> ''approved''::employment_status))
Rows Removed by Filter: 1
-> Seq Scan on employments (cost=0.00..18.50 rows=850 width=20) (actual time=0.002..0.003 rows=3 loops=1)
-> Hash (cost=11.90..11.90 rows=190 width=384) (actual time=0.011..0.011 rows=8 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 12kB
-> Seq Scan on profiles (cost=0.00..11.90 rows=190 width=384) (actual time=0.007..0.008 rows=8 loops=1)
Planning Time: 0.104 ms
Execution Time: 0.049 ms
My questions about the subject is:
Why are the query plans for second and third queries different even though they have the same structure?
Why are the query plans the first and fourth queries different even though they the same structure?
Why is Postgres totally ignoring my join and where condition for the third query?
EDIT:
With the following sample data, the expected query should return 2 and 3.
insert into profiles values (1);
insert into profiles values (2);
insert into profiles values (3);
insert into employments (profile_id, status) values (1, 'approved');
insert into employments (profile_id, status) values (2, 'denied');
There must be a unique or primary key constraint on employments.profile_id (or it is a view with an appropriate DISTINCT clause) so that the optimizer knows that there can be at most one row in employments that is related to a given row in profiles.
If that is the case and you don't use employments's columns in the SELECT list, the optimizer deduces that the join is redundant and need not be calculated, which makes for a simpler and faster execution plan.
See the comment for join_is_removable in src/backend/optimizer/plan/analyzejoins.c:
/*
* join_is_removable
* Check whether we need not perform this special join at all, because
* it will just duplicate its left input.
*
* This is true for a left join for which the join condition cannot match
* more than one inner-side row. (There are other possibly interesting
* cases, but we don't have the infrastructure to prove them.) We also
* have to check that the inner side doesn't generate any variables needed
* above the join.
*/
Related
I have a table of projects and a table of tasks, with each task referencing a single project. I want to get a list of projects sorted by their due dates along with the number of tasks in each project. I can pretty easily write this query two ways. First, using JOIN and GROUP BY:
SELECT p.name, p.due_date, COUNT(t.id) as num_tasks
FROM projects p
LEFT OUTER JOIN tasks t ON t.project_id = p.id
GROUP BY p.id
ORDER BY p.due_date ASC LIMIT 20;
Second, using a subquery:
SELECT p.name, p.due_date, (SELECT
COUNT(*) FROM tasks t WHERE t.project_id = p.id) as num_tasks
FROM projects p
ORDER BY p.due_date ASC LIMIT 20;
I'm using PostgreSQL 10, and I've got indices on projects.id, projects.due_date and tasks.project_id. Why does the first query using the GROUP BY clause do a full table scan while the second query makes proper use of the indices? It seems like these should compile down to the same thing.
Note that if I remove the GROUP BY and the COUNT(t.id) from the first query it will run quickly, just with lots of duplicate rows. So the problem is with the GROUP BY clause, not the JOIN. This seems like it's about the simplest GROUP BY one could do, so I'd like to understand if/how to make it more efficient before moving on to more complicated queries.
Edit — here's the result of EXPLAIN ANALYZE. First query:
Limit (cost=41919.58..41919.63 rows=20 width=53) (actual time=1046.762..1046.771 rows=20 loops=1)
-> Sort (cost=41919.58..42169.58 rows=100000 width=53) (actual time=1046.760..1046.765 rows=20 loops=1)
Sort Key: p.due_date
Sort Method: top-N heapsort Memory: 29kB
-> GroupAggregate (cost=0.71..39258.62 rows=100000 width=53) (actual time=0.109..1002.890 rows=100000 loops=1)
Group Key: p.id
-> Merge Left Join (cost=0.71..35758.62 rows=500000 width=49) (actual time=0.072..807.603 rows=500702 loops=1)
Merge Cond: (p.id = t.project_id)
-> Index Scan using projects_pkey on projects p (cost=0.29..3542.29 rows=100000 width=45) (actual time=0.025..38.363 rows=100000 loops=1)
-> Index Scan using project_id_idx on tasks t (cost=0.42..25716.33 rows=500000 width=8) (actual time=0.038..531.097 rows=500000 loops=1)
Planning Time: 0.573 ms
Execution Time: 1046.934 ms
Second query:
Limit (cost=0.29..92.61 rows=20 width=49) (actual time=0.079..0.443 rows=20 loops=1)
-> Index Scan using project_date_idx on projects p (cost=0.29..461594.09 rows=100000 width=49) (actual time=0.076..0.432 rows=20 loops=1)
SubPlan 1
-> Aggregate (cost=4.54..4.55 rows=1 width=8) (actual time=0.015..0.016 rows=1 loops=20)
-> Index Only Scan using project_id_idx on tasks t (cost=0.42..4.53 rows=6 width=0) (actual time=0.009..0.011 rows=5 loops=20)
Index Cond: (project_id = p.id)
Heap Fetches: 0
Planning Time: 0.284 ms
Execution Time: 0.551 ms
And if anyone wants to try to exactly reproduce this, here's my setup:
CREATE TABLE projects (
id serial NOT NULL PRIMARY KEY,
name varchar(100) NOT NULL,
due_date timestamp NOT NULL
);
CREATE TABLE tasks (
id serial NOT NULL PRIMARY KEY,
project_id integer NOT NULL,
data real NOT NULL
);
INSERT INTO projects (name, due_date) SELECT
md5(random()::text),
timestamp '2020-01-01 00:00:00' +
random() * (timestamp '2030-01-01 20:00:00' - timestamp '2020-01-01 10:00:00')
FROM generate_series(1, 100000);
INSERT INTO tasks (project_id, data)
SELECT CAST(1 + random()*99999 AS integer), random()
FROM generate_series(1, 500000);
CREATE INDEX project_date_idx ON projects ("due_date");
CREATE INDEX project_id_idx ON tasks ("project_id");
ALTER TABLE tasks ADD CONSTRAINT task_foreignkey FOREIGN KEY ("project_id") REFERENCES "projects" ("id") DEFERRABLE INITIALLY DEFERRED;
What is the difference between using ON and WHERE in a sub join when using an outer reference?
Consider these two SQL statements as an example (looking for 10 persons with not closed tasks, using person_task with a many-to-many relationship):
select p.name
from person p
where exists (
select 1
from person_task pt
join task t on pt.task_id = t.id
and t.state <> 'closed'
and pt.person_id = p.id -- ON
)
limit 10
select p.name
from person p
where exists (
select 1
from person_task pt
join task t on pt.task_id = t.id and t.state <> 'closed'
where pt.person_id = p.id -- WHERE
)
limit 10
They produce the same result but the statement with ON is considerably faster.
Here the corresponding EXPLAIN (ANALYZE) statements:
-- USING ON
Limit (cost=0.00..270.98 rows=10 width=8) (actual time=10.412..60.876 rows=10 loops=1)
-> Seq Scan on person p (cost=0.00..28947484.16 rows=1068266 width=8) (actual time=10.411..60.869 rows=10 loops=1)
Filter: (SubPlan 1)
Rows Removed by Filter: 68
SubPlan 1
-> Nested Loop (cost=1.00..20257.91 rows=1632 width=0) (actual time=0.778..0.778 rows=0 loops=78)
-> Index Scan using person_taskx1 on person_task pt (cost=0.56..6551.27 rows=1632 width=8) (actual time=0.633..0.633 rows=0 loops=78)
Index Cond: (id = p.id)
-> Index Scan using taskxpk on task t (cost=0.44..8.40 rows=1 width=8) (actual time=1.121..1.121 rows=1 loops=10)
Index Cond: (id = pt.task_id)
Filter: (state <> 'open')
Planning Time: 0.466 ms
Execution Time: 60.920 ms
-- USING WHERE
Limit (cost=2818814.57..2841563.37 rows=10 width=8) (actual time=29.075..6884.259 rows=10 loops=1)
-> Merge Semi Join (cost=2818814.57..59308642.64 rows=24832 width=8) (actual time=29.075..6884.251 rows=10 loops=1)
Merge Cond: (p.id = pt.person_id)
-> Index Scan using personxpk on person p (cost=0.43..1440340.27 rows=2136533 width=16) (actual time=0.003..0.168 rows=18 loops=1)
-> Gather Merge (cost=1001.03..57357094.42 rows=40517669 width=8) (actual time=9.441..6881.180 rows=23747 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Nested Loop (cost=1.00..52679350.05 rows=16882362 width=8) (actual time=1.862..4207.577 rows=7938 loops=3)
-> Parallel Index Scan using person_taskx1 on person_task pt (cost=0.56..25848782.35 rows=16882362 width=16) (actual time=1.344..1807.664 rows=7938 loops=3)
-> Index Scan using taskxpk on task t (cost=0.44..1.59 rows=1 width=8) (actual time=0.301..0.301 rows=1 loops=23814)
Index Cond: (id = pt.task_id)
Filter: (state <> 'open')
Planning Time: 0.430 ms
Execution Time: 6884.349 ms
Should therefore always the ON statement be used for filtering values in a sub JOIN? Or what is going on?
I have used Postgres for this example.
The condition and pt.person_id = p.id doesn't refer to any column of the joined table t. In an inner join this doesn't make much sense semantically and we can move this condition from ON to WHERE to get the query more readable.
You are right, hence, that the two queries are equivalent and should result in the same execution plan. As this is not the case, PostgreSQL seems to have a problem here with their optimizer.
In an outer join such a condition in ON can make sense and would be different from WHERE. I assume that this is the reason for the optimizer finding a different plan for ON in general. Once it detects the condition in ON it goes another route, oblivious of the join type (so my assumption). I am surprised though, that this leads to a better plan; I'd rather expect a worse plan.
This may indicate that the table's statistics are not up-to-date. Please analyze the tables to make sure. Or it may be a sore spot in the optimizer code PostgreSQL developers might want to work on.
I need your suggestion on the below query performance issue.
CREATE TEMPORARY TABLE tableA( id bigint NOT NULL, second_id bigint );
CREATE INDEX idx_tableA_id ON tableA USING btree (id);
CREATE INDEX idx_tableA_second_id ON tableA USING btree (second_id);
here the table A having 100K records.
CREATE TABLE tableB( id bigint NOT NULL);
CREATE INDEX idx_tableB_id ON tableB USING btree (id);
But the table B having the 145GB volume of data.
If i execute the query with one left join like below,
select a.id from table A left join table B on B.id = A.id
or
select a.id from table A left join table B on B.d_id = A.Second_id
getting the data quicker. But when i combine both the LEFT JOIN, then the query taking 30 minutes to query the records.
SELECT a.id
FROM tableA A LEFT JOIN tableB B on B.id = A.id
LEFT JOIN tableB B1 on B1.id = A.second_id;
Got the indexes on the respective columns. Any other performance suggestions to reduce the execution time.
VERSION: "PostgreSQL 9.5.15 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-16), 64-bit"
Execution plan
Hash Right Join (cost=18744968.20..108460708.81 rows=298384424 width=8) (actual time=469896.453..1290666.446 rows=26520 loops=1)
Hash Cond: (tableB.id = tableA.id)
-> Seq Scan on tableB ubp1 (cost=0.00..63944581.96 rows=264200740 width=8) (actual time=127.504..1182339.167 rows=268297289 loops=1)
Filter: (company_type_id = 2)
Rows Removed by Filter: 1409407086
-> Hash (cost=18722912.16..18722912.16 rows=1764483 width=8) (actual time=16564.303..16564.303 rows=26520 loops=1)
Buckets: 2097152 Batches: 1 Memory Usage: 17420kB
-> Merge Join (cost=6035.58..18722912.16 rows=1764483 width=8) (actual time=37.964..16503.057 rows=26520 loops=1)
-> Nested Loop Left Join (cost=0.86..18686031.22 rows=1752390 width=8) (actual time=0.019..16412.634 rows=26520 loops=1)
-> Index Scan using idx_tableA_id on tableA A (cost=0.29..94059.62 rows=26520 width=16) (actual time=0.013..69.016 rows=26520 loops=1)
-> Index Scan using idx_tableB_id on tableB B (cost=0.58..699.36 rows=169 width=8) (actual time=0.458..0.615 rows=0 loops=26520)
Index Cond: (tableA.id = tableB.second_id)
Filter: (company_type_id = 2)
Rows Removed by Filter: 2
-> Sort (cost=6034.21..6100.97 rows=26703 width=8) (actual time=37.941..54.444 rows=26520 loops=1)
Rows Removed by Filter: 105741
Planning time: 0.878 ms
Execution time: 1290674.154 ms
Thanks and Regards,
Thiru.M
I suspect
B1.second_id either does not have an index
B1.second_id is not unique (or a primary key)
B1.second_id is part of a multi-column index where it is not the first column in the index
If you have an index on that column, you could also try moving the indexes to a different volume so that it's not in contention with the main data retrieval.
Run EXPLAIN on your query to verify that indexes are being used instead of falling back to a sequential scan on a 145GB volume.
You also didn't mention how much RAM your database has available. If your settings combined with your query is making the system swap, you could see precipitous drops in performance as well.
I'm trying to build a report with a query, accessing different postgres DB via FDW.
And I'm guessing why does it work this way.
First query without where clause is fine:
SELECT s.student_id, p.surname
FROM rep_student s inner JOIN rep_person p ON p.id = s.person_id
But adding where caluse make this query a hundred times slower (40s vs 0.1s):
SELECT s.student_id, p.surname
FROM rep_student s inner JOIN rep_person p ON p.id = s.person_id
WHERE s.learning_end_date IS NULL
Result for EXPLAIN VERBOSE:
Nested Loop (cost=200.00..226.39 rows=1 width=734)
Output: s.student_id, p.surname
Join Filter: ((s.person_id)::text = (p.id)::text)
-> Foreign Scan on public.rep_student s (cost=100.00..111.80 rows=1 width=436)
Output: s.student_id, s.version, s.person_id, s.curriculum_flow_id, s.learning_start_date, s.learning_end_date, s.learning_end_reason, s.last_update_timestamp, s.aud_created_ts, s.aud_created_by, s.aud_last_updated_ts, s.aud_last_updated_by
Remote SQL: SELECT student_id, person_id FROM public.rep_student WHERE ((learning_end_date IS NULL))
-> Foreign Scan on public.rep_person p (cost=100.00..113.24 rows=108 width=734)
Output: p.id, p.version, p.surname, p.name, p.middle_name, p.birthdate, p.info, p.photo, p.last_update_timestamp, p.is_archived, p.gender, p.aud_created_ts, p.aud_created_by, p.aud_last_updated_ts, p.aud_last_updated_by, p.full_name
Remote SQL: SELECT id, surname FROM public.rep_person`
Result for EXPLAIN ANALYZE:
Nested Loop (cost=200.00..226.39 rows=1 width=734) (actual time=27.138..38996.303 rows=939 loops=1)
Join Filter: ((s.person_id)::text = (p.id)::text)
Rows Removed by Join Filter: 15194898
-> Foreign Scan on rep_student s (cost=100.00..111.80 rows=1 width=436) (actual time=0.685..4.259 rows=939 loops=1)
-> Foreign Scan on rep_person p (cost=100.00..113.24 rows=108 width=734) (actual time=1.380..39.094 rows=16183 loops=939)
Planning time: 0.251 ms
Execution time: 38997.914 ms
Data count for tables is relatively small. Almost all rows in student table has NULL in learning_end_date column.
Student ~ 1000 rows. Persons ~ 15000.
It seems that Postgres has issues with filtering NULLs with FDW, because this query executes fast again:
SELECT s.student_id, p.surname
FROM rep_student s inner JOIN rep_person p ON p.id = s.person_id
WHERE s.learning_start_date < current_date
Result for EXPLAIN VERBOSE:
Hash Join (cost=214.59..231.83 rows=36 width=734)
Output: s.student_id, p.surname
Hash Cond: ((s.person_id)::text = (p.id)::text)
-> Foreign Scan on public.rep_student s (cost=100.00..116.65 rows=59 width=436)
Output: s.student_id, s.version, s.person_id, s.curriculum_flow_id, s.learning_start_date, s.learning_end_date, s.learning_end_reason, s.last_update_timestamp, s.aud_created_ts, s.aud_created_by, s.aud_last_updated_ts, s.aud_last_updated_by
Filter: (s.learning_start_date < ('now'::cstring)::date)
Remote SQL: SELECT student_id, person_id, learning_start_date FROM public.rep_student"
-> Hash (cost=113.24..113.24 rows=108 width=734)
Output: p.surname, p.id
-> Foreign Scan on public.rep_person p (cost=100.00..113.24 rows=108 width=734)
Output: p.surname, p.id
Remote SQL: SELECT id, surname FROM public.rep_person`
Result for EXPLAIN ANALYZE:
Hash Join (cost=214.59..231.83 rows=36 width=734) (actual time=41.614..46.347 rows=940 loops=1)
Hash Cond: ((s.person_id)::text = (p.id)::text)
-> Foreign Scan on rep_student s (cost=100.00..116.65 rows=59 width=436) (actual time=0.718..3.829 rows=940 loops=1)
Filter: (learning_start_date < ('now'::cstring)::date)
-> Hash (cost=113.24..113.24 rows=108 width=734) (actual time=40.812..40.812 rows=16183 loops=1)
Buckets: 16384 (originally 1024) Batches: 2 (originally 1) Memory Usage: 921kB
-> Foreign Scan on rep_person p (cost=100.00..113.24 rows=108 width=734) (actual time=2.252..35.079 rows=16183 loops=1)
Planning time: 0.208 ms
Execution time: 47.176 ms
Tried to add index on learning_end_date but didn't experience any effect.
What do I need to change to make query execute faster with 'IS NULL' where clause? Any ideas will be appreciated!
Your problem is that you do not have good table statistics on those foreign tables, so the row count estimates of the PostgreSQL optimizer are pretty arbitrary.
That causes the optimizer to choose a nested loop join in the case you report as slow, which is an inappropriate plan.
It is just by coincidence that this happens for a certain IS NULL condition.
Collect statistics on the foreign tables with
ANALYZE rep_student;
ANALYZE rep_person;
Then the performance will be much better.
Note that while autovacuum automatically gathers statistics for local tables, it does not do that for remote tables because it does not know how many rows have changed, so you should regularly ANALYZE foreign tables whose data change.
I am observing that COUNT(*) from table is not an optimised query when it comes to deep SQLs.
Here's the sql I am working with
SELECT COUNT(*) FROM "items"
INNER JOIN (
SELECT c.* FROM companies c LEFT OUTER JOIN company_groups ON c.id = company_groups.company_id
WHERE company_groups.has_restriction IS NULL OR company_groups.has_restriction = 'f' OR company_groups.company_id = 1999 OR company_groups.group_id IN ('3','2')
GROUP BY c.id
) AS companies ON companies.id = stock_items.vendor_id
LEFT OUTER JOIN favs ON items.id = favs.item_id AND favs.user_id = 999 AND favs.is_visible = TRUE
WHERE "items"."type" IN ('Fashion') AND "items"."visibility" = 't' AND "items"."is_hidden" = 'f' AND (items.depth IS NULL OR (items.depth >= '0' AND items.depth <= '100')) AND (items.table IS NULL OR (items.table >= '0' AND items.table <= '100')) AND (items.company_id NOT IN (199,200,201))
This query is taking 4084.8ms to count from 0.35 Million records from database.
I am using Rails as framework, so the SQL I am composing fires a COUNT query of the original query whenever I call results.count
Since, I am using LIMIT and OFFSET so basic results are loading in less than 32.0ms (which is way too fast)
Here's the output of the EXPLAIN ANALYSE
Merge Join (cost=70743.22..184962.02 rows=7540499 width=4) (actual time=4018.351..4296.963 rows=360323 loops=1)
Merge Cond: (c.id = items.company_id)
-> Group (cost=0.56..216.21 rows=4515 width=4) (actual time=0.357..5.165 rows=4501 loops=1)
Group Key: c.id
-> Merge Left Join (cost=0.56..204.92 rows=4515 width=4) (actual time=0.303..2.590 rows=4504 loops=1)
Merge Cond: (c.id = company_groups.company_id)
Filter: ((company_groups.has_restriction IS NULL) OR (NOT company_groups.has_restriction) OR (company_groups.company_id = 1999) OR (company_groups.group_id = ANY ('{3,2}'::integer[])))
Rows Removed by Filter: 10
-> Index Only Scan using companies_pkey on companies c (cost=0.28..128.10 rows=4521 width=4) (actual time=0.155..0.941 rows=4508 loops=1)
Heap Fetches: 3
-> Index Scan using index_company_groups_on_company_id on company_groups (cost=0.28..50.14 rows=879 width=9) (actual time=0.141..0.480 rows=878 loops=1)
-> Materialize (cost=70742.66..72421.11 rows=335690 width=8) (actual time=4017.964..4216.381 rows=362180 loops=1)
-> Sort (cost=70742.66..71581.89 rows=335690 width=8) (actual time=4017.955..4140.168 rows=362180 loops=1)
Sort Key: items.company_id
Sort Method: external merge Disk: 6352kB
-> Hash Left Join (cost=1.05..35339.74 rows=335690 width=8) (actual time=0.617..3588.634 rows=362180 loops=1)
Hash Cond: (items.id = favs.item_id)
-> Seq Scan on items (cost=0.00..34079.84 rows=335690 width=8) (actual time=0.504..3447.355 rows=362180 loops=1)
Filter: (visibility AND (NOT is_hidden) AND ((type)::text = 'Fashion'::text) AND (company_id <> ALL ('{199,200,201}'::integer[])) AND ((depth IS NULL) OR ((depth >= '0'::numeric) AND (depth <= '100'::nume (...)
Rows Removed by Filter: 5814
-> Hash (cost=1.04..1.04 rows=1 width=4) (actual time=0.009..0.009 rows=0 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 8kB
-> Seq Scan on favs (cost=0.00..1.04 rows=1 width=4) (actual time=0.008..0.008 rows=0 loops=1)
Filter: (is_visible AND (user_id = 999))
Rows Removed by Filter: 3
Planning time: 3.526 ms
Execution time: 4397.849 ms
Please advise on how should I make it work faster!
P.S.: All the columns are indexed like type, visibility, is_hidden, table, depth etc.
Thanks in advance!
Well, you have two parts that select everything (SELECT *) in your query, maybe you could limit that and see if it helps, example:
SELECT COUNT(OneSpecificColumn)
FROM "items"
INNER JOIN
( SELECT c.(AnotherSpecificColumn)
FROM companies c
LEFT OUTER JOIN company_groups ON c.id = company_groups.company_id
WHERE company_groups.has_restriction IS NULL
OR company_groups.has_restriction = 'f'
OR company_groups.company_id = 1999
OR company_groups.group_id IN ('3',
'2')
GROUP BY c.id) AS companies ON companies.id = stock_items.vendor_id
LEFT OUTER JOIN favs ON items.id = favs.item_id
AND favs.user_id = 999
AND favs.is_visible = TRUE
WHERE "items"."type" IN ('Fashion')
AND "items"."visibility" = 't'
AND "items"."is_hidden" = 'f'
AND (items.depth IS NULL
OR (items.depth >= '0'
AND items.depth <= '100'))
AND (items.table IS NULL
OR (items.table >= '0'
AND items.table <= '100'))
AND (items.company_id NOT IN (199,
200,
201))
You could also check if those left joins are all necessary, inner joins are less costly and may speed up your search.
The lion's share of the time is spent in the sequential scan of items, and that cannot be improved, because you need almost all of the rows in the table.
So the only ways to improve the query are
see that items is cached in memory
get faster storage