We have 3 tables.
10,000 rows in one and 80,000 rows in second and 400 rows in third.
Code was working well, but recently we met performance issues.
EXPLAIN ANALYZE SELECT "users_users"."id", "users_users"."email"
FROM "users_users" WHERE (NOT ("users_users"."email" IN
(SELECT U0."email" FROM "users_blacklist" U0))
AND NOT ("users_users"."id" IN (SELECT U0."user_id"
FROM "games_user2game" U0))) ORDER BY "users_users"."id" DESC;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan Backward using users_user_pkey on users_users (cost=9.25..12534132.45 rows=2558 width=26) (actual time=46.101..77158.318 rows=2510 loops=1)
Filter: ((NOT (hashed SubPlan 1)) AND (NOT (SubPlan 2)))
Rows Removed by Filter: 7723
SubPlan 1
-> Seq Scan on users_blacklist u0 (cost=0.00..8.20 rows=420 width=22) (actual time=0.032..0.318 rows=420 loops=1)
SubPlan 2
-> Materialize (cost=0.00..2256.20 rows=77213 width=4) (actual time=0.003..4.042 rows=35774 loops=9946)
-> Seq Scan on games_user2game u0 (cost=0.00..1568.13 rows=77213 width=4) (actual time=0.011..17.159 rows=77213 loops=1)
Total runtime: 77159.689 ms
(9 rows)
Main question: Is it ok, that we meet performance issues on joining 2 tables with less than 100,000 rows?
Where to dig? Should we change query or dig into db settings?
UPD Temporary solution, is to rid off from subqueries by prefetching them in code.
I don't know the postgres dialet of SQL but it may be worth experimenting with outer joins. In many other dbms' they can offer better performance than subselects.
Something along the lines of
SELECT "users_users"."id", "users_users"."email"
FROM "users_users" us left join "users_blacklist" uo on uo.email = us.email
left join "games_user2game" ug on us.id = ug.user_id
where uo.email is null
AND ug.id is null
I think that is doing the same thing as your original query, but you'd have to test to make sure.
I have run across similar problems on SQL Server, and rewritten the query with an exists, as #Scotch suggests to good effect.
SELECT
"users_users"."id",
"users_users"."email"
FROM "users_users"
WHERE
NOT EXISTS
(
SELECT NULL FROM "users_blacklist" WHERE "users_blacklist"."email" = "users_users"."email"
)
AND NOT EXISTS
(
SELECT NULL FROM "games_user2game" WHERE "games_user2game"."user_id" = "users_users"."user_id"
)
ORDER BY "users_users"."id" DESC;
This query will give you all users who are not blacklisted, and who are not in a game. It may be faster than the outer join option depending on how postgres plans the query.
Related
I'm using this public Postgres DB of NEAR protocol: https://github.com/near/near-indexer-for-explorer#shared-public-access
postgres://public_readonly:nearprotocol#mainnet.db.explorer.indexer.near.dev/mainnet_explorer
SELECT "public"."receipts"."receipt_id",
"public"."receipts"."included_in_block_hash",
"public"."receipts"."included_in_chunk_hash",
"public"."receipts"."index_in_chunk",
"public"."receipts"."included_in_block_timestamp",
"public"."receipts"."predecessor_account_id",
"public"."receipts"."receiver_account_id",
"public"."receipts"."receipt_kind",
"public"."receipts"."originated_from_transaction_hash"
FROM "public"."receipts"
WHERE ("public"."receipts"."receipt_id") IN
(SELECT "t0"."receipt_id"
FROM "public"."receipts" AS "t0"
INNER JOIN "public"."action_receipts" AS "j0" ON ("j0"."receipt_id") = ("t0"."receipt_id")
WHERE ("j0"."signer_account_id" = 'ryancwalsh.near'
AND "t0"."receipt_id" IS NOT NULL))
ORDER BY "public"."receipts"."included_in_block_timestamp" DESC
LIMIT 1
OFFSET 0
always results in:
ERROR: canceling statement due to statement timeout
SQL state: 57014
But if I change it to LIMIT 2, the query runs in less than 1 second.
How would that ever be the case? Does that mean the database isn't set up well? Or am I doing something wrong?
P.S. The query here was generated via Prisma. findFirst always times out, so I think I might need to change it to findMany as a workaround.
Your query can be simplified /optimized:
SELECT r.receipt_id
, r.included_in_block_hash
, r.included_in_chunk_hash
, r.index_in_chunk
, r.included_in_block_timestamp
, r.predecessor_account_id
, r.receiver_account_id
, r.receipt_kind
, r.originated_from_transaction_hash
FROM public.receipts r
WHERE EXISTS (
SELECT FROM public.action_receipts j
WHERE j.receipt_id = r.receipt_id
AND j.signer_account_id = 'ryancwalsh.near'
)
ORDER BY r.included_in_block_timestamp DESC
LIMIT 1;
However, that only scratches the surface of your underlying problem.
Like Kirk already commented, Postgres chooses a different query plan for LIMIT 1, obviously ignorant of the fact that there are only 90 rows in table action_receipts with signer_account_id = 'ryancwalsh.near', while both involved tables have more than 220 million rows, obviously growing steadily.
Changing to LIMIT 2 makes a different query plan seem more favorable, hence the observed difference in performance. (So the query planner has the general idea that the filter is very selective, just not close enough for the corner case of LIMIT 1.)
You should have mentioned cardinalities to set us on the right track.
Knowing that our filter is so selective, we can force a more favorable query plan with a different query:
WITH j AS (
SELECT receipt_id -- is PK!
FROM public.action_receipts
WHERE signer_account_id = 'ryancwalsh.near'
)
SELECT r.receipt_id
, r.included_in_block_hash
, r.included_in_chunk_hash
, r.index_in_chunk
, r.included_in_block_timestamp
, r.predecessor_account_id
, r.receiver_account_id
, r.receipt_kind
, r.originated_from_transaction_hash
FROM j
JOIN public.receipts r USING (receipt_id)
ORDER BY r.included_in_block_timestamp DESC
LIMIT 1;
This uses the same query plan for LIMIT 1, and either finishes in under 2 ms in my test:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=134904.89..134904.89 rows=1 width=223) (actual time=1.750..1.754 rows=1 loops=1)
CTE j
-> Bitmap Heap Scan on action_receipts (cost=319.46..41564.59 rows=10696 width=44) (actual time=0.058..0.179 rows=90 loops=1)
Recheck Cond: (signer_account_id = 'ryancwalsh.near'::text)
Heap Blocks: exact=73
-> Bitmap Index Scan on action_receipt_signer_account_id_idx (cost=0.00..316.79 rows=10696 width=0) (actual time=0.043..0.043 rows=90 loops=1)
Index Cond: (signer_account_id = 'ryancwalsh.near'::text)
-> Sort (cost=93340.30..93367.04 rows=10696 width=223) (actual time=1.749..1.750 rows=1 loops=1)
Sort Key: r.included_in_block_timestamp DESC
Sort Method: top-N heapsort Memory: 25kB
-> Nested Loop (cost=0.70..93286.82 rows=10696 width=223) (actual time=0.089..1.705 rows=90 loops=1)
-> CTE Scan on j (cost=0.00..213.92 rows=10696 width=32) (actual time=0.060..0.221 rows=90 loops=1)
-> Index Scan using receipts_pkey on receipts r (cost=0.70..8.70 rows=1 width=223) (actual time=0.016..0.016 rows=1 loops=90)
Index Cond: (receipt_id = j.receipt_id)
Planning Time: 0.281 ms
Execution Time: 1.801 ms
The point is to execute the hugely selective query in the CTE first. Then Postgres does not attempt to walk the index on (included_in_block_timestamp) under the wrong assumption that it would find a matching row soon enough. (It does not.)
The DB at hand runs Postgres 11, where CTEs are always optimization barriers. In Postgres 12 or later add AS MATERIALIZED to the CTE to guarantee the same effect.
Or you could use the "OFFSET 0 hack" in any version like this:
SELECT r.receipt_id
, r.included_in_block_hash
, r.included_in_chunk_hash
, r.index_in_chunk
, r.included_in_block_timestamp
, r.predecessor_account_id
, r.receiver_account_id
, r.receipt_kind
, r.originated_from_transaction_hash
FROM (
SELECT receipt_id -- is PK!
FROM public.action_receipts
WHERE signer_account_id = 'ryancwalsh.near'
OFFSET 0 -- !
) j
JOIN public.receipts r USING (receipt_id)
ORDER BY r.included_in_block_timestamp DESC
LIMIT 1;
Prevents "inlining" of the subquery to the same effect. Finishes in < 2ms.
See:
How to prevent PostgreSQL from rewriting OUTER JOIN queries?
"Fix" the database?
The proper fix depends on the complete picture. The underlying problem is that Postgres overestimates the number of qualifying rows in table action_receipts. The MCV list (most common values) cannot keep up with 220 million rows (and growing). It's most probably not just ANALYZE lagging behind. (Though it could be: autovacuum not properly configured? Rookie mistake?) Depending on the actual cardinalities (data distribution) in action_receipts.signer_account_id and access patterns you could do various things to "fix" it. Two options:
1. Increase n_distinct and STATISTICS
If most values in action_receipts.signer_account_id are equally rare (high cardinality), consider setting a very large n_distinct value for the column. And combine that with a moderately increased STATISTICS target for the same column to counter errors in the other direction (underestimating the number of qualifying rows for common values). Read both answers here:
Postgres sometimes uses inferior index for WHERE a IN (...) ORDER BY b LIMIT N
And:
Very bad query plan in PostgreSQL 9.6
2. Local fix with partial index
If action_receipts.signer_account_id = 'ryancwalsh.near' is special in that it's queried more regularly than others, consider a small partial index for it, to fix just that case. Like:
CREATE INDEX ON action_receipts (receipt_id)
WHERE signer_account_id = 'ryancwalsh.near';
i'm a little confused here.
Here is my (simplified) query:
SELECT *
from (SELECT documents.*,
(SELECT max(date)
FROM registrations
WHERE registrations.document_id = documents.id) AS register_date
FROM documents) AS dcmnts
ORDER BY register_date
LIMIT 20;
And here is my EXPLAIN ANALYSE results:
Limit (cost=46697025.51..46697025.56 rows=20 width=193) (actual time=80329.201..80329.206 rows=20 loops=1)
-> Sort (cost=46697025.51..46724804.61 rows=11111641 width=193) (actual time=80329.199..80329.202 rows=20 loops=1)
Sort Key: ((SubPlan 1))
Sort Method: top-N heapsort Memory: 29kB
-> Seq Scan on documents (cost=0.00..46401348.74 rows=11111641 width=193) (actual time=0.061..73275.304 rows=11114254 loops=1)
SubPlan 1
-> Aggregate (cost=3.95..4.05 rows=1 width=4) (actual time=0.005..0.005 rows=1 loops=11114254)
-> Index Scan using registrations_document_id_index on registrations (cost=0.43..3.95 rows=2 width=4) (actual time=0.004..0.004 rows=1 loops=11114254)
Index Cond: (document_id = documents.id)
Planning Time: 0.334 ms
Execution Time: 80329.287 ms
The query takes 1m 20s to execute, is there any way to optimize it? There are lots of rows in these tables (documents:11114642;registrations:13176070).
In actual full query I also have some more filters and it takes up to 4 seconds to execute, and it's still too slow. This subquery orderby seems to be the bottleneck here and I can't figure out the way to optimize it.
I tried to set indexes on date/document_id columns.
Don't use a scalar subquery:
SELECT documents.*,
reg.register_date
FROM documents
JOIN (
SELECT document_id, max(date) as register_date
FROM registrations
GROUP BY document_id
) reg on reg.document_id = documents.id;
ORDER BY register_date
LIMIT 20;
Try to unnest the query
SELECT documents.id,
documents.other_attr,
max(registrations.date) register_date
FROM documents
JOIN registrations ON registrations.document_id = documents.id
GROUP BY documents.id, documents.other_attr
ORDER BY 2
LIMIT 20
The query should be supported at least by an index on registrations(document_id, date):
create index idx_registrations_did_date
on registrations(document_id, date)
In actual full query I also have some more filters and it takes up to 4 seconds to execute, and it's still too slow.
Then ask about that query. What can we say about a query we can't see? Clearly this other query isn't just like this query, except for filtering stuff out after all the work is done, as then it couldn't be faster (other than due to cache hotness) than the one you showed us. It is doing something different, it has to optimized differently.
This subquery orderby seems to be the bottleneck here and I can't figure out the way to optimize it.
The timing for the sort node includes the time if all the work that preceded it, so the time of the actual sort is 80329.206 - 73275.304 = 7 seconds, a long time perhaps but a minority of the total time. (This interpretation is not very obvious from the output itself--it from experience.)
For the query you did show us, you can get it to be quite fast, but only probabilistically correct, by using a rather convoluted construction.
with t as (select date, document_id from registrations
order by date desc, document_id desc limit 200),
t2 as (select distinct on (document_id) document_id, date from t
order by document_id, date desc),
t3 as ( select document_id, date from t2 order by date desc limit 20)
SELECT documents.*,
t3.date as register_date
FROM documents join t3 on t3.document_id = documents.id;
order by register_date
It will be efficiently supported by:
create index on registrations (register_date, document_id);
create index on documents(id);
The idea here is that the 200 most recent registrations will have at least 20 distinct document_id among them. Of course there is no way to know for sure that that will be true, so you might have to increase 200 to 20000 (which should still be pretty fast, compared to what you are currently doing) or even more to be sure you get the right answer. This also assumes that every distinct document_id matches exactly one document.id.
I have simple query (Postgres 9.4):
EXPLAIN ANALYZE
SELECT
COUNT(*)
FROM
bo_labels L
LEFT JOIN bo_party party ON (party.id = L.bo_party_fkey)
LEFT JOIN bo_document_base D ON (D.id = L.bo_doc_base_fkey)
LEFT JOIN bo_contract_hardwood_deal C ON (C.bo_document_fkey = D.id)
WHERE
party.inn = '?'
Explain looks like:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=2385.30..2385.30 rows=1 width=0) (actual time=31762.367..31762.367 rows=1 loops=1)
-> Nested Loop Left Join (cost=1.28..2385.30 rows=1 width=0) (actual time=7.621..31760.776 rows=1694 loops=1)
Join Filter: ((c.bo_document_fkey)::text = (d.id)::text)
Rows Removed by Join Filter: 101658634
-> Nested Loop Left Join (cost=1.28..106.33 rows=1 width=10) (actual time=0.110..54.635 rows=1694 loops=1)
-> Nested Loop (cost=0.85..105.69 rows=1 width=9) (actual time=0.081..4.404 rows=1694 loops=1)
-> Index Scan using bo_party_inn_idx on bo_party party (cost=0.43..12.43 rows=3 width=10) (actual time=0.031..0.037 rows=3 loops=1)
Index Cond: (inn = '2534005760'::text)
-> Index Only Scan using bo_labels__party_fkey__docbase_fkey__tnved_fkey__idx on bo_labels l (cost=0.42..29.80 rows=1289 width=17) (actual time=0.013..1.041 rows=565 loops=3)
Index Cond: (bo_party_fkey = (party.id)::text)
Heap Fetches: 0
-> Index Only Scan using bo_document_pkey on bo_document_base d (cost=0.43..0.64 rows=1 width=10) (actual time=0.022..0.025 rows=1 loops=1694)
Index Cond: (id = (l.bo_doc_base_fkey)::text)
Heap Fetches: 1134
-> Seq Scan on bo_contract_hardwood_deal c (cost=0.00..2069.77 rows=59770 width=9) (actual time=0.003..11.829 rows=60012 loops=1694)
Planning time: 13.484 ms
Execution time: 31762.885 ms
http://explain.depesz.com/s/V2wn
What is very annoying is incorrect estimate of rows:
Nested Loop (cost=0.85..105.69 rows=1 width=9) (actual time=0.081..4.404 rows=1694 loops=1)
Because that postgres choose nested loops and query run about 30 seconds.
With SET LOCAL enable_nestloop = OFF; it accomplished just in a second.
What is also interesting, I have default_statistics_target = 10000 (at max value) and on all 4 tables run VACUUM VERBOSE ANALYZE just before.
As postgres does not gather statistic between tables such case is very likely possible to happens for other joins too.
Without external extension pghintplan it is not possible change enable_nestloop for just that query.
Is there some other way I could try to force use more speedy way to accomplish that query?
Update by comments
I can't eliminate join in common way. My main search is there any possibilities change statistic (for example) to include desired values which break normal statistical appearance? May be other way to force postgres to change weight of nested loops to use it not so frequently?
Could also someone explain or point to documentation how postgres analyzer for nested loops of two results with 3 (exact correct) and 1289 (which will really 565, but actually such error different question) rows made assumption what in result will be only 1 row??? I've speak about that part of plan:
-> Nested Loop (cost=0.85..105.69 rows=1 width=9) (actual time=0.081..4.404 rows=1694 loops=1)
-> Index Scan using bo_party_inn_idx on bo_party party (cost=0.43..12.43 rows=3 width=10) (actual time=0.031..0.037 rows=3 loops=1)
Index Cond: (inn = '2534005760'::text)
-> Index Only Scan using bo_labels__party_fkey__docbase_fkey__tnved_fkey__idx on bo_labels l (cost=0.42..29.80 rows=1289 width=17) (actual time=0.013..1.041 rows=565 loops=3)
Index Cond: (bo_party_fkey = (party.id)::text)
On first glance it looks initially wrong. What statistics used there and how?
Does postgres maintain also some statistics for indexes?
Actually, I don't have a good sample data to test my answer but I think it might help.
Based on your join columns I'm assuming the following relationship cardinality:
1) bo_party (id 1:N bo_party_fkey) bo_labels
2) bo_labels (bo_doc_base_fkey N:1 id) bo_document_base
3) bo_document_base (id 1:N bo_document_fkey) bo_contract_hardwood_deal
You want to count how much rows were selected. So, based on the cardinality in 1) and 2) the table "bo_labels" have a many to many relationship. This means that joining it with "bo_party" and "bo_document_base" will produce no more rows than the ones existing in the table.
But, after joining "bo_document_base", another join is done to "bo_contract_hardwood_deal" which cardinality described in 3) is one to many, perhaps generating more rows in the final result.
This way, to find the right count of rows you can simplify the join structure to "bo_labels" and "bo_contract_hardwood_deal" through:
4) bo_labels (bo_doc_base_fkey 1:N bo_document_fkey) bo_contract_hardwood_deal
A sample query could be one of the following:
SELECT COUNT(*)
FROM bo_labels L
LEFT JOIN bo_contract_hardwood_deal C ON (C.bo_document_fkey = L.bo_doc_base_fkey)
WHERE 1=1
and exists
(
select 1
from bo_party party
where 1=1
and party.id = L.bo_party_fkey
and party.inn = '?'
)
;
or
SELECT sum((select COUNT(*) from bo_contract_hardwood_deal C where C.bo_document_fkey = L.bo_doc_base_fkey))
FROM bo_labels L
WHERE 1=1
and exists
(
select 1
from bo_party party
where 1=1
and party.id = L.bo_party_fkey
and party.inn = '?'
)
;
I could not test with large tables, so I don't know exactly if it will improve performance against your original query, but I think it might help.
I have a database with a few tables, each has a few millions rows (tables do have indexes). I need to count rows in a table, but only those whose foreign key field points to a subset from another table.
Here is the query:
WITH filtered_title
AS (SELECT top.id
FROM title top
WHERE ( top.production_year >= 1982
AND top.production_year <= 1984
AND top.kind_id IN( 1, 2 )
OR EXISTS(SELECT 1
FROM title sub
WHERE sub.episode_of_id = top.id
AND sub.production_year >= 1982
AND sub.production_year <= 1984
AND sub.kind_id IN( 1, 2 )) ))
SELECT Count(*)
FROM cast_info
WHERE EXISTS(SELECT 1
FROM filtered_title
WHERE cast_info.movie_id = filtered_title.id)
AND cast_info.role_id IN( 3, 8 )
I use CTE because there are more COUNT queries down there for other tables, which use the same subqueries. But I have tried to get rid of CTE and the results were the same: the first time I execute the query it runs... runs... and runs for more than ten minutes. The second time I execute the query, it's down to 4 seconds, which is acceptable for me.
The result of EXPLAIN ANALYZE:
Aggregate (cost=46194894.49..46194894.50 rows=1 width=0) (actual time=127728.452..127728.452 rows=1 loops=1)
CTE filtered_title
-> Seq Scan on title top (cost=0.00..46123542.41 rows=1430406 width=4) (actual time=732.509..1596.345 rows=16250 loops=1)
Filter: (((production_year >= 1982) AND (production_year <= 1984) AND (kind_id = ANY ('{1,2}'::integer[]))) OR (alternatives: SubPlan 1 or hashed SubPlan 2))
Rows Removed by Filter: 2832906
SubPlan 1
-> Index Scan using title_idx_epof on title sub (cost=0.43..16.16 rows=1 width=0) (never executed)
Index Cond: (episode_of_id = top.id)
Filter: ((production_year >= 1982) AND (production_year <= 1984) AND (kind_id = ANY ('{1,2}'::integer[])))
SubPlan 2
-> Seq Scan on title sub_1 (cost=0.00..90471.23 rows=11657 width=4) (actual time=0.071..730.311 rows=16250 loops=1)
Filter: ((production_year >= 1982) AND (production_year <= 1984) AND (kind_id = ANY ('{1,2}'::integer[])))
Rows Removed by Filter: 2832906
-> Nested Loop (cost=32184.70..63158.16 rows=3277568 width=0) (actual time=1620.382..127719.030 rows=29679 loops=1)
-> HashAggregate (cost=32184.13..32186.13 rows=200 width=4) (actual time=1620.058..1631.697 rows=16250 loops=1)
-> CTE Scan on filtered_title (cost=0.00..28608.12 rows=1430406 width=4) (actual time=732.513..1607.093 rows=16250 loops=1)
-> Index Scan using cast_info_idx_mid on cast_info (cost=0.56..154.80 rows=6 width=4) (actual time=5.977..7.758 rows=2 loops=16250)
Index Cond: (movie_id = filtered_title.id)
Filter: (role_id = ANY ('{3,8}'::integer[]))
Rows Removed by Filter: 15
Total runtime: 127729.100 ms
Now to my question. What am I doing wrong and how can I fix it?
I tried a few variants of the same query: exclusive joins, joins/exists. On one hand this one seems to require the least time to do the job (10x faster), but it's still 60 seconds on average. On the other hand, unlike my first query which needs 4-6 seconds on the second run, it always requires 60 seconds.
WITH filtered_title
AS (SELECT top.id
FROM title top
WHERE top.production_year >= 1982
AND top.production_year <= 1984
AND top.kind_id IN( 1, 2 )
OR EXISTS(SELECT 1
FROM title sub
WHERE sub.episode_of_id = top.id
AND sub.production_year >= 1982
AND sub.production_year <= 1984
AND sub.kind_id IN( 1, 2 )))
SELECT Count(*)
FROM cast_info
join filtered_title
ON cast_info.movie_id = filtered_title.id
WHERE cast_info.role_id IN( 3, 8 )
Disclaimer: There are way too many factors in play to give a conclusive answer. The information with a few tables, each has a few millions rows (tables do have indexes) just doesn't cut it. It depends on cardinalities, table definitions, data types, usage patterns and (probably most important) indexes. And a proper basic configuration of your db server, of course. All of this goes beyond the scope of a single question on SO. Start with the links in the postgresql-performance tag. Or hire a professional.
I am going to address the most prominent detail (for me) in your query plan:
Sequential scan on title?
-> Seq Scan on title sub_1 (cost=0.00..90471.23 rows=11657 width=4) (actual time=0.071..730.311 rows=16250 loops=1)
Filter: ((production_year >= 1982) AND (production_year <= 1984) AND (kind_id = ANY ('{1,2}'::integer[])))
Rows Removed by Filter: 2832906
Bold emphasis mine. Sequentially scanning 3 million rows to retrieve just 16250 is not very efficient. The sequential scan is also the likely reason why the first run takes so much longer. Subsequent calls can read data from the cache. Since the table is big, data probably won't stay in the cache for long unless you have huge amounts of cache.
An index scan is typically substantially faster to gather 0.5 % of the rows from a big table. Possible causes:
Statistics are off.
Cost settings are off.
No matching index.
My money is on the index. You didn't supply your version of Postgres, so assuming current 9.3. The perfect index for this query would be:
CREATE INDEX title_foo_idx ON title (kind_id, production_year, id, episode_of_id)
Data types matter. Sequence of columns in the index matters.
kind_id first, because the rule of thumb is: index for equality first — then for ranges.
The last two columns (id, episode_of_id) are only useful for a potential index-only scan. If not applicable, drop those. More details here:
PostgreSQL composite primary key
The way you built your query, you end up with two sequential scans on the big table. So here's an educated guess for a ...
Better query
WITH t_base AS (
SELECT id, episode_of_id
FROM title
WHERE kind_id BETWEEN 1 AND 2
AND production_year BETWEEN 1982 AND 1984
)
, t_all AS (
SELECT id FROM t_base
UNION -- not UNION ALL (!)
SELECT id
FROM (SELECT DISTINCT episode_of_id AS id FROM t_base) x
JOIN title t USING (id)
)
SELECT count(*) AS ct
FROM cast_info c
JOIN t_all t ON t.id = c.movie_id
WHERE c.role_id IN (3, 8);
This should give you one index scan on the new title_foo_idx, plus another index scan on the pk index of title. The rest should be relatively cheap. With any luck, much faster than before.
kind_id BETWEEN 1 AND 2 .. as long as you have a continuous range of values, that is faster than listing individual values because this way Postgres can fetch a continuous range from the index. Not very important for just two values.
Test this alternative for the second leg of t_all. Not sure which is faster:
SELECT id
FROM title t
WHERE EXISTS (SELECT 1 FROM t_base WHERE t_base.episode_of_id = t.id)
Temporary table instead of CTE
You write:
I use CTE because there are more COUNT queries down there for other
tables, which use the same subqueries.
A CTE poses as optimization barrier, the resulting internal work table is not indexed. When reusing a result (with more than a trivial number of rows) multiple times, it pays to use an indexed temp table instead. Index creation for a simple int column is fast.
CREATE TEMP TABLE t_tmp AS
WITH t_base AS (
SELECT id, episode_of_id
FROM title
WHERE kind_id BETWEEN 1 AND 2
AND production_year BETWEEN 1982 AND 1984
)
SELECT id FROM t_base
UNION
SELECT id FROM title t
WHERE EXISTS (SELECT 1 FROM t_base WHERE t_base.episode_of_id = t.id);
ANALYZE t_tmp; -- !
CREATE UNIQUE INDEX ON t_tmp (id); -- ! (unique is optional)
SELECT count(*) AS ct
FROM cast_info c
JOIN t_tmp t ON t.id = c.movie_id
WHERE c.role_id IN (3, 8);
-- More queries using t_tmp
About temp tables:
How to tell if record has changed in Postgres
I have a star schema here and I am querying the fact table and would like to join one very small dimension table. I can't really explain the following:
EXPLAIN ANALYZE SELECT
COUNT(impression_id), imp.os_id
FROM bi.impressions imp
GROUP BY imp.os_id;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=868719.08..868719.24 rows=16 width=10) (actual time=12559.462..12559.466 rows=26 loops=1)
-> Seq Scan on impressions imp (cost=0.00..690306.72 rows=35682472 width=10) (actual time=0.009..3030.093 rows=35682474 loops=1)
Total runtime: 12559.523 ms
(3 rows)
This takes ~12600ms, but of course there is no joined data, so I can't "resolve" the imp.os_id to something meaningful, so I add a join:
EXPLAIN ANALYZE SELECT
COUNT(impression_id), imp.os_id, os.os_desc
FROM bi.impressions imp, bi.os_desc os
WHERE imp.os_id=os.os_id
GROUP BY imp.os_id, os.os_desc;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=1448560.83..1448564.99 rows=416 width=22) (actual time=25565.124..25565.127 rows=26 loops=1)
-> Hash Join (cost=1.58..1180942.29 rows=35682472 width=22) (actual time=0.046..15157.684 rows=35682474 loops=1)
Hash Cond: (imp.os_id = os.os_id)
-> Seq Scan on impressions imp (cost=0.00..690306.72 rows=35682472 width=10) (actual time=0.007..3705.647 rows=35682474 loops=1)
-> Hash (cost=1.26..1.26 rows=26 width=14) (actual time=0.028..0.028 rows=26 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 2kB
-> Seq Scan on os_desc os (cost=0.00..1.26 rows=26 width=14) (actual time=0.003..0.010 rows=26 loops=1)
Total runtime: 25565.199 ms
(8 rows)
This effectively doubles the execution time of my query. My question is, what did I leave out from the picture? I would think such a small lookup was not causing huge difference in query execution time.
Rewritten with (recommended) explicit ANSI JOIN syntax:
SELECT COUNT(impression_id), imp.os_id, os.os_desc
FROM bi.impressions imp
JOIN bi.os_desc os ON os.os_id = imp.os_id
GROUP BY imp.os_id, os.os_desc;
First of all, your second query might be wrong, if more or less than exactly one match are found in os_desc for every row in impressions.
This can be ruled out if you have a foreign key constraint on os_id in place, that guarantees referential integrity, plus a NOT NULL constraint on bi.impressions.os_id. If so, in a first step, simplify to:
SELECT COUNT(*) AS ct, imp.os_id, os.os_desc
FROM bi.impressions imp
JOIN bi.os_desc os USING (os_id)
GROUP BY imp.os_id, os.os_desc;
count(*) is faster than count(column) and equivalent here if the column is NOT NULL. And add a column alias for the count.
Faster, yet:
SELECT os_id, os.os_desc, sub.ct
FROM (
SELECT os_id, COUNT(*) AS ct
FROM bi.impressions
GROUP BY 1
) sub
JOIN bi.os_desc os USING (os_id)
Aggregate first, join later. More here:
Aggregate a single column in query with many columns
PostgreSQL - order by an array
HashAggregate (cost=868719.08..868719.24 rows=16 width=10)
HashAggregate (cost=1448560.83..1448564.99 rows=416 width=22)
Hmm, width from 10 to 22 is a doubling. Perhaps you should join after grouping instead of before?
The following query solves the problem without increasing the query execution time. The question still stands why does the execution time increase significantly with adding a very simple join, but it might be a Postgres specific question and somebody with extensive experience in the area might answer it eventually.
WITH
OSES AS (SELECT os_id,os_desc from bi.os_desc)
SELECT
COUNT(impression_id) as imp_count,
os_desc FROM bi.impressions imp,
OSES os
WHERE
os.os_id=imp.os_id
GROUP BY os_desc
ORDER BY imp_count;