I have a table, let's call it "foos", with almost 6 million records in it. I am running the following query:
SELECT "foos".*
FROM "foos"
INNER JOIN "bars" ON "foos".bar_id = "bars".id
WHERE (("bars".baz_id = 13266))
ORDER BY "foos"."id" DESC
LIMIT 5 OFFSET 0;
This query takes a very long time to run (Rails times out while running it). There is an index on all IDs in question. The curious part is, if I remove either the ORDER BY clause or the LIMIT clause, it runs almost instantaneously.
I'm assuming that the presence of both ORDER BY and LIMIT are making PostgreSQL make some bad choices in query planning. Anyone have any ideas on how to fix this?
In case it helps, here is the EXPLAIN for all 3 cases:
//////// Both ORDER and LIMIT
SELECT "foos".*
FROM "foos"
INNER JOIN "bars" ON "foos".bar_id = "bars".id
WHERE (("bars".baz_id = 13266))
ORDER BY "foos"."id" DESC
LIMIT 5 OFFSET 0;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..16663.44 rows=5 width=663)
-> Nested Loop (cost=0.00..25355084.05 rows=7608 width=663)
Join Filter: (foos.bar_id = bars.id)
-> Index Scan Backward using foos_pkey on foos (cost=0.00..11804133.33 rows=4963477 width=663)
Filter: (((NOT privacy_protected) OR (user_id = 67962)) AND ((status)::text = 'DONE'::text))
-> Materialize (cost=0.00..658.96 rows=182 width=4)
-> Index Scan using index_bars_on_baz_id on bars (cost=0.00..658.05 rows=182 width=4)
Index Cond: (baz_id = 13266)
(8 rows)
//////// Just LIMIT
SELECT "foos".*
FROM "foos"
INNER JOIN "bars" ON "foos".bar_id = "bars".id
WHERE (("bars".baz_id = 13266))
LIMIT 5 OFFSET 0;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..22.21 rows=5 width=663)
-> Nested Loop (cost=0.00..33788.21 rows=7608 width=663)
-> Index Scan using index_bars_on_baz_id on bars (cost=0.00..658.05 rows=182 width=4)
Index Cond: (baz_id = 13266)
-> Index Scan using index_foos_on_bar_id on foos (cost=0.00..181.51 rows=42 width=663)
Index Cond: (foos.bar_id = bars.id)
Filter: (((NOT foos.privacy_protected) OR (foos.user_id = 67962)) AND ((foos.status)::text = 'DONE'::text))
(7 rows)
//////// Just ORDER
SELECT "foos".*
FROM "foos"
INNER JOIN "bars" ON "foos".bar_id = "bars".id
WHERE (("bars".baz_id = 13266))
ORDER BY "foos"."id" DESC;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=36515.17..36534.19 rows=7608 width=663)
Sort Key: foos.id
-> Nested Loop (cost=0.00..33788.21 rows=7608 width=663)
-> Index Scan using index_bars_on_baz_id on bars (cost=0.00..658.05 rows=182 width=4)
Index Cond: (baz_id = 13266)
-> Index Scan using index_foos_on_bar_id on foos (cost=0.00..181.51 rows=42 width=663)
Index Cond: (foos.bar_id = bars.id)
Filter: (((NOT foos.privacy_protected) OR (foos.user_id = 67962)) AND ((foos.status)::text = 'DONE'::text))
(8 rows)
When you have both the LIMIT and ORDER BY, the optimizer has decided it is faster to limp through the unfiltered records on foo by key descending until it gets five matches for the rest of the criteria. In the other cases, it simply runs the query as a nested loop and returns all the records.
Offhand, I'd say the problem is that PG doesn't grok the joint distribution of the various ids and that's why the plan is so sub-optimal.
For possible solutions: I'll assume that you have run ANALYZE recently. If not, do so. That may explain why your estimated times are high even on the version that returns fast. If the problem persists, perhaps run the ORDER BY as a subselect and slap the LIMIT on in an outer query.
Probably it happens because before it tries to order then to select. Why do not try to sort the result in an outer select all? Something like:
SELECT * FROM (SELECT ... INNER JOIN ETC...) ORDER BY ... DESC
Your query plan indicates a filter on
(((NOT privacy_protected) OR (user_id = 67962)) AND ((status)::text = 'DONE'::text))
which doesn't appear in the SELECT - where is it coming from?
Also, note that expression is listed as a "Filter" and not an "Index Cond" which would seem to indicate there's no index applied to it.
it may be running a full-table scan on "foos". did you try changing the order of the tables and instead use a left-join instead of inner-join and see if it displays results faster.
say...
SELECT "bars"."id", "foos".*
FROM "bars"
LEFT JOIN "foos" ON "bars"."id" = "foos"."bar_id"
WHERE "bars"."baz_id" = 13266
ORDER BY "foos"."id" DESC
LIMIT 5 OFFSET 0;
Related
I'm using this public Postgres DB of NEAR protocol: https://github.com/near/near-indexer-for-explorer#shared-public-access
postgres://public_readonly:nearprotocol#mainnet.db.explorer.indexer.near.dev/mainnet_explorer
SELECT "public"."receipts"."receipt_id",
"public"."receipts"."included_in_block_hash",
"public"."receipts"."included_in_chunk_hash",
"public"."receipts"."index_in_chunk",
"public"."receipts"."included_in_block_timestamp",
"public"."receipts"."predecessor_account_id",
"public"."receipts"."receiver_account_id",
"public"."receipts"."receipt_kind",
"public"."receipts"."originated_from_transaction_hash"
FROM "public"."receipts"
WHERE ("public"."receipts"."receipt_id") IN
(SELECT "t0"."receipt_id"
FROM "public"."receipts" AS "t0"
INNER JOIN "public"."action_receipts" AS "j0" ON ("j0"."receipt_id") = ("t0"."receipt_id")
WHERE ("j0"."signer_account_id" = 'ryancwalsh.near'
AND "t0"."receipt_id" IS NOT NULL))
ORDER BY "public"."receipts"."included_in_block_timestamp" DESC
LIMIT 1
OFFSET 0
always results in:
ERROR: canceling statement due to statement timeout
SQL state: 57014
But if I change it to LIMIT 2, the query runs in less than 1 second.
How would that ever be the case? Does that mean the database isn't set up well? Or am I doing something wrong?
P.S. The query here was generated via Prisma. findFirst always times out, so I think I might need to change it to findMany as a workaround.
Your query can be simplified /optimized:
SELECT r.receipt_id
, r.included_in_block_hash
, r.included_in_chunk_hash
, r.index_in_chunk
, r.included_in_block_timestamp
, r.predecessor_account_id
, r.receiver_account_id
, r.receipt_kind
, r.originated_from_transaction_hash
FROM public.receipts r
WHERE EXISTS (
SELECT FROM public.action_receipts j
WHERE j.receipt_id = r.receipt_id
AND j.signer_account_id = 'ryancwalsh.near'
)
ORDER BY r.included_in_block_timestamp DESC
LIMIT 1;
However, that only scratches the surface of your underlying problem.
Like Kirk already commented, Postgres chooses a different query plan for LIMIT 1, obviously ignorant of the fact that there are only 90 rows in table action_receipts with signer_account_id = 'ryancwalsh.near', while both involved tables have more than 220 million rows, obviously growing steadily.
Changing to LIMIT 2 makes a different query plan seem more favorable, hence the observed difference in performance. (So the query planner has the general idea that the filter is very selective, just not close enough for the corner case of LIMIT 1.)
You should have mentioned cardinalities to set us on the right track.
Knowing that our filter is so selective, we can force a more favorable query plan with a different query:
WITH j AS (
SELECT receipt_id -- is PK!
FROM public.action_receipts
WHERE signer_account_id = 'ryancwalsh.near'
)
SELECT r.receipt_id
, r.included_in_block_hash
, r.included_in_chunk_hash
, r.index_in_chunk
, r.included_in_block_timestamp
, r.predecessor_account_id
, r.receiver_account_id
, r.receipt_kind
, r.originated_from_transaction_hash
FROM j
JOIN public.receipts r USING (receipt_id)
ORDER BY r.included_in_block_timestamp DESC
LIMIT 1;
This uses the same query plan for LIMIT 1, and either finishes in under 2 ms in my test:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=134904.89..134904.89 rows=1 width=223) (actual time=1.750..1.754 rows=1 loops=1)
CTE j
-> Bitmap Heap Scan on action_receipts (cost=319.46..41564.59 rows=10696 width=44) (actual time=0.058..0.179 rows=90 loops=1)
Recheck Cond: (signer_account_id = 'ryancwalsh.near'::text)
Heap Blocks: exact=73
-> Bitmap Index Scan on action_receipt_signer_account_id_idx (cost=0.00..316.79 rows=10696 width=0) (actual time=0.043..0.043 rows=90 loops=1)
Index Cond: (signer_account_id = 'ryancwalsh.near'::text)
-> Sort (cost=93340.30..93367.04 rows=10696 width=223) (actual time=1.749..1.750 rows=1 loops=1)
Sort Key: r.included_in_block_timestamp DESC
Sort Method: top-N heapsort Memory: 25kB
-> Nested Loop (cost=0.70..93286.82 rows=10696 width=223) (actual time=0.089..1.705 rows=90 loops=1)
-> CTE Scan on j (cost=0.00..213.92 rows=10696 width=32) (actual time=0.060..0.221 rows=90 loops=1)
-> Index Scan using receipts_pkey on receipts r (cost=0.70..8.70 rows=1 width=223) (actual time=0.016..0.016 rows=1 loops=90)
Index Cond: (receipt_id = j.receipt_id)
Planning Time: 0.281 ms
Execution Time: 1.801 ms
The point is to execute the hugely selective query in the CTE first. Then Postgres does not attempt to walk the index on (included_in_block_timestamp) under the wrong assumption that it would find a matching row soon enough. (It does not.)
The DB at hand runs Postgres 11, where CTEs are always optimization barriers. In Postgres 12 or later add AS MATERIALIZED to the CTE to guarantee the same effect.
Or you could use the "OFFSET 0 hack" in any version like this:
SELECT r.receipt_id
, r.included_in_block_hash
, r.included_in_chunk_hash
, r.index_in_chunk
, r.included_in_block_timestamp
, r.predecessor_account_id
, r.receiver_account_id
, r.receipt_kind
, r.originated_from_transaction_hash
FROM (
SELECT receipt_id -- is PK!
FROM public.action_receipts
WHERE signer_account_id = 'ryancwalsh.near'
OFFSET 0 -- !
) j
JOIN public.receipts r USING (receipt_id)
ORDER BY r.included_in_block_timestamp DESC
LIMIT 1;
Prevents "inlining" of the subquery to the same effect. Finishes in < 2ms.
See:
How to prevent PostgreSQL from rewriting OUTER JOIN queries?
"Fix" the database?
The proper fix depends on the complete picture. The underlying problem is that Postgres overestimates the number of qualifying rows in table action_receipts. The MCV list (most common values) cannot keep up with 220 million rows (and growing). It's most probably not just ANALYZE lagging behind. (Though it could be: autovacuum not properly configured? Rookie mistake?) Depending on the actual cardinalities (data distribution) in action_receipts.signer_account_id and access patterns you could do various things to "fix" it. Two options:
1. Increase n_distinct and STATISTICS
If most values in action_receipts.signer_account_id are equally rare (high cardinality), consider setting a very large n_distinct value for the column. And combine that with a moderately increased STATISTICS target for the same column to counter errors in the other direction (underestimating the number of qualifying rows for common values). Read both answers here:
Postgres sometimes uses inferior index for WHERE a IN (...) ORDER BY b LIMIT N
And:
Very bad query plan in PostgreSQL 9.6
2. Local fix with partial index
If action_receipts.signer_account_id = 'ryancwalsh.near' is special in that it's queried more regularly than others, consider a small partial index for it, to fix just that case. Like:
CREATE INDEX ON action_receipts (receipt_id)
WHERE signer_account_id = 'ryancwalsh.near';
I have a transactions table in PostgreSQL with block_height and index as BIGINT values. Those two values are used for determining the order of the transactions in this table.
So if I want to query transactions from this table that comes after the given block_height and index, I'd have to put this on the condition
If two transactions are in the same block_height, then check the ordering of their index
Otherwise compare their block_height
For example if I want to get 10 transactions that came after block_height 100000 and index 5:
SELECT * FROM transactions
WHERE (
(block_height = 10000 AND index > 5)
OR (block_height > 10000)
)
ORDER BY block_height, index ASC
LIMIT 10
However I find this query to be extremely slow, it took up to 60 seconds for a table with 50 million rows.
However if I split up the condition and run them individually like so:
SELECT * FROM transactions
WHERE block_height = 10000 AND index > 5
ORDER BY block_height, index ASC
LIMIT 10
and
SELECT * FROM transactions
WHERE block_height > 10000
ORDER BY block_height, index ASC
LIMIT 10
Both queries took at most 200ms on the same table! It is much faster to do both queries and then UNION the final result instead of putting an OR in the condition.
This is the part of the query plan for the slow query (OR-ed condition):
-> Nested Loop (cost=0.98..11689726.68 rows=68631 width=73) (actual time=10230.480..10234.289 rows=10 loops=1)
-> Index Scan using src_transactions_block_height_index on src_transactions (cost=0.56..3592792.96 rows=16855334 width=73) (actual time=10215.698..10219.004 rows=1364 loops=1)
Filter: (((block_height = $1) AND (index > $2)) OR (block_height > $3))
Rows Removed by Filter: 2728151
And this is the query plan for the fast query:
-> Nested Loop (cost=0.85..52.62 rows=1 width=73) (actual time=0.014..0.014 rows=0 loops=1)
-> Index Scan using src_transactions_block_height_index on src_transactions (cost=0.43..22.22 rows=5 width=73) (actual time=0.014..0.014 rows=0 loops=1)
Index Cond: ((block_height = $1) AND (index > $2))
I see the main difference to be the use of Filter instead of Index Cond between the query plans.
Is there any way to do this query in a performant way without resorting to the UNION workaround?
The fact that block_height is compared to two different parameters which you know just happen to be equal might be a problem. What if you use $1 twice, rather than $1 and $3?
But better yet, try a tuple comparison
WHERE (block_height, index) > (10000, 5)
This can become fast with a two-column index on (block_height, index).
i'm a little confused here.
Here is my (simplified) query:
SELECT *
from (SELECT documents.*,
(SELECT max(date)
FROM registrations
WHERE registrations.document_id = documents.id) AS register_date
FROM documents) AS dcmnts
ORDER BY register_date
LIMIT 20;
And here is my EXPLAIN ANALYSE results:
Limit (cost=46697025.51..46697025.56 rows=20 width=193) (actual time=80329.201..80329.206 rows=20 loops=1)
-> Sort (cost=46697025.51..46724804.61 rows=11111641 width=193) (actual time=80329.199..80329.202 rows=20 loops=1)
Sort Key: ((SubPlan 1))
Sort Method: top-N heapsort Memory: 29kB
-> Seq Scan on documents (cost=0.00..46401348.74 rows=11111641 width=193) (actual time=0.061..73275.304 rows=11114254 loops=1)
SubPlan 1
-> Aggregate (cost=3.95..4.05 rows=1 width=4) (actual time=0.005..0.005 rows=1 loops=11114254)
-> Index Scan using registrations_document_id_index on registrations (cost=0.43..3.95 rows=2 width=4) (actual time=0.004..0.004 rows=1 loops=11114254)
Index Cond: (document_id = documents.id)
Planning Time: 0.334 ms
Execution Time: 80329.287 ms
The query takes 1m 20s to execute, is there any way to optimize it? There are lots of rows in these tables (documents:11114642;registrations:13176070).
In actual full query I also have some more filters and it takes up to 4 seconds to execute, and it's still too slow. This subquery orderby seems to be the bottleneck here and I can't figure out the way to optimize it.
I tried to set indexes on date/document_id columns.
Don't use a scalar subquery:
SELECT documents.*,
reg.register_date
FROM documents
JOIN (
SELECT document_id, max(date) as register_date
FROM registrations
GROUP BY document_id
) reg on reg.document_id = documents.id;
ORDER BY register_date
LIMIT 20;
Try to unnest the query
SELECT documents.id,
documents.other_attr,
max(registrations.date) register_date
FROM documents
JOIN registrations ON registrations.document_id = documents.id
GROUP BY documents.id, documents.other_attr
ORDER BY 2
LIMIT 20
The query should be supported at least by an index on registrations(document_id, date):
create index idx_registrations_did_date
on registrations(document_id, date)
In actual full query I also have some more filters and it takes up to 4 seconds to execute, and it's still too slow.
Then ask about that query. What can we say about a query we can't see? Clearly this other query isn't just like this query, except for filtering stuff out after all the work is done, as then it couldn't be faster (other than due to cache hotness) than the one you showed us. It is doing something different, it has to optimized differently.
This subquery orderby seems to be the bottleneck here and I can't figure out the way to optimize it.
The timing for the sort node includes the time if all the work that preceded it, so the time of the actual sort is 80329.206 - 73275.304 = 7 seconds, a long time perhaps but a minority of the total time. (This interpretation is not very obvious from the output itself--it from experience.)
For the query you did show us, you can get it to be quite fast, but only probabilistically correct, by using a rather convoluted construction.
with t as (select date, document_id from registrations
order by date desc, document_id desc limit 200),
t2 as (select distinct on (document_id) document_id, date from t
order by document_id, date desc),
t3 as ( select document_id, date from t2 order by date desc limit 20)
SELECT documents.*,
t3.date as register_date
FROM documents join t3 on t3.document_id = documents.id;
order by register_date
It will be efficiently supported by:
create index on registrations (register_date, document_id);
create index on documents(id);
The idea here is that the 200 most recent registrations will have at least 20 distinct document_id among them. Of course there is no way to know for sure that that will be true, so you might have to increase 200 to 20000 (which should still be pretty fast, compared to what you are currently doing) or even more to be sure you get the right answer. This also assumes that every distinct document_id matches exactly one document.id.
I have a database with a few tables, each has a few millions rows (tables do have indexes). I need to count rows in a table, but only those whose foreign key field points to a subset from another table.
Here is the query:
WITH filtered_title
AS (SELECT top.id
FROM title top
WHERE ( top.production_year >= 1982
AND top.production_year <= 1984
AND top.kind_id IN( 1, 2 )
OR EXISTS(SELECT 1
FROM title sub
WHERE sub.episode_of_id = top.id
AND sub.production_year >= 1982
AND sub.production_year <= 1984
AND sub.kind_id IN( 1, 2 )) ))
SELECT Count(*)
FROM cast_info
WHERE EXISTS(SELECT 1
FROM filtered_title
WHERE cast_info.movie_id = filtered_title.id)
AND cast_info.role_id IN( 3, 8 )
I use CTE because there are more COUNT queries down there for other tables, which use the same subqueries. But I have tried to get rid of CTE and the results were the same: the first time I execute the query it runs... runs... and runs for more than ten minutes. The second time I execute the query, it's down to 4 seconds, which is acceptable for me.
The result of EXPLAIN ANALYZE:
Aggregate (cost=46194894.49..46194894.50 rows=1 width=0) (actual time=127728.452..127728.452 rows=1 loops=1)
CTE filtered_title
-> Seq Scan on title top (cost=0.00..46123542.41 rows=1430406 width=4) (actual time=732.509..1596.345 rows=16250 loops=1)
Filter: (((production_year >= 1982) AND (production_year <= 1984) AND (kind_id = ANY ('{1,2}'::integer[]))) OR (alternatives: SubPlan 1 or hashed SubPlan 2))
Rows Removed by Filter: 2832906
SubPlan 1
-> Index Scan using title_idx_epof on title sub (cost=0.43..16.16 rows=1 width=0) (never executed)
Index Cond: (episode_of_id = top.id)
Filter: ((production_year >= 1982) AND (production_year <= 1984) AND (kind_id = ANY ('{1,2}'::integer[])))
SubPlan 2
-> Seq Scan on title sub_1 (cost=0.00..90471.23 rows=11657 width=4) (actual time=0.071..730.311 rows=16250 loops=1)
Filter: ((production_year >= 1982) AND (production_year <= 1984) AND (kind_id = ANY ('{1,2}'::integer[])))
Rows Removed by Filter: 2832906
-> Nested Loop (cost=32184.70..63158.16 rows=3277568 width=0) (actual time=1620.382..127719.030 rows=29679 loops=1)
-> HashAggregate (cost=32184.13..32186.13 rows=200 width=4) (actual time=1620.058..1631.697 rows=16250 loops=1)
-> CTE Scan on filtered_title (cost=0.00..28608.12 rows=1430406 width=4) (actual time=732.513..1607.093 rows=16250 loops=1)
-> Index Scan using cast_info_idx_mid on cast_info (cost=0.56..154.80 rows=6 width=4) (actual time=5.977..7.758 rows=2 loops=16250)
Index Cond: (movie_id = filtered_title.id)
Filter: (role_id = ANY ('{3,8}'::integer[]))
Rows Removed by Filter: 15
Total runtime: 127729.100 ms
Now to my question. What am I doing wrong and how can I fix it?
I tried a few variants of the same query: exclusive joins, joins/exists. On one hand this one seems to require the least time to do the job (10x faster), but it's still 60 seconds on average. On the other hand, unlike my first query which needs 4-6 seconds on the second run, it always requires 60 seconds.
WITH filtered_title
AS (SELECT top.id
FROM title top
WHERE top.production_year >= 1982
AND top.production_year <= 1984
AND top.kind_id IN( 1, 2 )
OR EXISTS(SELECT 1
FROM title sub
WHERE sub.episode_of_id = top.id
AND sub.production_year >= 1982
AND sub.production_year <= 1984
AND sub.kind_id IN( 1, 2 )))
SELECT Count(*)
FROM cast_info
join filtered_title
ON cast_info.movie_id = filtered_title.id
WHERE cast_info.role_id IN( 3, 8 )
Disclaimer: There are way too many factors in play to give a conclusive answer. The information with a few tables, each has a few millions rows (tables do have indexes) just doesn't cut it. It depends on cardinalities, table definitions, data types, usage patterns and (probably most important) indexes. And a proper basic configuration of your db server, of course. All of this goes beyond the scope of a single question on SO. Start with the links in the postgresql-performance tag. Or hire a professional.
I am going to address the most prominent detail (for me) in your query plan:
Sequential scan on title?
-> Seq Scan on title sub_1 (cost=0.00..90471.23 rows=11657 width=4) (actual time=0.071..730.311 rows=16250 loops=1)
Filter: ((production_year >= 1982) AND (production_year <= 1984) AND (kind_id = ANY ('{1,2}'::integer[])))
Rows Removed by Filter: 2832906
Bold emphasis mine. Sequentially scanning 3 million rows to retrieve just 16250 is not very efficient. The sequential scan is also the likely reason why the first run takes so much longer. Subsequent calls can read data from the cache. Since the table is big, data probably won't stay in the cache for long unless you have huge amounts of cache.
An index scan is typically substantially faster to gather 0.5 % of the rows from a big table. Possible causes:
Statistics are off.
Cost settings are off.
No matching index.
My money is on the index. You didn't supply your version of Postgres, so assuming current 9.3. The perfect index for this query would be:
CREATE INDEX title_foo_idx ON title (kind_id, production_year, id, episode_of_id)
Data types matter. Sequence of columns in the index matters.
kind_id first, because the rule of thumb is: index for equality first — then for ranges.
The last two columns (id, episode_of_id) are only useful for a potential index-only scan. If not applicable, drop those. More details here:
PostgreSQL composite primary key
The way you built your query, you end up with two sequential scans on the big table. So here's an educated guess for a ...
Better query
WITH t_base AS (
SELECT id, episode_of_id
FROM title
WHERE kind_id BETWEEN 1 AND 2
AND production_year BETWEEN 1982 AND 1984
)
, t_all AS (
SELECT id FROM t_base
UNION -- not UNION ALL (!)
SELECT id
FROM (SELECT DISTINCT episode_of_id AS id FROM t_base) x
JOIN title t USING (id)
)
SELECT count(*) AS ct
FROM cast_info c
JOIN t_all t ON t.id = c.movie_id
WHERE c.role_id IN (3, 8);
This should give you one index scan on the new title_foo_idx, plus another index scan on the pk index of title. The rest should be relatively cheap. With any luck, much faster than before.
kind_id BETWEEN 1 AND 2 .. as long as you have a continuous range of values, that is faster than listing individual values because this way Postgres can fetch a continuous range from the index. Not very important for just two values.
Test this alternative for the second leg of t_all. Not sure which is faster:
SELECT id
FROM title t
WHERE EXISTS (SELECT 1 FROM t_base WHERE t_base.episode_of_id = t.id)
Temporary table instead of CTE
You write:
I use CTE because there are more COUNT queries down there for other
tables, which use the same subqueries.
A CTE poses as optimization barrier, the resulting internal work table is not indexed. When reusing a result (with more than a trivial number of rows) multiple times, it pays to use an indexed temp table instead. Index creation for a simple int column is fast.
CREATE TEMP TABLE t_tmp AS
WITH t_base AS (
SELECT id, episode_of_id
FROM title
WHERE kind_id BETWEEN 1 AND 2
AND production_year BETWEEN 1982 AND 1984
)
SELECT id FROM t_base
UNION
SELECT id FROM title t
WHERE EXISTS (SELECT 1 FROM t_base WHERE t_base.episode_of_id = t.id);
ANALYZE t_tmp; -- !
CREATE UNIQUE INDEX ON t_tmp (id); -- ! (unique is optional)
SELECT count(*) AS ct
FROM cast_info c
JOIN t_tmp t ON t.id = c.movie_id
WHERE c.role_id IN (3, 8);
-- More queries using t_tmp
About temp tables:
How to tell if record has changed in Postgres
We have 3 tables.
10,000 rows in one and 80,000 rows in second and 400 rows in third.
Code was working well, but recently we met performance issues.
EXPLAIN ANALYZE SELECT "users_users"."id", "users_users"."email"
FROM "users_users" WHERE (NOT ("users_users"."email" IN
(SELECT U0."email" FROM "users_blacklist" U0))
AND NOT ("users_users"."id" IN (SELECT U0."user_id"
FROM "games_user2game" U0))) ORDER BY "users_users"."id" DESC;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan Backward using users_user_pkey on users_users (cost=9.25..12534132.45 rows=2558 width=26) (actual time=46.101..77158.318 rows=2510 loops=1)
Filter: ((NOT (hashed SubPlan 1)) AND (NOT (SubPlan 2)))
Rows Removed by Filter: 7723
SubPlan 1
-> Seq Scan on users_blacklist u0 (cost=0.00..8.20 rows=420 width=22) (actual time=0.032..0.318 rows=420 loops=1)
SubPlan 2
-> Materialize (cost=0.00..2256.20 rows=77213 width=4) (actual time=0.003..4.042 rows=35774 loops=9946)
-> Seq Scan on games_user2game u0 (cost=0.00..1568.13 rows=77213 width=4) (actual time=0.011..17.159 rows=77213 loops=1)
Total runtime: 77159.689 ms
(9 rows)
Main question: Is it ok, that we meet performance issues on joining 2 tables with less than 100,000 rows?
Where to dig? Should we change query or dig into db settings?
UPD Temporary solution, is to rid off from subqueries by prefetching them in code.
I don't know the postgres dialet of SQL but it may be worth experimenting with outer joins. In many other dbms' they can offer better performance than subselects.
Something along the lines of
SELECT "users_users"."id", "users_users"."email"
FROM "users_users" us left join "users_blacklist" uo on uo.email = us.email
left join "games_user2game" ug on us.id = ug.user_id
where uo.email is null
AND ug.id is null
I think that is doing the same thing as your original query, but you'd have to test to make sure.
I have run across similar problems on SQL Server, and rewritten the query with an exists, as #Scotch suggests to good effect.
SELECT
"users_users"."id",
"users_users"."email"
FROM "users_users"
WHERE
NOT EXISTS
(
SELECT NULL FROM "users_blacklist" WHERE "users_blacklist"."email" = "users_users"."email"
)
AND NOT EXISTS
(
SELECT NULL FROM "games_user2game" WHERE "games_user2game"."user_id" = "users_users"."user_id"
)
ORDER BY "users_users"."id" DESC;
This query will give you all users who are not blacklisted, and who are not in a game. It may be faster than the outer join option depending on how postgres plans the query.