Order by by subquery result is ridiculously slow

Order by by subquery result is ridiculously slow - sql

i'm a little confused here.
Here is my (simplified) query:
SELECT *
from (SELECT documents.*,
(SELECT max(date)
FROM registrations
WHERE registrations.document_id = documents.id) AS register_date
FROM documents) AS dcmnts
ORDER BY register_date
LIMIT 20;
And here is my EXPLAIN ANALYSE results:
Limit (cost=46697025.51..46697025.56 rows=20 width=193) (actual time=80329.201..80329.206 rows=20 loops=1)
-> Sort (cost=46697025.51..46724804.61 rows=11111641 width=193) (actual time=80329.199..80329.202 rows=20 loops=1)
Sort Key: ((SubPlan 1))
Sort Method: top-N heapsort Memory: 29kB
-> Seq Scan on documents (cost=0.00..46401348.74 rows=11111641 width=193) (actual time=0.061..73275.304 rows=11114254 loops=1)
SubPlan 1
-> Aggregate (cost=3.95..4.05 rows=1 width=4) (actual time=0.005..0.005 rows=1 loops=11114254)
-> Index Scan using registrations_document_id_index on registrations (cost=0.43..3.95 rows=2 width=4) (actual time=0.004..0.004 rows=1 loops=11114254)
Index Cond: (document_id = documents.id)
Planning Time: 0.334 ms
Execution Time: 80329.287 ms
The query takes 1m 20s to execute, is there any way to optimize it? There are lots of rows in these tables (documents:11114642;registrations:13176070).
In actual full query I also have some more filters and it takes up to 4 seconds to execute, and it's still too slow. This subquery orderby seems to be the bottleneck here and I can't figure out the way to optimize it.
I tried to set indexes on date/document_id columns.

Don't use a scalar subquery:
SELECT documents.*,
reg.register_date
FROM documents
JOIN (
SELECT document_id, max(date) as register_date
FROM registrations
GROUP BY document_id
) reg on reg.document_id = documents.id;
ORDER BY register_date
LIMIT 20;

Try to unnest the query
SELECT documents.id,
documents.other_attr,
max(registrations.date) register_date
FROM documents
JOIN registrations ON registrations.document_id = documents.id
GROUP BY documents.id, documents.other_attr
ORDER BY 2
LIMIT 20
The query should be supported at least by an index on registrations(document_id, date):
create index idx_registrations_did_date
on registrations(document_id, date)

In actual full query I also have some more filters and it takes up to 4 seconds to execute, and it's still too slow.
Then ask about that query. What can we say about a query we can't see? Clearly this other query isn't just like this query, except for filtering stuff out after all the work is done, as then it couldn't be faster (other than due to cache hotness) than the one you showed us. It is doing something different, it has to optimized differently.
This subquery orderby seems to be the bottleneck here and I can't figure out the way to optimize it.
The timing for the sort node includes the time if all the work that preceded it, so the time of the actual sort is 80329.206 - 73275.304 = 7 seconds, a long time perhaps but a minority of the total time. (This interpretation is not very obvious from the output itself--it from experience.)
For the query you did show us, you can get it to be quite fast, but only probabilistically correct, by using a rather convoluted construction.
with t as (select date, document_id from registrations
order by date desc, document_id desc limit 200),
t2 as (select distinct on (document_id) document_id, date from t
order by document_id, date desc),
t3 as ( select document_id, date from t2 order by date desc limit 20)
SELECT documents.*,
t3.date as register_date
FROM documents join t3 on t3.document_id = documents.id;
order by register_date
It will be efficiently supported by:
create index on registrations (register_date, document_id);
create index on documents(id);
The idea here is that the 200 most recent registrations will have at least 20 distinct document_id among them. Of course there is no way to know for sure that that will be true, so you might have to increase 200 to 20000 (which should still be pretty fast, compared to what you are currently doing) or even more to be sure you get the right answer. This also assumes that every distinct document_id matches exactly one document.id.

Related

Why would LIMIT 2 queries work but LIMIT 1 always times out?

I'm using this public Postgres DB of NEAR protocol: https://github.com/near/near-indexer-for-explorer#shared-public-access
postgres://public_readonly:nearprotocol#mainnet.db.explorer.indexer.near.dev/mainnet_explorer
SELECT "public"."receipts"."receipt_id",
"public"."receipts"."included_in_block_hash",
"public"."receipts"."included_in_chunk_hash",
"public"."receipts"."index_in_chunk",
"public"."receipts"."included_in_block_timestamp",
"public"."receipts"."predecessor_account_id",
"public"."receipts"."receiver_account_id",
"public"."receipts"."receipt_kind",
"public"."receipts"."originated_from_transaction_hash"
FROM "public"."receipts"
WHERE ("public"."receipts"."receipt_id") IN
(SELECT "t0"."receipt_id"
FROM "public"."receipts" AS "t0"
INNER JOIN "public"."action_receipts" AS "j0" ON ("j0"."receipt_id") = ("t0"."receipt_id")
WHERE ("j0"."signer_account_id" = 'ryancwalsh.near'
AND "t0"."receipt_id" IS NOT NULL))
ORDER BY "public"."receipts"."included_in_block_timestamp" DESC
LIMIT 1
OFFSET 0
always results in:
ERROR: canceling statement due to statement timeout
SQL state: 57014
But if I change it to LIMIT 2, the query runs in less than 1 second.
How would that ever be the case? Does that mean the database isn't set up well? Or am I doing something wrong?
P.S. The query here was generated via Prisma. findFirst always times out, so I think I might need to change it to findMany as a workaround.

Your query can be simplified /optimized:
SELECT r.receipt_id
, r.included_in_block_hash
, r.included_in_chunk_hash
, r.index_in_chunk
, r.included_in_block_timestamp
, r.predecessor_account_id
, r.receiver_account_id
, r.receipt_kind
, r.originated_from_transaction_hash
FROM public.receipts r
WHERE EXISTS (
SELECT FROM public.action_receipts j
WHERE j.receipt_id = r.receipt_id
AND j.signer_account_id = 'ryancwalsh.near'
)
ORDER BY r.included_in_block_timestamp DESC
LIMIT 1;
However, that only scratches the surface of your underlying problem.
Like Kirk already commented, Postgres chooses a different query plan for LIMIT 1, obviously ignorant of the fact that there are only 90 rows in table action_receipts with signer_account_id = 'ryancwalsh.near', while both involved tables have more than 220 million rows, obviously growing steadily.
Changing to LIMIT 2 makes a different query plan seem more favorable, hence the observed difference in performance. (So the query planner has the general idea that the filter is very selective, just not close enough for the corner case of LIMIT 1.)
You should have mentioned cardinalities to set us on the right track.
Knowing that our filter is so selective, we can force a more favorable query plan with a different query:
WITH j AS (
SELECT receipt_id -- is PK!
FROM public.action_receipts
WHERE signer_account_id = 'ryancwalsh.near'
)
SELECT r.receipt_id
, r.included_in_block_hash
, r.included_in_chunk_hash
, r.index_in_chunk
, r.included_in_block_timestamp
, r.predecessor_account_id
, r.receiver_account_id
, r.receipt_kind
, r.originated_from_transaction_hash
FROM j
JOIN public.receipts r USING (receipt_id)
ORDER BY r.included_in_block_timestamp DESC
LIMIT 1;
This uses the same query plan for LIMIT 1, and either finishes in under 2 ms in my test:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=134904.89..134904.89 rows=1 width=223) (actual time=1.750..1.754 rows=1 loops=1)
CTE j
-> Bitmap Heap Scan on action_receipts (cost=319.46..41564.59 rows=10696 width=44) (actual time=0.058..0.179 rows=90 loops=1)
Recheck Cond: (signer_account_id = 'ryancwalsh.near'::text)
Heap Blocks: exact=73
-> Bitmap Index Scan on action_receipt_signer_account_id_idx (cost=0.00..316.79 rows=10696 width=0) (actual time=0.043..0.043 rows=90 loops=1)
Index Cond: (signer_account_id = 'ryancwalsh.near'::text)
-> Sort (cost=93340.30..93367.04 rows=10696 width=223) (actual time=1.749..1.750 rows=1 loops=1)
Sort Key: r.included_in_block_timestamp DESC
Sort Method: top-N heapsort Memory: 25kB
-> Nested Loop (cost=0.70..93286.82 rows=10696 width=223) (actual time=0.089..1.705 rows=90 loops=1)
-> CTE Scan on j (cost=0.00..213.92 rows=10696 width=32) (actual time=0.060..0.221 rows=90 loops=1)
-> Index Scan using receipts_pkey on receipts r (cost=0.70..8.70 rows=1 width=223) (actual time=0.016..0.016 rows=1 loops=90)
Index Cond: (receipt_id = j.receipt_id)
Planning Time: 0.281 ms
Execution Time: 1.801 ms
The point is to execute the hugely selective query in the CTE first. Then Postgres does not attempt to walk the index on (included_in_block_timestamp) under the wrong assumption that it would find a matching row soon enough. (It does not.)
The DB at hand runs Postgres 11, where CTEs are always optimization barriers. In Postgres 12 or later add AS MATERIALIZED to the CTE to guarantee the same effect.
Or you could use the "OFFSET 0 hack" in any version like this:
SELECT r.receipt_id
, r.included_in_block_hash
, r.included_in_chunk_hash
, r.index_in_chunk
, r.included_in_block_timestamp
, r.predecessor_account_id
, r.receiver_account_id
, r.receipt_kind
, r.originated_from_transaction_hash
FROM (
SELECT receipt_id -- is PK!
FROM public.action_receipts
WHERE signer_account_id = 'ryancwalsh.near'
OFFSET 0 -- !
) j
JOIN public.receipts r USING (receipt_id)
ORDER BY r.included_in_block_timestamp DESC
LIMIT 1;
Prevents "inlining" of the subquery to the same effect. Finishes in < 2ms.
See:
How to prevent PostgreSQL from rewriting OUTER JOIN queries?
"Fix" the database?
The proper fix depends on the complete picture. The underlying problem is that Postgres overestimates the number of qualifying rows in table action_receipts. The MCV list (most common values) cannot keep up with 220 million rows (and growing). It's most probably not just ANALYZE lagging behind. (Though it could be: autovacuum not properly configured? Rookie mistake?) Depending on the actual cardinalities (data distribution) in action_receipts.signer_account_id and access patterns you could do various things to "fix" it. Two options:
1. Increase n_distinct and STATISTICS
If most values in action_receipts.signer_account_id are equally rare (high cardinality), consider setting a very large n_distinct value for the column. And combine that with a moderately increased STATISTICS target for the same column to counter errors in the other direction (underestimating the number of qualifying rows for common values). Read both answers here:
Postgres sometimes uses inferior index for WHERE a IN (...) ORDER BY b LIMIT N
And:
Very bad query plan in PostgreSQL 9.6
2. Local fix with partial index
If action_receipts.signer_account_id = 'ryancwalsh.near' is special in that it's queried more regularly than others, consider a small partial index for it, to fix just that case. Like:
CREATE INDEX ON action_receipts (receipt_id)
WHERE signer_account_id = 'ryancwalsh.near';

Slow LEFT JOIN on CTE with time intervals

I am trying to debug a query in PostgreSQL that I've built to bucket market data in time buckets in arbitrary time intervals. Here is my table definition:
CREATE TABLE historical_ohlcv (
exchange_symbol TEXT NOT NULL,
symbol_id TEXT NOT NULL,
kafka_key TEXT NOT NULL,
open NUMERIC,
high NUMERIC,
low NUMERIC,
close NUMERIC,
volume NUMERIC,
time_open TIMESTAMP WITH TIME ZONE NOT NULL,
time_close TIMESTAMP WITH TIME ZONE,
CONSTRAINT historical_ohlcv_pkey
PRIMARY KEY (exchange_symbol, symbol_id, time_open)
);
CREATE INDEX symbol_id_idx
ON historical_ohlcv (symbol_id);
CREATE INDEX open_close_symbol_id
ON historical_ohlcv (time_open, time_close, exchange_symbol, symbol_id);
CREATE INDEX time_open_idx
ON historical_ohlcv (time_open);
CREATE INDEX time_close_idx
ON historical_ohlcv (time_close);
The table has ~25m rows currently. My query as an example for 1 hour, but could be 5 mins, 10 mins, 2 days, etc.
EXPLAIN ANALYZE WITH vals AS (
SELECT
NOW() - '5 months' :: INTERVAL AS frame_start,
NOW() AS frame_end,
INTERVAL '1 hour' AS t_interval
)
, grid AS (
SELECT
start_time,
lead(start_time, 1)
OVER (
ORDER BY start_time ) AS end_time
FROM (
SELECT
generate_series(frame_start, frame_end,
t_interval) AS start_time,
frame_end
FROM vals
) AS x
)
SELECT max(high)
FROM grid g
LEFT JOIN historical_ohlcv ohlcv ON ohlcv.time_open >= g.start_time
WHERE exchange_symbol = 'BINANCE'
AND symbol_id = 'ETHBTC'
GROUP BY start_time;
The WHERE clause could be any valid value in the table.
This technique was inspired by:
Best way to count records by arbitrary time intervals in Rails+Postgres.
The idea is to make a common table and left join your data with that to indicate which bucket stuff is in. This query is really slow! It's currently taking 15s. Based on the query planner, we have a really expensive nested loop:
QUERY PLAN
HashAggregate (cost=2758432.05..2758434.05 rows=200 width=40) (actual time=16023.713..16023.817 rows=542 loops=1)
Group Key: g.start_time
CTE vals
-> Result (cost=0.00..0.02 rows=1 width=32) (actual time=0.005..0.005 rows=1 loops=1)
CTE grid
-> WindowAgg (cost=64.86..82.36 rows=1000 width=16) (actual time=2.986..9.594 rows=3625 loops=1)
-> Sort (cost=64.86..67.36 rows=1000 width=8) (actual time=2.981..4.014 rows=3625 loops=1)
Sort Key: x.start_time
Sort Method: quicksort Memory: 266kB
-> Subquery Scan on x (cost=0.00..15.03 rows=1000 width=8) (actual time=0.014..1.991 rows=3625 loops=1)
-> ProjectSet (cost=0.00..5.03 rows=1000 width=16) (actual time=0.013..1.048 rows=3625 loops=1)
-> CTE Scan on vals (cost=0.00..0.02 rows=1 width=32) (actual time=0.008..0.009 rows=1 loops=1)
-> Nested Loop (cost=0.56..2694021.34 rows=12865667 width=14) (actual time=7051.730..16015.873 rows=31978 loops=1)
-> CTE Scan on grid g (cost=0.00..20.00 rows=1000 width=16) (actual time=2.988..11.635 rows=3625 loops=1)
-> Index Scan using historical_ohlcv_pkey on historical_ohlcv ohlcv (cost=0.56..2565.34 rows=12866 width=22) (actual time=3.712..4.413 rows=9 loops=3625)
Index Cond: ((exchange_symbol = 'BINANCE'::text) AND (symbol_id = 'ETHBTC'::text) AND (time_open >= g.start_time))
Filter: (time_close < g.end_time)
Rows Removed by Filter: 15502
Planning time: 0.568 ms
Execution time: 16023.979 ms
My guess is this line is doing a lot:
LEFT JOIN historical_ohlcv ohlcv ON ohlcv.time_open >= g.start_time
AND ohlcv.time_close < g.end_time
But I'm not sure how to accomplish this in another way.
P.S. apologies if this belongs to dba.SE. I read the FAQ and this seemed too basic for that site, so I posted here.
Edits as requested:
SELECT avg(pg_column_size(t)) FROM historical_ohlcv t TABLESAMPLE SYSTEM (0.1); returns 107.632
For exchange_symbol, there are 3 unique values, for symbol_id there are ~400
PostgreSQL version: PostgreSQL 10.3 (Ubuntu 10.3-1.pgdg16.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609, 64-bit.
The table will be growing about ~1m records a day, so not exactly read-only. All this stuff is done locally and I will try to move to RDS or to help manage hardware issues.
Related: if I wanted to add other aggregates, specifically 'first in the bucket', 'last in the bucket', min, sum, would my indexing strategy change?

Correctness first: I suspect a bug in your query:
LEFT JOIN historical_ohlcv ohlcv ON ohlcv.time_open >= g.start_time
AND ohlcv.time_close < g.end_time
Unlike my referenced answer, you join on a time interval: (time_open, time_close]. The way you do it excludes rows in the table where the interval crosses bucket borders. Only intervals fully contained in a single bucket count. I don't think that's intended?
A simple fix would be to decide bucket membership based on time_open (or time_close) alone. If you want to keep working with both, you have to define exactly how to deal with intervals overlapping with multiple buckets.
Also, you are looking for max(high) per bucket, which is different in nature from count(*) in my referenced answer.
And your buckets are simple intervals per hour?
Then we can radically simplify. Working with just time_open:
SELECT date_trunc('hour', time_open) AS hour, max(high) AS max_high
FROM historical_ohlcv
WHERE exchange_symbol = 'BINANCE'
AND symbol_id = 'ETHBTC'
AND time_open >= now() - interval '5 months' -- frame_start
AND time_open < now() -- frame_end
GROUP BY 1
ORDER BY 1;
Related:
Resample on time series data
It's hard to talk about further performance optimization while basics are unclear. And we'd need more information.
Are WHERE conditions variable?
How many distinct values in exchange_symbol and symbol_id?
Avg. row size? What do you get for:
SELECT avg(pg_column_size(t)) FROM historical_ohlcv t TABLESAMPLE SYSTEM (0.1);
Is the table read-only?
Assuming you always filter on exchange_symbol and symbol_id and values are variable, your table is read-only or autovacuum can keep up with the write load so we can hope for index-only scans, you would best have a multicolumn index on (exchange_symbol, symbol_id, time_open, high DESC) to support this query. Index columns in this order. Related:
Multicolumn index and performance
Depending on data distribution and other details a LEFT JOIN LATERAL solution might be another option. Related:
How to find an average of values for time intervals in postgres
Optimize GROUP BY query to retrieve latest record per user
Aside from all that, you EXPLAIN plan exhibits some very bad estimates:
https://explain.depesz.com/s/E5yI
Are you using a current version of Postgres? You may have to work on your server configuration - or at least set higher statistics targets on relevant columns and more aggressive autovacuum settings for the big table. Related:
Keep PostgreSQL from sometimes choosing a bad query plan
Aggressive Autovacuum on PostgreSQL

What is wrong with the way I count rows in a complex query?

I have a database with a few tables, each has a few millions rows (tables do have indexes). I need to count rows in a table, but only those whose foreign key field points to a subset from another table.
Here is the query:
WITH filtered_title
AS (SELECT top.id
FROM title top
WHERE ( top.production_year >= 1982
AND top.production_year <= 1984
AND top.kind_id IN( 1, 2 )
OR EXISTS(SELECT 1
FROM title sub
WHERE sub.episode_of_id = top.id
AND sub.production_year >= 1982
AND sub.production_year <= 1984
AND sub.kind_id IN( 1, 2 )) ))
SELECT Count(*)
FROM cast_info
WHERE EXISTS(SELECT 1
FROM filtered_title
WHERE cast_info.movie_id = filtered_title.id)
AND cast_info.role_id IN( 3, 8 )
I use CTE because there are more COUNT queries down there for other tables, which use the same subqueries. But I have tried to get rid of CTE and the results were the same: the first time I execute the query it runs... runs... and runs for more than ten minutes. The second time I execute the query, it's down to 4 seconds, which is acceptable for me.
The result of EXPLAIN ANALYZE:
Aggregate (cost=46194894.49..46194894.50 rows=1 width=0) (actual time=127728.452..127728.452 rows=1 loops=1)
CTE filtered_title
-> Seq Scan on title top (cost=0.00..46123542.41 rows=1430406 width=4) (actual time=732.509..1596.345 rows=16250 loops=1)
Filter: (((production_year >= 1982) AND (production_year <= 1984) AND (kind_id = ANY ('{1,2}'::integer[]))) OR (alternatives: SubPlan 1 or hashed SubPlan 2))
Rows Removed by Filter: 2832906
SubPlan 1
-> Index Scan using title_idx_epof on title sub (cost=0.43..16.16 rows=1 width=0) (never executed)
Index Cond: (episode_of_id = top.id)
Filter: ((production_year >= 1982) AND (production_year <= 1984) AND (kind_id = ANY ('{1,2}'::integer[])))
SubPlan 2
-> Seq Scan on title sub_1 (cost=0.00..90471.23 rows=11657 width=4) (actual time=0.071..730.311 rows=16250 loops=1)
Filter: ((production_year >= 1982) AND (production_year <= 1984) AND (kind_id = ANY ('{1,2}'::integer[])))
Rows Removed by Filter: 2832906
-> Nested Loop (cost=32184.70..63158.16 rows=3277568 width=0) (actual time=1620.382..127719.030 rows=29679 loops=1)
-> HashAggregate (cost=32184.13..32186.13 rows=200 width=4) (actual time=1620.058..1631.697 rows=16250 loops=1)
-> CTE Scan on filtered_title (cost=0.00..28608.12 rows=1430406 width=4) (actual time=732.513..1607.093 rows=16250 loops=1)
-> Index Scan using cast_info_idx_mid on cast_info (cost=0.56..154.80 rows=6 width=4) (actual time=5.977..7.758 rows=2 loops=16250)
Index Cond: (movie_id = filtered_title.id)
Filter: (role_id = ANY ('{3,8}'::integer[]))
Rows Removed by Filter: 15
Total runtime: 127729.100 ms
Now to my question. What am I doing wrong and how can I fix it?
I tried a few variants of the same query: exclusive joins, joins/exists. On one hand this one seems to require the least time to do the job (10x faster), but it's still 60 seconds on average. On the other hand, unlike my first query which needs 4-6 seconds on the second run, it always requires 60 seconds.
WITH filtered_title
AS (SELECT top.id
FROM title top
WHERE top.production_year >= 1982
AND top.production_year <= 1984
AND top.kind_id IN( 1, 2 )
OR EXISTS(SELECT 1
FROM title sub
WHERE sub.episode_of_id = top.id
AND sub.production_year >= 1982
AND sub.production_year <= 1984
AND sub.kind_id IN( 1, 2 )))
SELECT Count(*)
FROM cast_info
join filtered_title
ON cast_info.movie_id = filtered_title.id
WHERE cast_info.role_id IN( 3, 8 )

Disclaimer: There are way too many factors in play to give a conclusive answer. The information with a few tables, each has a few millions rows (tables do have indexes) just doesn't cut it. It depends on cardinalities, table definitions, data types, usage patterns and (probably most important) indexes. And a proper basic configuration of your db server, of course. All of this goes beyond the scope of a single question on SO. Start with the links in the postgresql-performance tag. Or hire a professional.
I am going to address the most prominent detail (for me) in your query plan:
Sequential scan on title?
-> Seq Scan on title sub_1 (cost=0.00..90471.23 rows=11657 width=4) (actual time=0.071..730.311 rows=16250 loops=1)
Filter: ((production_year >= 1982) AND (production_year <= 1984) AND (kind_id = ANY ('{1,2}'::integer[])))
Rows Removed by Filter: 2832906
Bold emphasis mine. Sequentially scanning 3 million rows to retrieve just 16250 is not very efficient. The sequential scan is also the likely reason why the first run takes so much longer. Subsequent calls can read data from the cache. Since the table is big, data probably won't stay in the cache for long unless you have huge amounts of cache.
An index scan is typically substantially faster to gather 0.5 % of the rows from a big table. Possible causes:
Statistics are off.
Cost settings are off.
No matching index.
My money is on the index. You didn't supply your version of Postgres, so assuming current 9.3. The perfect index for this query would be:
CREATE INDEX title_foo_idx ON title (kind_id, production_year, id, episode_of_id)
Data types matter. Sequence of columns in the index matters.
kind_id first, because the rule of thumb is: index for equality first — then for ranges.
The last two columns (id, episode_of_id) are only useful for a potential index-only scan. If not applicable, drop those. More details here:
PostgreSQL composite primary key
The way you built your query, you end up with two sequential scans on the big table. So here's an educated guess for a ...
Better query
WITH t_base AS (
SELECT id, episode_of_id
FROM title
WHERE kind_id BETWEEN 1 AND 2
AND production_year BETWEEN 1982 AND 1984
)
, t_all AS (
SELECT id FROM t_base
UNION -- not UNION ALL (!)
SELECT id
FROM (SELECT DISTINCT episode_of_id AS id FROM t_base) x
JOIN title t USING (id)
)
SELECT count(*) AS ct
FROM cast_info c
JOIN t_all t ON t.id = c.movie_id
WHERE c.role_id IN (3, 8);
This should give you one index scan on the new title_foo_idx, plus another index scan on the pk index of title. The rest should be relatively cheap. With any luck, much faster than before.
kind_id BETWEEN 1 AND 2 .. as long as you have a continuous range of values, that is faster than listing individual values because this way Postgres can fetch a continuous range from the index. Not very important for just two values.
Test this alternative for the second leg of t_all. Not sure which is faster:
SELECT id
FROM title t
WHERE EXISTS (SELECT 1 FROM t_base WHERE t_base.episode_of_id = t.id)
Temporary table instead of CTE
You write:
I use CTE because there are more COUNT queries down there for other
tables, which use the same subqueries.
A CTE poses as optimization barrier, the resulting internal work table is not indexed. When reusing a result (with more than a trivial number of rows) multiple times, it pays to use an indexed temp table instead. Index creation for a simple int column is fast.
CREATE TEMP TABLE t_tmp AS
WITH t_base AS (
SELECT id, episode_of_id
FROM title
WHERE kind_id BETWEEN 1 AND 2
AND production_year BETWEEN 1982 AND 1984
)
SELECT id FROM t_base
UNION
SELECT id FROM title t
WHERE EXISTS (SELECT 1 FROM t_base WHERE t_base.episode_of_id = t.id);
ANALYZE t_tmp; -- !
CREATE UNIQUE INDEX ON t_tmp (id); -- ! (unique is optional)
SELECT count(*) AS ct
FROM cast_info c
JOIN t_tmp t ON t.id = c.movie_id
WHERE c.role_id IN (3, 8);
-- More queries using t_tmp
About temp tables:
How to tell if record has changed in Postgres

Simple query gets much slower when I add a LIMIT clause

I have a table (of snmp traps, but that's neither here nor there).
I have a query which pulls some records, which goes something like this:
SELECT *
FROM traps_trap
WHERE summary_id = 1600
ORDER BY traps_trap."trapTime";
This responds instantaneously, with 6 records.
When I add LIMIT 50 (since not all results will have only 6 records), it is very, VERY slow (to the point of not returning at all).
There is an index on the summary_id column, I can only assume that it isn't being used for the second query.
I understand that the tool to solve this is explain, but I'm not familiar enough with it to understand the results.
The explain analyse verbose for the first (quick) query is as follows:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=14491.51..14502.48 rows=4387 width=263) (actual time=0.128..0.130 rows=6 loops=1)
Output: id, summary_id, "trapTime", packet
Sort Key: traps_trap."trapTime"
Sort Method: quicksort Memory: 28kB
-> Index Scan using traps_trap_summary_id on public.traps_trap (cost=0.00..13683.62 rows=4387 width=263) (actual time=0.060..0.108 rows=6 loops=1)
Output: id, summary_id, "trapTime", packet
Index Cond: (traps_trap.summary_id = 1600)
Total runtime: 0.205 ms
(8 rows)
explain for the second is:
QUERY PLAN
---------------------------------------------------------------------------------------------------------
Limit (cost=0.00..2538.69 rows=10 width=263)
-> Index Scan using "traps_trap_trapTime" on traps_trap (cost=0.00..1113975.68 rows=4388 width=263)
Filter: (summary_id = 1600)
(3 rows)
I run VACUUM and ANALYZE daily, I understand that is supposed to improve the plan. Any other pointers?

Index scan using trapTime is much slower than one using summary_id.
I would try to nest the query (to use plan #1):
select * from (
SELECT *
FROM traps_trap
WHERE summary_id = 1600
ORDER BY traps_trap."trapTime"
) t
limit 50;
Edit:
After doing some tests I learned that simple query nesting (as above) has no effect on the planner.
To force the planner to use traps_trap_summary_id index you can use CTE (my tests confirm this method):
with t as (
SELECT *
FROM traps_trap
WHERE summary_id = 1600
ORDER BY traps_trap."trapTime"
)
select * from t
limit 50;

Extremely slow PostgreSQL query with ORDER and LIMIT clauses

I have a table, let's call it "foos", with almost 6 million records in it. I am running the following query:
SELECT "foos".*
FROM "foos"
INNER JOIN "bars" ON "foos".bar_id = "bars".id
WHERE (("bars".baz_id = 13266))
ORDER BY "foos"."id" DESC
LIMIT 5 OFFSET 0;
This query takes a very long time to run (Rails times out while running it). There is an index on all IDs in question. The curious part is, if I remove either the ORDER BY clause or the LIMIT clause, it runs almost instantaneously.
I'm assuming that the presence of both ORDER BY and LIMIT are making PostgreSQL make some bad choices in query planning. Anyone have any ideas on how to fix this?
In case it helps, here is the EXPLAIN for all 3 cases:
//////// Both ORDER and LIMIT
SELECT "foos".*
FROM "foos"
INNER JOIN "bars" ON "foos".bar_id = "bars".id
WHERE (("bars".baz_id = 13266))
ORDER BY "foos"."id" DESC
LIMIT 5 OFFSET 0;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..16663.44 rows=5 width=663)
-> Nested Loop (cost=0.00..25355084.05 rows=7608 width=663)
Join Filter: (foos.bar_id = bars.id)
-> Index Scan Backward using foos_pkey on foos (cost=0.00..11804133.33 rows=4963477 width=663)
Filter: (((NOT privacy_protected) OR (user_id = 67962)) AND ((status)::text = 'DONE'::text))
-> Materialize (cost=0.00..658.96 rows=182 width=4)
-> Index Scan using index_bars_on_baz_id on bars (cost=0.00..658.05 rows=182 width=4)
Index Cond: (baz_id = 13266)
(8 rows)
//////// Just LIMIT
SELECT "foos".*
FROM "foos"
INNER JOIN "bars" ON "foos".bar_id = "bars".id
WHERE (("bars".baz_id = 13266))
LIMIT 5 OFFSET 0;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..22.21 rows=5 width=663)
-> Nested Loop (cost=0.00..33788.21 rows=7608 width=663)
-> Index Scan using index_bars_on_baz_id on bars (cost=0.00..658.05 rows=182 width=4)
Index Cond: (baz_id = 13266)
-> Index Scan using index_foos_on_bar_id on foos (cost=0.00..181.51 rows=42 width=663)
Index Cond: (foos.bar_id = bars.id)
Filter: (((NOT foos.privacy_protected) OR (foos.user_id = 67962)) AND ((foos.status)::text = 'DONE'::text))
(7 rows)
//////// Just ORDER
SELECT "foos".*
FROM "foos"
INNER JOIN "bars" ON "foos".bar_id = "bars".id
WHERE (("bars".baz_id = 13266))
ORDER BY "foos"."id" DESC;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=36515.17..36534.19 rows=7608 width=663)
Sort Key: foos.id
-> Nested Loop (cost=0.00..33788.21 rows=7608 width=663)
-> Index Scan using index_bars_on_baz_id on bars (cost=0.00..658.05 rows=182 width=4)
Index Cond: (baz_id = 13266)
-> Index Scan using index_foos_on_bar_id on foos (cost=0.00..181.51 rows=42 width=663)
Index Cond: (foos.bar_id = bars.id)
Filter: (((NOT foos.privacy_protected) OR (foos.user_id = 67962)) AND ((foos.status)::text = 'DONE'::text))
(8 rows)

When you have both the LIMIT and ORDER BY, the optimizer has decided it is faster to limp through the unfiltered records on foo by key descending until it gets five matches for the rest of the criteria. In the other cases, it simply runs the query as a nested loop and returns all the records.
Offhand, I'd say the problem is that PG doesn't grok the joint distribution of the various ids and that's why the plan is so sub-optimal.
For possible solutions: I'll assume that you have run ANALYZE recently. If not, do so. That may explain why your estimated times are high even on the version that returns fast. If the problem persists, perhaps run the ORDER BY as a subselect and slap the LIMIT on in an outer query.

Probably it happens because before it tries to order then to select. Why do not try to sort the result in an outer select all? Something like:
SELECT * FROM (SELECT ... INNER JOIN ETC...) ORDER BY ... DESC

Your query plan indicates a filter on
(((NOT privacy_protected) OR (user_id = 67962)) AND ((status)::text = 'DONE'::text))
which doesn't appear in the SELECT - where is it coming from?
Also, note that expression is listed as a "Filter" and not an "Index Cond" which would seem to indicate there's no index applied to it.

it may be running a full-table scan on "foos". did you try changing the order of the tables and instead use a left-join instead of inner-join and see if it displays results faster.
say...
SELECT "bars"."id", "foos".*
FROM "bars"
LEFT JOIN "foos" ON "bars"."id" = "foos"."bar_id"
WHERE "bars"."baz_id" = 13266
ORDER BY "foos"."id" DESC
LIMIT 5 OFFSET 0;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas