Let's say I have the following PostgreSQL database schema:
Group
id: int
Task:
id: int
created_at: datetime
group: FK Group
I have the following Materialized View to calculate the number of tasks and the most recent Task.created_at value per group:
CREATE MATERIALIZED VIEW group_statistics AS (
SELECT
group.id as group_id,
MAX(task.created_at) AS latest_task_created_at,
COUNT(task.id) AS task_count
FROM group
LEFT OUTER JOIN task ON (group.id = task.group_id)
GROUP BY group.id
);
The Task table currently has 20 million records so refreshing this materialized view takes a long time (20-30 seconds). We've also been experiencing some short but major DB performance issues ever since we started refreshing the materialized every 10 min, even with CONCURRENTLY:
REFRESH MATERIALIZED VIEW CONCURRENTLY group_statistics;
Is there a more performant way to calculate these values? Note, they do NOT need to be exact. Approximate values are totally fine, e.g. latest_task_created_at can be 10-20 min delayed.
I'm thinking of caching these values on every write to the Task table. Either in Redis or in PostgreSQL itself.
Update
People are requesting the execution plan. EXPLAIN doesn't work on REFRESH but I ran EXPLAIN on the actual query. Note, it's different than my theoretical data model above. In this case, Database is Group and Record is Task. Also note, I'm on PostgreSQL 12.10.
EXPLAIN (analyze, buffers, verbose)
SELECT
store_database.id as database_id,
MAX(store_record.updated_at) AS latest_record_updated_at,
COUNT(store_record.id) AS record_count
FROM store_database
LEFT JOIN store_record ON (store_database.id = store_record.database_id)
GROUP BY store_database.id;
Output:
HashAggregate (cost=1903868.71..1903869.22 rows=169 width=32) (actual time=18227.016..18227.042 rows=169 loops=1)
" Output: store_database.id, max(store_record.updated_at), count(store_record.id)"
Group Key: store_database.id
Buffers: shared hit=609211 read=1190704
I/O Timings: read=3385.027
-> Hash Right Join (cost=41.28..1872948.10 rows=20613744 width=40) (actual time=169.766..14572.558 rows=20928339 loops=1)
" Output: store_database.id, store_record.updated_at, store_record.id"
Inner Unique: true
Hash Cond: (store_record.database_id = store_database.id)
Buffers: shared hit=609211 read=1190704
I/O Timings: read=3385.027
-> Seq Scan on public.store_record (cost=0.00..1861691.23 rows=20613744 width=40) (actual time=0.007..8607.425 rows=20928316 loops=1)
" Output: store_record.id, store_record.key, store_record.data, store_record.created_at, store_record.updated_at, store_record.database_id, store_record.organization_id, store_record.user_id"
Buffers: shared hit=609146 read=1190704
I/O Timings: read=3385.027
-> Hash (cost=40.69..40.69 rows=169 width=16) (actual time=169.748..169.748 rows=169 loops=1)
Output: store_database.id
Buckets: 1024 Batches: 1 Memory Usage: 16kB
Buffers: shared hit=65
-> Index Only Scan using store_database_pkey on public.store_database (cost=0.05..40.69 rows=169 width=16) (actual time=0.012..0.124 rows=169 loops=1)
Output: store_database.id
Heap Fetches: 78
Buffers: shared hit=65
Planning Time: 0.418 ms
JIT:
Functions: 14
" Options: Inlining true, Optimization true, Expressions true, Deforming true"
" Timing: Generation 2.465 ms, Inlining 15.728 ms, Optimization 92.852 ms, Emission 60.694 ms, Total 171.738 ms"
Execution Time: 18229.600 ms
Note, the large execution time. It sometimes takes 5-10 minutes to run. I would love to bring this down to consistently a few seconds max.
Update #2
People are requesting the execution plan when the query takes minutes. Here it is:
HashAggregate (cost=1905790.10..1905790.61 rows=169 width=32) (actual time=128442.799..128442.825 rows=169 loops=1)
" Output: store_database.id, max(store_record.updated_at), count(store_record.id)"
Group Key: store_database.id
Buffers: shared hit=114011 read=1685876 dirtied=367
I/O Timings: read=112953.619
-> Hash Right Join (cost=15.32..1874290.39 rows=20999810 width=40) (actual time=323.497..124809.521 rows=21448762 loops=1)
" Output: store_database.id, store_record.updated_at, store_record.id"
Inner Unique: true
Hash Cond: (store_record.database_id = store_database.id)
Buffers: shared hit=114011 read=1685876 dirtied=367
I/O Timings: read=112953.619
-> Seq Scan on public.store_record (cost=0.00..1862849.43 rows=20999810 width=40) (actual time=0.649..119522.406 rows=21448739 loops=1)
" Output: store_record.id, store_record.key, store_record.data, store_record.created_at, store_record.updated_at, store_record.database_id, store_record.organization_id, store_record.user_id"
Buffers: shared hit=113974 read=1685876 dirtied=367
I/O Timings: read=112953.619
-> Hash (cost=14.73..14.73 rows=169 width=16) (actual time=322.823..322.824 rows=169 loops=1)
Output: store_database.id
Buckets: 1024 Batches: 1 Memory Usage: 16kB
Buffers: shared hit=37
-> Index Only Scan using store_database_pkey on public.store_database (cost=0.05..14.73 rows=169 width=16) (actual time=0.032..0.220 rows=169 loops=1)
Output: store_database.id
Heap Fetches: 41
Buffers: shared hit=37
Planning Time: 5.390 ms
JIT:
Functions: 14
" Options: Inlining true, Optimization true, Expressions true, Deforming true"
" Timing: Generation 1.306 ms, Inlining 82.966 ms, Optimization 176.787 ms, Emission 62.561 ms, Total 323.620 ms"
Execution Time: 128474.490 ms
Your MV currently has 169 rows, so write costs are negligible (unless you have locking issues). It's all about the expensive sequential scan over the big table.
Full counts are slow
Getting exact counts per group ("database") is expensive. There is no magic bullet for that in Postgres. Postgres has to count all rows. If the table is all-visible (visibility map is up to date), Postgres can shorten the procedure somewhat by only traversing a covering index. (You did not provide indexes ...)
There are possible shortcuts with an estimate for the total row count in the whole table. But the same is not easily available per group. See:
Fast way to discover the row count of a table in PostgreSQL
But not that slow
That said, your query can still be substantially faster. Aggregate before the join:
SELECT id AS database_id
, r.latest_record_updated_at
, COALESCE(r.record_count, 0) AS record_count
FROM store_database d
LEFT JOIN (
SELECT r.database_id AS id
, max(r.updated_at) AS latest_record_updated_at
, count(*) AS record_count
FROM store_record r
GROUP BY 1
) r USING (id);
See:
Query with LEFT JOIN not returning rows for count of 0
And use the slightly faster (and equivalent in this case) count(*). Related:
PostgreSQL: running count of rows for a query 'by minute'
Also - visibility provided - count(*) can use any non-partial index, preferably the smallest, while count(store_record.id) is limited to an index on that column (and has to inspect values, too).
I/O is your bottleneck
You added the EXPLAIN plan for an expensive execution, and the skyrocketing I/O cost stands out. It dominates the cost of your query.
Fast plan:
Buffers: shared hit=609146 read=1190704
I/O Timings: read=3385.027
Slow plan:
Buffers: shared hit=113974 read=1685876 dirtied=367
I/O Timings: read=112953.619
Your Seq Scan on public.store_record spent 112953.619 ms on reading data file blocks. 367 dirtied buffers represent under 3MB and are only a tiny fraction of total I/O. Either way, I/O dominates the cost.
Either your storage system is abysmally slow or, more likely since I/O of the fast query costs 30x less, there is too much contention for I/O from concurrent work load (on an inappropriately configured system). One or more of these can help:
faster storage
better (more appropriate) server configuration
more RAM (and server config that allows more cache memory)
less concurrent workload
more efficient table design with smaller disk footprint
smarter query that needs to read fewer data blocks
upgrade to a current version of Postgres
Hugely faster without count
If there was no count, just latest_record_updated_at, this query would deliver that in close to no time:
SELECT d.id
, (SELECT r.updated_at
FROM store_record r
WHERE r.database_id = d.id
ORDER BY r.updated_at DESC NULLS LAST
LIMIT 1) AS latest_record_updated_at
FROM store_database d;
In combination with a matching index! Ideally:
CREATE INDEX store_record_database_id_idx ON store_record (database_id, updated_at DESC NULL LAST);
See:
Optimize GROUP BY query to retrieve latest row per user
The same index can also help the complete query above, even if not as dramatically. If the table is vacuumed enough (visibility map up to date) Postgres can do a sequential scan on the smaller index without involving the bigger table. Obviously matters more for wider table rows - especially easing your I/O problem.
(Of course, index maintenance adds costs, too ...)
Upgrade to use parallelism
Upgrade to the latest version of Postgres if at all possible. Postgres 14 or 15 have received various performance improvements over Postgres 12. Most importantly, quoting the release notes for Postgres 14:
Allow REFRESH MATERIALIZED VIEW to use parallelism (Bharath Rupireddy)
Could be massive for your use case. Related:
Materialized view refresh in parallel
Estimates?
Warning: experimental stuff.
You stated:
Approximate values are totally fine
I see only 169 groups ("databases") in the query plan. Postgres maintains column statistics. While the distinct count of groups is tiny and stays below the "statistics target" for the column store_record.database_id (which you have to make sure of!), we can work with this. See:
How to check statistics targets used by ANALYZE?
Unless you have very aggressive autovacuum settings, to get better estimates, run ANALYZE on database_id to update column statistics before running below query. (Also updates reltuples and relpages in pg_class.):
ANALYZE public.store_record(database_id);
Or even (to also update the visibility map for above query):
VACUUM ANALYZE public.store_record(database_id);
This was the most expensive part (with collateral benefits). And it's optional.
WITH ct(total_est) AS (
SELECT reltuples / relpages * (pg_relation_size(oid) / 8192)
FROM pg_class
WHERE oid = 'public.store_record'::regclass -- your table here
)
SELECT v.database_id, (ct.total_est * v.freq)::bigint AS estimate
FROM pg_stats s
, ct
, unnest(most_common_vals::text::int[], most_common_freqs) v(database_id, freq)
WHERE s.schemaname = 'public'
AND s.tablename = 'store_record'
AND s.attname = 'database_id';
The query relies on various Postgres internals and may break in the future major versions (though unlikely). Tested with Postgres 14, but works with Postgres 12, too. It's basically black magic. You need to know what you are doing. You have been warned.
But the query costs close to nothing.
Take exact values for latest_record_updated_at from the fast query above, and join to these estimates for the count.
Basic explanation: Postgres maintains column statistics in the system catalog pg_statistic. pg_stats is a view on it, easier to access. Among other things, "most common values" and their relative frequency are gathered. Represented in most_common_vals and most_common_freqs. Multiplied with the current (estimated) total count, we get estimates per group. You could do all of it manually, but Postgres is probably much faster and better at this.
For the computation of the total estimate ct.total_est see:
Fast way to discover the row count of a table in PostgreSQL
(Note the "Safe and explicit" form for this query.)
Given the explain plan, the sequential scan seems to be causing the slowness. An index can definitely help there.
You can also utilize index-only scans as there are few columns in the query. So you can use something like this for store_record table.
Create index idx_store_record_db_id btree(database_id) include (id, updated_at);
An index on id column on the store_database table is also needed.
Create index idx_db_id on store_database btree(id)
Sometimes in such cases it is necessary to think of completely different business logic solutions.
For example, the count operation is a very slow query. This cannot be accelerated by any means in DB. What can be done in such cases? Since I do not know your business logic in full detail, I will tell you several options. However, these options also have disadvantages. For example:
group_id id
---------------
1 12
1 145
1 100
3 652
3 102
We group it once and insert the numbers into a table.
group_id count_id
--------------------
1 3
3 2
After then, when each record is inserted into main table then we update the group table using with triggers. Like as this:
update group_table set count_id = count_id + 1 where group_id = new.group_id
Or like that:
update group_table set count_id = (select count(id) from main_table where group_id = new.group_id)
I am not talking about small details here. For updating row properly, we can use clause for update, so for update locks row for other transactions.
So, the main solution is that: Functions like count need to be executed separately on grouped data, not on the entire table at once. Similar solutions can be applied. I explained it for general understanding.
The disadvantage of this solution is that: if you have many inserting operations on this main table, so performance of inserting will be decrease.
Parallel plan
If you first collect the store_record statistics and then join that with the store_database, you'll get a better, parallelisable plan.
EXPLAIN (analyze, buffers, verbose)
SELECT
store_database.id as database_id,
s.latest_record_updated_at as latest_record_updated_at,
coalesce(s.record_count,0) as record_count
FROM store_database
LEFT JOIN
( SELECT
store_record.database_id as database_id,
MAX(store_record.updated_at) as latest_record_updated_at,
COUNT(store_record.id) as record_count
FROM store_record
GROUP BY store_record.database_id)
AS s ON (store_database.id = s.database_id);
Here's a demo - at the end you can see both queries return the exact same results, but the one I suggest runs faster and has a more flexible plan. The number of workers dispatched depends on your max_worker_processes, max_parallel_workers, max_parallel_workers_per_gather settings as well as some additional logic inside the planner.
With more rows in store_record the difference will be more pronounced. On my system with 40 million test rows it went down from 14 seconds to 3 seconds with one worker, 1.4 seconds when it caps out dispatching six workers out of 16 available.
Caching
I'm thinking of caching these values on every write to the Task table. Either in Redis or in PostgreSQL itself.
If it's an option, it's worth a try - you can maintain proper accuracy and instantly available statistics at the cost of some (deferrable) table throughput overhead. You can replace your materialized view with a regular table or add the statistics columns to store_database
create table store_record_statistics(
database_id smallint unique references store_database(id)
on update cascade,
latest_record_updated_at timestamptz,
record_count integer default 0);
insert into store_record_statistics --initializes table with view definition
SELECT g.id, MAX(s.updated_at), COUNT(*)
FROM store_database g LEFT JOIN store_record s ON g.id = s.database_id
GROUP BY g.id;
create index store_record_statistics_idx
on store_record_statistics (database_id)
include (latest_record_updated_at,record_count);
cluster verbose store_record_statistics using store_record_statistics_idx;
And leave keeping the table up to date to a trigger that fires each time store_record changes.
CREATE FUNCTION maintain_store_record_statistics_trigger()
RETURNS TRIGGER LANGUAGE plpgsql AS
$$ BEGIN
IF TG_OP IN ('UPDATE', 'DELETE') THEN --decrement and find second most recent updated_at
UPDATE store_record_statistics srs
SET (record_count,
latest_record_updated_at)
= (record_count - 1,
(SELECT s.updated_at
FROM store_record s
WHERE s.database_id = srs.database_id
ORDER BY s.updated_at DESC NULLS LAST
LIMIT 1))
WHERE database_id = old.database_id;
END IF;
IF TG_OP in ('INSERT','UPDATE') THEN --increment and pick most recent updated_at
UPDATE store_record_statistics
SET (record_count,
latest_record_updated_at)
= (record_count + 1,
greatest(
latest_record_updated_at,
new.updated_at))
WHERE database_id=new.database_id;
END IF;
RETURN NULL;
END $$;
Making the trigger deferrable decouples its execution time from the main operation but it'll still infer its costs at the end of the transaction.
CREATE CONSTRAINT TRIGGER maintain_store_record_statistics
AFTER INSERT OR UPDATE OF database_id OR DELETE ON store_record
INITIALLY DEFERRED FOR EACH ROW
EXECUTE PROCEDURE maintain_store_record_statistics_trigger();
TRUNCATE trigger cannot be declared FOR EACH ROW with the rest of events, so it has to be defined separately
CREATE FUNCTION maintain_store_record_statistics_truncate_trigger()
RETURNS TRIGGER LANGUAGE plpgsql AS
$$ BEGIN
update store_record_statistics
set (record_count, latest_record_updated_at)
= (0 , null);--wipes/resets all stats
RETURN NULL;
END $$;
CREATE TRIGGER maintain_store_record_statistics_truncate
AFTER TRUNCATE ON store_record
EXECUTE PROCEDURE maintain_store_record_statistics_truncate_trigger();
In my test, an update or delete of 10000 random rows in a 100-million-row table run in seconds. A single insert of 1000 new, randomly generated rows took 25ms without and 200ms with the trigger. A million was 30s and 3 minutes correspondingly.
A demo.
Partitioning-backed parallel plan
store_record might be a good fit for partitioning:
create table store_record(
id serial not null,
updated_at timestamptz default now(),
database_id smallint references store_database(id)
) partition by range (database_id);
DO $$
declare
vi_database_max_id smallint:=0;
vi_database_id smallint:=0;
vi_database_id_per_partition smallint:=40;--tweak for lower/higher granularity
begin
select max(id) from store_database into vi_database_max_id;
for vi_database_id in 1 .. vi_database_max_id by vi_database_id_per_partition loop
execute format ('
drop table if exists store_record_p_%1$s;
create table store_record_p_%1$s
partition of store_record
for VALUES from (%1$s) to (%1$s + %2$s)
with (parallel_workers=16);
', vi_database_id, vi_database_id_per_partition);
end loop;
end $$ ;
Splitting objects in this manner lets the planner split their scans accordingly, which works best with parallel workers, but doesn't require them. Even your initial, unaltered query behind the view is able to take advantage of this structure:
HashAggregate (cost=60014.27..60041.47 rows=2720 width=18) (actual time=910.657..910.698 rows=169 loops=1)
Output: store_database.id, max(store_record_p_1.updated_at), count(store_record_p_1.id)
Group Key: store_database.id
Buffers: shared hit=827 read=9367 dirtied=5099 written=4145
-> Hash Right Join (cost=71.20..45168.91 rows=1979382 width=14) (actual time=0.064..663.603 rows=1600020 loops=1)
Output: store_database.id, store_record_p_1.updated_at, store_record_p_1.id
Inner Unique: true
Hash Cond: (store_record_p_1.database_id = store_database.id)
Buffers: shared hit=827 read=9367 dirtied=5099 written=4145
-> Append (cost=0.00..39893.73 rows=1979382 width=14) (actual time=0.014..390.152 rows=1600000 loops=1)
Buffers: shared hit=826 read=9367 dirtied=5099 written=4145
-> Seq Scan on public.store_record_p_1 (cost=0.00..8035.02 rows=530202 width=14) (actual time=0.014..77.130 rows=429068 loops=1)
Output: store_record_p_1.updated_at, store_record_p_1.id, store_record_p_1.database_id
Buffers: shared read=2733 dirtied=1367 written=1335
-> Seq Scan on public.store_record_p_41 (cost=0.00..8067.36 rows=532336 width=14) (actual time=0.017..75.193 rows=430684 loops=1)
Output: store_record_p_41.updated_at, store_record_p_41.id, store_record_p_41.database_id
Buffers: shared read=2744 dirtied=1373 written=1341
-> Seq Scan on public.store_record_p_81 (cost=0.00..8029.14 rows=529814 width=14) (actual time=0.017..74.583 rows=428682 loops=1)
Output: store_record_p_81.updated_at, store_record_p_81.id, store_record_p_81.database_id
Buffers: shared read=2731 dirtied=1366 written=1334
-> Seq Scan on public.store_record_p_121 (cost=0.00..5835.90 rows=385090 width=14) (actual time=0.016..45.407 rows=311566 loops=1)
Output: store_record_p_121.updated_at, store_record_p_121.id, store_record_p_121.database_id
Buffers: shared hit=826 read=1159 dirtied=993 written=135
-> Seq Scan on public.store_record_p_161 (cost=0.00..29.40 rows=1940 width=14) (actual time=0.008..0.008 rows=0 loops=1)
Output: store_record_p_161.updated_at, store_record_p_161.id, store_record_p_161.database_id
-> Hash (cost=37.20..37.20 rows=2720 width=2) (actual time=0.041..0.042 rows=169 loops=1)
Output: store_database.id
Buckets: 4096 Batches: 1 Memory Usage: 38kB
Buffers: shared hit=1
-> Seq Scan on public.store_database (cost=0.00..37.20 rows=2720 width=2) (actual time=0.012..0.021 rows=169 loops=1)
Output: store_database.id
Buffers: shared hit=1
Planning Time: 0.292 ms
Execution Time: 910.811 ms
Demo. It's best to test what granularity gives the best performance on your setup. You can even test sub-partioning, giving each store_record.database_id a partition, that is then sub-partitioned into date ranges, simplifying access to most recent entries.
MATERIALIZED VIEW is not a good idea for that ...
If you just want to "calculate the number of tasks and the most recent Task.created_at value per group" then I suggest you to simply :
Add two columns in the group table :
ALTER TABLE IF EXISTS "group" ADD COLUMN IF NOT EXISTS task_count integer SET DEFAULT 0 ;
ALTER TABLE IF EXISTS "group" ADD COLUMN IF NOT EXISTS last_created_date timestamp ; -- instead of datetime which does not really exist in postgres ...
Update these 2 columns from trigger fonctions defined on table task :
CREATE OR REPLACE FUNCTION task_insert() RETURNS trigger LANGUAGE plpgsql AS $$
BEGIN
UPDATE "group" AS g
SET task_count = count + 1
, last_created_at = NEW.created_at -- assuming that the last task inserted has the latest created_at datetime of the group, if not, then reuse the solution proposed in task_delete()
WHERE g.id = NEW.group ;
RETURN NEW ;
END ; $$ ;
CREATE OR REPLACE TRIGGER task_insert AFTER INSERT ON task
FOR EACH ROW EXECUTE FUNCTION task_insert () ;
CREATE OR REPLACE FUNCTION task_delete () RETURNS trigger LANGUAGE plpgsql AS $$
BEGIN
UPDATE "group" AS g
SET task_count = count - 1
, last_created_at = u.last_created_at
FROM
( SELECT max(created_at) AS last_created_at
FROM task
WHERE t.group = OLD.group
) AS u
WHERE g.id = OLD.group ;
RETURN OLD ;
END ; $$ ;
CREATE OR REPLACE TRIGGER task_insert AFTER DELETE ON task
FOR EACH ROW EXECUTE FUNCTION task_delete () ;
You will need to perform a setup action at the beginning ...
UPDATE "group" AS g
SET task_count = ref.count
, last_created_date = ref.last_created_at
FROM
( SELECT group
, max(created_at) AS last_created_at
, count(*) AS count
FROM task
GROUP BY group
) AS ref
WHERE g.id= ref.group ;
... but then you will have no more performance issue with the queries !!!
SELECT * FROM "group"
and you will optimize the size of your database ...
I'm using this public Postgres DB of NEAR protocol: https://github.com/near/near-indexer-for-explorer#shared-public-access
postgres://public_readonly:nearprotocol#mainnet.db.explorer.indexer.near.dev/mainnet_explorer
SELECT "public"."receipts"."receipt_id",
"public"."receipts"."included_in_block_hash",
"public"."receipts"."included_in_chunk_hash",
"public"."receipts"."index_in_chunk",
"public"."receipts"."included_in_block_timestamp",
"public"."receipts"."predecessor_account_id",
"public"."receipts"."receiver_account_id",
"public"."receipts"."receipt_kind",
"public"."receipts"."originated_from_transaction_hash"
FROM "public"."receipts"
WHERE ("public"."receipts"."receipt_id") IN
(SELECT "t0"."receipt_id"
FROM "public"."receipts" AS "t0"
INNER JOIN "public"."action_receipts" AS "j0" ON ("j0"."receipt_id") = ("t0"."receipt_id")
WHERE ("j0"."signer_account_id" = 'ryancwalsh.near'
AND "t0"."receipt_id" IS NOT NULL))
ORDER BY "public"."receipts"."included_in_block_timestamp" DESC
LIMIT 1
OFFSET 0
always results in:
ERROR: canceling statement due to statement timeout
SQL state: 57014
But if I change it to LIMIT 2, the query runs in less than 1 second.
How would that ever be the case? Does that mean the database isn't set up well? Or am I doing something wrong?
P.S. The query here was generated via Prisma. findFirst always times out, so I think I might need to change it to findMany as a workaround.
Your query can be simplified /optimized:
SELECT r.receipt_id
, r.included_in_block_hash
, r.included_in_chunk_hash
, r.index_in_chunk
, r.included_in_block_timestamp
, r.predecessor_account_id
, r.receiver_account_id
, r.receipt_kind
, r.originated_from_transaction_hash
FROM public.receipts r
WHERE EXISTS (
SELECT FROM public.action_receipts j
WHERE j.receipt_id = r.receipt_id
AND j.signer_account_id = 'ryancwalsh.near'
)
ORDER BY r.included_in_block_timestamp DESC
LIMIT 1;
However, that only scratches the surface of your underlying problem.
Like Kirk already commented, Postgres chooses a different query plan for LIMIT 1, obviously ignorant of the fact that there are only 90 rows in table action_receipts with signer_account_id = 'ryancwalsh.near', while both involved tables have more than 220 million rows, obviously growing steadily.
Changing to LIMIT 2 makes a different query plan seem more favorable, hence the observed difference in performance. (So the query planner has the general idea that the filter is very selective, just not close enough for the corner case of LIMIT 1.)
You should have mentioned cardinalities to set us on the right track.
Knowing that our filter is so selective, we can force a more favorable query plan with a different query:
WITH j AS (
SELECT receipt_id -- is PK!
FROM public.action_receipts
WHERE signer_account_id = 'ryancwalsh.near'
)
SELECT r.receipt_id
, r.included_in_block_hash
, r.included_in_chunk_hash
, r.index_in_chunk
, r.included_in_block_timestamp
, r.predecessor_account_id
, r.receiver_account_id
, r.receipt_kind
, r.originated_from_transaction_hash
FROM j
JOIN public.receipts r USING (receipt_id)
ORDER BY r.included_in_block_timestamp DESC
LIMIT 1;
This uses the same query plan for LIMIT 1, and either finishes in under 2 ms in my test:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=134904.89..134904.89 rows=1 width=223) (actual time=1.750..1.754 rows=1 loops=1)
CTE j
-> Bitmap Heap Scan on action_receipts (cost=319.46..41564.59 rows=10696 width=44) (actual time=0.058..0.179 rows=90 loops=1)
Recheck Cond: (signer_account_id = 'ryancwalsh.near'::text)
Heap Blocks: exact=73
-> Bitmap Index Scan on action_receipt_signer_account_id_idx (cost=0.00..316.79 rows=10696 width=0) (actual time=0.043..0.043 rows=90 loops=1)
Index Cond: (signer_account_id = 'ryancwalsh.near'::text)
-> Sort (cost=93340.30..93367.04 rows=10696 width=223) (actual time=1.749..1.750 rows=1 loops=1)
Sort Key: r.included_in_block_timestamp DESC
Sort Method: top-N heapsort Memory: 25kB
-> Nested Loop (cost=0.70..93286.82 rows=10696 width=223) (actual time=0.089..1.705 rows=90 loops=1)
-> CTE Scan on j (cost=0.00..213.92 rows=10696 width=32) (actual time=0.060..0.221 rows=90 loops=1)
-> Index Scan using receipts_pkey on receipts r (cost=0.70..8.70 rows=1 width=223) (actual time=0.016..0.016 rows=1 loops=90)
Index Cond: (receipt_id = j.receipt_id)
Planning Time: 0.281 ms
Execution Time: 1.801 ms
The point is to execute the hugely selective query in the CTE first. Then Postgres does not attempt to walk the index on (included_in_block_timestamp) under the wrong assumption that it would find a matching row soon enough. (It does not.)
The DB at hand runs Postgres 11, where CTEs are always optimization barriers. In Postgres 12 or later add AS MATERIALIZED to the CTE to guarantee the same effect.
Or you could use the "OFFSET 0 hack" in any version like this:
SELECT r.receipt_id
, r.included_in_block_hash
, r.included_in_chunk_hash
, r.index_in_chunk
, r.included_in_block_timestamp
, r.predecessor_account_id
, r.receiver_account_id
, r.receipt_kind
, r.originated_from_transaction_hash
FROM (
SELECT receipt_id -- is PK!
FROM public.action_receipts
WHERE signer_account_id = 'ryancwalsh.near'
OFFSET 0 -- !
) j
JOIN public.receipts r USING (receipt_id)
ORDER BY r.included_in_block_timestamp DESC
LIMIT 1;
Prevents "inlining" of the subquery to the same effect. Finishes in < 2ms.
See:
How to prevent PostgreSQL from rewriting OUTER JOIN queries?
"Fix" the database?
The proper fix depends on the complete picture. The underlying problem is that Postgres overestimates the number of qualifying rows in table action_receipts. The MCV list (most common values) cannot keep up with 220 million rows (and growing). It's most probably not just ANALYZE lagging behind. (Though it could be: autovacuum not properly configured? Rookie mistake?) Depending on the actual cardinalities (data distribution) in action_receipts.signer_account_id and access patterns you could do various things to "fix" it. Two options:
1. Increase n_distinct and STATISTICS
If most values in action_receipts.signer_account_id are equally rare (high cardinality), consider setting a very large n_distinct value for the column. And combine that with a moderately increased STATISTICS target for the same column to counter errors in the other direction (underestimating the number of qualifying rows for common values). Read both answers here:
Postgres sometimes uses inferior index for WHERE a IN (...) ORDER BY b LIMIT N
And:
Very bad query plan in PostgreSQL 9.6
2. Local fix with partial index
If action_receipts.signer_account_id = 'ryancwalsh.near' is special in that it's queried more regularly than others, consider a small partial index for it, to fix just that case. Like:
CREATE INDEX ON action_receipts (receipt_id)
WHERE signer_account_id = 'ryancwalsh.near';
I have a query in Postgres / Postgis which is based around finding the nearest points of a given point filtered by some other columns in the table.
The table consists of a bit over 10 million rows and the query looks like this:
SELECT t.id FROM my_table t
WHERE round(t.col1) IN (1,2,3)
ORDER BY t.geom <-> st_transform(st_setsrid('POINT(lon lat)'::geometry, 4326), 3857)
LIMIT 1000;
The geom column is indexed using GIST and col1 is also indexed.
When the WHERE clause finds many rows that are also near the point this is blazing fast using the geom index:
Limit (cost=0.42..10575.49 rows=1000 width=12) (actual time=0.150..6.742 rows=1000 loops=1)
-> Index Scan using my_table_geom_idx on my_table t (cost=0.42..2148612.35 rows=203177 width=12) (actual time=0.149..6.663 rows=1000 loops=1)
Order By: (geom <-> '.....'::geometry)
Filter: (round(t.col1) = ANY ('{1,2,3}'::double precision[]))
Rows Removed by Filter: 3348
Planning Time: 0.288 ms
Execution Time: 6.817 ms
The problem occurs when the WHERE clause does not find many rows that are close in distance to the given point. Example:
SELECT t.id FROM my_table t
WHERE round(t.col1) IN (1) // 1 is very rare near the given point
ORDER BY t.geom <-> st_transform(st_setsrid('POINT(lon lat)'::geometry, 4326), 3857)
LIMIT 1000;
This query runs much much slower:
Limit (cost=0.42..14487.97 rows=1000 width=12) (actual time=8443.514..10629.745 rows=1000 loops=1)
-> Index Scan using my_table_geom_idx on my_table t (cost=0.42..1962368.41 rows=135452 width=12) (actual time=8443.513..10629.553 rows=1000 loops=1)
Order By: (t.geom <-> '.....'::geometry)
Filter: (round(t.col1) = ANY ('{1}'::double precision[]))
Rows Removed by Filter: 5866030
Planning Time: 0.292 ms
Execution Time: 10629.906 ms
I created an index on round(col1) trying to speed up searches on col1 but postgres uses the geom index only which works great when there are many rows nearby that fit the criteria but not so great if there are few rows that match.
If I remove the LIMIT clause Postgres uses the index on col1, which works great when there are few resulting rows but is very slow when the result contains many rows, so I would like to keep the LIMIT clause.
Any suggestions on how I could optimize this query or create an index that handles this?
EDIT:
Thank you for all the suggestions and feedback!
I tried the tip from #JGH and restricted my query using st_dwithin as to not order the entire table before limiting.
...where st_dwithin(geom, searchpoint, 10000)
This greatly reduced the time of the slow query, down to a few milliseconds. Restricting the search to a constant distance works well in my application, so I will use this as the solution.
I have a situation where the select query could be done in 3 seconds or more than 1 hours still not finish (I could not wait that long and killed it). I believe it may have something to do with the automatic statistics collection behavior of postgres server. I have a 3 table join one of them has over 70 million rows.
-- tmp_variant_filtered has about 4000 rows
-- variant_quick > 70 million rows
-- filtered_variant_quick has about 70 k rows
select count(*)
from "tmp_variant_filtered" t join "variant_quick" v on getchrnum(t.seqname)=v.chrom
and t.pos_start=v.pos and t.ref=v.ref
and t.alt=v.alt
join "filtered_variant_quick" f on f.variantid=v.id
where v.samplerun=165
;
-- running the query immediately after tmp_variant_filtered was loaded
-- Query plan that will take > 1 hour and not finish
Aggregate (cost=332.05..332.06 rows=1 width=8)
-> Nested Loop (cost=0.86..332.05 rows=1 width=0)
-> Nested Loop (cost=0.57..323.74 rows=1 width=8)
Join Filter: ((t.pos_start = v.pos) AND ((t.ref)::text = (v.ref)::text) AND ((t.alt)::text = (v.alt)::text) AND (getchrnum(t.seqname) = v.chrom))
-> Seq Scan on tmp_variant_filtered t (cost=0.00..315.00 rows=1 width=1126)
-> Index Scan using variant_quick_samplerun_chrom_pos_ref_alt_key on variant_quick v (cost=0.57..8.47 rows=1 width=20)
Index Cond: (samplerun = 165)
-> Index Only Scan using filtered_variant_quick_pkey on filtered_variant_quick f (cost=0.29..8.31 rows=1 width=8)
Index Cond: (variantid = v.id)
-- running the query a few minutes after tmp_variant_filtered was loaded with copy command
-- query plan that will take less than 5 seconds to finish
Aggregate (cost=425.69..425.70 rows=1 width=8)
-> Nested Loop (cost=8.78..425.68 rows=1 width=0)
-> Hash Join (cost=8.48..417.37 rows=1 width=8)
Hash Cond: ((t.pos_start = v.pos) AND ((t.ref)::text = (v.ref)::text) AND ((t.alt)::text = (v.alt)::text))
Join Filter: (getchrnum(t.seqname) = v.chrom)
-> Seq Scan on tmp_variant_filtered t (cost=0.00..359.06 rows=4406 width=13)
-> Hash (cost=8.47..8.47 rows=1 width=20)
-> Index Scan using variant_quick_samplerun_chrom_pos_ref_alt_key on variant_quick v (cost=0.57..8.47 rows=1 width=20)
Index Cond: (samplerun = 165)
-> Index Only Scan using filtered_variant_quick_pkey on filtered_variant_quick f (cost=0.29..8.31 rows=1 width=8)
Index Cond: (variantid = v.id)
If you run the query immediately after the tmp table got populated, it will give you the plan as shown on top, and the query will take a very long time. If you wait a few minutes, the the plan will be the lower with hash-join. The cost estimate for the upper is less than the lower.
Since the query was embedded in some scripting language, the top plan is used and usually it got finished in a couple of hours. If I do this on a terminal, after I terminated the script, the lower plan would be used, and it usually take a couple of seconds to finish.
I even did an experiment by copying the tmp_variant_filtered table into another table, say 'test'. If I run the query immediately after the copy (manually, there will be a couple of seconds of delay), then I was stuck. Killing the current job, wait for a few minutes, the the same query become blazing fast.
It was long time ago that I was doing query tuning; now I am just starting to pick it up again. I am reading and trying to understand why postgres has such a behavior. Would appreciate the experts to give a hint.
Immediately after inserting the rows into the table, there are no statistics available for column values and their distribution. Thus the optimizer assumes the table is empty. The only sensible strategy to retrieve all rows from an (supposedly) empty table is to do a Seq Scan. You can see this assumption in the execution plan:
Seq Scan on tmp_variant_filtered t (cost=0.00..315.00 rows=1 width=1126)
The rows=1 means that the optimizer expects that only one row will be returned by the Seq Scan. Because it's only one row, the planner chooses a nested loop to do the join - which means the Seq Scan is done once for each row in the other table (you could see that more clearly if your use explain (analyze, verbose) to generate the execution plan)
The statistics are updated in the background by the "autovacuum daemon" if you don't do it manually. That's why after waiting a while, you see a better plan, as the optimizer now know the table isn't empty.
Once the optimizer has better knowledge of the size of the table, it chooses the much more efficient Hash Join to bring the two tables together - which means the Seq Scan is only executed once, rather than multiple times.
It is always recommended to run analyze (or vacuum analyze) on tables where you changed the number of rows significantly if you need a good execution plan immediately after populating the table.
Quote from the manual
Whenever you have significantly altered the distribution of data within a table, running ANALYZE is strongly recommended. This includes bulk loading large amounts of data into the table. Running ANALYZE (or VACUUM ANALYZE) ensures that the planner has up-to-date statistics about the table. With no statistics or obsolete statistics, the planner might make poor decisions during query planning, leading to poor performance on any tables with inaccurate or nonexistent statistics
Regardless the mechanism for this time dependent behavior, I figure out a solution with VACUUM ANALYZE my_table. Not sure it is the cure or just give a little bit time delay. I was using psocopg2 to execute the query and had to avoid the 'cannot vacuum inside a transaction' exception. Here I list the code block you need:
self.conn.commit()
self.conn.set_session(autocommit=True)
self.cursor.execute("vacuum analyze {}".format(one_of_my_tables))
# here you probably should have used sql.SQL("...").format()
# to be more secure, I am using the text composition for example
self.conn.set_session(autocommit=False)
I applied to two of the three tables involved in my join in the question. Maybe apply vacuum analyze to one should be sufficient. As mentioned by Basil, I should have asked the question in the dba group.
I've got a pretty large table with nearly 1 million rows and some of the queries are taking a long time (over a minute).
Here is one that's giving me a particularly hard time...
EXPLAIN ANALYZE SELECT "apps".* FROM "apps" WHERE "apps"."kind" = 'software' ORDER BY itunes_release_date DESC, rating_count DESC LIMIT 12;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Limit (cost=153823.03..153823.03 rows=12 width=2091) (actual time=162681.166..162681.194 rows=12 loops=1)
-> Sort (cost=153823.03..154234.66 rows=823260 width=2091) (actual time=162681.159..162681.169 rows=12 loops=1)
Sort Key: itunes_release_date, rating_count
Sort Method: top-N heapsort Memory: 48kB
-> Seq Scan on apps (cost=0.00..150048.41 rows=823260 width=2091) (actual time=0.718..161561.149 rows=808554 loops=1)
Filter: (kind = 'software'::text)
Total runtime: 162682.143 ms
(7 rows)
So, how would I optimize that? PG version is 9.2.4, FWIW.
There are already indexes on kind and kind, itunes_release_date.
Looks like you're missing an index, e.g. on (kind, itunes_release_date desc, rating_count desc).
How big is the apps table? Do you have at least this much memory allocated to postgres? If it's having to read from disk every time, query speed will be much slower.
Another thing that may help is to cluster the table on the 'apps' column. This may speed up disk access since all the software rows will be stored sequentially on disk.
The only way to speed up this query is to create a composite index on (itunes_release_date, rating_count). It will allow Postgres to pick first N rows from the index directly.