How to manually update the statistics data of tables in PostgreSQL - sql

The ANALYZE statement can be used in PostgreSQL to collect the statistics data of tables. However, I do not want to actually insert these data into tables, I just need to evaluate the cost of some queries, is there anyway to manually specify the statistics data of tables in PostgreSQL without actually putting data into it?

I think you are muddling ANALYZE with EXPLAIN ANALYZE. There are different things.
If you want query costs and timing without applying the changes, the only real option you have is to begin a transaction, execute the query under EXPLAIN ANALYZE, and then ROLLBACK.
This still executes the query, meaning that:
CPU time and I/O are consumed
Locks are still taken and held for the duration
New rows are actually written to the tables and indexes, but are never marked visible. They are cleaned up in the next VACUUM.

You can already EXPLAIN ANALYSE a query even with no inserted data, it will help you to get a feeling of the execution plan.
But there's no such thing as real data :)
What you can do, as a workaround, is BEGINning a transaction, INSERT some data, EXPLAIN ANALYSE your query, then ROLLBACK your transaction.
Example :
mydatabase=# BEGIN;
BEGIN
mydatabase=# INSERT INTO auth_message (user_id, message) VALUES (1, 'foobar');
INSERT 0 1
mydatabase=# EXPLAIN ANALYSE SELECT count(*) FROM auth_message;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Aggregate (cost=24.50..24.51 rows=1 width=0) (actual time=0.011..0.011 rows=1 loops=1)
-> Seq Scan on auth_message (cost=0.00..21.60 rows=1160 width=0) (actual time=0.007..0.008 rows=1 loops=1)
Total runtime: 0.042 ms
(3 lignes)
mydatabase=# ROLLBACK;
ROLLBACK
mydatabase=# EXPLAIN ANALYSE SELECT count(*) FROM auth_message;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Aggregate (cost=24.50..24.51 rows=1 width=0) (actual time=0.011..0.011 rows=1 loops=1)
-> Seq Scan on auth_message (cost=0.00..21.60 rows=1160 width=0) (actual time=0.009..0.009 rows=0 loops=1)
Total runtime: 0.043 ms
(3 lignes)
The 1st EXPLAIN ANALYSE shows that there was some "temporary" data (rows=1)
This is not strictly a "mock", but at least, PostgreSQL plan execution (and various optimizations it could do) should be, IMHO, best than with no data (disclaimer : purely intuitive)

Related

How to speed up SUM query in postgres on large table

The problem
I'm trying to run the following query on a SQL view in a postgres database:
SELECT sum(value) FROM invoices_view;
The invoices_view has approximately 45 million rows, the data size of the entire database is 40.5 GB and the database has 61 GB of RAM.
Currently this query is taking 4.5 seconds, and I'd like it to be ideally under 1 second.
Things I've tried
I cannot add indexes directly to the SQL view of course, but have an index on the underlying table:
CREATE INDEX invoices_on_value_idx ON invoices (value);
I have also run a VACUUM ANALYZE on the invoices table.
EXPLAIN ANALYZE
The output of EXPLAIN ANALYZE is as follows:
EXPLAIN (ANALYZE, BUFFERS) SELECT sum(value) FROM invoices_view;
Finalize Aggregate (cost=1514195.47..1514195.47 rows=1 width=32) (actual time=5102.805..5102.806 rows=1 loops=1)
Buffers: shared hit=14996 read=1446679
I/O Timings: read=3235.147
-> Gather (cost=1514195.16..1514195.47 rows=3 width=32) (actual time=5102.716..5109.229 rows=4 loops=1)
Workers Planned: 3
Workers Launched: 3
Buffers: shared hit=14996 read=1446679
I/O Timings: read=3235.147
-> Partial Aggregate (cost=1513195.16..1513195.17 rows=1 width=32) (actual time=5097.626..5097.626 rows=1 loops=4)
Buffers: shared hit=14996 read=1446679
I/O Timings: read=3235.147
-> Parallel Seq Scan on invoices (cost=0.00..1505835.14 rows=14720046 width=6) (actual time=0.049..3734.495 rows=11408036 loops=4)
Buffers: shared hit=14996 read=1446679
I/O Timings: read=3235.147
Planning Time: 2.503 ms
Execution Time: 5109.327 ms
Does anyone have any thought on how I might be able to speed this up? Or should I be looking at alternatives to postgres at this point?
More detail
This is the simplest version of the queries I'll need to run over the dataset.
For example, I need to be able to SUM based on user inputs i.e. additional WHERE clauses and GROUP BYs.
Keeping a running total would solve for this simplest case only.
You should consider using a trigger to keep track of a rolling sum:
CREATE OR REPLACE FUNCTION func_sum_invoice()
RETURNS trigger AS
$BODY$
BEGIN
UPDATE invoices_sum
SET total = total + NEW.value;
RETURN NEW;
END;
$BODY$
Then create the trigger using this function:
CREATE TRIGGER sum_invoice
AFTER INSERT ON invoices
FOR EACH ROW
EXECUTE PROCEDURE func_sum_invoice();
Now each insert into the invoices table will fire a trigger which tallies the rolling sum. To obtain that sum, now you need only a single select, which should be very fast:
SELECT total
FROM invoices_sum;

Prevent usage of index for a particular query in Postgres

I have a slow query in a Postgres DB. Using explain analyze, I can see that Postgres makes bitmap index scan on two different indexes followed by bitmap AND on the two resulting sets.
Deleting one of the indexes makes the evaluation ten times faster (bitmap index scan is still used on the first index). However, that deleted index is useful in other queries.
Query:
select
booking_id
from
booking
where
substitute_confirmation_token is null
and date_trunc('day', from_time) >= cast('01/25/2016 14:23:00.004' as date)
and from_time >= '01/25/2016 14:23:00.004'
and type = 'LESSON_SUBSTITUTE'
and valid
order by
booking_id;
Indexes:
"idx_booking_lesson_substitute_day" btree (date_trunc('day'::text, from_time)) WHERE valid AND type::text = 'LESSON_SUBSTITUTE'::text
"booking_substitute_confirmation_token_key" UNIQUE CONSTRAINT, btree (substitute_confirmation_token)
Query plan:
Sort (cost=287.26..287.26 rows=1 width=8) (actual time=711.371..711.377 rows=44 loops=1)
Sort Key: booking_id
Sort Method: quicksort Memory: 27kB
Buffers: shared hit=8 read=7437 written=1
-> Bitmap Heap Scan on booking (cost=275.25..287.25 rows=1 width=8) (actual time=711.255..711.294 rows=44 loops=1)
Recheck Cond: ((date_trunc('day'::text, from_time) >= '2016-01-25'::date) AND valid AND ((type)::text = 'LESSON_SUBSTITUTE'::text) AND (substitute_confirmation_token IS NULL))
Filter: (from_time >= '2016-01-25 14:23:00.004'::timestamp without time zone)
Buffers: shared hit=5 read=7437 written=1
-> BitmapAnd (cost=275.25..275.25 rows=3 width=0) (actual time=711.224..711.224 rows=0 loops=1)
Buffers: shared hit=5 read=7433 written=1
-> Bitmap Index Scan on idx_booking_lesson_substitute_day (cost=0.00..20.50 rows=594 width=0) (actual time=0.080..0.080 rows=72 loops=1)
Index Cond: (date_trunc('day'::text, from_time) >= '2016-01-25'::date)
Buffers: shared hit=5 read=1
-> Bitmap Index Scan on booking_substitute_confirmation_token_key (cost=0.00..254.50 rows=13594 width=0) (actual time=711.102..711.102 rows=2718734 loops=1)
Index Cond: (substitute_confirmation_token IS NULL)
Buffers: shared read=7432 written=1
Total runtime: 711.436 ms
Can I prevent using a particular index for a particular query in Postgres?
Your clever solution
You already found a clever solution for your particular case: A partial unique index that only covers rare values, so Postgres won't (can't) use the index for the common NULL value.
CREATE UNIQUE INDEX booking_substitute_confirmation_uni
ON booking (substitute_confirmation_token)
WHERE substitute_confirmation_token IS NOT NULL;
It's a textbook use-case for a partial index. Literally! The manual has a similar example and these perfectly matching advice to go with it:
Finally, a partial index can also be used to override the system's
query plan choices. Also, data sets with peculiar distributions might
cause the system to use an index when it really should not. In that
case the index can be set up so that it is not available for the
offending query. Normally, PostgreSQL makes reasonable choices about
index usage (e.g., it avoids them when retrieving common values, so
the earlier example really only saves index size, it is not required
to avoid index usage), and grossly incorrect plan choices are cause
for a bug report.
Keep in mind that setting up a partial index indicates that you know
at least as much as the query planner knows, in particular you know
when an index might be profitable. Forming this knowledge requires
experience and understanding of how indexes in PostgreSQL work. In
most cases, the advantage of a partial index over a regular index will
be minimal.
You commented: The table has few millions of rows and just few thousands of rows with not null values, so this is a perfect use-case. It will even speed up queries on non-null values for substitute_confirmation_token because the index is much smaller now.
Answer to question
To answer your original question: it's not possible to "disable" an existing index for a particular query. You would have to drop it, but that's way to expensive.
Fake drop index
You could drop an index inside a transaction, run your SELECT and then, instead of committing, use ROLLBACK. That's fast, but be aware that (per documentation):
A normal DROP INDEX acquires exclusive lock on the table, blocking
other accesses until the index drop can be completed.
So this is no good for multi-user environments.
BEGIN;
DROP INDEX big_user_id_created_at_idx;
SELECT ...;
ROLLBACK; -- so the index is preserved after all
More detailed statistics
Normally, though, it should be enough to raise the STATISTICS target for the column, so Postgres can more reliably identify common values and avoid the index for those. Try:
ALTER TABLE booking ALTER COLUMN substitute_confirmation_token SET STATISTICS 2000;
Then: ANALYZE booking; before you try your query again. 2000 is an example value. Related:
Keep PostgreSQL from sometimes choosing a bad query plan

Postgres not using index when index scan is much better option

I have a simple query to join two tables that's being really slow. I found out that the query plan does a seq scan on the large table email_activities (~10m rows) while I think using indexes doing nested loops will actually be faster.
I rewrote the query using a subquery in an attempt to force the use of index, then noticed something interesting. If you look at the two query plans below, you will see that when I limit the result set of subquery to 43k, query plan does use index on email_activities while setting the limit in subquery to even 44k will cause query plan to use seq scan on email_activities. One is clearly more efficient than the other, but Postgres doesn't seem to care.
What could cause this? Does it have a configs somewhere that forces the use of hash join if one of the set is larger than certain size?
explain analyze SELECT COUNT(DISTINCT "email_activities"."email_recipient_id") FROM "email_activities" where email_recipient_id in (select "email_recipients"."id" from email_recipients WHERE "email_recipients"."email_campaign_id" = 1607 limit 43000);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=118261.50..118261.50 rows=1 width=4) (actual time=224.556..224.556 rows=1 loops=1)
-> Nested Loop (cost=3699.03..118147.99 rows=227007 width=4) (actual time=32.586..209.076 rows=40789 loops=1)
-> HashAggregate (cost=3698.94..3827.94 rows=43000 width=4) (actual time=32.572..47.276 rows=43000 loops=1)
-> Limit (cost=0.09..3548.44 rows=43000 width=4) (actual time=0.017..22.547 rows=43000 loops=1)
-> Index Scan using index_email_recipients_on_email_campaign_id on email_recipients (cost=0.09..5422.47 rows=65710 width=4) (actual time=0.017..19.168 rows=43000 loops=1)
Index Cond: (email_campaign_id = 1607)
-> Index Only Scan using index_email_activities_on_email_recipient_id on email_activities (cost=0.09..2.64 rows=5 width=4) (actual time=0.003..0.003 rows=1 loops=43000)
Index Cond: (email_recipient_id = email_recipients.id)
Heap Fetches: 40789
Total runtime: 224.675 ms
And:
explain analyze SELECT COUNT(DISTINCT "email_activities"."email_recipient_id") FROM "email_activities" where email_recipient_id in (select "email_recipients"."id" from email_recipients WHERE "email_recipients"."email_campaign_id" = 1607 limit 50000);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=119306.25..119306.25 rows=1 width=4) (actual time=3050.612..3050.613 rows=1 loops=1)
-> Hash Semi Join (cost=4451.08..119174.27 rows=263962 width=4) (actual time=1831.673..3038.683 rows=47935 loops=1)
Hash Cond: (email_activities.email_recipient_id = email_recipients.id)
-> Seq Scan on email_activities (cost=0.00..107490.96 rows=9359988 width=4) (actual time=0.003..751.988 rows=9360039 loops=1)
-> Hash (cost=4276.08..4276.08 rows=50000 width=4) (actual time=34.058..34.058 rows=50000 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 1758kB
-> Limit (cost=0.09..4126.08 rows=50000 width=4) (actual time=0.016..27.302 rows=50000 loops=1)
-> Index Scan using index_email_recipients_on_email_campaign_id on email_recipients (cost=0.09..5422.47 rows=65710 width=4) (actual time=0.016..22.244 rows=50000 loops=1)
Index Cond: (email_campaign_id = 1607)
Total runtime: 3050.660 ms
Version: PostgreSQL 9.3.10 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit
email_activities: ~10m rows
email_recipients: ~11m rows
Index (Only) Scan --> Bitmap Index Scan --> Sequential Scan
For few rows it pays to run an index scan. If enough data pages are visible to all (= vacuumed enough, and not too much concurrent write load) and the index can provide all column values needed, then a faster index only scan is used. With more rows expected to be returned (higher percentage of the table and depending on data distribution, value frequencies and row width) it becomes more likely to find several rows on one data page. Then it pays to switch to a bitmap index scans. (Or to combine multiple distinct indexes.) Once a large percentage of data pages has to be visited anyway, it's cheaper to run a sequential scan, filter surplus rows and skip the overhead for indexes altogether.
Index usage becomes (much) cheaper and more likely when accessing data pages in random order is not (much) more expensive than accessing them in sequential order. That's the case when using SSD instead of spinning disks, or even more so the more is cached in RAM - and the respective configuration parameters random_page_cost and effective_cache_size are set accordingly.
In your case, Postgres switches to a sequential scan, expecting to find rows=263962, that's already 3 % of the whole table. (While only rows=47935 are actually found, see below.)
More in this related answer:
Efficient PostgreSQL query on timestamp using index or bitmap index scan?
Beware of forcing query plans
You cannot force a certain planner method directly in Postgres, but you can make other methods seem extremely expensive for debugging purposes. See Planner Method Configuration in the manual.
SET enable_seqscan = off (like suggested in another answer) does that to sequential scans. But that's intended for debugging purposes in your session only. Do not use this as a general setting in production unless you know exactly what you are doing. It can force ridiculous query plans. The manual:
These configuration parameters provide a crude method of influencing
the query plans chosen by the query optimizer. If the default plan
chosen by the optimizer for a particular query is not optimal, a
temporary solution is to use one of these configuration parameters to force the optimizer to choose a different plan. Better ways to
improve the quality of the plans chosen by the optimizer include
adjusting the planner cost constants (see Section 19.7.2),
running ANALYZE manually, increasing the value of the
default_statistics_target configuration parameter, and
increasing the amount of statistics collected for specific columns
using ALTER TABLE SET STATISTICS.
That's already most of the advice you need.
Keep PostgreSQL from sometimes choosing a bad query plan
In this particular case, Postgres expects 5-6 times more hits on email_activities.email_recipient_id than are actually found:
estimated rows=227007 vs. actual ... rows=40789
estimated rows=263962 vs. actual ... rows=47935
If you run this query often it will pay to have ANALYZE look at a bigger sample for more accurate statistics on the particular column. Your table is big (~ 10M rows), so make that:
ALTER TABLE email_activities ALTER COLUMN email_recipient_id
SET STATISTICS 3000; -- max 10000, default 100
Then ANALYZE email_activities;
Measure of last resort
In very rare cases you might resort to force an index with SET LOCAL enable_seqscan = off in a separate transaction or in a function with its own environment. Like:
CREATE OR REPLACE FUNCTION f_count_dist_recipients(_email_campaign_id int, _limit int)
RETURNS bigint AS
$func$
SELECT COUNT(DISTINCT a.email_recipient_id)
FROM email_activities a
WHERE a.email_recipient_id IN (
SELECT id
FROM email_recipients
WHERE email_campaign_id = $1
LIMIT $2) -- or consider query below
$func$ LANGUAGE sql VOLATILE COST 100000 SET enable_seqscan = off;
The setting only applies to the local scope of the function.
Warning: This is just a proof of concept. Even this much less radical manual intervention might bite you in the long run. Cardinalities, value frequencies, your schema, global Postgres settings, everything changes over time. You are going to upgrade to a new Postgres version. The query plan you force now, may become a very bad idea later.
And typically this is just a workaround for a problem with your setup. Better find and fix it.
Alternative query
Essential information is missing in the question, but this equivalent query is probably faster and more likely to use an index on (email_recipient_id) - increasingly so for a bigger LIMIT.
SELECT COUNT(*) AS ct
FROM (
SELECT id
FROM email_recipients
WHERE email_campaign_id = 1607
LIMIT 43000
) r
WHERE EXISTS (
SELECT FROM email_activities
WHERE email_recipient_id = r.id);
A sequential scan can be more efficient, even when an index exists. In this case, postgres seems to estimate things rather wrong.
An ANALYZE <TABLE> on all related tables can help in such cases. If it doesnt, you can set the variable enable_seqscan to OFF, to force postgres to use an index whenever technically possible, at the expense, that sometimes an index-scan will be used when a sequential scan would perform better.

Postgres inconsistent use of Index vs Seq Scan

I'm having difficulty understanding what I perceive as an inconsistancy in how postgres chooses to use indices. We have a query based on NOT IN against an indexed column that postgres executes sequentially, but when we perform the same query as IN, it uses the index.
I've created a simplistic example that I believe demonstrates the issue, notice this first query is sequential
CREATE TABLE node
(
id SERIAL PRIMARY KEY,
vid INTEGER
);
CREATE INDEX x ON node(vid);
INSERT INTO node(vid) VALUES (1),(2);
EXPLAIN ANALYZE
SELECT *
FROM node
WHERE NOT vid IN (1);
Seq Scan on node (cost=0.00..36.75 rows=2129 width=8) (actual time=0.009..0.010 rows=1 loops=1)
Filter: (vid <> 1)
Rows Removed by Filter: 1
Total runtime: 0.025 ms
But if we invert the query to IN, you'll notice that it now decided to use the index
EXPLAIN ANALYZE
SELECT *
FROM node
WHERE vid IN (2);
Bitmap Heap Scan on node (cost=4.34..15.01 rows=11 width=8) (actual time=0.017..0.017 rows=1 loops=1)
Recheck Cond: (vid = 1)
-> Bitmap Index Scan on x (cost=0.00..4.33 rows=11 width=0) (actual time=0.012..0.012 rows=1 loops=1)
Index Cond: (vid = 1)
Total runtime: 0.039 ms
Can anyone shed any light on this? Specifically, is there a way to re-write out NOT IN to work with the index (when obviously the result set is not as simplistic as just 1 or 2).
We are using Postgres 9.2 on CentOS 6.6
PostgreSQL is going to use an Index when it makes sense. It is likely that the statistics state that your NOT IN has too many tuples to return to make an Index effective.
You can test this by doing the following:
set enable_seqscan to false;
explain analyze .... NOT IN
set enable_seqscan to true;
explain analyze .... NOT IN
The results will tell you if PostgreSQL is making the correct decision. If it isn't you can make adjustments to the statistics of the column and or the costs (random_page_cost) to get the desired behavior.

Why does this SUM() function take so long in PostgreSQL?

This is my query:
SELECT SUM(amount) FROM bill WHERE name = 'peter'
There are 800K+ rows in the table. EXPLAIN ANALYZE says:
Aggregate (cost=288570.06..288570.07 rows=1 width=4) (actual time=537213.327..537213.328 rows=1 loops=1)
-> Seq Scan on bill (cost=0.00..288320.94 rows=498251 width=4) (actual time=48385.201..535941.041 rows=800947 loops=1)
Filter: ((name)::text = 'peter'::text)
Rows Removed by Filter: 8
Total runtime: 537213.381 ms
All rows are affected, and this is correct. But why so long? A similar query without WHERE runs way faster:
ANALYZE EXPLAIN SELECT SUM(amount) FROM bill
Aggregate (cost=137523.31..137523.31 rows=1 width=4) (actual time=2198.663..2198.664 rows=1 loops=1)
-> Index Only Scan using idx_amount on bill (cost=0.00..137274.17 rows=498268 width=4) (actual time=0.032..1223.512 rows=800955 loops=1)
Heap Fetches: 533399
Total runtime: 2198.717 ms
I have an index on amount and an index on name. Have I missed any indexes?
ps. I managed to solve the problem just by adding a new idex ON bill(name, amount). I didn't get why it helped, so let's leave the question open for some time...
Since you are searching for a specific name, you should have an index that has name as the first column, e.g. CREATE INDEX IX_bill_name ON bill( name ).
But Postgres can still opt to do a full table scan if it estimates your index to not be specific enough, i.e. if it thinks it is faster to just scan all rows and pick the matching ones instead of consulting an index and start jumping around in the table to gather the matching rows. Postgres uses a cost-based estimation technique that weights random disk reads to be more expensive than sequential reads.
For an index to actually be used in your situation, there should be no more than 10% of the rows matching what you are searching for. Since most of your rows have name=peter it is actually faster to do a full table scan.
As to why the SUM without filtering runs faster has to do with overall width of the table. With a where-clause, postgres has to sequentially read all rows in the table so it can disregard those that do not match the filter. Without a where-clause, postgres can instead read all the amounts from the index. Because the index on amounts contains the amounts and pointers to each corresponding rows, but no other data from the table, it is simply less data to wade through. Based on the big different in performance I guess you have quite a lot of other fields in your table..