How to measure time in postgres - sql

In postgres I want to know the time taken to generate plan. I know that \timing gives me the time taken to execute the plan after finding the optimal plan. But I want to find out the time which postgres takes in finding out the optimal plan. Is it possible to determine this time in postgres. If yes, then how?
Also query plan generators at times do not find the optimal plan. Can I force postgres to use the optimal plan for plan generation. If yes, then how can I do so?

For the time taken to prepare a plan and the time taken to execute it, you can use explain (which merely finds a plan) vs explain analyze (which actually runs it) with \timing turned on:
test=# explain select * from test where val = 1 order by id limit 10;
QUERY PLAN
--------------------------------------------------------------------------------
Limit (cost=0.00..4.35 rows=10 width=8)
-> Index Scan using test_pkey on test (cost=0.00..343.25 rows=789 width=8)
Filter: (val = 1)
(3 rows)
Time: 0.759 ms
test=# explain analyze select * from test where val = 1 order by id limit 10;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..4.35 rows=10 width=8) (actual time=0.122..0.170 rows=10 loops=1)
-> Index Scan using test_pkey on test (cost=0.00..343.25 rows=789 width=8) (actual time=0.121..0.165 rows=10 loops=1)
Filter: (val = 1)
Rows Removed by Filter: 67
Total runtime: 0.204 ms
(5 rows)
Time: 1.019 ms
Note that there is a tiny overhead in both commands to actually output the plan.
Usually it happens e.g. in DB2 that since finding the optimal plan takes a lot of time..therefore the database engine decides to use a suboptimal plan...i think it must be the case with postgres also.
In Postgres, this only occurs if your query is gory enough that it cannot reasonably do an exhaustive search. When you reach the relevant thresholds (which are high, if your use-cases are typical), the planner uses the genetic query optimizer:
http://www.postgresql.org/docs/current/static/geqo-pg-intro.html
If it is then how can i fiddle with postgres such that it chooses the optimal plan.
In more general use cases, there are many things that you can fiddle, but be very wary of messing around with them (apart, perhaps, from collecting a bit more statistics on a select few columns using ALTER TABLE SET STATISTICS):
http://www.postgresql.org/docs/current/static/runtime-config-query.html

If you use \i command in psql client than in sql file add following row
\timing true

Related

Postgres select query making sequential scan instead of index scan on table with 18 Million rows

I have a postgres table which has almost 18 Million rows and I am trying to run this query
select * from answer where submission_id = 5 and deleted_at is NULL;
There is an partial index on the table on column submission_id. This is the command used to create index
CREATE INDEX answer_submission_id ON answer USING btree (submission_id) WHERE (deleted_at IS NULL)
This is the explain analyse of the above select query
Gather (cost=1000.00..3130124.70 rows=834 width=377) (actual time=7607.568..7610.130 rows=2 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=2144966 read=3
I/O Timings: read=6.169
-> Parallel Seq Scan on answer (cost=0.00..3129041.30 rows=348 width=377) (actual time=6501.951..7604.623 rows=1 loops=3)
Filter: ((deleted_at IS NULL) AND (submission_id = 5))
Rows Removed by Filter: 62213625
Buffers: shared hit=2144966 read=3
I/O Timings: read=6.169
Planning Time: 0.117 ms
Execution Time: 7610.154 ms
Ideally it should pick the answer_submission_id index. But postgres is going for an sequential scan.
Any help would be really thankful
The execution plan shows us there is a deviation between the estimated read row and the actual read row.
Postgresql optimizer is a cost-based optimizer (CBO) queries will be executed by the smallest cost from execution plans.
so that the wrong statistics might choose a bad execution plan.
There is a link to represent the wrong statistics causing a slow query.
Why are bad row estimates slow in Postgres?
Firstly I will use this query to search last_analyze & last_vacuum last time.
SELECT
schemaname, relname,
last_vacuum, last_autovacuum,
vacuum_count, autovacuum_count,
last_analyze,last_autoanalyze
FROM pg_stat_user_tables
where relname = 'tablename';
if your statistics are wrong we can use ANALYZE "tablename"to help us collect new statistics from the table, ANALYZE scans speed depends on table size.
For large tables, ANALYZE takes a random sample of the table contents, rather than examining every row. This allows even very large tables to be analyzed in a small amount of time. Note, however, that the statistics are only approximate, and will change slightly each time ANALYZE is run, even if the actual table contents did not change. This might result in small changes in the planner's estimated costs shown by EXPLAIN. In rare situations, this non-determinism will cause the planner's choices of query plans to change after ANALYZE is run. To avoid this, raise the amount of statistics collected by ANALYZE, as described below.
When we UPDATE and DELETE data that will create a dead tuple which might exist in the heap or indexes but we can't query that, VACUUM can help us to reclaim storage occupied by dead tuples.

Why is this SQL statement so fast?

The table in question has 3.8M records. The data column is indexed on a different field: "idx2_table_on_data_id" btree ((data ->> 'id'::text)). I assumed the sequential scan would be very slow but it is completing in just over 1 second. data->'array' does not exist in many of the records, fyi. Why is this running so quickly? Postgres v10
db=> explain analyze select * from table where jsonb_array_length(data->'array') != 0;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Seq Scan on table (cost=0.00..264605.21 rows=3797785 width=681) (actual time=0.090..1189.997 rows=1762 loops=1)
Filter: (jsonb_array_length((data -> 'array'::text)) <> 0)
Rows Removed by Filter: 3818154
Planning time: 0.561 ms
Execution time: 1190.492 ms
(5 rows)
We could tell for sure if you had run EXPLAIN (ANALYZE, BUFFERS), but odds are that most of the data were cached in RAM.
Also jsonb_array_length(data->'array') is not terribly expensive if the JSON is short.

Postgresql select count query takes long time

I have a table named events in my Postgresql 9.5 database. And this table has about 6 million records.
I am runnig a select count(event_id) from events query. But this query takes 40seconds. This is very long time for a database. My event_id field of table is primary key and indexed. Why this takes very long time? (Server is ubuntu vm on vmware has 4cpu)
Explain:
"Aggregate (cost=826305.19..826305.20 rows=1 width=0) (actual time=24739.306..24739.306 rows=1 loops=1)"
" Buffers: shared hit=13 read=757739 dirtied=53 written=48"
" -> Seq Scan on event_source (cost=0.00..812594.55 rows=5484255 width=0) (actual time=0.014..24087.050 rows=6320689 loops=1)"
" Buffers: shared hit=13 read=757739 dirtied=53 written=48"
"Planning time: 0.369 ms"
"Execution time: 24739.364 ms"
I know that this is an old question and the existing answer covers the vast majority of info around this, but I just ran into a situation where a table of 1.3 million rows was taking about 35 seconds to perform a simple SELECT COUNT(*). None of the other solutions helped. The issue ended up being that the table was just bloated and hadn't been vacuumed, so Postgres couldn't figure out the most optimal way to query the data. After I ran this, the query time dropped down to about 25ms!
VACUUM (ANALYZE, VERBOSE, FULL) my_table_name;
Hope this helps someone else!
There are multiple factors playing a big role in the decision for PostgreSQL how to execute the count(), but first of all, the column you use inside the count function does not matter. In fact, if you don't need DISTINCT count, stick with count(*).
You can try the following to force an index-only scan:
SELECT count(*) FROM (SELECT event_id FROM events) t;
...if that still results in a sequential scan, than most likely the index is not much smaller than the table itself. To still see how an index-only scan would perform, you can enforce it with:
SELECT count(*) FROM (SELECT event_id FROM events ORDER BY 1) t;
IF that is not much faster, you should also consider an upgrade of the PostgreSQL to at least version 9.6, which introduces parallel sequential scans to speed up these things.
In addition, you can achieve dramatic speedups choosing from a variety of techniques to provide counts which largely depend on your use-case and your requirements:
Faster PostgreSQL Counting
Last but not least, please always provide the output of an extended explain as #a_horse_with_no_name already recommended, e.g.:
EXPLAIN (ANALYZE, BUFFERS) SELECT count(event_id) FROM events;

Why does this simple query not use the index in postgres?

In my postgreSQL database I have a table named "product". In this table I have a column named "date_touched" with type timestamp. I created a simple btree index on this column. This is the schema of my table (I omitted irrelevant column & index definitions):
Table "public.product"
Column | Type | Modifiers
---------------------------+--------------------------+-------------------
id | integer | not null default nextval('product_id_seq'::regclass)
date_touched | timestamp with time zone | not null
Indexes:
"product_pkey" PRIMARY KEY, btree (id)
"product_date_touched_59b16cfb121e9f06_uniq" btree (date_touched)
The table has ~300,000 rows and I want to get the n-th element from the table ordered by "date_touched". when I want to get the 1000th element, it takes 0.2s, but when I want to get the 100,000th element, it takes about 6s. My question is, why does it take too much time to retrieve the 100,000th element, although I've defined a btree index?
Here is my query with explain analyze that shows postgreSQL does not use the btree index and instead sorts all rows to find the 100,000th element:
first query (100th element):
explain analyze
SELECT product.id
FROM product
ORDER BY product.date_touched ASC
LIMIT 1
OFFSET 1000;
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Limit (cost=3035.26..3038.29 rows=1 width=12) (actual time=160.208..160.209 rows=1 loops=1)
-> Index Scan using product_date_touched_59b16cfb121e9f06_uniq on product (cost=0.42..1000880.59 rows=329797 width=12) (actual time=16.651..159.766 rows=1001 loops=1)
Total runtime: 160.395 ms
second query (100,000th element):
explain analyze
SELECT product.id
FROM product
ORDER BY product.date_touched ASC
LIMIT 1
OFFSET 100000;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Limit (cost=106392.87..106392.88 rows=1 width=12) (actual time=6621.947..6621.950 rows=1 loops=1)
-> Sort (cost=106142.87..106967.37 rows=329797 width=12) (actual time=6381.174..6568.802 rows=100001 loops=1)
Sort Key: date_touched
Sort Method: external merge Disk: 8376kB
-> Seq Scan on product (cost=0.00..64637.97 rows=329797 width=12) (actual time=1.357..4184.115 rows=329613 loops=1)
Total runtime: 6629.903 ms
It is a very good thing, that SeqScan is used here. Your OFFSET 100000 is not a good thing for the IndexScan.
A bit of theory
Btree indexes contain 2 structures inside:
balanced tree and
double-linked list of keys.
First structure allows for fast keys lookups, second is responsible for the ordering. For bigger tables, linked list cannot fit into a single page and therefore it is a list of linked pages, where each page's entries maintain ordering, specified during index creation.
It is wrong to think, though, that such pages are sitting together on the disk. In fact, it is more probable that those are spread across different locations. And in order to read pages based on the index's order, system has to perform random disk reads. Random disk IO is expensive, compared to sequential access. Therefore good optimizer will prefer a SeqScan instead.
I highly recommend “SQL Performance Explained” book to better understand indexes. It is also available on-line.
What is going on?
Your OFFSET clause would cause database to read index's linked list of keys (causing lots of random disk reads) and than discarding all those results, till you hit the wanted offset. And it is good, in fact, that Postgres decided to use SeqScan + Sort here — this should be faster.
You can check this assumption by:
running EXPLAIN (analyze, buffers) of your big-OFFSET query
than do SET enable_seqscan TO 'off';
and run EXPLAIN (analyze, buffers) again, comparing the results.
In general, it is better to avoid OFFSET, as DBMSes not always pick the right approach here. (BTW, which version of PostgreSQL you're using?)
Here's a comparison of how it performs for different offset values.
EDIT: In order to avoid OFFSET one would have to base pagination on the real data, that exists in the table and is a part of the index. For this particular case, the following might be possible:
show first N (say, 20) elements
include maximal date_touched that is shown on the page to all the “Next” links. You can compute this value on the application side. Do similar for the “Previous” links, except include minimal date_touch for these.
on the server side you will get the limiting value. Therefore, say for the “Next” case, you can do a query like this:
SELECT id
FROM product
WHERE date_touched > $max_date_seen_on_the_page
ORDER BY date_touched ASC
LIMIT 20;
This query makes best use of the index.
Of course, you can adjust this example to your needs. I used pagination as it is a typical case for the OFFSET.
One more note — querying 1 row many times, increasing offset for each query by 1, will be much more time consuming, than doing a single batch query that returns all those records, which are then iterated from on the application side.

Postgres not using index when index scan is much better option

I have a simple query to join two tables that's being really slow. I found out that the query plan does a seq scan on the large table email_activities (~10m rows) while I think using indexes doing nested loops will actually be faster.
I rewrote the query using a subquery in an attempt to force the use of index, then noticed something interesting. If you look at the two query plans below, you will see that when I limit the result set of subquery to 43k, query plan does use index on email_activities while setting the limit in subquery to even 44k will cause query plan to use seq scan on email_activities. One is clearly more efficient than the other, but Postgres doesn't seem to care.
What could cause this? Does it have a configs somewhere that forces the use of hash join if one of the set is larger than certain size?
explain analyze SELECT COUNT(DISTINCT "email_activities"."email_recipient_id") FROM "email_activities" where email_recipient_id in (select "email_recipients"."id" from email_recipients WHERE "email_recipients"."email_campaign_id" = 1607 limit 43000);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=118261.50..118261.50 rows=1 width=4) (actual time=224.556..224.556 rows=1 loops=1)
-> Nested Loop (cost=3699.03..118147.99 rows=227007 width=4) (actual time=32.586..209.076 rows=40789 loops=1)
-> HashAggregate (cost=3698.94..3827.94 rows=43000 width=4) (actual time=32.572..47.276 rows=43000 loops=1)
-> Limit (cost=0.09..3548.44 rows=43000 width=4) (actual time=0.017..22.547 rows=43000 loops=1)
-> Index Scan using index_email_recipients_on_email_campaign_id on email_recipients (cost=0.09..5422.47 rows=65710 width=4) (actual time=0.017..19.168 rows=43000 loops=1)
Index Cond: (email_campaign_id = 1607)
-> Index Only Scan using index_email_activities_on_email_recipient_id on email_activities (cost=0.09..2.64 rows=5 width=4) (actual time=0.003..0.003 rows=1 loops=43000)
Index Cond: (email_recipient_id = email_recipients.id)
Heap Fetches: 40789
Total runtime: 224.675 ms
And:
explain analyze SELECT COUNT(DISTINCT "email_activities"."email_recipient_id") FROM "email_activities" where email_recipient_id in (select "email_recipients"."id" from email_recipients WHERE "email_recipients"."email_campaign_id" = 1607 limit 50000);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=119306.25..119306.25 rows=1 width=4) (actual time=3050.612..3050.613 rows=1 loops=1)
-> Hash Semi Join (cost=4451.08..119174.27 rows=263962 width=4) (actual time=1831.673..3038.683 rows=47935 loops=1)
Hash Cond: (email_activities.email_recipient_id = email_recipients.id)
-> Seq Scan on email_activities (cost=0.00..107490.96 rows=9359988 width=4) (actual time=0.003..751.988 rows=9360039 loops=1)
-> Hash (cost=4276.08..4276.08 rows=50000 width=4) (actual time=34.058..34.058 rows=50000 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 1758kB
-> Limit (cost=0.09..4126.08 rows=50000 width=4) (actual time=0.016..27.302 rows=50000 loops=1)
-> Index Scan using index_email_recipients_on_email_campaign_id on email_recipients (cost=0.09..5422.47 rows=65710 width=4) (actual time=0.016..22.244 rows=50000 loops=1)
Index Cond: (email_campaign_id = 1607)
Total runtime: 3050.660 ms
Version: PostgreSQL 9.3.10 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit
email_activities: ~10m rows
email_recipients: ~11m rows
Index (Only) Scan --> Bitmap Index Scan --> Sequential Scan
For few rows it pays to run an index scan. If enough data pages are visible to all (= vacuumed enough, and not too much concurrent write load) and the index can provide all column values needed, then a faster index only scan is used. With more rows expected to be returned (higher percentage of the table and depending on data distribution, value frequencies and row width) it becomes more likely to find several rows on one data page. Then it pays to switch to a bitmap index scans. (Or to combine multiple distinct indexes.) Once a large percentage of data pages has to be visited anyway, it's cheaper to run a sequential scan, filter surplus rows and skip the overhead for indexes altogether.
Index usage becomes (much) cheaper and more likely when accessing data pages in random order is not (much) more expensive than accessing them in sequential order. That's the case when using SSD instead of spinning disks, or even more so the more is cached in RAM - and the respective configuration parameters random_page_cost and effective_cache_size are set accordingly.
In your case, Postgres switches to a sequential scan, expecting to find rows=263962, that's already 3 % of the whole table. (While only rows=47935 are actually found, see below.)
More in this related answer:
Efficient PostgreSQL query on timestamp using index or bitmap index scan?
Beware of forcing query plans
You cannot force a certain planner method directly in Postgres, but you can make other methods seem extremely expensive for debugging purposes. See Planner Method Configuration in the manual.
SET enable_seqscan = off (like suggested in another answer) does that to sequential scans. But that's intended for debugging purposes in your session only. Do not use this as a general setting in production unless you know exactly what you are doing. It can force ridiculous query plans. The manual:
These configuration parameters provide a crude method of influencing
the query plans chosen by the query optimizer. If the default plan
chosen by the optimizer for a particular query is not optimal, a
temporary solution is to use one of these configuration parameters to force the optimizer to choose a different plan. Better ways to
improve the quality of the plans chosen by the optimizer include
adjusting the planner cost constants (see Section 19.7.2),
running ANALYZE manually, increasing the value of the
default_statistics_target configuration parameter, and
increasing the amount of statistics collected for specific columns
using ALTER TABLE SET STATISTICS.
That's already most of the advice you need.
Keep PostgreSQL from sometimes choosing a bad query plan
In this particular case, Postgres expects 5-6 times more hits on email_activities.email_recipient_id than are actually found:
estimated rows=227007 vs. actual ... rows=40789
estimated rows=263962 vs. actual ... rows=47935
If you run this query often it will pay to have ANALYZE look at a bigger sample for more accurate statistics on the particular column. Your table is big (~ 10M rows), so make that:
ALTER TABLE email_activities ALTER COLUMN email_recipient_id
SET STATISTICS 3000; -- max 10000, default 100
Then ANALYZE email_activities;
Measure of last resort
In very rare cases you might resort to force an index with SET LOCAL enable_seqscan = off in a separate transaction or in a function with its own environment. Like:
CREATE OR REPLACE FUNCTION f_count_dist_recipients(_email_campaign_id int, _limit int)
RETURNS bigint AS
$func$
SELECT COUNT(DISTINCT a.email_recipient_id)
FROM email_activities a
WHERE a.email_recipient_id IN (
SELECT id
FROM email_recipients
WHERE email_campaign_id = $1
LIMIT $2) -- or consider query below
$func$ LANGUAGE sql VOLATILE COST 100000 SET enable_seqscan = off;
The setting only applies to the local scope of the function.
Warning: This is just a proof of concept. Even this much less radical manual intervention might bite you in the long run. Cardinalities, value frequencies, your schema, global Postgres settings, everything changes over time. You are going to upgrade to a new Postgres version. The query plan you force now, may become a very bad idea later.
And typically this is just a workaround for a problem with your setup. Better find and fix it.
Alternative query
Essential information is missing in the question, but this equivalent query is probably faster and more likely to use an index on (email_recipient_id) - increasingly so for a bigger LIMIT.
SELECT COUNT(*) AS ct
FROM (
SELECT id
FROM email_recipients
WHERE email_campaign_id = 1607
LIMIT 43000
) r
WHERE EXISTS (
SELECT FROM email_activities
WHERE email_recipient_id = r.id);
A sequential scan can be more efficient, even when an index exists. In this case, postgres seems to estimate things rather wrong.
An ANALYZE <TABLE> on all related tables can help in such cases. If it doesnt, you can set the variable enable_seqscan to OFF, to force postgres to use an index whenever technically possible, at the expense, that sometimes an index-scan will be used when a sequential scan would perform better.