Postgresql select count query takes long time - sql

I have a table named events in my Postgresql 9.5 database. And this table has about 6 million records.
I am runnig a select count(event_id) from events query. But this query takes 40seconds. This is very long time for a database. My event_id field of table is primary key and indexed. Why this takes very long time? (Server is ubuntu vm on vmware has 4cpu)
Explain:
"Aggregate (cost=826305.19..826305.20 rows=1 width=0) (actual time=24739.306..24739.306 rows=1 loops=1)"
" Buffers: shared hit=13 read=757739 dirtied=53 written=48"
" -> Seq Scan on event_source (cost=0.00..812594.55 rows=5484255 width=0) (actual time=0.014..24087.050 rows=6320689 loops=1)"
" Buffers: shared hit=13 read=757739 dirtied=53 written=48"
"Planning time: 0.369 ms"
"Execution time: 24739.364 ms"

I know that this is an old question and the existing answer covers the vast majority of info around this, but I just ran into a situation where a table of 1.3 million rows was taking about 35 seconds to perform a simple SELECT COUNT(*). None of the other solutions helped. The issue ended up being that the table was just bloated and hadn't been vacuumed, so Postgres couldn't figure out the most optimal way to query the data. After I ran this, the query time dropped down to about 25ms!
VACUUM (ANALYZE, VERBOSE, FULL) my_table_name;
Hope this helps someone else!

There are multiple factors playing a big role in the decision for PostgreSQL how to execute the count(), but first of all, the column you use inside the count function does not matter. In fact, if you don't need DISTINCT count, stick with count(*).
You can try the following to force an index-only scan:
SELECT count(*) FROM (SELECT event_id FROM events) t;
...if that still results in a sequential scan, than most likely the index is not much smaller than the table itself. To still see how an index-only scan would perform, you can enforce it with:
SELECT count(*) FROM (SELECT event_id FROM events ORDER BY 1) t;
IF that is not much faster, you should also consider an upgrade of the PostgreSQL to at least version 9.6, which introduces parallel sequential scans to speed up these things.
In addition, you can achieve dramatic speedups choosing from a variety of techniques to provide counts which largely depend on your use-case and your requirements:
Faster PostgreSQL Counting
Last but not least, please always provide the output of an extended explain as #a_horse_with_no_name already recommended, e.g.:
EXPLAIN (ANALYZE, BUFFERS) SELECT count(event_id) FROM events;

Related

Postgres select query making sequential scan instead of index scan on table with 18 Million rows

I have a postgres table which has almost 18 Million rows and I am trying to run this query
select * from answer where submission_id = 5 and deleted_at is NULL;
There is an partial index on the table on column submission_id. This is the command used to create index
CREATE INDEX answer_submission_id ON answer USING btree (submission_id) WHERE (deleted_at IS NULL)
This is the explain analyse of the above select query
Gather (cost=1000.00..3130124.70 rows=834 width=377) (actual time=7607.568..7610.130 rows=2 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=2144966 read=3
I/O Timings: read=6.169
-> Parallel Seq Scan on answer (cost=0.00..3129041.30 rows=348 width=377) (actual time=6501.951..7604.623 rows=1 loops=3)
Filter: ((deleted_at IS NULL) AND (submission_id = 5))
Rows Removed by Filter: 62213625
Buffers: shared hit=2144966 read=3
I/O Timings: read=6.169
Planning Time: 0.117 ms
Execution Time: 7610.154 ms
Ideally it should pick the answer_submission_id index. But postgres is going for an sequential scan.
Any help would be really thankful
The execution plan shows us there is a deviation between the estimated read row and the actual read row.
Postgresql optimizer is a cost-based optimizer (CBO) queries will be executed by the smallest cost from execution plans.
so that the wrong statistics might choose a bad execution plan.
There is a link to represent the wrong statistics causing a slow query.
Why are bad row estimates slow in Postgres?
Firstly I will use this query to search last_analyze & last_vacuum last time.
SELECT
schemaname, relname,
last_vacuum, last_autovacuum,
vacuum_count, autovacuum_count,
last_analyze,last_autoanalyze
FROM pg_stat_user_tables
where relname = 'tablename';
if your statistics are wrong we can use ANALYZE "tablename"to help us collect new statistics from the table, ANALYZE scans speed depends on table size.
For large tables, ANALYZE takes a random sample of the table contents, rather than examining every row. This allows even very large tables to be analyzed in a small amount of time. Note, however, that the statistics are only approximate, and will change slightly each time ANALYZE is run, even if the actual table contents did not change. This might result in small changes in the planner's estimated costs shown by EXPLAIN. In rare situations, this non-determinism will cause the planner's choices of query plans to change after ANALYZE is run. To avoid this, raise the amount of statistics collected by ANALYZE, as described below.
When we UPDATE and DELETE data that will create a dead tuple which might exist in the heap or indexes but we can't query that, VACUUM can help us to reclaim storage occupied by dead tuples.

Why does this simple query not use the index in postgres?

In my postgreSQL database I have a table named "product". In this table I have a column named "date_touched" with type timestamp. I created a simple btree index on this column. This is the schema of my table (I omitted irrelevant column & index definitions):
Table "public.product"
Column | Type | Modifiers
---------------------------+--------------------------+-------------------
id | integer | not null default nextval('product_id_seq'::regclass)
date_touched | timestamp with time zone | not null
Indexes:
"product_pkey" PRIMARY KEY, btree (id)
"product_date_touched_59b16cfb121e9f06_uniq" btree (date_touched)
The table has ~300,000 rows and I want to get the n-th element from the table ordered by "date_touched". when I want to get the 1000th element, it takes 0.2s, but when I want to get the 100,000th element, it takes about 6s. My question is, why does it take too much time to retrieve the 100,000th element, although I've defined a btree index?
Here is my query with explain analyze that shows postgreSQL does not use the btree index and instead sorts all rows to find the 100,000th element:
first query (100th element):
explain analyze
SELECT product.id
FROM product
ORDER BY product.date_touched ASC
LIMIT 1
OFFSET 1000;
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Limit (cost=3035.26..3038.29 rows=1 width=12) (actual time=160.208..160.209 rows=1 loops=1)
-> Index Scan using product_date_touched_59b16cfb121e9f06_uniq on product (cost=0.42..1000880.59 rows=329797 width=12) (actual time=16.651..159.766 rows=1001 loops=1)
Total runtime: 160.395 ms
second query (100,000th element):
explain analyze
SELECT product.id
FROM product
ORDER BY product.date_touched ASC
LIMIT 1
OFFSET 100000;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Limit (cost=106392.87..106392.88 rows=1 width=12) (actual time=6621.947..6621.950 rows=1 loops=1)
-> Sort (cost=106142.87..106967.37 rows=329797 width=12) (actual time=6381.174..6568.802 rows=100001 loops=1)
Sort Key: date_touched
Sort Method: external merge Disk: 8376kB
-> Seq Scan on product (cost=0.00..64637.97 rows=329797 width=12) (actual time=1.357..4184.115 rows=329613 loops=1)
Total runtime: 6629.903 ms
It is a very good thing, that SeqScan is used here. Your OFFSET 100000 is not a good thing for the IndexScan.
A bit of theory
Btree indexes contain 2 structures inside:
balanced tree and
double-linked list of keys.
First structure allows for fast keys lookups, second is responsible for the ordering. For bigger tables, linked list cannot fit into a single page and therefore it is a list of linked pages, where each page's entries maintain ordering, specified during index creation.
It is wrong to think, though, that such pages are sitting together on the disk. In fact, it is more probable that those are spread across different locations. And in order to read pages based on the index's order, system has to perform random disk reads. Random disk IO is expensive, compared to sequential access. Therefore good optimizer will prefer a SeqScan instead.
I highly recommend “SQL Performance Explained” book to better understand indexes. It is also available on-line.
What is going on?
Your OFFSET clause would cause database to read index's linked list of keys (causing lots of random disk reads) and than discarding all those results, till you hit the wanted offset. And it is good, in fact, that Postgres decided to use SeqScan + Sort here — this should be faster.
You can check this assumption by:
running EXPLAIN (analyze, buffers) of your big-OFFSET query
than do SET enable_seqscan TO 'off';
and run EXPLAIN (analyze, buffers) again, comparing the results.
In general, it is better to avoid OFFSET, as DBMSes not always pick the right approach here. (BTW, which version of PostgreSQL you're using?)
Here's a comparison of how it performs for different offset values.
EDIT: In order to avoid OFFSET one would have to base pagination on the real data, that exists in the table and is a part of the index. For this particular case, the following might be possible:
show first N (say, 20) elements
include maximal date_touched that is shown on the page to all the “Next” links. You can compute this value on the application side. Do similar for the “Previous” links, except include minimal date_touch for these.
on the server side you will get the limiting value. Therefore, say for the “Next” case, you can do a query like this:
SELECT id
FROM product
WHERE date_touched > $max_date_seen_on_the_page
ORDER BY date_touched ASC
LIMIT 20;
This query makes best use of the index.
Of course, you can adjust this example to your needs. I used pagination as it is a typical case for the OFFSET.
One more note — querying 1 row many times, increasing offset for each query by 1, will be much more time consuming, than doing a single batch query that returns all those records, which are then iterated from on the application side.

Index doesn't improve performance

I have a simple table structure in my postgres database:
CREATE TABLE device
(
id bigint NOT NULL,
version bigint NOT NULL,
device_id character varying(255),
date_created timestamp without time zone,
last_updated timestamp without time zone,
CONSTRAINT device_pkey PRIMARY KEY (id )
)
I'm often querying data based on deviceId column. The table has 3,5 million rows, so it leads to performance issues:
"Seq Scan on device (cost=0.00..71792.70 rows=109 width=8) (actual time=352.725..353.445 rows=2 loops=1)"
" Filter: ((device_id)::text = '352184052470420'::text)"
"Total runtime: 353.463 ms"
Hence I've created index on device_id column:
CREATE INDEX device_device_id_idx
ON device
USING btree
(device_id );
However my problem is, that database still uses sequential scan, not index scan. The query plan after creating the index is the same:
"Seq Scan on device (cost=0.00..71786.33 rows=109 width=8) (actual time=347.133..347.508 rows=2 loops=1)"
" Filter: ((device_id)::text = '352184052470420'::text)"
"Total runtime: 347.538 ms"
The result of the query are 2 rows, so I'm not selecting a big portion of the table. I don't really understand why index is disregarded. What can I do to improve the performance?
edit:
My query:
select id from device where device_id ='357560051102491A';
I've run analyse on the device table, which didn't help
device_id contains also characters.
You may need to look at the queries. To use an index, the queries need to be sargable. That means certain ways to construct the queries are better than other ways. I am not familiar with Postgre but in SQl Server this would include such things (very small sample of the bad constructs):
Not doing data transformations in the join - instead store the data
properly
Not using correlated subqueries - use derived tables or temp table
instead
Not using OR conditions - use UNION ALL instead
Your first step shoud be to get a good book on performance tuning for your specific database. It will talk about what constructions to avoid for your particular database engine.
Indexes are not used when you cast a column to a different type:
((device_id)::text = '352184052470420'::text)
Instead, you can do this way:
(device_id = ('352184052470420'::character varying))
(or maybe you can change device_id to TEXT in the original table, if you wish.)
Also, remember to run analyze device after index has been created, or it will not be used anyway.
It seems, like time resolves everything. I'm not sure what have happened, but currently its working fine.
From the time I've posted this question I didn't change anything and now I get this query plan:
"Bitmap Heap Scan on device (cost=5.49..426.77 rows=110 width=166)"
" Recheck Cond: ((device_id)::text = '357560051102491'::text)"
" -> Bitmap Index Scan on device_device_id_idx (cost=0.00..5.46 rows=110 width=0)"
" Index Cond: ((device_id)::text = '357560051102491'::text)"
Time breakdown (timezone GMT+2):
~15:50 I've created the index
~16:00 I've dropepd and recreated the index several times, since it was not working
16:05 I've run analyse device (didn't help)
16:44:49 from app server request_log, I can see that the requests executing the query are still taking around 500 ms
16:56:59 I can see first request, which takes 23 ms (the index started to work!)
The question stays, why it took around 1:10 hour for the index to be applied? When I was creating indexes in the same database few days ago the changes were immediate.

How to measure time in postgres

In postgres I want to know the time taken to generate plan. I know that \timing gives me the time taken to execute the plan after finding the optimal plan. But I want to find out the time which postgres takes in finding out the optimal plan. Is it possible to determine this time in postgres. If yes, then how?
Also query plan generators at times do not find the optimal plan. Can I force postgres to use the optimal plan for plan generation. If yes, then how can I do so?
For the time taken to prepare a plan and the time taken to execute it, you can use explain (which merely finds a plan) vs explain analyze (which actually runs it) with \timing turned on:
test=# explain select * from test where val = 1 order by id limit 10;
QUERY PLAN
--------------------------------------------------------------------------------
Limit (cost=0.00..4.35 rows=10 width=8)
-> Index Scan using test_pkey on test (cost=0.00..343.25 rows=789 width=8)
Filter: (val = 1)
(3 rows)
Time: 0.759 ms
test=# explain analyze select * from test where val = 1 order by id limit 10;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..4.35 rows=10 width=8) (actual time=0.122..0.170 rows=10 loops=1)
-> Index Scan using test_pkey on test (cost=0.00..343.25 rows=789 width=8) (actual time=0.121..0.165 rows=10 loops=1)
Filter: (val = 1)
Rows Removed by Filter: 67
Total runtime: 0.204 ms
(5 rows)
Time: 1.019 ms
Note that there is a tiny overhead in both commands to actually output the plan.
Usually it happens e.g. in DB2 that since finding the optimal plan takes a lot of time..therefore the database engine decides to use a suboptimal plan...i think it must be the case with postgres also.
In Postgres, this only occurs if your query is gory enough that it cannot reasonably do an exhaustive search. When you reach the relevant thresholds (which are high, if your use-cases are typical), the planner uses the genetic query optimizer:
http://www.postgresql.org/docs/current/static/geqo-pg-intro.html
If it is then how can i fiddle with postgres such that it chooses the optimal plan.
In more general use cases, there are many things that you can fiddle, but be very wary of messing around with them (apart, perhaps, from collecting a bit more statistics on a select few columns using ALTER TABLE SET STATISTICS):
http://www.postgresql.org/docs/current/static/runtime-config-query.html
If you use \i command in psql client than in sql file add following row
\timing true

How to efficiently delete rows from a Postgresql 8.1 table?

I'm working on a PostgreSQL 8.1 SQL script which needs to delete a large number of rows from a table.
Let's say the table I need to delete from is Employees (~260K rows).
It has primary key named id.
The rows I need to delete from this table are stored in a separate temporary table called EmployeesToDelete (~10K records) with a foreign key reference to Employees.id called employee_id.
Is there an efficient way to do this?
At first, I thought of the following:
DELETE
FROM Employees
WHERE id IN
(
SELECT employee_id
FROM EmployeesToDelete
)
But I heard that using the "IN" clause and subqueries can be inefficient, especially with larger tables.
I've looked at the PostgreSQL 8.1 documentation, and there's mention of
DELETE FROM ... USING but it doesn't have examples so I'm not sure how to use it.
I'm wondering if the following works and is more efficient?
DELETE
FROM Employees
USING Employees e
INNER JOIN
EmployeesToDelete ed
ON e.id = ed.employee_id
Your comments are greatly appreciated.
Edit:
I ran EXPLAIN ANALYZE and the weird thing is that the first DELETE ran pretty quickly (within seconds), while the second DELETE took so long (over 20 min) I eventually cancelled it.
Adding an index to the temp table helped the performance quite a bit.
Here's a query plan of the first DELETE for anyone interested:
Hash Join (cost=184.64..7854.69 rows=256482 width=6) (actual time=54.089..660.788 rows=27295 loops=1)
Hash Cond: ("outer".id = "inner".employee_id)
-> Seq Scan on Employees (cost=0.00..3822.82 rows=256482 width=10) (actual time=15.218..351.978 rows=256482 loops=1)
-> Hash (cost=184.14..184.14 rows=200 width=4) (actual time=38.807..38.807 rows=10731 loops=1)
-> HashAggregate (cost=182.14..184.14 rows=200 width=4) (actual time=19.801..28.773 rows=10731 loops=1)
-> Seq Scan on EmployeesToDelete (cost=0.00..155.31 rows=10731 width=4) (actual time=0.005..9.062 rows=10731 loops=1)
Total runtime: 935.316 ms
(7 rows)
At this point, I'll stick with the first DELETE unless I can find a better way of writing it.
Don't guess, measure. Try the various methods and see which one is the shortest to execute. Also, use EXPLAIN to know what PostgreSQL will do and see where you can optimize. Very few PostgreSQL users are able to guess correctly the fastest query...
I'm wondering if the following works and is more efficient?
DELETE
FROM Employees e
USING EmployeesToDelete ed
WHERE id = ed.employee_id;
This totally depend on your index selectivity.
PostgreSQL tends to employ MERGE IN JOIN for IN predicates, which has stable execution time.
It's not affected by how many rows satisfy this condition, provided that you already have an ordered resultset.
An ordered resultset requires either a sort operation or an index. Full index traversal is very inefficient in PostgreSQL compared to SEQ SCAN.
The JOIN predicate, on the other hand, may benefit from using NESTED LOOPS if your index is very selective, and from using HASH JOIN is it's inselective.
PostgreSQL should select the right one by estimating the row count.
Since you have 30k rows against 260K rows, I expect HASH JOIN to be more efficient, and you should try to build a plan on a DELETE ... USING query.
To make sure, please post execution plan for both queries.
I'm not sure about the DELETE FROM ... USING syntax, but generally, a subquery should logically be the same thing as an INNER JOIN anyway. The database query optimizer should be capable (and this is just a guess) of executing the same query plan for both.
Why can't you delete the rows in the first place instead of adding them to the EmployeesToDelete table?
Or if you need to undo, just add a "deleted" flag to Employees, so you can reverse the deletion, or make in permanent, all in one table?