We have 2 identical (double precision) columns on the same table with 2 identical indices running 2 identical queries. yet one runs nearly 10* quicker than the other. what's causing this?
1) SELECT MIN("reports"."longitude") AS min_id FROM "reports" WHERE (area2 = 18)
2) SELECT MIN("reports"."latitude") AS min_id FROM "reports" WHERE (area2 = 18)
1 runs in 28ms and 2 runs in >300ms
Here are the 'explains':
1)
Result (cost=6.07..6.08 rows=1 width=0)"
InitPlan 1 (returns $0)"
-> Limit (cost=0.00..6.07 rows=1 width=8)"
-> Index Scan using longitude on reports (cost=0.00..139617.49 rows=22983 width=8)"
Index Cond: (longitude IS NOT NULL)"
Filter: (area2 = 18)"
2)
Result (cost=5.95..5.96 rows=1 width=0)"
InitPlan 1 (returns $0)"
-> Limit (cost=0.00..5.95 rows=1 width=8)"
-> Index Scan using latitude on reports (cost=0.00..136754.07 rows=22983 width=8)"
Index Cond: (latitude IS NOT NULL)"
Filter: (area2 = 18)"
as requested here is the explain analyse output...
1)
Result (cost=6.07..6.08 rows=1 width=0) (actual time=10.992..10.993 rows=1 loops=1)"
InitPlan 1 (returns $0)"
-> Limit (cost=0.00..6.07 rows=1 width=8) (actual time=10.985..10.986 rows=1 loops=1)"
-> Index Scan using longitude on reports (cost=0.00..139617.49 rows=22983 width=8) (actual time=10.983..10.983 rows=1 loops=1)"
Index Cond: (longitude IS NOT NULL)"
Filter: (area2 = 18)"
Total runtime: 11.033 ms"
2)
Result (cost=5.95..5.96 rows=1 width=0) (actual time=259.749..259.749 rows=1 loops=1)"
InitPlan 1 (returns $0)"
-> Limit (cost=0.00..5.95 rows=1 width=8) (actual time=259.740..259.740 rows=1 loops=1)"
-> Index Scan using latitude on reports (cost=0.00..136754.07 rows=22983 width=8) (actual time=259.739..259.739 rows=1 loops=1)"
Index Cond: (latitude IS NOT NULL)"
Filter: (area2 = 18)"
Total runtime: 259.789 ms"
---------------------
What is going on? How can I get the second query to behave properly and run quickly? Both setups are identical as far as I can tell.
First, there is no guarantee that indexes speed queries. Second, when doing performance considerations, you need to run each query multiple times. There is overhead for loading the index and loading pages into the cache that can affect the length of the queries.
I am not a specialist in Postgres, but on thinking about this, I'm not that surprised.
The query plan is looping through the index, finding the corresponding row that matches area2 = 18, and then hopefully stopping at the first one (it is using the index, so it can start at the lowest value and move upwards). This is speculation on how it is working; I don't know that Postgres is doing this exactly.
In any case, what is happening is that the area is much closer to the beginning of the longitude index than the beginning of the latitude index. So, it finds the first matching record there first. If this explanation is correct, it would suggest that the area is relatively west (lower longitude) and relatively north (higher latitude), compared to other things in the database.
By the way, assuming that there are lots of areas, you might get better results with an index on Area2.
You are getting an index scan, but the number of records examined depends on how far up the list you have to go to match the area2 condition.
Unless your area2 distribution is strange, to optimize this query you should put composite indices on (area2, latitude) and (area2, longitude). I suspect you will get <10 ms. PG may also be able to combine a separate index on area2 with the existing indices, in lieu of composite indices, using its Bitmap Heap Scan capabilities.
Related
I have two same queries but with different where condition values
explain analyse select survey_contact_id, relation_id, count(survey_contact_id), count(relation_id) from nomination where survey_id = 1565 and account_id = 225 and deleted_at is NULL group by survey_contact_id, relation_id;
explain analyse select survey_contact_id, relation_id, count(survey_contact_id), count(relation_id) from nomination where survey_id = 888 and account_id = 12 and deleted_at is NULL group by survey_contact_id, relation_id;
When I ran this two queries they both producing different result
for first query the result
GroupAggregate (cost=0.28..8.32 rows=1 width=24) (actual time=0.016..0.021 rows=4 loops=1)
Group Key: survey_contact_id, relation_id
-> Index Only Scan using test on nomination (cost=0.28..8.30 rows=1 width=8) (actual time=0.010..0.012 rows=5 loops=1)
Index Cond: ((account_id = 225) AND (survey_id = 1565))
Heap Fetches: 5
Planning time: 0.148 ms
Execution time: 0.058 ms
and for the 2nd one
GroupAggregate (cost=11.08..11.12 rows=2 width=24) (actual time=0.015..0.015 rows=0 loops=1)
Group Key: survey_contact_id, relation_id
-> Sort (cost=11.08..11.08 rows=2 width=8) (actual time=0.013..0.013 rows=0 loops=1)
Sort Key: survey_contact_id, relation_id
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on nomination (cost=4.30..11.07 rows=2 width=8) (actual time=0.008..0.008 rows=0 loops=1)
Recheck Cond: ((account_id = 12) AND (survey_id = 888) AND (deleted_at IS NULL))
-> Bitmap Index Scan on test (cost=0.00..4.30 rows=2 width=0) (actual time=0.006..0.006 rows=0 loops=1)
Index Cond: ((account_id = 12) AND (survey_id = 888))
Planning time: 0.149 ms
Execution time: 0.052 ms
Can anyone explain Me why Postgres is making a BitMap Scan instead of Index Only scan ?
The short version is that Postgres has a cost-based approach, so it has estimated that the cost of doing so is less in the second case, based on the statistics it has.
In your case, the total cost (estimated) of each of these queries is 8.32 and 11.12 respectively. You may be able to see the cost of the index-only scan for the second query by running set enable_bitmapscan = off.
Note that, based on its statistics, Postgres estimated that the first query would return 1 row (actually 4), and that the second would return 2 rows (actually 0).
There are several ways to get better statistics, but if analyze (or autovacuum) hasn't been run on that table for a while, that is a common cause of bad estimates. Another tell-tale that vacuum may not have been run recently (at least on this table) is the Heap Fetches: 5 you can see in the first query plan.
I'm confused by the "when data set increases" part of your question, please do add more context on that front if relevant.
Finally, if you're not already planning a PostgreSQL upgrade, I highly recommend doing so soon. 9.6 is nearly out of support, and versions 10, 11, 12, and 13 each contained a host of performance focussed features.
I have the following table:
CREATE TABLE foo
(
c1 integer,
c2 text
)
created as follows:
CREATE INDEDX ON foo(c2);
INSERT INTO foo
SELECT i, md5(random()::text)
FROM generate_series(1, 1000000) AS i;
ANALYZE baz;
Now, I tried the query:
EXPLAIN (ANALYZE, BUFFERS) SELECT MAX(c2) FROM foo;
and got the following plan:
Result (cost=0.08..0.09 rows=1 width=0) (actual time=0.574..0.574 rows=1 loops=1)
Buffers: shared read=5
InitPlan 1 (returns $0)
-> Limit (cost=0.00..0.08 rows=1 width=33) (actual time=0.570..0.571 rows=1 loops=1)
Buffers: shared read=5
-> Index Only Scan Backward using foo_c2_idx on foo (cost=0.00..79676.27 rows=1000000 width=33) (actual time=0.569..0.569 rows=1 loops=1)
Index Cond: (c2 IS NOT NULL)
Heap Fetches: 1
Buffers: shared read=5
What I was really confused by is that the resulting cost was just 0.08.. 0.09. Why?
I thought to find the max, if we had an index on the column we had to perform Index Only Scan and read at least one of the index leafs. Reading the leafs in turn accomplished with 1 random acces which costs 4. So, the cost should have been more than 4.
What did I miss here?
The cost of the index scan is pro-rated by the LIMIT. The proration logic does not try to take rounding up of page accesses into integers into account, as at the point that the proration is done it has all just been collapsed to a single floating point number.
In our database we have a table menus having 515502 rows. It has a column status which is of type smallint.
Currently, a simple count query takes 700 ms for set of docs having value of status as 3.
explain analyze select count(id) from menus where status = 2;
Aggregate (cost=72973.71..72973.72 rows=1 width=4) (actual time=692.564..692.565 rows=1 loops=1)
-> Bitmap Heap Scan on menus (cost=2510.63..72638.80 rows=133962 width=4) (actual time=28.179..623.077 rows=135429 loops=1)
Recheck Cond: (status = 2)
Rows Removed by Index Recheck: 199654
-> Bitmap Index Scan on menus_status (cost=0.00..2477.14 rows=133962 width=0) (actual time=26.211..26.211 rows=135429 loops=1)
Index Cond: (status = 2)
Total runtime: 692.705 ms
(7 rows)
Some rows have column value of 1 for which the query runs very fast.
explain analyze select count(id) from menus where status = 4;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=7198.73..7198.74 rows=1 width=4) (actual time=24.926..24.926 rows=1 loops=1)
-> Bitmap Heap Scan on menus (cost=40.53..7193.53 rows=2079 width=4) (actual time=1.461..23.418 rows=2220 loops=1)
Recheck Cond: (status = 4)
-> Bitmap Index Scan on menus_status (cost=0.00..40.02 rows=2079 width=0) (actual time=0.858..0.858 rows=2220 loops=1)
Index Cond: (status = 4)
Total runtime: 25.089 ms
(6 rows)
I observed that the most general btree index is the best indexing strategy for simple equality based queries. Both gin and hash were slower than btree.
Any tips for making count queries faster for any filter that is using an index?
I understand that this is a beginner level question, so apologies in advance for any kind of mistakes I might have made.
Maybe your table has more rows having status = 2 than ones having status = 4 , so, the total table access time is more for the second case.
So, for status = 2 there are too many rows to consider, so the the Bitmap for the Bitmap Heap Scan goes to the "lossy" mode, and recheck is needed after the operation. So, there are two things to consider: either your result is too big (but you can't do anything with that without reorganizing your tables, say, with partitioning), or your 'work_mem' param is too small to keep the intermittent result. Try to increase its value if you have possibility.
I encountered a strange behaviour of the Postgres optimizer on the following query:
select count(product0_.id) as col_0_0_ from Product product0_
where product0_.active=true
and (product0_.aggregatorId is null
or product0_.aggregatorId in ($1 , $2 , $3))
Product has about 54 columns, active is a boolean having a btree index, and aggregatorId is 'varchar(15)` and has a btree index.
On this query above the index for 'aggregatorId' is not used:
Aggregate (cost=169995.75..169995.76 rows=1 width=32) (actual time=3904.726..3904.727 rows=1 loops=1)
-> Seq Scan on product product0_ (cost=0.00..165510.39 rows=1794146 width=32) (actual time=0.055..2407.195 rows=1851827 loops=1)
Filter: (active AND ((aggregatorid IS NULL) OR ((aggregatorid)::text = ANY ('{5109037,5001015,70601}'::text[]))))
Rows Removed by Filter: 542146
Total runtime: 3904.925 ms
But if we reduce the query by leaving out the null check for this column, the index gets used:
Aggregate (cost=17600.93..17600.94 rows=1 width=32) (actual time=614.933..614.935 rows=1 loops=1)
-> Index Scan using idx_prod_aggr on product product0_ (cost=0.43..17487.56 rows=45347 width=32) (actual time=19.284..594.509 rows=12099 loops=1)
Index Cond: ((aggregatorid)::text = ANY ('{5109037,5001015,70601}'::text[]))
Filter: active
Rows Removed by Filter: 49130
Total runtime: 150.255 ms
As far as I know a btree index can handle null checks, so I don't understand why the index is not used for the complete query. The product table contains about 2.3 million entries, so it is not very fast.
EDIT:
The index is very standard:
CREATE INDEX idx_prod_aggr
ON product
USING btree
(aggregatorid COLLATE pg_catalog."default");
Since there are many identical values for the column which you use in the where clause (78% of all the table rows according to your numbers), the database will conclude that it is cheaper to use full table scan than to waste additional time to read the index.
The rule of thumb in most database vendors is that index will probably not be used if it can't narrow the search down to about 5% of all the table records.
Your problem looked interesting, so I reproduced your scenario - postgres 9.1, table with 1M rows, one boolean column, one varchar column, both indexed, half of table has NULL names.
I had same explain analyze output when varchar column was not indexed. However, with index postgres uses bitmap scan on NULL condition and IN condition and then merges them with OR condition.
Then he uses seq scan on boolean condition (because indexes are separated)
explain analyze
select * from A where active is true and ((name is null) OR (name in ('1','2','3') ));
See output:
"Bitmap Heap Scan on a (cost=17.34..21.35 rows=1 width=18) (actual time=0.048..0.048 rows=0 loops=1)"
" Recheck Cond: ((name IS NULL) OR ((name)::text = ANY ('{1,2,3}'::text[])))"
" Filter: (active IS TRUE)"
" -> BitmapOr (cost=17.34..17.34 rows=1 width=0) (actual time=0.047..0.047 rows=0 loops=1)"
" -> Bitmap Index Scan on idx_prod_aggr (cost=0.00..4.41 rows=1 width=0) (actual time=0.010..0.010 rows=0 loops=1)"
" Index Cond: (name IS NULL)"
" -> Bitmap Index Scan on idx_prod_aggr (cost=0.00..12.93 rows=1 width=0) (actual time=0.036..0.036 rows=0 loops=1)"
" Index Cond: ((name)::text = ANY ('{1,2,3}'::text[]))"
"Total runtime: 0.077 ms"
This makes me think that you missed some details, if so, add them to your question.
I have a table with over a billion records. In order to improve performance, I partitioned it to 30 partitions. The most frequent queries have (id = ...) in their where clause, so I decided to partition the table on the id column.
Basically, the partitions were created in this way:
CREATE TABLE foo_0 (CHECK (id % 30 = 0)) INHERITS (foo);
CREATE TABLE foo_1 (CHECK (id % 30 = 1)) INHERITS (foo);
CREATE TABLE foo_2 (CHECK (id % 30 = 2)) INHERITS (foo);
CREATE TABLE foo_3 (CHECK (id % 30 = 3)) INHERITS (foo);
.
.
.
I ran ANALYZE for the entire database and in particular, I made it collect extra statistics for this table's id column by running:
ALTER TABLE foo ALTER COLUMN id SET STATISTICS 10000;
However when I run queries that filter on the id column the planner shows that it's still scanning all the partitions. constraint_exclusion is set to partition, so that's not the problem.
EXPLAIN ANALYZE SELECT * FROM foo WHERE (id = 2);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Result (cost=0.00..8106617.40 rows=3620981 width=54) (actual time=30.544..215.540 rows=171477 loops=1)
-> Append (cost=0.00..8106617.40 rows=3620981 width=54) (actual time=30.539..106.446 rows=171477 loops=1)
-> Seq Scan on foo (cost=0.00..0.00 rows=1 width=203) (actual time=0.002..0.002 rows=0 loops=1)
Filter: (id = 2)
-> Bitmap Heap Scan on foo_0 foo (cost=3293.44..281055.75 rows=122479 width=52) (actual time=0.020..0.020 rows=0 loops=1)
Recheck Cond: (id = 2)
-> Bitmap Index Scan on foo_0_idx_1 (cost=0.00..3262.82 rows=122479 width=0) (actual time=0.018..0.018 rows=0 loops=1)
Index Cond: (id = 2)
-> Bitmap Heap Scan on foo_1 foo (cost=3312.59..274769.09 rows=122968 width=56) (actual time=0.012..0.012 rows=0 loops=1)
Recheck Cond: (id = 2)
-> Bitmap Index Scan on foo_1_idx_1 (cost=0.00..3281.85 rows=122968 width=0) (actual time=0.010..0.010 rows=0 loops=1)
Index Cond: (id = 2)
-> Bitmap Heap Scan on foo_2 foo (cost=3280.30..272541.10 rows=121903 width=56) (actual time=30.504..77.033 rows=171477 loops=1)
Recheck Cond: (id = 2)
-> Bitmap Index Scan on foo_2_idx_1 (cost=0.00..3249.82 rows=121903 width=0) (actual time=29.825..29.825 rows=171477 loops=1)
Index Cond: (id = 2)
.
.
.
What could I do to make the planer have a better plan? Do I need to run ALTER TABLE foo ALTER COLUMN id SET STATISTICS 10000; for all the partitions as well?
EDIT
After using Erwin's suggested change to the query, the planner only scans the correct partition, however the execution time is actually worse then a full scan (at least of the index).
EXPLAIN ANALYZE select * from foo where (id % 30 = 2) and (id = 2);
QUERY PLAN
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Result (cost=0.00..8106617.40 rows=3620981 width=54) (actual time=32.611..224.934 rows=171477 loops=1)
-> Append (cost=0.00..8106617.40 rows=3620981 width=54) (actual time=32.606..116.565 rows=171477 loops=1)
-> Seq Scan on foo (cost=0.00..0.00 rows=1 width=203) (actual time=0.002..0.002 rows=0 loops=1)
Filter: (id = 2)
-> Bitmap Heap Scan on foo_0 foo (cost=3293.44..281055.75 rows=122479 width=52) (actual time=0.046..0.046 rows=0 loops=1)
Recheck Cond: (id = 2)
-> Bitmap Index Scan on foo_0_idx_1 (cost=0.00..3262.82 rows=122479 width=0) (actual time=0.044..0.044 rows=0 loops=1)
Index Cond: (id = 2)
-> Bitmap Heap Scan on foo_1 foo (cost=3312.59..274769.09 rows=122968 width=56) (actual time=0.021..0.021 rows=0 loops=1)
Recheck Cond: (id = 2)
-> Bitmap Index Scan on foo_1_idx_1 (cost=0.00..3281.85 rows=122968 width=0) (actual time=0.020..0.020 rows=0 loops=1)
Index Cond: (id = 2)
-> Bitmap Heap Scan on foo_2 foo (cost=3280.30..272541.10 rows=121903 width=56) (actual time=32.536..86.730 rows=171477 loops=1)
Recheck Cond: (id = 2)
-> Bitmap Index Scan on foo_2_idx_1 (cost=0.00..3249.82 rows=121903 width=0) (actual time=31.842..31.842 rows=171477 loops=1)
Index Cond: (id = 2)
-> Bitmap Heap Scan on foo_3 foo (cost=3475.87..285574.05 rows=129032 width=52) (actual time=0.035..0.035 rows=0 loops=1)
Recheck Cond: (id = 2)
-> Bitmap Index Scan on foo_3_idx_1 (cost=0.00..3443.61 rows=129032 width=0) (actual time=0.031..0.031 rows=0 loops=1)
.
.
.
-> Bitmap Heap Scan on foo_29 foo (cost=3401.84..276569.90 rows=126245 width=56) (actual time=0.019..0.019 rows=0 loops=1)
Recheck Cond: (id = 2)
-> Bitmap Index Scan on foo_29_idx_1 (cost=0.00..3370.28 rows=126245 width=0) (actual time=0.018..0.018 rows=0 loops=1)
Index Cond: (id = 2)
Total runtime: 238.790 ms
Versus:
EXPLAIN ANALYZE select * from foo where (id % 30 = 2) and (id = 2);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------
Result (cost=0.00..273120.30 rows=611 width=56) (actual time=31.519..257.051 rows=171477 loops=1)
-> Append (cost=0.00..273120.30 rows=611 width=56) (actual time=31.516..153.356 rows=171477 loops=1)
-> Seq Scan on foo (cost=0.00..0.00 rows=1 width=203) (actual time=0.002..0.002 rows=0 loops=1)
Filter: ((id = 2) AND ((id % 30) = 2))
-> Bitmap Heap Scan on foo_2 foo (cost=3249.97..273120.30 rows=610 width=56) (actual time=31.512..124.177 rows=171477 loops=1)
Recheck Cond: (id = 2)
Filter: ((id % 30) = 2)
-> Bitmap Index Scan on foo_2_idx_1 (cost=0.00..3249.82 rows=121903 width=0) (actual time=30.816..30.816 rows=171477 loops=1)
Index Cond: (id = 2)
Total runtime: 270.384 ms
For non-trivial expressions you have to repeat the more or less verbatim condition in queries to make the Postgres query planner understand it can rely on the CHECK constraint. Even if it seems redundant!
Per documentation:
With constraint exclusion enabled, the planner will examine the
constraints of each partition and try to prove that the partition need
not be scanned because it could not contain any rows meeting the
query's WHERE clause. When the planner can prove this, it excludes
the partition from the query plan.
Bold emphasis mine. The planner does not understand complex expressions.
Of course, this has to be met, too:
Ensure that the constraint_exclusion configuration parameter is not
disabled in postgresql.conf. If it is, queries will not be optimized as desired.
Instead of
SELECT * FROM foo WHERE (id = 2);
Try:
SELECT * FROM foo WHERE id % 30 = 2 AND id = 2;
And:
The default (and recommended) setting of constraint_exclusion is
actually neither on nor off, but an intermediate setting called
partition, which causes the technique to be applied only to queries
that are likely to be working on partitioned tables. The on setting
causes the planner to examine CHECK constraints in all queries, even
simple ones that are unlikely to benefit.
You can experiment with the constraint_exclusion = on to see if the planner catches on without redundant verbatim condition. But you have to weigh cost and benefit of this setting.
The alternative would be simpler conditions for your partitions as already outlined by #harmic.
An no, increasing the number for STATISTICS will not help in this case. Only the CHECK constraints and your WHERE conditions in the query matter.
Unfortunately, partioning in postgresql is fairly primitive. It only works for range and list based constraints. Your partition constraints are too complex for the query planner to use to decide to exclude some partitions.
In the manual it says:
Keep the partitioning constraints simple, else the planner may not be
able to prove that partitions don't need to be visited. Use simple
equality conditions for list partitioning, or simple range tests for
range partitioning, as illustrated in the preceding examples. A good
rule of thumb is that partitioning constraints should contain only
comparisons of the partitioning column(s) to constants using
B-tree-indexable operators.
You might get away with changing your WHERE clause so that the modulus expression is explicitly mentioned, as Erwin suggested. I haven't had much luck with that in the past, although I have not tried recently and as he says, there have been improvements in the planner. That is probably the first thing to try.
Otherwise, you will have to rearrange your partitions to use ranges of id values instead of the modulus method you are using now. Not a great solution, I know.
One other solution is to store the modulus of the id in a separate column, which you can then use a simple value equality check for the partition constraint. Bit of a waste of disk space, though, and you would also need to add a term to the where clauses to boot.
In addition to Erwin's answer about the details of the how the planner works with partitions, there is a larger issue here.
Partitioning is not a magic bullet. There are a handful of very specific things for which partitioning is very useful. If none of those very specific things apply to you, then you cannot expect a performance improvement from partitioning, and most likely will get a decrease instead.
To do partitioning correctly, you need a thorough understanding of your usage patterns, or your data loading and unloading patterns.