Same query on 2 identical DBs - different execution plans - sql

I'm moving the Postgres DB to another server. After imporing the data (dumped with pg_dump) I checked the performane and found out that the same query results in different query plans on the two DBs (given that the DBMS versions, DB structure and the data itself are the same):
the query is:
explain analyse select * from common.composite where id = 0176200005519000087
query plan of the production DB:
Index Scan using composite_id_idx on composite (cost=0.43..8.45 rows=1 width=222) (actual time=0.070..0.071 rows=1 loops=1)
Index Cond: (id = '176200005519000087'::bigint)
Planning time: 0.502 ms
Execution time: 0.102 ms
for the new one:
Bitmap Heap Scan on composite (cost=581.08..54325.66 rows=53916 width=76) (actual time=0.209..0.210 rows=1 loops=1)
Recheck Cond: (id = '176200005519000087'::bigint)
Heap Blocks: exact=1
-> Bitmap Index Scan on composite_id_idx (cost=0.00..567.61 rows=53916 width=0) (actual time=0.187..0.187 rows=1 loops=1)
Index Cond: (id = '176200005519000087'::bigint)
Planning time: 0.428 ms
Execution time: 0.305 ms
Obviously, there is a btree index for id in both DBs.
As far as I could get, the new one uses bitmap indeces for some reason, while btree was imported from the dump. This results in a huge delay (up to 30x) in complex queries.
Is there something wrong the the index/dependencies importing or there is a way to point which indexes the planner should use?
Thank you.

Related

IN clause is slow when sorting

I have simple SQL with IN clause and ordering:
SELECT * FROM my_table
WHERE fieldA IN ('value1', 'value2')
ORDER BY fieldB
I've also created the following indexes:
CREATE INDEX idx_my_table_fieldA ON my_table (fieldA)
CREATE INDEX idx_my_table_fieldA_fieldB ON my_table (fieldA, fieldB)
There are millions of rows in this table.
I have very slow queries when I run it with multiple values in IN clause:
Sort (cost=800.35..802.21 rows=744 width=601) (actual time=36.409..36.423 rows=5 loops=1)
Sort Key: "fieldB"
Sort Method: quicksort Memory: 30kB
Buffers: shared hit=18 read=13
I/O Timings: read=29.326
-> Index Scan using idx_my_table_fieldA on my_table (cost=0.57..764.86 rows=744 width=601) (actual time=5.597..35.819 rows=5 loops=1)
Index Cond: (fieldA = ANY ('{value1,value2}'::text[]))
Buffers: shared hit=15 read=13
I/O Timings: read=29.326
Planning:
Buffers: shared hit=238 read=54
I/O Timings: read=81.777
Planning Time: 94.622 ms
Execution Time: 36.528 ms
(14 rows)
The idx_my_table_fieldA index is used.
But it's very fast if I run it for single value in IN clause:
fast SQL example:
SELECT * FROM my_table
WHERE fieldA IN ('value1')
ORDER BY fieldB
This is the query plan:
Index Scan using idx_my_table_fieldA_fieldB on my_table (cost=0.57..153.17 rows=149 width=601) (actual time=1.435..1.440 rows=1 loops=1)
Index Cond: (fieldA = 'value1'::text)
Buffers: shared hit=3 read=2
I/O Timings: read=1.313
Planning Time: 0.194 ms
Execution Time: 1.472 ms
(6 rows)
The multicolumn index is used in this case.
Could you recommend me how to improve this query? I use an ORM system (Hibernate + Spring Data), so it's better not to use native SQL. It would be great to solve this problem using another appropriate indexes (if it's possible).
This query is slow because the "planer" (called optimizer in other RDBMS) is very limited.
In such cases (with a list of values in a IN operator), the query is transformed in Oracle or SQL Server in several queries with a unique value, then results are concatenates, this way to have benefits from indexes.
But it seems that Postgresql does not rewrite your query this way.
You can manually test if a rewrite will do a better plan...

PostgreSQL. Improve indexes

I have the following structure:
create table bitmex
(
timestamp timestamp with time zone not null,
symbol varchar(255) not null,
side varchar(255) not null,
tid varchar(255) not null,
size numeric not null,
price numeric not null,
constraint bitmex_tid_symbol_pk
primary key (tid, symbol)
);
create index bitmex_timestamp_symbol_index on bitmex (timestamp, symbol);
create index bitmex_symbol_index on bitmex (symbol);
I need to know the exact value of the quantity every time. So reltuples is not usable.
The table has more than 45,000,000 rows.
Running
explain analyze select count(*) from bitmex where symbol = 'XBTUSD';
gives
Finalize Aggregate (cost=1038428.56..1038428.57 rows=1 width=8)
-> Gather (cost=1038428.35..1038428.56 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=1037428.35..1037428.36 rows=1 width=8)
-> Parallel Seq Scan on bitmex (cost=0.00..996439.12 rows=16395690 width=0)
Filter: ((symbol)::text = 'XBTUSD'::text)
Running
explain analyze select count(*) from bitmex;
gives
Finalize Aggregate (cost=997439.34..997439.35 rows=1 width=8) (actual time=6105.463..6105.463 rows=1 loops=1)
-> Gather (cost=997439.12..997439.33 rows=2 width=8) (actual time=6105.444..6105.457 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=996439.12..996439.14 rows=1 width=8) (actual time=6085.960..6085.960 rows=1 loops=3)
-> Parallel Seq Scan on bitmex (cost=0.00..954473.50 rows=16786250 width=0) (actual time=0.364..4342.460 rows=13819096 loops=3)
Planning time: 0.080 ms
Execution time: 6108.277 ms
Why it did not use indexes?
Thanks
If all rows have to be visited, an index scan is only cheaper if the table does not have to be consulted for most of the values found in the index.
Due to the way PostgreSQL is organized, the table has to be visited to determine if the entry found in the index is visible or not. This step can be skipped if the whole page is marked as “visible” in the visibility map of the table.
To update the visibility map, run VACUUM on the table. Maybe then an index only scan will be used.
But counting the number of rows in a table is never cheap, even with an index scan. If you need to do that often, it may be a good idea to have a separate table that only contains a counter for the number of rows. Then you can write triggers that update the counter whenever rows are inserted or deleted.
That will slow down the performance during INSERT and DELETE, but you can count the rows with lightning speed.

Why cost is increased by adding indexes?

I'm using postgresql 9.4.6.
There are the following entities:
CREATE TABLE user (id CHARACTER VARYING NOT NULL PRIMARY KEY);
CREATE TABLE group (id CHARACTER VARYING NOT NULL PRIMARY KEY);
CREATE TABLE group_member (
id CHARACTER VARYING NOT NULL PRIMARY KEY,
gid CHARACTER VARYING REFERENCES group(id),
uid CHARACTER VARYING REFERENCES user(id));
I analyze that query:
explain analyze select x2."gid" from "group_member" x2 where x2."uid" = 'a1';
I have several results. Before each result I flushed OS-caches and restarted postgres:
# /etc/init.d/postgresql stop
# sync
# echo 3 > /proc/sys/vm/drop_caches
# /etc/init.d/postgresql start
The results of analyzing are:
1) cost=4.17..11.28 with indexes:
create index "group_member_gid_idx" on "group_member" ("gid");
create index "group_member_uid_idx" on "group_member" ("uid");
Bitmap Heap Scan on group_member x2 (cost=4.17..11.28 rows=3 width=32) (actual time=0.021..0.021 rows=0 loops=1)
Recheck Cond: ((uid)::text = 'a1'::text)
-> Bitmap Index Scan on group_member_uid_idx (cost=0.00..4.17 rows=3 width=0) (actual time=0.005..0.005 rows=0 loops=1)
Index Cond: ((uid)::text = 'a1'::text)
Planning time: 28.641 ms
Execution time: 0.359 ms
2) cost=7.97..15.08 with indexes:
create unique index "group_member_gid_uid_idx" on "group_member" ("gid","uid");
Bitmap Heap Scan on group_member x2 (cost=7.97..15.08 rows=3 width=32) (actual time=0.013..0.013 rows=0 loops=1)
Recheck Cond: ((uid)::text = 'a1'::text)
-> Bitmap Index Scan on group_member_gid_uid_idx (cost=0.00..7.97 rows=3 width=0) (actual time=0.006..0.006 rows=0 loops=1)
Index Cond: ((uid)::text = 'a1'::text)
Planning time: 0.132 ms
Execution time: 0.047 ms
3) cost=0.00..16.38 without any indexes:
Seq Scan on group_member x2 (cost=0.00..16.38 rows=3 width=32) (actual time=0.002..0.002 rows=0 loops=1)
Filter: ((uid)::text = 'a1'::text)
Planning time: 42.599 ms
Execution time: 0.402 ms
Is a result #3 more effective? And why?
EDIT
There will be many rows in tables (group, user, group_members) in practice. About > 1 Million.
When analyzing queries, the costs and query plans on small data sets are not generally not a reliable guide to performance on larger data sets. And, SQL is more concerned with larger data sets than with trivially small ones.
The reading of data from disk is often the driving factor in query performance. The main purpose of using an index is to reduce the number of data pages being read. If all the data in the table fits on a single data page, then there isn't much opportunity for reducing the number of page reads: It takes the same amount of time to read one page, whether the page has one record or 100 records. (Reading through a page to find the right record also incurs overhead, whereas an index would identify the specific record on the page.)
Indexes incur overhead, but typically much, much less than reading a data page. The index itself needs to be read into memory -- so that means that two pages are being read into memory rather than one. One could argue that for tables that fit on one or two pages, the use of an index is probably not a big advantage.
Although using the index (in this case) does take longer, differences in performance measured in fractions of a millisecond are generally not germane to most database tasks. If you want to see the index do its work, put 100,000 rows in the table and run the same tests. You'll see that the version without the index scales roughly in proportion to the amount of data in the table; the version with the index is relatively constant (well, actually scaling more like the log of the number of records in the table).

Prevent usage of index for a particular query in Postgres

I have a slow query in a Postgres DB. Using explain analyze, I can see that Postgres makes bitmap index scan on two different indexes followed by bitmap AND on the two resulting sets.
Deleting one of the indexes makes the evaluation ten times faster (bitmap index scan is still used on the first index). However, that deleted index is useful in other queries.
Query:
select
booking_id
from
booking
where
substitute_confirmation_token is null
and date_trunc('day', from_time) >= cast('01/25/2016 14:23:00.004' as date)
and from_time >= '01/25/2016 14:23:00.004'
and type = 'LESSON_SUBSTITUTE'
and valid
order by
booking_id;
Indexes:
"idx_booking_lesson_substitute_day" btree (date_trunc('day'::text, from_time)) WHERE valid AND type::text = 'LESSON_SUBSTITUTE'::text
"booking_substitute_confirmation_token_key" UNIQUE CONSTRAINT, btree (substitute_confirmation_token)
Query plan:
Sort (cost=287.26..287.26 rows=1 width=8) (actual time=711.371..711.377 rows=44 loops=1)
Sort Key: booking_id
Sort Method: quicksort Memory: 27kB
Buffers: shared hit=8 read=7437 written=1
-> Bitmap Heap Scan on booking (cost=275.25..287.25 rows=1 width=8) (actual time=711.255..711.294 rows=44 loops=1)
Recheck Cond: ((date_trunc('day'::text, from_time) >= '2016-01-25'::date) AND valid AND ((type)::text = 'LESSON_SUBSTITUTE'::text) AND (substitute_confirmation_token IS NULL))
Filter: (from_time >= '2016-01-25 14:23:00.004'::timestamp without time zone)
Buffers: shared hit=5 read=7437 written=1
-> BitmapAnd (cost=275.25..275.25 rows=3 width=0) (actual time=711.224..711.224 rows=0 loops=1)
Buffers: shared hit=5 read=7433 written=1
-> Bitmap Index Scan on idx_booking_lesson_substitute_day (cost=0.00..20.50 rows=594 width=0) (actual time=0.080..0.080 rows=72 loops=1)
Index Cond: (date_trunc('day'::text, from_time) >= '2016-01-25'::date)
Buffers: shared hit=5 read=1
-> Bitmap Index Scan on booking_substitute_confirmation_token_key (cost=0.00..254.50 rows=13594 width=0) (actual time=711.102..711.102 rows=2718734 loops=1)
Index Cond: (substitute_confirmation_token IS NULL)
Buffers: shared read=7432 written=1
Total runtime: 711.436 ms
Can I prevent using a particular index for a particular query in Postgres?
Your clever solution
You already found a clever solution for your particular case: A partial unique index that only covers rare values, so Postgres won't (can't) use the index for the common NULL value.
CREATE UNIQUE INDEX booking_substitute_confirmation_uni
ON booking (substitute_confirmation_token)
WHERE substitute_confirmation_token IS NOT NULL;
It's a textbook use-case for a partial index. Literally! The manual has a similar example and these perfectly matching advice to go with it:
Finally, a partial index can also be used to override the system's
query plan choices. Also, data sets with peculiar distributions might
cause the system to use an index when it really should not. In that
case the index can be set up so that it is not available for the
offending query. Normally, PostgreSQL makes reasonable choices about
index usage (e.g., it avoids them when retrieving common values, so
the earlier example really only saves index size, it is not required
to avoid index usage), and grossly incorrect plan choices are cause
for a bug report.
Keep in mind that setting up a partial index indicates that you know
at least as much as the query planner knows, in particular you know
when an index might be profitable. Forming this knowledge requires
experience and understanding of how indexes in PostgreSQL work. In
most cases, the advantage of a partial index over a regular index will
be minimal.
You commented: The table has few millions of rows and just few thousands of rows with not null values, so this is a perfect use-case. It will even speed up queries on non-null values for substitute_confirmation_token because the index is much smaller now.
Answer to question
To answer your original question: it's not possible to "disable" an existing index for a particular query. You would have to drop it, but that's way to expensive.
Fake drop index
You could drop an index inside a transaction, run your SELECT and then, instead of committing, use ROLLBACK. That's fast, but be aware that (per documentation):
A normal DROP INDEX acquires exclusive lock on the table, blocking
other accesses until the index drop can be completed.
So this is no good for multi-user environments.
BEGIN;
DROP INDEX big_user_id_created_at_idx;
SELECT ...;
ROLLBACK; -- so the index is preserved after all
More detailed statistics
Normally, though, it should be enough to raise the STATISTICS target for the column, so Postgres can more reliably identify common values and avoid the index for those. Try:
ALTER TABLE booking ALTER COLUMN substitute_confirmation_token SET STATISTICS 2000;
Then: ANALYZE booking; before you try your query again. 2000 is an example value. Related:
Keep PostgreSQL from sometimes choosing a bad query plan

Postgres inconsistent use of Index vs Seq Scan

I'm having difficulty understanding what I perceive as an inconsistancy in how postgres chooses to use indices. We have a query based on NOT IN against an indexed column that postgres executes sequentially, but when we perform the same query as IN, it uses the index.
I've created a simplistic example that I believe demonstrates the issue, notice this first query is sequential
CREATE TABLE node
(
id SERIAL PRIMARY KEY,
vid INTEGER
);
CREATE INDEX x ON node(vid);
INSERT INTO node(vid) VALUES (1),(2);
EXPLAIN ANALYZE
SELECT *
FROM node
WHERE NOT vid IN (1);
Seq Scan on node (cost=0.00..36.75 rows=2129 width=8) (actual time=0.009..0.010 rows=1 loops=1)
Filter: (vid <> 1)
Rows Removed by Filter: 1
Total runtime: 0.025 ms
But if we invert the query to IN, you'll notice that it now decided to use the index
EXPLAIN ANALYZE
SELECT *
FROM node
WHERE vid IN (2);
Bitmap Heap Scan on node (cost=4.34..15.01 rows=11 width=8) (actual time=0.017..0.017 rows=1 loops=1)
Recheck Cond: (vid = 1)
-> Bitmap Index Scan on x (cost=0.00..4.33 rows=11 width=0) (actual time=0.012..0.012 rows=1 loops=1)
Index Cond: (vid = 1)
Total runtime: 0.039 ms
Can anyone shed any light on this? Specifically, is there a way to re-write out NOT IN to work with the index (when obviously the result set is not as simplistic as just 1 or 2).
We are using Postgres 9.2 on CentOS 6.6
PostgreSQL is going to use an Index when it makes sense. It is likely that the statistics state that your NOT IN has too many tuples to return to make an Index effective.
You can test this by doing the following:
set enable_seqscan to false;
explain analyze .... NOT IN
set enable_seqscan to true;
explain analyze .... NOT IN
The results will tell you if PostgreSQL is making the correct decision. If it isn't you can make adjustments to the statistics of the column and or the costs (random_page_cost) to get the desired behavior.