IN clause is slow when sorting - sql

I have simple SQL with IN clause and ordering:
SELECT * FROM my_table
WHERE fieldA IN ('value1', 'value2')
ORDER BY fieldB
I've also created the following indexes:
CREATE INDEX idx_my_table_fieldA ON my_table (fieldA)
CREATE INDEX idx_my_table_fieldA_fieldB ON my_table (fieldA, fieldB)
There are millions of rows in this table.
I have very slow queries when I run it with multiple values in IN clause:
Sort (cost=800.35..802.21 rows=744 width=601) (actual time=36.409..36.423 rows=5 loops=1)
Sort Key: "fieldB"
Sort Method: quicksort Memory: 30kB
Buffers: shared hit=18 read=13
I/O Timings: read=29.326
-> Index Scan using idx_my_table_fieldA on my_table (cost=0.57..764.86 rows=744 width=601) (actual time=5.597..35.819 rows=5 loops=1)
Index Cond: (fieldA = ANY ('{value1,value2}'::text[]))
Buffers: shared hit=15 read=13
I/O Timings: read=29.326
Planning:
Buffers: shared hit=238 read=54
I/O Timings: read=81.777
Planning Time: 94.622 ms
Execution Time: 36.528 ms
(14 rows)
The idx_my_table_fieldA index is used.
But it's very fast if I run it for single value in IN clause:
fast SQL example:
SELECT * FROM my_table
WHERE fieldA IN ('value1')
ORDER BY fieldB
This is the query plan:
Index Scan using idx_my_table_fieldA_fieldB on my_table (cost=0.57..153.17 rows=149 width=601) (actual time=1.435..1.440 rows=1 loops=1)
Index Cond: (fieldA = 'value1'::text)
Buffers: shared hit=3 read=2
I/O Timings: read=1.313
Planning Time: 0.194 ms
Execution Time: 1.472 ms
(6 rows)
The multicolumn index is used in this case.
Could you recommend me how to improve this query? I use an ORM system (Hibernate + Spring Data), so it's better not to use native SQL. It would be great to solve this problem using another appropriate indexes (if it's possible).

This query is slow because the "planer" (called optimizer in other RDBMS) is very limited.
In such cases (with a list of values in a IN operator), the query is transformed in Oracle or SQL Server in several queries with a unique value, then results are concatenates, this way to have benefits from indexes.
But it seems that Postgresql does not rewrite your query this way.
You can manually test if a rewrite will do a better plan...

Related

How to speed up SUM query in postgres on large table

The problem
I'm trying to run the following query on a SQL view in a postgres database:
SELECT sum(value) FROM invoices_view;
The invoices_view has approximately 45 million rows, the data size of the entire database is 40.5 GB and the database has 61 GB of RAM.
Currently this query is taking 4.5 seconds, and I'd like it to be ideally under 1 second.
Things I've tried
I cannot add indexes directly to the SQL view of course, but have an index on the underlying table:
CREATE INDEX invoices_on_value_idx ON invoices (value);
I have also run a VACUUM ANALYZE on the invoices table.
EXPLAIN ANALYZE
The output of EXPLAIN ANALYZE is as follows:
EXPLAIN (ANALYZE, BUFFERS) SELECT sum(value) FROM invoices_view;
Finalize Aggregate (cost=1514195.47..1514195.47 rows=1 width=32) (actual time=5102.805..5102.806 rows=1 loops=1)
Buffers: shared hit=14996 read=1446679
I/O Timings: read=3235.147
-> Gather (cost=1514195.16..1514195.47 rows=3 width=32) (actual time=5102.716..5109.229 rows=4 loops=1)
Workers Planned: 3
Workers Launched: 3
Buffers: shared hit=14996 read=1446679
I/O Timings: read=3235.147
-> Partial Aggregate (cost=1513195.16..1513195.17 rows=1 width=32) (actual time=5097.626..5097.626 rows=1 loops=4)
Buffers: shared hit=14996 read=1446679
I/O Timings: read=3235.147
-> Parallel Seq Scan on invoices (cost=0.00..1505835.14 rows=14720046 width=6) (actual time=0.049..3734.495 rows=11408036 loops=4)
Buffers: shared hit=14996 read=1446679
I/O Timings: read=3235.147
Planning Time: 2.503 ms
Execution Time: 5109.327 ms
Does anyone have any thought on how I might be able to speed this up? Or should I be looking at alternatives to postgres at this point?
More detail
This is the simplest version of the queries I'll need to run over the dataset.
For example, I need to be able to SUM based on user inputs i.e. additional WHERE clauses and GROUP BYs.
Keeping a running total would solve for this simplest case only.
You should consider using a trigger to keep track of a rolling sum:
CREATE OR REPLACE FUNCTION func_sum_invoice()
RETURNS trigger AS
$BODY$
BEGIN
UPDATE invoices_sum
SET total = total + NEW.value;
RETURN NEW;
END;
$BODY$
Then create the trigger using this function:
CREATE TRIGGER sum_invoice
AFTER INSERT ON invoices
FOR EACH ROW
EXECUTE PROCEDURE func_sum_invoice();
Now each insert into the invoices table will fire a trigger which tallies the rolling sum. To obtain that sum, now you need only a single select, which should be very fast:
SELECT total
FROM invoices_sum;

Same query on 2 identical DBs - different execution plans

I'm moving the Postgres DB to another server. After imporing the data (dumped with pg_dump) I checked the performane and found out that the same query results in different query plans on the two DBs (given that the DBMS versions, DB structure and the data itself are the same):
the query is:
explain analyse select * from common.composite where id = 0176200005519000087
query plan of the production DB:
Index Scan using composite_id_idx on composite (cost=0.43..8.45 rows=1 width=222) (actual time=0.070..0.071 rows=1 loops=1)
Index Cond: (id = '176200005519000087'::bigint)
Planning time: 0.502 ms
Execution time: 0.102 ms
for the new one:
Bitmap Heap Scan on composite (cost=581.08..54325.66 rows=53916 width=76) (actual time=0.209..0.210 rows=1 loops=1)
Recheck Cond: (id = '176200005519000087'::bigint)
Heap Blocks: exact=1
-> Bitmap Index Scan on composite_id_idx (cost=0.00..567.61 rows=53916 width=0) (actual time=0.187..0.187 rows=1 loops=1)
Index Cond: (id = '176200005519000087'::bigint)
Planning time: 0.428 ms
Execution time: 0.305 ms
Obviously, there is a btree index for id in both DBs.
As far as I could get, the new one uses bitmap indeces for some reason, while btree was imported from the dump. This results in a huge delay (up to 30x) in complex queries.
Is there something wrong the the index/dependencies importing or there is a way to point which indexes the planner should use?
Thank you.

Why cost is increased by adding indexes?

I'm using postgresql 9.4.6.
There are the following entities:
CREATE TABLE user (id CHARACTER VARYING NOT NULL PRIMARY KEY);
CREATE TABLE group (id CHARACTER VARYING NOT NULL PRIMARY KEY);
CREATE TABLE group_member (
id CHARACTER VARYING NOT NULL PRIMARY KEY,
gid CHARACTER VARYING REFERENCES group(id),
uid CHARACTER VARYING REFERENCES user(id));
I analyze that query:
explain analyze select x2."gid" from "group_member" x2 where x2."uid" = 'a1';
I have several results. Before each result I flushed OS-caches and restarted postgres:
# /etc/init.d/postgresql stop
# sync
# echo 3 > /proc/sys/vm/drop_caches
# /etc/init.d/postgresql start
The results of analyzing are:
1) cost=4.17..11.28 with indexes:
create index "group_member_gid_idx" on "group_member" ("gid");
create index "group_member_uid_idx" on "group_member" ("uid");
Bitmap Heap Scan on group_member x2 (cost=4.17..11.28 rows=3 width=32) (actual time=0.021..0.021 rows=0 loops=1)
Recheck Cond: ((uid)::text = 'a1'::text)
-> Bitmap Index Scan on group_member_uid_idx (cost=0.00..4.17 rows=3 width=0) (actual time=0.005..0.005 rows=0 loops=1)
Index Cond: ((uid)::text = 'a1'::text)
Planning time: 28.641 ms
Execution time: 0.359 ms
2) cost=7.97..15.08 with indexes:
create unique index "group_member_gid_uid_idx" on "group_member" ("gid","uid");
Bitmap Heap Scan on group_member x2 (cost=7.97..15.08 rows=3 width=32) (actual time=0.013..0.013 rows=0 loops=1)
Recheck Cond: ((uid)::text = 'a1'::text)
-> Bitmap Index Scan on group_member_gid_uid_idx (cost=0.00..7.97 rows=3 width=0) (actual time=0.006..0.006 rows=0 loops=1)
Index Cond: ((uid)::text = 'a1'::text)
Planning time: 0.132 ms
Execution time: 0.047 ms
3) cost=0.00..16.38 without any indexes:
Seq Scan on group_member x2 (cost=0.00..16.38 rows=3 width=32) (actual time=0.002..0.002 rows=0 loops=1)
Filter: ((uid)::text = 'a1'::text)
Planning time: 42.599 ms
Execution time: 0.402 ms
Is a result #3 more effective? And why?
EDIT
There will be many rows in tables (group, user, group_members) in practice. About > 1 Million.
When analyzing queries, the costs and query plans on small data sets are not generally not a reliable guide to performance on larger data sets. And, SQL is more concerned with larger data sets than with trivially small ones.
The reading of data from disk is often the driving factor in query performance. The main purpose of using an index is to reduce the number of data pages being read. If all the data in the table fits on a single data page, then there isn't much opportunity for reducing the number of page reads: It takes the same amount of time to read one page, whether the page has one record or 100 records. (Reading through a page to find the right record also incurs overhead, whereas an index would identify the specific record on the page.)
Indexes incur overhead, but typically much, much less than reading a data page. The index itself needs to be read into memory -- so that means that two pages are being read into memory rather than one. One could argue that for tables that fit on one or two pages, the use of an index is probably not a big advantage.
Although using the index (in this case) does take longer, differences in performance measured in fractions of a millisecond are generally not germane to most database tasks. If you want to see the index do its work, put 100,000 rows in the table and run the same tests. You'll see that the version without the index scales roughly in proportion to the amount of data in the table; the version with the index is relatively constant (well, actually scaling more like the log of the number of records in the table).

Prevent usage of index for a particular query in Postgres

I have a slow query in a Postgres DB. Using explain analyze, I can see that Postgres makes bitmap index scan on two different indexes followed by bitmap AND on the two resulting sets.
Deleting one of the indexes makes the evaluation ten times faster (bitmap index scan is still used on the first index). However, that deleted index is useful in other queries.
Query:
select
booking_id
from
booking
where
substitute_confirmation_token is null
and date_trunc('day', from_time) >= cast('01/25/2016 14:23:00.004' as date)
and from_time >= '01/25/2016 14:23:00.004'
and type = 'LESSON_SUBSTITUTE'
and valid
order by
booking_id;
Indexes:
"idx_booking_lesson_substitute_day" btree (date_trunc('day'::text, from_time)) WHERE valid AND type::text = 'LESSON_SUBSTITUTE'::text
"booking_substitute_confirmation_token_key" UNIQUE CONSTRAINT, btree (substitute_confirmation_token)
Query plan:
Sort (cost=287.26..287.26 rows=1 width=8) (actual time=711.371..711.377 rows=44 loops=1)
Sort Key: booking_id
Sort Method: quicksort Memory: 27kB
Buffers: shared hit=8 read=7437 written=1
-> Bitmap Heap Scan on booking (cost=275.25..287.25 rows=1 width=8) (actual time=711.255..711.294 rows=44 loops=1)
Recheck Cond: ((date_trunc('day'::text, from_time) >= '2016-01-25'::date) AND valid AND ((type)::text = 'LESSON_SUBSTITUTE'::text) AND (substitute_confirmation_token IS NULL))
Filter: (from_time >= '2016-01-25 14:23:00.004'::timestamp without time zone)
Buffers: shared hit=5 read=7437 written=1
-> BitmapAnd (cost=275.25..275.25 rows=3 width=0) (actual time=711.224..711.224 rows=0 loops=1)
Buffers: shared hit=5 read=7433 written=1
-> Bitmap Index Scan on idx_booking_lesson_substitute_day (cost=0.00..20.50 rows=594 width=0) (actual time=0.080..0.080 rows=72 loops=1)
Index Cond: (date_trunc('day'::text, from_time) >= '2016-01-25'::date)
Buffers: shared hit=5 read=1
-> Bitmap Index Scan on booking_substitute_confirmation_token_key (cost=0.00..254.50 rows=13594 width=0) (actual time=711.102..711.102 rows=2718734 loops=1)
Index Cond: (substitute_confirmation_token IS NULL)
Buffers: shared read=7432 written=1
Total runtime: 711.436 ms
Can I prevent using a particular index for a particular query in Postgres?
Your clever solution
You already found a clever solution for your particular case: A partial unique index that only covers rare values, so Postgres won't (can't) use the index for the common NULL value.
CREATE UNIQUE INDEX booking_substitute_confirmation_uni
ON booking (substitute_confirmation_token)
WHERE substitute_confirmation_token IS NOT NULL;
It's a textbook use-case for a partial index. Literally! The manual has a similar example and these perfectly matching advice to go with it:
Finally, a partial index can also be used to override the system's
query plan choices. Also, data sets with peculiar distributions might
cause the system to use an index when it really should not. In that
case the index can be set up so that it is not available for the
offending query. Normally, PostgreSQL makes reasonable choices about
index usage (e.g., it avoids them when retrieving common values, so
the earlier example really only saves index size, it is not required
to avoid index usage), and grossly incorrect plan choices are cause
for a bug report.
Keep in mind that setting up a partial index indicates that you know
at least as much as the query planner knows, in particular you know
when an index might be profitable. Forming this knowledge requires
experience and understanding of how indexes in PostgreSQL work. In
most cases, the advantage of a partial index over a regular index will
be minimal.
You commented: The table has few millions of rows and just few thousands of rows with not null values, so this is a perfect use-case. It will even speed up queries on non-null values for substitute_confirmation_token because the index is much smaller now.
Answer to question
To answer your original question: it's not possible to "disable" an existing index for a particular query. You would have to drop it, but that's way to expensive.
Fake drop index
You could drop an index inside a transaction, run your SELECT and then, instead of committing, use ROLLBACK. That's fast, but be aware that (per documentation):
A normal DROP INDEX acquires exclusive lock on the table, blocking
other accesses until the index drop can be completed.
So this is no good for multi-user environments.
BEGIN;
DROP INDEX big_user_id_created_at_idx;
SELECT ...;
ROLLBACK; -- so the index is preserved after all
More detailed statistics
Normally, though, it should be enough to raise the STATISTICS target for the column, so Postgres can more reliably identify common values and avoid the index for those. Try:
ALTER TABLE booking ALTER COLUMN substitute_confirmation_token SET STATISTICS 2000;
Then: ANALYZE booking; before you try your query again. 2000 is an example value. Related:
Keep PostgreSQL from sometimes choosing a bad query plan

Postgres inconsistent use of Index vs Seq Scan

I'm having difficulty understanding what I perceive as an inconsistancy in how postgres chooses to use indices. We have a query based on NOT IN against an indexed column that postgres executes sequentially, but when we perform the same query as IN, it uses the index.
I've created a simplistic example that I believe demonstrates the issue, notice this first query is sequential
CREATE TABLE node
(
id SERIAL PRIMARY KEY,
vid INTEGER
);
CREATE INDEX x ON node(vid);
INSERT INTO node(vid) VALUES (1),(2);
EXPLAIN ANALYZE
SELECT *
FROM node
WHERE NOT vid IN (1);
Seq Scan on node (cost=0.00..36.75 rows=2129 width=8) (actual time=0.009..0.010 rows=1 loops=1)
Filter: (vid <> 1)
Rows Removed by Filter: 1
Total runtime: 0.025 ms
But if we invert the query to IN, you'll notice that it now decided to use the index
EXPLAIN ANALYZE
SELECT *
FROM node
WHERE vid IN (2);
Bitmap Heap Scan on node (cost=4.34..15.01 rows=11 width=8) (actual time=0.017..0.017 rows=1 loops=1)
Recheck Cond: (vid = 1)
-> Bitmap Index Scan on x (cost=0.00..4.33 rows=11 width=0) (actual time=0.012..0.012 rows=1 loops=1)
Index Cond: (vid = 1)
Total runtime: 0.039 ms
Can anyone shed any light on this? Specifically, is there a way to re-write out NOT IN to work with the index (when obviously the result set is not as simplistic as just 1 or 2).
We are using Postgres 9.2 on CentOS 6.6
PostgreSQL is going to use an Index when it makes sense. It is likely that the statistics state that your NOT IN has too many tuples to return to make an Index effective.
You can test this by doing the following:
set enable_seqscan to false;
explain analyze .... NOT IN
set enable_seqscan to true;
explain analyze .... NOT IN
The results will tell you if PostgreSQL is making the correct decision. If it isn't you can make adjustments to the statistics of the column and or the costs (random_page_cost) to get the desired behavior.