I have a big report table. Bitmap Heap Scan step take more than 5 sec.
Is there something that I can do? I add columns to the table, does reindex the index that it use will help?
I do union and sum on the data, so I don't return 500K records to the client.
I use postgres 9.1.
Here the explain:
Bitmap Heap Scan on foo_table (cost=24747.45..1339408.81 rows=473986 width=116) (actual time=422.210..5918.037 rows=495747 loops=1)
Recheck Cond: ((foo_id = 72) AND (date >= '2013-04-04 00:00:00'::timestamp without time zone) AND (date <= '2013-05-05 00:00:00'::timestamp without time zone))
Filter: ((foo)::text = 'foooooo'::text)
-> Bitmap Index Scan on foo_table_idx (cost=0.00..24628.96 rows=573023 width=0) (actual time=341.269..341.269 rows=723918 loops=1)
Query:
explain analyze
SELECT CAST(date as date) AS date, foo_id, ....
from foo_table
where foo_id = 72
and date >= '2013-04-04'
and date <= '2013-05-05'
and foo = 'foooooo'
Index def:
Index "public.foo_table_idx"
Column | Type
-------------+-----------------------------
foo_id | bigint
date | timestamp without time zone
btree, for table "public.external_channel_report"
Table:
foo is text field with 4 different values.
foo_id is bigint with currently 10K distinct values.
Create a composite index on (foo_id, foo, date) (in this order).
Note that if you select 500k records (and return them all to the client), this may take long.
Are you sure you need all 500k records on the client (rather than some kind of an aggregate or a LIMIT)?
Answer to comment
Do i need the where columns in the same order of the index?
The order of expressions in the WHERE clause is completely irrelevant, SQL is not a procedural language.
Fix mistakes
The timestamp column should not be named "date" for several reasons. Obviously, it's a timestamp, not a date. But more importantly, date it is a reserved word in all SQL standards and a type and function name in Postgres and shouldn't be used as identifier.
You should provide proper information with your question, including a complete table definition and conclusive information about existing indexes. I might be a good idea to start by reading the chapter about indexes in the manual.
The WHERE conditions on the timestamp are most probably incorrect:
and date >= '2013-04-04'
and date <= '2013-05-05'
The upper border for a timestamp column should probably be excluded:
and date >= '2013-04-04'
and date < '2013-05-05'
Index
With the multicolumn index #Quassnoi provided, your query will be much faster, since all qualifying rows can be read from one continuous data block of the index. No row is read in vain (and later disqualified), like you have it now.
But 500k rows will still take some time. Normally you have to verify visibility and fetch additional columns from the table. An index-only scan might be an option in Postgres 9.2+.
The order of columns is best this way, because the rule of thumb is: columns for equality first — then for ranges. More explanation and links in this related answer on dba.SE.
CLUSTER / pg_repack
You could further speed things up by streamlining the table according to this index, so that a minimum of blocks have to be read from the table - if you don't have other requirements that stand against it!
If you want it faster, yet, you could streamline the physical order of rows in your table. If you can afford to lock your table exclusively for a few seconds (at off hours for instance) to rewrite your table and order rows according to the index:
ALTER TABLE foo_table CLUSTER ON idx_myindex_idx;
If concurrent use is a problem, consider pg_repack, which can do the same without exclusive lock.
The effect: fewer blocks need to be read from the table and everything is pre-sorted. It's a one-time effect deteriorating over time, if you have writes on the table. So you would rerun it from time to time.
I copied and adapted the last chapter from this related answer on dba.SE.
Related
I have following table:
create table if not exists inventory
(
expired_at timestamp(0),
-- ...
);
create index if not exists inventory_expired_at_index
on inventory (expired_at);
However when I run following query:
EXPLAIN UPDATE "inventory" SET "status" = 'expired' WHERE "expired_at" < '2020-12-08 12:05:00';
I get next execution plan:
Update on inventory (cost=0.00..4.09 rows=2 width=126)
-> Seq Scan on inventory (cost=0.00..4.09 rows=2 width=126)
Filter: (expired_at < '2020-12-08 12:05:00'::timestamp without time zone)
Same happens for big dataset:
EXPLAIN SELECT * FROM "inventory" WHERE "expired_at" < '2020-12-08 12:05:00';
-[ RECORD 1 ]---------------------------------------------------------------------------
QUERY PLAN | Seq Scan on inventory (cost=0.00..58616.63 rows=1281058 width=71)
-[ RECORD 2 ]---------------------------------------------------------------------------
QUERY PLAN | Filter: (expired_at < '2020-12-08 12:05:00'::timestamp without time zone)
The question is: why not Index Scan but Seq Scan?
This is a bit long for a comment.
The short answer is that you have two rows in the table, so it doesn't make a difference.
The longer answer is that your are using an update, so the data rows have to be retrieved anyway. Using an index requires loading both the index and the data rows and then indirecting from the index to the data rows. It is a little more complicated. And with two rows, not worth the effort at all.
The power of indexes is to handle large amounts of data, not small amounts of data.
To respond to the large question: Database optimizers are not required to use an index. They use some sort of measures (often cost-based optimization) to determine whether or not an index is appropriate. In your larger example, the optimizer has determined that the index is not appropriate. This could happen if the statistics are out-of-synch with the underlying data.
here is the query
explain analyze
SELECT first_name, last_name, date_of_birth
FROM employees
WHERE phone_number < '989898'
AND date_of_birth < '2020-01-01'
I have index on
Indexes:
"employees_pk" PRIMARY KEY, btree (employee_id)
"dob_pn_on_employess" btree (date_of_birth, phone_number)
"ln_dob_employees" btree (upper(last_name::text), date_of_birth)
and here is the analyze output
"Seq Scan on employees (cost=0.00..301.00 rows=1000 width=14) (actual time=0.110..8.644 rows=1000 loops=1)"
" Filter: (((phone_number)::text < 'we'::text) AND (date_of_birth < '2020-01-01'::date))"
"Planning Time: 0.127 ms"
"Execution Time: 15.740 ms"
Why is postgres not using index on the compound index.
There is not enough info in the question to really know for sure but here are some tips:
The filters you have in the query are very inclusive:
date_of_birth < '2020-01-01' will most likely match all the rows as there will be only a few 5 month old babies that own a phone.
phone_number < '989898' will also match most of the rows.
Postgress knows that you are asking it for (almost) the full table and in this case seq scan is faster. This is because index is helpful to pick which pages to read from the disc. But there is cost with using an index. So there is no point in using the index if you already know you are reading all of them.
And indeed here postgress knows you are reading the full table: (cost=0.00..301.00 rows=1000 width=14) and that is why it chooses seq scan as it will be faster. If you create a more exclusive filter like phone phone_number < '11' (depending in your data distribution of course!) you should see an index scan.
Postgress has internal statistics about each column, when creating an execution plan it will estimate the number of rows that will be returned for the query. The statistics are not perfect and Postgress assumes that columns are independent. This is by design to provide the best mix of time-to-plan vs power. So if it assumes that filter1 matches 0,1 rows and filter2 matches 0,01 rows it will assume that the number of rows returned will be 0,1*0,001*number_of_rows. There is also a number of other statistics available and used. Based on this Postgress makes a decision if it will be more beneficial to do a seq scan or use an index (and which index).
In this case Postgress needs to do a seq scan as it needs to go to the disc to fetch first_name, last_name columns as those are not included in the index(es).
A way to have a faster query (depending on you usage pattern!) is to create a covering index. You have 4 columns involved in the query:
first_name, last_name, date_of_birth, phone_number. If you create an index like:
btree (date_of_birth, phone_number, first_name, last_name) Postgress will be able to always run an index-only scan for this query and never use the disc. But mind that this index can get large and it will only work if you can fit it in memory. So be careful with that.
You did not add which Postgress version you are using but starting with 11 (if I remember correctly, for sure more than 10) you are able to INCLUDE columns in the indexes. This is a very cool new feature. If you always filter only on phone number and day of birth you could do for example:
btree (date_of_birth, phone_number) INCLUDE (first_name, last_name) and get index-only scans here with a smaller index.
If this filter on phone_number and date_of_birth is a very common one you can consider creating a compound statistic on both columns. That should allow Postgress to create better query plans. This will not change anything in this case as this plan with seq scan is already optimal but may help with different filter values.
These two tips will depend on the type of the columns which was not added to the question:
If you have a table like date_of_birth it may be beneficial to look into a BRIN index.
Also mind that with time columns asking date_of_birth < '2020-01-01' means you are asking for all people born from 2020 to the begining of time :) Depending on the column type it MAY be beneficial to provide a lower bound ex. date_of_birth < '2020-01-01' AND date_of_birth > '1900-01-01'. But you will need to test this on a large dataset to see if you do see a difference.
A DBMS uses an index when it is probably faster than reading the full table. This is the case when you only read, say, 1% of the table 'srows. Once the DBMS thinks that a query might access many more rows - and this can be as little as, say, 5% of the table's rows - it may rather just read the table sequentially.
Both your conditions are <. To get the rows with a phone number smaller than a given number and a birth date before a given birth date may result in 0% to 100% of the table's rows, depending on the values. I suppose the DBMS is playing safe by reading the full table, because muddling its way through an index only to have to access most or all of the rows in the table would result in a huge runtime.
I have a table in PostgreSQL 9.2 that has a text column. Let's call this text_col. The values in this column are fairly unique (may contain 5-6 duplicates at the most). The table has ~5 million rows. About half these rows contain a null value for text_col. When I execute the following query I expect 1-5 rows. In most cases (>80%) I only expect 1 row.
Query
explain analyze SELECT col1,col2.. colN
FROM table
WHERE text_col = 'my_value';
A btree index exists on text_col. This index is never used by the query planner and I am not sure why. This is the output of the query.
Planner
Seq Scan on two (cost=0.000..459573.080 rows=93 width=339) (actual time=1392.864..3196.283 rows=2 loops=1)
Filter: (victor = 'foxtrot'::text)
Rows Removed by Filter: 4077384
I added another partial index to try to filter out those values that were not null, but that did not help (with or without text_pattern_ops. I do not need text_pattern_ops considering no LIKE conditions are expressed in my queries, but they also match equality).
CREATE INDEX name_idx
ON table
USING btree
(text_col COLLATE pg_catalog."default" text_pattern_ops)
WHERE text_col IS NOT NULL;
Disabling sequence scans using set enable_seqscan = off; makes the planner still pick the seqscan over an index_scan. In summary...
The number of rows returned by this query is small.
Given that the non-null rows are fairly unique, an index scan over the text should be faster.
Vacuuming and analyzing the table did not help the optimizer pick the index.
My questions
Why does the database pick the sequence scan over the index scan?
When a table has a text column whose equality condition should be checked, are there any best practices I can adhere to?
How do I reduce the time taken for this query?
[Edit - More information]
The index scan is picked up on my local database that houses about 10% of the data that is available in production.
A partial index is a good idea to exclude half the rows of the table which you obviously do not need. Simpler:
CREATE INDEX name_idx ON table (text_col)
WHERE text_col IS NOT NULL;
Be sure to run ANALYZE table after creating the index. (Autovacuum does that automatically after some time if you don't do it manually, but if you test right after creation, your test will fail.)
Then, to convince the query planner that a particular partial index can be used, repeat the WHERE condition in the query - even if it seems completely redundant:
SELECT col1,col2, .. colN
FROM table
WHERE text_col = 'my_value'
AND text_col IS NOT NULL; -- repeat condition
Voilá.
Per documentation:
However, keep in mind that the predicate must match the conditions
used in the queries that are supposed to benefit from the index. To be
precise, a partial index can be used in a query only if the system can
recognize that the WHERE condition of the query mathematically implies
the predicate of the index. PostgreSQL does not have a sophisticated
theorem prover that can recognize mathematically equivalent
expressions that are written in different forms. (Not only is such a
general theorem prover extremely difficult to create, it would
probably be too slow to be of any real use.) The system can recognize
simple inequality implications, for example "x < 1" implies "x < 2";
otherwise the predicate condition must exactly match part of the
query's WHERE condition or the index will not be recognized as usable.
Matching takes place at query planning time, not at run time. As a
result, parameterized query clauses do not work with a partial index.
As for parameterized queries: again, add the (redundant) predicate of the partial index as an additional, constant WHERE condition, and it works just fine.
An important update in Postgres 9.6 largely improves chances for index-only scans (which can make queries cheaper and the query planner will more readily chose such query plans). Related:
PostgreSQL not using index during count(*)
A partial index is only used if the WHERE conditions match. Thus an index with WHERE text_col IS NOT NULL can only be used if you use the same condition in your SELECT. Collation mismatch could also cause harm.
Try the following:
Make a simplest possible btree index CREATE INDEX foo ON table (text_col)
ANALYZE table
Query
I figured it out. Upon taking a closer look at the pg_stats view that analyze helps build, I came across this excerpt on the documentation.
Correlation
Statistical correlation between physical row ordering and logical
ordering of the column values. This ranges from -1 to +1. When the
value is near -1 or +1, an index scan on the column will be estimated
to be cheaper than when it is near zero, due to reduction of random
access to the disk. (This column is null if the column data type does
not have a < operator.)
On my local box the correlation number is 0.97 and on production it was 0.05. Thus the planner is estimating that it is easier to go through all those rows sequentially instead of looking up the index each time and diving into a random access on the disk block. This is the query I used to peek at the correlation number.
select * from pg_stats where tablename = 'table_name' and attname = 'text_col';
This table also has a few updates performed on its rows. The avg_width of the rows is estimated to be 20 bytes. If the update has a large value for a text column, it can exceed the average and also result in a slower update. My guess was that the physical and logical ordering are slowing moving apart with each update. To fix that I executed the following queries.
ALTER TABLE table_name SET (FILLFACTOR = 80);
VACUUM FULL table_name;
REINDEX TABLE table_name;
ANALYZE table_name;
The idea is that I could give each disk block a 20% buffer and vacuum full the table to reclaim lost space and maintain physical and logical order. After I did this the query picks up the index.
Query
explain analyze SELECT col1,col2... colN
FROM table_name
WHERE text_col is not null
AND
text_col = 'my_value';
Partial index scan - 1.5ms
Index Scan using tango on two (cost=0.000..165.290 rows=40 width=339) (actual time=0.083..0.086 rows=1 loops=1)
Index Cond: ((victor five NOT NULL) AND (victor = 'delta'::text))
Excluding the NULL condition picks up the other index with a bitmap heap scan.
Full index - 0.08ms
Bitmap Heap Scan on two (cost=5.380..392.150 rows=98 width=339) (actual time=0.038..0.039 rows=1 loops=1)
Recheck Cond: (victor = 'delta'::text)
-> Bitmap Index Scan on tango (cost=0.000..5.360 rows=98 width=0) (actual time=0.029..0.029 rows=1 loops=1)
Index Cond: (victor = 'delta'::text)
[EDIT]
While it initially looked like correlation plays a major role in choosing the index scan #Mike has observed that a correlation value that is close to 0 on his database still resulted in an index scan. Changing fill factor and vacuuming fully has helped but I'm unsure why.
I've got a table pings with about 15 million rows in it. I'm on postgres 9.2.4. The relevant columns it has are a foreign key monitor_id, a created_at timestamp, and a response_time that's an integer that represents milliseconds. Here is the exact structure:
Column | Type | Modifiers
-----------------+-----------------------------+----------------------------------------------------
id | integer | not null default nextval('pings_id_seq'::regclass)
url | character varying(255) |
monitor_id | integer |
response_status | integer |
response_time | integer |
created_at | timestamp without time zone |
updated_at | timestamp without time zone |
response_body | text |
Indexes:
"pings_pkey" PRIMARY KEY, btree (id)
"index_pings_on_created_at_and_monitor_id" btree (created_at DESC, monitor_id)
"index_pings_on_monitor_id" btree (monitor_id)
I want to query for all the response times that are not NULL (90% won't be NULL, about 10% will be NULL), that have a specific monitor_id, and that were created in the last month. I'm doing the query with ActiveRecord, but the end result looks something like this:
SELECT "pings"."response_time"
FROM "pings"
WHERE "pings"."monitor_id" = 3
AND (created_at > '2014-03-03 20:23:07.254281'
AND response_time IS NOT NULL)
It's a pretty basic query, but it takes about 2000ms to run, which seems rather slow. I'm assuming an index would make it faster, but all the indexes I've tried aren't working, which I'm assuming means I'm not indexing properly.
When I run EXPLAIN ANALYZE, this is what I get:
Bitmap Heap Scan on pings (cost=6643.25..183652.31 rows=83343 width=4) (actual time=58.997..1736.179 rows=42063 loops=1)
Recheck Cond: (monitor_id = 3)
Rows Removed by Index Recheck: 11643313
Filter: ((response_time IS NOT NULL) AND (created_at > '2014-03-03 20:23:07.254281'::timestamp without time zone))
Rows Removed by Filter: 324834
-> Bitmap Index Scan on index_pings_on_monitor_id (cost=0.00..6622.41 rows=358471 width=0) (actual time=57.935..57.935 rows=366897 loops=1)
Index Cond: (monitor_id = 3)
So there is an index on monitor_id that is being used towards the end, but nothing else. I've tried various permutations and orders of compound indexes using monitor_id, created_at, and response_time. I've tried ordering the index by created_at in descending order. I've tried a partial index with response_time IS NOT NULL.
Nothing I've tried makes the query any faster. How would you optimize and/or index it?
Sequence of columns
Create a partial multicolumn index with the right sequence of columns. You have one:
"index_pings_on_created_at_and_monitor_id" btree (created_at DESC, monitor_id)
But the sequence of columns is not serving you well. Reverse it:
CREATE INDEX idx_pings_monitor_created ON pings (monitor_id, created_at DESC)
WHERE response_time IS NOT NULL;
The rule of thumb here is: equality first, ranges later. More about that:
Multicolumn index and performance
As discussed, the condition WHERE response_time IS NOT NULL does not buy you much. If you have other queries that could utilize this index including NULL values in response_time, drop it. Else, keep it.
You can probably also drop both other existing indexes. More about the sequence of columns in btree indexes:
Working of indexes in PostgreSQL
Covering index
If all you need from the table is response_time, this can be much faster yet - if you don't have lots of write operations on the rows of your table. Include the column in the index at the last position to allow index-only scans (making it a "covering index"):
CREATE INDEX idx_pings_monitor_created
ON pings (monitor_id, created_at DESC, response_time)
WHERE response_time IS NOT NULL; -- maybe
Or, you try this even ..
More radical partial index
Create a tiny helper function. Effectively a "global constant" in your db:
CREATE OR REPLACE FUNCTION f_ping_event_horizon()
RETURNS timestamp LANGUAGE sql IMMUTABLE COST 1 AS
$$SELECT '2014-03-03 0:0'::timestamp$$; -- One month in the past
Use it as condition in your index:
CREATE INDEX idx_pings_monitor_created_response_time
ON pings (monitor_id, created_at DESC, response_time)
WHERE response_time IS NOT NULL -- maybe
AND created_at > f_ping_event_horizon();
And your query looks like this now:
SELECT response_time
FROM pings
WHERE monitor_id = 3
AND response_time IS NOT NULL
AND created_at > '2014-03-03 20:23:07.254281'
AND created_at > f_ping_event_horizon();
Aside: I trimmed some noise.
The last condition seems logically redundant. Only include it, if Postgres does not understand it can use the index without it. Might be necessary. The actual timestamp in the condition must be bigger than the one in the function. But that's obviously the case according to your comments.
This way we cut all the irrelevant rows and make the index much smaller. The effect degrades slowly over time. Refit the event horizon and recreate indexes from time to time to get rid of added weight. You could do with a weekly cron job, for example.
When updating (recreating) the function, you need to recreate all indexes that use the function in any way. Best in the same transaction. Because the IMMUTABLE declaration for the helper function is a bit of a false promise. But Postgres only accepts immutable functions in index definitions. So we have to lie about it. More about that:
Does PostgreSQL support "accent insensitive" collations?
Why the function at all? This way, all the queries using the index can remain unchanged.
With all of these changes the query should be faster by orders of magnitude now. A single, continuous index-only scan is all that's needed. Can you confirm that?
I've got a table with around 20 million rows. For arguments sake, lets say there are two columns in the table - an id and a timestamp. I'm trying to get a count of the number of items per day. Here's what I have at the moment.
SELECT DATE(timestamp) AS day, COUNT(*)
FROM actions
WHERE DATE(timestamp) >= '20100101'
AND DATE(timestamp) < '20110101'
GROUP BY day;
Without any indices, this takes about a 30s to run on my machine. Here's the explain analyze output:
GroupAggregate (cost=675462.78..676813.42 rows=46532 width=8) (actual time=24467.404..32417.643 rows=346 loops=1)
-> Sort (cost=675462.78..675680.34 rows=87021 width=8) (actual time=24466.730..29071.438 rows=17321121 loops=1)
Sort Key: (date("timestamp"))
Sort Method: external merge Disk: 372496kB
-> Seq Scan on actions (cost=0.00..667133.11 rows=87021 width=8) (actual time=1.981..12368.186 rows=17321121 loops=1)
Filter: ((date("timestamp") >= '2010-01-01'::date) AND (date("timestamp") < '2011-01-01'::date))
Total runtime: 32447.762 ms
Since I'm seeing a sequential scan, I tried to index on the date aggregate
CREATE INDEX ON actions (DATE(timestamp));
Which cuts the speed by about 50%.
HashAggregate (cost=796710.64..796716.19 rows=370 width=8) (actual time=17038.503..17038.590 rows=346 loops=1)
-> Seq Scan on actions (cost=0.00..710202.27 rows=17301674 width=8) (actual time=1.745..12080.877 rows=17321121 loops=1)
Filter: ((date("timestamp") >= '2010-01-01'::date) AND (date("timestamp") < '2011-01-01'::date))
Total runtime: 17038.663 ms
I'm new to this whole query-optimization business, and I have no idea what to do next. Any clues how I could get this query running faster?
--edit--
It looks like I'm hitting the limits of indices. This is pretty much the only query that gets run on this table (though the values of the dates change). Is there a way to partition up the table? Or create a cache table with all the count values? Or any other options?
Is there a way to partition up the table?
Yes:
http://www.postgresql.org/docs/current/static/ddl-partitioning.html
Or create a cache table with all the count values? Or any other options?
Create a "cache" table certainly is possible. But this depends on how often you need that result and how accurate it needs to be.
CREATE TABLE action_report
AS
SELECT DATE(timestamp) AS day, COUNT(*)
FROM actions
WHERE DATE(timestamp) >= '20100101'
AND DATE(timestamp) < '20110101'
GROUP BY day;
Then a SELECT * FROM action_report will give you what you want in a timely manner. You would then schedule a cron job to recreate that table on a regular basis.
This approach of course won't help if the time range changes with every query or if that query is only run once a day.
In general most databases will ignore indexes if the expected number of rows returned is going to be high. This is because for each index hit, it will need to then find the row as well, so it's faster to just do a full table scan. This number is between 10,000 and 100,000. You can experiment with this by shrinking the date range and seeing where postgres flips to using the index. In this case, postgres is planning to scan 17,301,674 rows, so your table is pretty large. If you make it really small and you still feel like postgres is making the wrong choice then try running an analyze on the table so that postgres gets its approximations right.
It looks like the range covers just about covers all the data available.
This could be a design issue. If you will be running this often, you are better off creating an additional column timestamp_date that contains only the date. Then create an index on that column, and change the query accordingly. The column should be maintained by insert+update triggers.
SELECT timestamp_date AS day, COUNT(*)
FROM actions
WHERE timestamp_date >= '20100101'
AND timestamp_date < '20110101'
GROUP BY day;
If I am wrong about the number of rows the date range will find (and it is only a small subset), then you can try an index on just the timestamp column itself, applying the WHERE clause to just the column (which given the range works just as well)
SELECT DATE(timestamp) AS day, COUNT(*)
FROM actions
WHERE timestamp >= '20100101'
AND timestamp < '20110101'
GROUP BY day;
Try running explain analyze verbose ... to see if the aggregate is using a temp file. Perhaps increase work_mem to allow more to be done in memory?
Set work_mem to say 2GB and see if that changes the plan. If it doesn't, you might be out of options.
What you really want for such DSS type queries is a date table that describes days. In database design lingo it's called a date dimension. To populate such table you can use the code I posted in this article: http://www.mockbites.com/articles/tech/data_mart_temporal
Then in each row in your actions table put the appropriate date_key.
Your query then becomes:
SELECT
d.full_date, COUNT(*)
FROM actions a
JOIN date_dimension d
ON a.date_key = d.date_key
WHERE d.full_date = '2010/01/01'
GROUP BY d.full_date
Assuming indices on the keys and full_date, this will be super fast because it operates on INT4 keys!
Another benefit is that you can now slice and dice by any other date_dimension column(s).