Here's my explain plan:
SELECT STATEMENT, GOAL = ALL_ROWS 244492 4525870 235345240
SORT ORDER BY 244492 4525870 235345240
**PARTITION RANGE ALL** 207633 4525870 235345240
INDEX FAST FULL SCAN MCT MCT_PLANNED_CT_PK 207633 4525870 235345240
Just wondering if this is the best optimized plan for querying huge partitioned tables.
Using Oracle10g
PARTITION RANGE ALL just means that the predicates could not be used to perform any partition pruning. More info. Or, that the alternative (scanning the table blocks instead of using a fast full scan on the index) was estimated to be more expensive overall.
If you can change the predicate to limit the affected rows to a small subset of the partitions, the database will be able to skip whole partitions when querying the table.
Related
I am using Postgres database , I am trying to see the difference between Index Scan and Sequential scan on table of 1000000 rows
Describe table
\d grades
Then explain analyze for rows between 10 and 500000
explain analyze select name from grades where pid between 10 and 500000 ;
Then explain analyze for rows between 10 and 600000
explain analyze select name from grades where pid between 10 and 600000 ;
The strange for me why it made Index scan on first query and
sequential scan in the second although they query by the same column
which it contained in the index .
If you need only a single table row, an index scan is much faster than a sequential scan. If you need the whole table, a sequential scan is faster than an index scan.
Somewhere between that is the turning point where PostgreSQL switches between these two access methods.
You can tune random_page_cost to influence the point where a sequential scan is chosen. If you have SSD storage, you should set the parameter to 1.0 or 1.1 to tell PostgreSQL that index scans are cheaper on your hardware.
PostgreSQL uses a cost based optimizer, not a rule based optimizer. If you take the estimated cost of the index scan, 18693, and scale it up linearly by the ratio of the expected rows between the two plans (which is not exactly what the planner does, but should be a good enough first approximation) you get 22330. That is higher than the expected cost of the seq scan, 21372, so it chooses the seq scan.
If you scale the index-scan actual time up the same way, you get 89ms, which is slightly faster than the seq scan actually was. So maybe the planner made a very slight error here, but it is certainly nothing to worry about in practice.
If the difference in run times were a factor of 10, rather than 10%, that might be worth investigating further.
its because If the SELECT returns more than approximately 5-10% of all rows in the table, a sequential scan is much faster than an index scan. and your second query hit that threshold; because you are fetching more rows
There are the following scenarios:
Use PG to execute the query as follows:
Select count(*) from t where DATETIME >'2018-07-27 10.12.12.000000' and DATETIME < '2018-07-28 10.12.12.000000'
It returns 22 indexes with rapid execution.
The query condition has "="
Select count(*) from t where DATETIME >='2018-07-27 10.12.12.000000' and DATETIME <= '2018-07-28 10.12.12.000000'
It return 22 indexes which cost 20s.
I find that the query without “=” choose index scan, however, the query with “=” partly choose table scan.
According to your question:
The current indexing mechanism is that the optimizer matches the first available index, which means that the query will first select the first index created, and the choice of index depends on the order in which the index is created. In the case of an index, the query will take the index scan first.
Make sure that the nodes on each data group contain the index, otherwise the unindexed data nodes will take the table scan.
Execute analyze optimization query. Analyze is a new feature of SequoiaDB v3.0. It is mainly used to analyze collections, index data, and collect statistical information, and provide an optimal query algorithm to determine either index or table scan. Analyze specific usage reference: http://doc.sequoiadb.com/cn/index-cat_id-1496923440-edition_id-300
View the access plan by find.explain() to view the query cost
Performing VACUUM on my DB significantly improves query performance. While trying to determine why this is, I found that sqlite3 isn't using the index on the DB in its original state, just a generic SEARCH TABLE.
QUERY PLAN
|--SCAN TABLE data <--- no Index
|--USE TEMP B-TREE FOR GROUP BY
`--USE TEMP B-TREE FOR ORDER BY
After performing VACUUM, the QUERY PLAN shows a SEARCH USING INDEX as it should
QUERY PLAN
|--SEARCH TABLE data USING INDEX index_name (name=?)
|--USE TEMP B-TREE FOR GROUP BY
`--USE TEMP B-TREE FOR ORDER BY
How can I determine why the index isn't being used before the vacuum operation?
I have the explain results as well, but I'm not sure they'd be useful. They are clearly different (original, non-vacuumed result performs a Rewind/Loop where the vacuumed DB OpenRead's the index)
Thank you,
I have a table in PostgreSQL 9.2 that has a text column. Let's call this text_col. The values in this column are fairly unique (may contain 5-6 duplicates at the most). The table has ~5 million rows. About half these rows contain a null value for text_col. When I execute the following query I expect 1-5 rows. In most cases (>80%) I only expect 1 row.
Query
explain analyze SELECT col1,col2.. colN
FROM table
WHERE text_col = 'my_value';
A btree index exists on text_col. This index is never used by the query planner and I am not sure why. This is the output of the query.
Planner
Seq Scan on two (cost=0.000..459573.080 rows=93 width=339) (actual time=1392.864..3196.283 rows=2 loops=1)
Filter: (victor = 'foxtrot'::text)
Rows Removed by Filter: 4077384
I added another partial index to try to filter out those values that were not null, but that did not help (with or without text_pattern_ops. I do not need text_pattern_ops considering no LIKE conditions are expressed in my queries, but they also match equality).
CREATE INDEX name_idx
ON table
USING btree
(text_col COLLATE pg_catalog."default" text_pattern_ops)
WHERE text_col IS NOT NULL;
Disabling sequence scans using set enable_seqscan = off; makes the planner still pick the seqscan over an index_scan. In summary...
The number of rows returned by this query is small.
Given that the non-null rows are fairly unique, an index scan over the text should be faster.
Vacuuming and analyzing the table did not help the optimizer pick the index.
My questions
Why does the database pick the sequence scan over the index scan?
When a table has a text column whose equality condition should be checked, are there any best practices I can adhere to?
How do I reduce the time taken for this query?
[Edit - More information]
The index scan is picked up on my local database that houses about 10% of the data that is available in production.
A partial index is a good idea to exclude half the rows of the table which you obviously do not need. Simpler:
CREATE INDEX name_idx ON table (text_col)
WHERE text_col IS NOT NULL;
Be sure to run ANALYZE table after creating the index. (Autovacuum does that automatically after some time if you don't do it manually, but if you test right after creation, your test will fail.)
Then, to convince the query planner that a particular partial index can be used, repeat the WHERE condition in the query - even if it seems completely redundant:
SELECT col1,col2, .. colN
FROM table
WHERE text_col = 'my_value'
AND text_col IS NOT NULL; -- repeat condition
Voilá.
Per documentation:
However, keep in mind that the predicate must match the conditions
used in the queries that are supposed to benefit from the index. To be
precise, a partial index can be used in a query only if the system can
recognize that the WHERE condition of the query mathematically implies
the predicate of the index. PostgreSQL does not have a sophisticated
theorem prover that can recognize mathematically equivalent
expressions that are written in different forms. (Not only is such a
general theorem prover extremely difficult to create, it would
probably be too slow to be of any real use.) The system can recognize
simple inequality implications, for example "x < 1" implies "x < 2";
otherwise the predicate condition must exactly match part of the
query's WHERE condition or the index will not be recognized as usable.
Matching takes place at query planning time, not at run time. As a
result, parameterized query clauses do not work with a partial index.
As for parameterized queries: again, add the (redundant) predicate of the partial index as an additional, constant WHERE condition, and it works just fine.
An important update in Postgres 9.6 largely improves chances for index-only scans (which can make queries cheaper and the query planner will more readily chose such query plans). Related:
PostgreSQL not using index during count(*)
A partial index is only used if the WHERE conditions match. Thus an index with WHERE text_col IS NOT NULL can only be used if you use the same condition in your SELECT. Collation mismatch could also cause harm.
Try the following:
Make a simplest possible btree index CREATE INDEX foo ON table (text_col)
ANALYZE table
Query
I figured it out. Upon taking a closer look at the pg_stats view that analyze helps build, I came across this excerpt on the documentation.
Correlation
Statistical correlation between physical row ordering and logical
ordering of the column values. This ranges from -1 to +1. When the
value is near -1 or +1, an index scan on the column will be estimated
to be cheaper than when it is near zero, due to reduction of random
access to the disk. (This column is null if the column data type does
not have a < operator.)
On my local box the correlation number is 0.97 and on production it was 0.05. Thus the planner is estimating that it is easier to go through all those rows sequentially instead of looking up the index each time and diving into a random access on the disk block. This is the query I used to peek at the correlation number.
select * from pg_stats where tablename = 'table_name' and attname = 'text_col';
This table also has a few updates performed on its rows. The avg_width of the rows is estimated to be 20 bytes. If the update has a large value for a text column, it can exceed the average and also result in a slower update. My guess was that the physical and logical ordering are slowing moving apart with each update. To fix that I executed the following queries.
ALTER TABLE table_name SET (FILLFACTOR = 80);
VACUUM FULL table_name;
REINDEX TABLE table_name;
ANALYZE table_name;
The idea is that I could give each disk block a 20% buffer and vacuum full the table to reclaim lost space and maintain physical and logical order. After I did this the query picks up the index.
Query
explain analyze SELECT col1,col2... colN
FROM table_name
WHERE text_col is not null
AND
text_col = 'my_value';
Partial index scan - 1.5ms
Index Scan using tango on two (cost=0.000..165.290 rows=40 width=339) (actual time=0.083..0.086 rows=1 loops=1)
Index Cond: ((victor five NOT NULL) AND (victor = 'delta'::text))
Excluding the NULL condition picks up the other index with a bitmap heap scan.
Full index - 0.08ms
Bitmap Heap Scan on two (cost=5.380..392.150 rows=98 width=339) (actual time=0.038..0.039 rows=1 loops=1)
Recheck Cond: (victor = 'delta'::text)
-> Bitmap Index Scan on tango (cost=0.000..5.360 rows=98 width=0) (actual time=0.029..0.029 rows=1 loops=1)
Index Cond: (victor = 'delta'::text)
[EDIT]
While it initially looked like correlation plays a major role in choosing the index scan #Mike has observed that a correlation value that is close to 0 on his database still resulted in an index scan. Changing fill factor and vacuuming fully has helped but I'm unsure why.
My query:
DROP TABLE IF EXISTS tmp;
CREATE TEMP TABLE tmp AS SELECT *, ST_BUFFER(the_geom::GEOGRAPHY, 3000)::GEOMETRY AS buffer FROM af_modis_master LIMIT 20000;
CREATE INDEX idx_tmp_the_geom ON tmp USING gist(buffer);
EXPLAIN SELECT (DUMP(ST_UNION(buffer))).path[1], (DUMP(ST_UNION(buffer))).geom FROM tmp;
Output from EXPLAIN:
Aggregate (cost=1705.52..1705.54 rows=1 width=32)
-> Seq Scan on tmp (cost=0.00..1625.01 rows=16101 width=32)
Seq Scan means it is not using the index, right? Why not?
(This question was first posted here: https://gis.stackexchange.com/questions/51877/postgis-query-not-using-gist-index-when-doing-a-st-dumpst-union . Apologies for reposting but the community here is much more active, so perhaps wil provide an answer quicker.)
UPDATE: Even adding a where clause that filters based on the buffer causes a Seq Scan:
ANALYZE tmp;
EXPLAIN SELECT (DUMP(ST_UNION(buffer))).path[1], (DUMP(ST_UNION(buffer))).geom FROM tmp WHERE ST_XMIN(buffer) = 0.0;
A query like you have will never use an index ever. To do so would substitute significant random disk I/O (possibly even in addition to normal disk I/O) for the scan of the table.
In essence you are not selecting on criteria so an index will be slower than just pulling the data from disk in physical order and processing it.
Now if you pull only a single row with a where condition your index might help with, then you may find it may use the index, or not, depending on how big the table is. Very small tables will never use indexes because the extra random disk I/O is never a win. Remember no query plan beats a sequential scan through a single page....