I am using Postgres database , I am trying to see the difference between Index Scan and Sequential scan on table of 1000000 rows
Describe table
\d grades
Then explain analyze for rows between 10 and 500000
explain analyze select name from grades where pid between 10 and 500000 ;
Then explain analyze for rows between 10 and 600000
explain analyze select name from grades where pid between 10 and 600000 ;
The strange for me why it made Index scan on first query and
sequential scan in the second although they query by the same column
which it contained in the index .
If you need only a single table row, an index scan is much faster than a sequential scan. If you need the whole table, a sequential scan is faster than an index scan.
Somewhere between that is the turning point where PostgreSQL switches between these two access methods.
You can tune random_page_cost to influence the point where a sequential scan is chosen. If you have SSD storage, you should set the parameter to 1.0 or 1.1 to tell PostgreSQL that index scans are cheaper on your hardware.
PostgreSQL uses a cost based optimizer, not a rule based optimizer. If you take the estimated cost of the index scan, 18693, and scale it up linearly by the ratio of the expected rows between the two plans (which is not exactly what the planner does, but should be a good enough first approximation) you get 22330. That is higher than the expected cost of the seq scan, 21372, so it chooses the seq scan.
If you scale the index-scan actual time up the same way, you get 89ms, which is slightly faster than the seq scan actually was. So maybe the planner made a very slight error here, but it is certainly nothing to worry about in practice.
If the difference in run times were a factor of 10, rather than 10%, that might be worth investigating further.
its because If the SELECT returns more than approximately 5-10% of all rows in the table, a sequential scan is much faster than an index scan. and your second query hit that threshold; because you are fetching more rows
Related
I have a SELECT statment which a lot of selections in the IN(...) statement. The column for the IN statment has a nonunique index and is VARCHAR(50). When the number of elements in the IN statement goes over a certain threshold, the index is not used.
My select is structured like this
SELECT T.*, RANK() OVER
(PARTITION BY KEY_ID ORDER BY OBS_DATE ASC) AS XRANK
FROM "MY_TABLE" T WHERE KEY_ID IN('A','B','C')
But in reality there are a few hundred more elements in the IN statement and they are not called A, B, C.
If I reduce the number of items in my IN statement to 50 the index is used and the query takes 0.003s. 7k rows returned
If I double the items for my IN statment to 100, the index is not used and a full table scan is performed taking 0.4s to return 14k rows.
I'm not sure why the index is not used but I want to see what would happen if it was, so I tried to experiment I with a hint,
SELECT /*+ index(MY_TABLE,MY_INDEX) */ O.*, RANK() OVER ...blah blah
But the hint is ignored. When I run the explain plan it is not used and the query is the same speed.
Any advice would be appreciated, especially
Why is the index not being used when there is a higher number of elements in the IN statment
Why is the hint being ignored.
Thanks.
It's because of selectivity. If the query is not selective enough, then a full table scan is usually better than multiple index range scans (not unique).
If the number of values is low, the cost based optimizer will consider the cost of multiple index range scans together is still lower than the cost of a full table scan. If you add too many values, then the index range scan will surpass the cost of the full table scan.
Now, the costs are relative, and it depends on how the optimizer is configured (hints), but also depends on the histogram and table stats. Are they up to date?
By the way, you say 7k and 14k rows are selected. What percentage of the table does that represent? If it's too high the engine will go for the heap ignoring the index.
Having said all this, I think this is a really bad design. Instead of sending 100 parameters over the wire, can you produce those values from another table/select instead?
I have this query on the order lines table. Its a fairly large table. I am trying to get quantity shipped by item in the last 365 days. The query works, but is very slow to return results. Should I use a function based index for this? I read a bit about them, but havent work with them much at all.
How can I make this query faster?
select OOL.INVENTORY_ITEM_ID
,SUM(nvl(OOL.shipped_QUANTITY,0)) shipped_QUANTITY_Last_365
from oe_order_lines_all OOL
where ool.actual_shipment_date>=trunc(sysdate)-365
and cancelled_flag='N'
and fulfilled_flag='Y'
group by ool.inventory_item_id;
Explain plan:
Stats are up to date, we regather once a week.
Query taking 30+ minutes to finish.
UPDATE
After adding this index:
The explain plan shows the query is using index now:
The query runs faster but not 'fast.' Completing in about 6 minutes.
UPDATE2
I created a covering index as suggested by Matthew and Gordon:
The query now completes in less than 1 second.
Explain Plan:
I still wonder why or if a function-based index would have also been a viable solution, but I dont have time to play with it right now.
As a rule, using an index that access a "significant" percentage of the rows in your table is slower than a full table scan. Depending on your system, "significant" could be as low as 5% or 10%.
So, think about your data for a minute...
How many rows in OE_ORDER_LINES_ALL are cancelled? (Hopefully not many...)
How many rows are fulfilled? (Hopefully almost all of them...)
How many rows where shipped in the last year? (Unless you have more than 10 years of history in your table, more than 10% of them...)
Put that all together and your query is probably going to have to read at least 10% of the rows in your table. This is very near the threshold where an index is going to be worse than a full table scan (or, at least not much better than one).
Now, if you need to run this query a lot, you have a few options.
Materialized view, possibly for the prior 11 months together with a live query against OE_ORDER_LINES_ALL for the current month-to-date.
A covering index (see below).
You can improve the performance of an index, even one accessing a significant percentage of the table rows, by making it include all the information required by the query -- allowing Oracle to avoid accessing the table at all.
CREATE INDEX idx1 ON OE_ORDER_LINES_ALL
( actual_shipment_date,
cancelled_flag,
fulfilled_flag,
inventory_item_id,
shipped_quantity ) ONLINE;
With an index like that, Oracle can satisfy the query by just reading the index (which is faster because it's much smaller than the table).
For this query:
select OOL.INVENTORY_ITEM_ID,
SUM(OOL.shipped_QUANTITY) as shipped_QUANTITY_Last_365
from oe_order_lines_all OOL
where ool.actual_shipment_date >= trunc(sysdate) - 365 and
cancelled_flag = 'N' and
fulfilled_flag = 'Y'
group by ool.inventory_item_id;
I would recommend starting with an index on oe_order_lines_all(cancelled_flag, fulfilled_flag, actual_shipment_date). That should do a good job in identifying the rows.
You can add the additional columns inventory_item_id and quantity_shipped to the index as well.
Let recapitulate the facts:
a) You access about 300K rows from your table (see cardinality in the 3rd line of the execution plan)
b) you use the FULL TABLE SCAN the get the data
c) the query is very slow
The first thing is to check why is the FULL TABLE SCAN so very slow - if the table is extremly large (check the BYTES in user_segments) you need to optimize the access to your data.
But remember no index will help you the get 300K rows from say 30M total rows.
Index access to 300K rows can take 1/4 of an hour or even more if th eindex is not much used and a large part of it s on the disk.
What you need is partitioning - in your case a range partitioning on actual_shipment_date - for your data size on a monthly or yearly basis.
This will eliminate the need of scaning the old data (partition pruning) and make the query much more effective.
Other possibility - if the number of rows is small, but the table size is very large - you need to reorganize the table to get better full scan time.
I have a table in PostgreSQL 9.2 that has a text column. Let's call this text_col. The values in this column are fairly unique (may contain 5-6 duplicates at the most). The table has ~5 million rows. About half these rows contain a null value for text_col. When I execute the following query I expect 1-5 rows. In most cases (>80%) I only expect 1 row.
Query
explain analyze SELECT col1,col2.. colN
FROM table
WHERE text_col = 'my_value';
A btree index exists on text_col. This index is never used by the query planner and I am not sure why. This is the output of the query.
Planner
Seq Scan on two (cost=0.000..459573.080 rows=93 width=339) (actual time=1392.864..3196.283 rows=2 loops=1)
Filter: (victor = 'foxtrot'::text)
Rows Removed by Filter: 4077384
I added another partial index to try to filter out those values that were not null, but that did not help (with or without text_pattern_ops. I do not need text_pattern_ops considering no LIKE conditions are expressed in my queries, but they also match equality).
CREATE INDEX name_idx
ON table
USING btree
(text_col COLLATE pg_catalog."default" text_pattern_ops)
WHERE text_col IS NOT NULL;
Disabling sequence scans using set enable_seqscan = off; makes the planner still pick the seqscan over an index_scan. In summary...
The number of rows returned by this query is small.
Given that the non-null rows are fairly unique, an index scan over the text should be faster.
Vacuuming and analyzing the table did not help the optimizer pick the index.
My questions
Why does the database pick the sequence scan over the index scan?
When a table has a text column whose equality condition should be checked, are there any best practices I can adhere to?
How do I reduce the time taken for this query?
[Edit - More information]
The index scan is picked up on my local database that houses about 10% of the data that is available in production.
A partial index is a good idea to exclude half the rows of the table which you obviously do not need. Simpler:
CREATE INDEX name_idx ON table (text_col)
WHERE text_col IS NOT NULL;
Be sure to run ANALYZE table after creating the index. (Autovacuum does that automatically after some time if you don't do it manually, but if you test right after creation, your test will fail.)
Then, to convince the query planner that a particular partial index can be used, repeat the WHERE condition in the query - even if it seems completely redundant:
SELECT col1,col2, .. colN
FROM table
WHERE text_col = 'my_value'
AND text_col IS NOT NULL; -- repeat condition
Voilá.
Per documentation:
However, keep in mind that the predicate must match the conditions
used in the queries that are supposed to benefit from the index. To be
precise, a partial index can be used in a query only if the system can
recognize that the WHERE condition of the query mathematically implies
the predicate of the index. PostgreSQL does not have a sophisticated
theorem prover that can recognize mathematically equivalent
expressions that are written in different forms. (Not only is such a
general theorem prover extremely difficult to create, it would
probably be too slow to be of any real use.) The system can recognize
simple inequality implications, for example "x < 1" implies "x < 2";
otherwise the predicate condition must exactly match part of the
query's WHERE condition or the index will not be recognized as usable.
Matching takes place at query planning time, not at run time. As a
result, parameterized query clauses do not work with a partial index.
As for parameterized queries: again, add the (redundant) predicate of the partial index as an additional, constant WHERE condition, and it works just fine.
An important update in Postgres 9.6 largely improves chances for index-only scans (which can make queries cheaper and the query planner will more readily chose such query plans). Related:
PostgreSQL not using index during count(*)
A partial index is only used if the WHERE conditions match. Thus an index with WHERE text_col IS NOT NULL can only be used if you use the same condition in your SELECT. Collation mismatch could also cause harm.
Try the following:
Make a simplest possible btree index CREATE INDEX foo ON table (text_col)
ANALYZE table
Query
I figured it out. Upon taking a closer look at the pg_stats view that analyze helps build, I came across this excerpt on the documentation.
Correlation
Statistical correlation between physical row ordering and logical
ordering of the column values. This ranges from -1 to +1. When the
value is near -1 or +1, an index scan on the column will be estimated
to be cheaper than when it is near zero, due to reduction of random
access to the disk. (This column is null if the column data type does
not have a < operator.)
On my local box the correlation number is 0.97 and on production it was 0.05. Thus the planner is estimating that it is easier to go through all those rows sequentially instead of looking up the index each time and diving into a random access on the disk block. This is the query I used to peek at the correlation number.
select * from pg_stats where tablename = 'table_name' and attname = 'text_col';
This table also has a few updates performed on its rows. The avg_width of the rows is estimated to be 20 bytes. If the update has a large value for a text column, it can exceed the average and also result in a slower update. My guess was that the physical and logical ordering are slowing moving apart with each update. To fix that I executed the following queries.
ALTER TABLE table_name SET (FILLFACTOR = 80);
VACUUM FULL table_name;
REINDEX TABLE table_name;
ANALYZE table_name;
The idea is that I could give each disk block a 20% buffer and vacuum full the table to reclaim lost space and maintain physical and logical order. After I did this the query picks up the index.
Query
explain analyze SELECT col1,col2... colN
FROM table_name
WHERE text_col is not null
AND
text_col = 'my_value';
Partial index scan - 1.5ms
Index Scan using tango on two (cost=0.000..165.290 rows=40 width=339) (actual time=0.083..0.086 rows=1 loops=1)
Index Cond: ((victor five NOT NULL) AND (victor = 'delta'::text))
Excluding the NULL condition picks up the other index with a bitmap heap scan.
Full index - 0.08ms
Bitmap Heap Scan on two (cost=5.380..392.150 rows=98 width=339) (actual time=0.038..0.039 rows=1 loops=1)
Recheck Cond: (victor = 'delta'::text)
-> Bitmap Index Scan on tango (cost=0.000..5.360 rows=98 width=0) (actual time=0.029..0.029 rows=1 loops=1)
Index Cond: (victor = 'delta'::text)
[EDIT]
While it initially looked like correlation plays a major role in choosing the index scan #Mike has observed that a correlation value that is close to 0 on his database still resulted in an index scan. Changing fill factor and vacuuming fully has helped but I'm unsure why.
Is it a bad symptom that non-clustered index scan cost is 53% ?
That depends on your query. The total query always costs 100%. So if you have a query like
SELECT Name from Customers WHERE ID = 3
than the index scan or seek may even cost 100%. That doesn't mean it's a bad thing. If you want a clear answer about you're query then you should at least post the query itself.
SQL Server will not use non-clustered index with key/bookmark lookup in the case if it expects iterator to return more than a few % of the total rows from the table.
Here's my explain plan:
SELECT STATEMENT, GOAL = ALL_ROWS 244492 4525870 235345240
SORT ORDER BY 244492 4525870 235345240
**PARTITION RANGE ALL** 207633 4525870 235345240
INDEX FAST FULL SCAN MCT MCT_PLANNED_CT_PK 207633 4525870 235345240
Just wondering if this is the best optimized plan for querying huge partitioned tables.
Using Oracle10g
PARTITION RANGE ALL just means that the predicates could not be used to perform any partition pruning. More info. Or, that the alternative (scanning the table blocks instead of using a fast full scan on the index) was estimated to be more expensive overall.
If you can change the predicate to limit the affected rows to a small subset of the partitions, the database will be able to skip whole partitions when querying the table.