Improve join query in Oracle - sql

I have a query which takes 17 seconds to execute. I have applied indexes on FIPS, STR_DT, END_DT but still it's taking time. Any suggestions on how I can improve the performance?
My query:
SELECT /*+ALL_ROWS*/ K_LF_SVA_VA.NEXTVAL VAL_REC_ID, a.REC_ID,
b.VID,
1 VA_SEQ,
51 VA_VALUE_DATATYPE,
b.VALUE VAL_NUM,
SYSDATE CREATED_DATE,
SYSDATE UPDATED_DATE
FROM CTY_REC a JOIN FIPS_CONS b
ON a.FIPS=b.FIPS AND a.STR_DT=b.STR_DT AND a.END_DT=b.END_DT;
DESC CTY_REC;
Name Null Type
------------------- ---- -------------
REC_ID NUMBER(38)
DATA_SOURCE_DATE DATE
STR_DT DATE
END_DT DATE
VID_RECSET_ID NUMBER
VID_VALSET_ID NUMBER
FIPS VARCHAR2(255)
DESC FIPS_CONS;
Name Null Type
------------- -------- -------------
STR_DT DATE
END_DT DATE
FIPS VARCHAR2(255)
VARIABLE VARCHAR2(515)
VALUE NUMBER
VID NOT NULL NUMBER
Explain Plan:
Plan hash value: 919279614
--------------------------------------------------------------
| Id | Operation | Name |
--------------------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | SEQUENCE | K_VAL |
| 2 | HASH JOIN | |
| 3 | TABLE ACCESS FULL| CTY_REC |
| 4 | TABLE ACCESS FULL| FIPS_CONS |
--------------------------------------------------------------
I have added description of tables and explain plan for my query.

On the face of it, and without information on the configuration of the sequence you're using, the number of rows in each table, and the total number of rows projected from the query, it's possible that the execution plan you have is the most efficient one for returning all rows.
The optimiser clearly thinks that the indexes will not benefit performance, and this is often more likely when you optimise for all rows, not first rows. Index-based access is single block and one row at a time, so can be inherently slower than multiblock full scans on a per-block basis.
The hash join that Oracle is using is an extremely efficient way of joining data sets. Unless the hashed table is so large that it spills to disk, the total cost is only slightly more than full scans of the two tables. We need more detailed statistics on the execution to be able to tell if the hashed table is spilling to disk, and if it is the solution may just be modified memory management, not indexes.
What might also hold up your SQL execution is calling that sequence, if the sequence's cache value is very low and the number of records is high. More info required on that -- if you need to generate a sequential identifier for each row then you could use ROWNUM.

This is basically your query:
SELECT . . .
FROM CTY_REC a JOIN
FIPS_CONS b
ON a.FIPS = b.FIPS AND a.STR_DT = b.STR_DT AND a.END_DT = b.END_DT;
You want a composite index on (FIPS, STR_DT, END_DT), perhaps on both tables:
create index idx_cty_rec_3 on cty_rec(FIPS, STR_DT, END_DT);
create index idx_fipx_con_3 on cty_rec(FIPS, STR_DT, END_DT);
Actually, only one is probably necessary but having both gives the optimizer more choices for improving the query.

You should have at least these two indexes on the table:
CTY_REC(FIPS, STR_DT, END_DT)
FIPS_CONS(FIPS, STR_DT, END_DT)
which can still be sped up with covering indexes instead:
CTY_REC(FIPS, STR_DT, END_DT, REC_ID)
FIPS_CONS(FIPS, STR_DT, END_DT, VALUE, VID)

If you wish to drive the optimizer to use the indexes,
replace /*+ all_rows */ with /*+ first_rows */

Related

SQL Query with part of the key possibly being NULL

I've been working on a SQL query which needs to pull a value with a two-column key, where one of the columns may be null.And if it's null, I want to pick that value only if there is no row with the specific key
So.
CUSTOM_____PLAN_____COST
VENDCO_____LMNK_____50
VENDCO_____null_____25
BALLCO_____null_____10
I'm trying to run a query that will pull this into one field, i.e., the value of VENDCO at 50, and the value of BUYCO at 10, ignoring the VENDCO row with 25. This would be as part of a joined subquery, so I can't use the actual keys of VENDCO/BUYCO etc. Essentially, pick the cost value with the plan if it exists, but the one where it's null if the plan is not there.
It might also be worthwhile to point out that if I "select * from table where PLAN is null" I don't get results -- I have to select where PLAN=''. I'm not sure if that indicates anything weird about the data.
Hope I'm making myself clear.
I think that not exists should do what you want:
select t.*
from mytable t
where
plan is not null
or not exists (
select 1 from mytable t1 where t1.custom = t.custom and t1.plan is not null
)
Basically this gives priority to rows where plan is not null in groups sharing the same custom.
Demo on DB Fiddle:
CUSTOM | PLAN | COST
:----- | :--- | ---:
VENDCO | LMNK | 50
BALLCO | null | 10

Why is Oracle SQL Optimizer ignoring index predicate for this view?

I'm trying to optimize a set of stored procs which are going against many tables including this view. The view is as such:
We have TBL_A (id, hist_date, hist_type, other_columns) with two types of rows: hist_type 'O' vs. hist_type 'N'. The view self joins table A to itself and transposes the N rows against the corresponding O rows. If no N row exists for the O row, the O row values are repeated. Like so:
CREATE OR REPLACE FORCE VIEW V_A (id, hist_date, hist_type, other_columns_o, other_columns_n)
select
o.id, o.hist_date, o.hist_type,
o.other_columns as other_columns_o,
case when n.id is not null then n.other_columns else o.other_columns end as other_columns_n
from
TBL_A o left outer join TBL_A n
on o.id=n.id and o.hist_date=n.hist_date and n.hist_type = 'N'
where o.hist_type = 'O';
TBL_A has a unique index on: (id, hist_date, hist_type). It also has a unique index on: (hist_date, id, hist_type) and this is the primary key.
The following query is at issue (in a stored proc, with x declared as TYPE_TABLE_OF_NUMBER):
select b.id BULK COLLECT into x from TBL_B b where b.parent_id = input_id;
select v.id from v_a v
where v.id in (select column_value from table(x))
and v.hist_date = input_date
and v.status_new = 'CLOSED';
This query ignores the index on id column when accessing TBL_A and instead does a range scan using the date to pick up all the rows for the date. Then it filters that set using the values from the array. However if I simply give the list of ids as a list of numbers the optimizer uses the index just fine:
select v.id from v_a v
where v.id in (123, 234, 345, 456, 567, 678, 789)
and v.hist_date = input_date
and v.status_new = 'CLOSED';
The problem also doesn't exist when going against TBL_A directly (and I have a workaround that does that, but it's not ideal.).Is there a way to get the optimizer to first retrieve the array values and use them as predicates when accessing the table? Or a good way to restructure the view to achieve this?
Oracle does not use the index because it assumes select column_value from table(x) returns 8168 rows.
Indexes are faster for retrieving small amounts of data. At some point it's faster to scan the whole table than repeatedly walk the index tree.
Estimating the cardinality of a regular SQL statement is difficult enough. Creating an accurate estimate for procedural code is almost impossible. But I don't know where they came up with 8168. Table functions are normally used with pipelined functions in data warehouses, a sorta-large number makes sense.
Dynamic sampling can generate a more accurate estimate and likely generate a plan that will use the index.
Here's an example of a bad cardinality estimate:
create or replace type type_table_of_number as table of number;
explain plan for
select * from table(type_table_of_number(1,2,3,4,5,6,7));
select * from table(dbms_xplan.display(format => '-cost -bytes'));
Plan hash value: 1748000095
-------------------------------------------------------------------------
| Id | Operation | Name | Rows | Time |
-------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 8168 | 00:00:01 |
| 1 | COLLECTION ITERATOR CONSTRUCTOR FETCH| | 8168 | 00:00:01 |
-------------------------------------------------------------------------
Here's how to fix it:
explain plan for select /*+ dynamic_sampling(2) */ *
from table(type_table_of_number(1,2,3,4,5,6,7));
select * from table(dbms_xplan.display(format => '-cost -bytes'));
Plan hash value: 1748000095
-------------------------------------------------------------------------
| Id | Operation | Name | Rows | Time |
-------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 7 | 00:00:01 |
| 1 | COLLECTION ITERATOR CONSTRUCTOR FETCH| | 7 | 00:00:01 |
-------------------------------------------------------------------------
Note
-----
- dynamic statistics used: dynamic sampling (level=2)

Which field should I use with Oracle Partition By clause to improve performance

I have an update statement that works fine but takes a very long time to complete.
I'm updating roughly 150 rows in one table with some tens of thousands of rows exposed through a view. It's been suggested that I use the Partition By clause to speed up the process.
I'm not too familiar with Partition By statement but I've been looking around and I think maybe I need to use a field that has a numeric value that can be compared against.
Is this correct? Or can I partition the larger table with something else?
if that is the case I'm struggling with what in the larger table can be used. The table is composed as follows.
ID has a type of NUMBER and creates the unique id for a particular item.
Start_Date has a date type and indicates the start when the ID is valid.
End date has a date type and indicates the end time when the ID cease to be valid.
ID_Type is NVARCHAR2(30) and indicates what type of Identifier we are using.
ID_Type2 is NVARCHAR2(30) and indicates what sub_type of Identifier we are using.
Identifier is NVARCHAR2(30) and any one ID can be mapped to one or more Identifiers.
So for example - View_ID
ID | Start_Date | End_Date | ID_Type1| ID_Type2 | Identifier
1 | 2012-01-01 | NULL | Primary | Tertiary | xyz1
1 | 2012-01-01 | NULL | Second | Alpha | abc2
2 | 2012-01-01 | 2012-01-31 | Primary | Tertiary | ghv2
2 | 2012-02-01 | NULL | Second | Alpha | mno4
Would it be possible to Partition By the ID field of this view as long as there is a clause that the id is valid by date?
The update statement is quite basic although it selects against one of several possible identifiers and and ID_Type1's.
UPDATE Temp_Table t set ID =
(SELECT DISTINCT ID FROM View_ID v
WHERE inDate BETWEEN Start_Date and End_Date
AND v.Identifier = (NVL(t.ID1, NVL(t.ID2, t.ID3)))
AND v.ID_Type1 in ('Primary','Secondary'));
Thanks in advance for any advice on any aspect of my question.
Additional Info ***
After investigating and following Gordon's advice I changed the update to three updates. This reduced the overall update process 75% going from just over a minute to just over 20 seconds. Thats a big improvement but I'd like to reduce the process even more if possible.
Does anyone think that Partition By clause would help even further? If so what would be the correct method for putting this clause into an update statement. I'm honestly not sure if I understand how this clause operates.
If the UPDATE using a SELECT statement only allows for 1 value to be selected does this exclude something like the following from working?
UPDATE Temp_Table t SET t.ID =
(SELECT DISTINCT ID,
Row_Number () (OVER PARTITION BY ID_Type1) AS PT1
FROM View_ID v
WHERE inDate BETWEEN v.Start_Date and v.End_Date
AND v.Identifier = t.ID1
AND PT1.Row_Number = 1 )
*Solution************
I combined advice from both Responders below to dramatically improve performance. From Gordon I removed the NVL from my UPDATE and changed it to three separate updates. (I'd prefer to combine them into a case but my trials were still slow.)
From Eggi, I looked working with some kind of Materialized view that I can actually index myself and settled on a WITH Clause.
UPDATE Temp_Table t set ID =
(WITH IDs AS (SELECT /*+ materialize */ DISTINCT ID, Identifier FROM View_ID v
WHERE inDate BETWEEN Start_Date and End_Date
AND v.Identifier = ID1)
SELECT g.ID FROM IDs g
WHERE g.Identifier = t.ID1;
Thanks again.
It is very hard to imagine how windows/analytic functions would help with this update. I do highly recommend that you learn them, but not for this purpose.
Perhaps the suggestion was for partitioning the table space, used for the table. Note that this is very different from the "partition by" statement, which usually refers to window/analytic functions. Tablespace partitioning might help performance. However, here is something else you can try.
I think your problem is the join between the temp table and the view. Presumably, you are creating the temporary table. You should add in a new column, say UsedID, with the definition:
coalesce(t.ID1, t.ID2, t.ID3) as UsedId
The "WHERE" clause in the update would then be:
WHERE inDate BETWEEN Start_Date and End_Date AND
v.Identifier = t.UsedId AND
v.ID_Type1 in ('Primary', 'Secondary')
I suspect that the performance problem is the use of NVL in the join, which interferes with optimization strategies.
In response to your comment . . . your original query would have the same problem as this version. Perhaps the logic you want is:
WHERE inDate BETWEEN Start_Date and End_Date AND
v.Identifier in (t.ID1, t.ID2, t.ID3) AND
v.ID_Type1 in ('Primary', 'Secondary')
The best option for partitioning seems to be the start date, because it seems to always have a value and you also get it as input parameter in your query.
If you have not already done that I would add a bitmap index on ID_Type1.

optimize mysql count query

Is there a way to optimize this further or should I just be satisfied that it takes 9 seconds to count 11M rows ?
devuser#xcmst > mysql --user=user --password=pass -D marctoxctransformation -e "desc record_updates"
+--------------+----------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------+----------+------+-----+---------+-------+
| record_id | int(11) | YES | MUL | NULL | |
| date_updated | datetime | YES | MUL | NULL | |
+--------------+----------+------+-----+---------+-------+
devuser#xcmst > date; mysql --user=user --password=pass -D marctoxctransformation -e "select count(*) from record_updates where date_updated > '2009-10-11 15:33:22' "; date
Thu Dec 9 11:13:17 EST 2010
+----------+
| count(*) |
+----------+
| 11772117 |
+----------+
Thu Dec 9 11:13:26 EST 2010
devuser#xcmst > mysql --user=user --password=pass -D marctoxctransformation -e "explain select count(*) from record_updates where date_updated > '2009-10-11 15:33:22' "
+----+-------------+----------------+-------+--------------------------------------------------------+--------------------------------------------------------+---------+------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------+-------+--------------------------------------------------------+--------------------------------------------------------+---------+------+----------+--------------------------+
| 1 | SIMPLE | record_updates | index | idx_marctoxctransformation_record_updates_date_updated | idx_marctoxctransformation_record_updates_date_updated | 9 | NULL | 11772117 | Using where; Using index |
+----+-------------+----------------+-------+--------------------------------------------------------+--------------------------------------------------------+---------+------+----------+--------------------------+
devuser#xcmst > mysql --user=user --password=pass -D marctoxctransformation -e "show keys from record_updates"
+----------------+------------+--------------------------------------------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+----------------+------------+--------------------------------------------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+
| record_updates | 1 | idx_marctoxctransformation_record_updates_date_updated | 1 | date_updated | A | 2416 | NULL | NULL | YES | BTREE | |
| record_updates | 1 | idx_marctoxctransformation_record_updates_record_id | 1 | record_id | A | 11772117 | NULL | NULL | YES | BTREE | |
+----------------+------------+--------------------------------------------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+
If mysql has to count 11M rows, there really isn't much of a way to speed up a simple count. At least not to get it to a sub 1 second speed. You should rethink how you do your count. A few ideas:
Add an auto increment field to the table. It looks you wouldn't delete from the table, so you can use simple math to find the record count. Select the min auto increment number for the initial earlier date and the max for the latter date and subtract one from the other to get the record count. For example:
SELECT min(incr_id) min_id FROM record_updates WHERE date_updated BETWEEN '2009-10-11 15:33:22' AND '2009-10-12 23:59:59';
SELECT max(incr_id) max_id FROM record_updates WHERE date_updated > DATE_SUB(NOW(), INTERVAL 2 DAY);`
Create another table summarizing the record count for each day. Then you can query that table for the total records. There would only be 365 records for each year. If you need to get down to more fine grained times, query the summary table for full days and the current table for just the record count for the start and end days. Then add them all together.
If the data isn't changing, which it doesn't seem like it is, then summary tables will be easy to maintain and update. They will significantly speed things up.
Since >'2009-10-11 15:33:22' contains most of the records,
I would suggest to do a reverse matching like <'2009-10-11 15:33:22' (mysql work less harder and less rows involved)
select
TABLE_ROWS -
(select count(*) from record_updates where add_date<"2009-10-11 15:33:22")
from information_schema.tables
where table_schema = "marctoxctransformation" and table_name="record_updates"
You can combine with programming language (like bash shell)
to make this calculation a bit smarter...
such as do execution plan first to calculate which comparison will use lesser row
From my testing (around 10M records), the normal comparison takes around 3s,
and now cut-down to around 0.25s
MySQL doesn't "optimize" count(*) queries in InnoDB because of versioning. Every item in the index has to be iterated over and checked to make sure that the version is correct for display (e.g., not an open commit). Since any of your data can be modified across the database, ranged selects and caching won't work. However, you possibly can get by using triggers. There are two methods to this madness.
This first method risks slowing down your transactions since none of them can truly run in parallel: use after insert and after delete triggers to increment / decrement a counter table. Second trick: use those insert / delete triggers to call a stored procedure which feeds into an external program which similarly adjusts values up and down, or acts upon a non-transactional table. Beware that in the event of a rollback, this will result in inaccurate numbers.
If you don't need an exact numbers, check out this query:
select table_rows from information_schema.tables
where table_name = 'foo';
Example difference: count(*): 1876668, table_rows: 1899004. The table_rows value is an estimation, and you'll get a different number every time even if you database doesn't change.
For my own curiosity: do you need exact numbers that are updated every second? IF so, why?
If the historical data is not volatile, create a summary table. There are various approaches, the one to choose will depend on how your table is updated, and how often.
For example, assuming old data is rarely/never changed, but recent data is, create a monthly summary table, populated for the previous month at the end of each month (eg insert January's count at the end of February). Once you have your summary table, you can add up the full months and the part months at the beginning and end of the range:
select count(*)
from record_updates
where date_updated >= '2009-10-11 15:33:22' and date_updated < '2009-11-01';
select count(*)
from record_updates
where date_updated >= '2010-12-00';
select sum(row_count)
from record_updates_summary
where date_updated >= '2009-11-01' and date_updated < '2010-12-00';
I've left it split out above for clarity but you can do this in one query:
select ( select count(*)
from record_updates
where date_updated >= '2010-12-00'
or ( date_updated>='2009-10-11 15:33:22'
and date_updated < '2009-11-01' ) ) +
( select count(*)
from record_updates
where date_updated >= '2010-12-00' );
You can adapt this approach for make the summary table based on whole weeks or whole days.
You should add an index on the 'date_updated' field.
Another thing you can do if you don't mind changing the structure of the table, is to use the timestamp of the date in 'int' instead of 'datetime' format, and it might be even faster.
If you decide to do so, the query will be
select count(date_updated) from record_updates where date_updated > 1291911807
There is no primary key in your table. It's possible that in this case it always scans the whole table. Having a primary key is never a bad idea.
If you need to return the total table's row count, then there is an alternative to the
SELECT COUNT(*) statement which you can use. SELECT COUNT(*) makes a full table scan to return the total table's row count, so it can take a long time. You can use the sysindexes system table instead in this case. There is a ROWS column in the sysindexes table. This column contains the total row count for each table in your database. So, you can use the following select statement instead of SELECT COUNT(*):
SELECT rows FROM sysindexes WHERE id = OBJECT_ID('table_name') AND indid < 2
This can improve the speed of your query.
EDIT: I have discovered that my answer would be correct if you were using a SQL Server database. MySQL databases do not have a sysindexes table.
It depends on a few things but something like this may work for you
im assuming this count never changes as it is in the past so the result can be cached somehow
count1 = "select count(*) from record_updates where date_updated <= '2009-10-11 15:33:22'"
gives you the total count of records in the table,
this is an approximate value in innodb table so BEWARE, depends on engine
count2 = "select table_rows from information_schema.`TABLES` where table_schema = 'marctoxctransformation' and TABLE_NAME = 'record_updates'"
your answer
result = count2 - count1
There are a few details I'd like you to clarify (would put into comments on the q, but it is actually easier to remove from here when you update your question).
What is the intended usage of data, insert once and get the counts many times, or your inserts and selects are approx on par?
Do you care about insert/update performance?
What is the engine used for the table? (heck you can do SHOW CREATE TABLE ...)
Do you need the counts to be exact or approximately exact (like 0.1% correct)
Can you use triggers, summary tables, change schema, change RDBMS, etc.. or just add/remove indexes?
Maybe you should explain also what is this table supposed to be? You have record_id with cardinality that matches the number of rows, so is it PK or FK or what is it? Also the cardinality of the date_updated suggests (though not necessarily correct) that it has same values for ~5,000 records on average), so what is that? - it is ok to ask a SQL tuning question with not context, but it is also nice to have some context - especially if redesigning is an option.
In the meantime, I'll suggest you to get this tuning script and check the recommendations it will give you (it's just a general tuning script - but it will inspect your data and stats).
Instead of doing count(*), try doing count(1), like this:-
select count(1) from record_updates where date_updated > '2009-10-11 15:33:22'
I took a DB2 class before, and I remember the instructor mentioned about doing a count(1) when we just want to count number of rows in the table regardless the data because it is technically faster than count(*). Let me know if it makes a difference.
NOTE: Here's a link you might be interested to read: http://www.mysqlperformanceblog.com/2007/04/10/count-vs-countcol/

How to use a function-based index on a column that contains NULLs in Oracle 10+?

Lets just say you have a table in Oracle:
CREATE TABLE person (
id NUMBER PRIMARY KEY,
given_names VARCHAR2(50),
surname VARCHAR2(50)
);
with these function-based indices:
CREATE INDEX idx_person_upper_given_names ON person (UPPER(given_names));
CREATE INDEX idx_person_upper_last_name ON person (UPPER(last_name));
Now, given_names has no NULL values but for argument's sake last_name does. If I do this:
SELECT * FROM person WHERE UPPER(given_names) LIKE 'P%'
the explain plan tells me its using the index but change it to:
SELECT * FROM person WHERE UPPER(last_name) LIKE 'P%'
it doesn't. The Oracle docs say that to use the function-based index will only be used when several conditions are met, one of which is ensuring there are no NULL values since they aren't indexed.
I've tried these queries:
SELECT * FROM person WHERE UPPER(last_name) LIKE 'P%' AND UPPER(last_name) IS NOT NULL
and
SELECT * FROM person WHERE UPPER(last_name) LIKE 'P%' AND last_name IS NOT NULL
In the latter case I even added an index on last_name but no matter what I try it uses a full table scan. Assuming I can't get rid of the NULL values, how do I get this query to use the index on UPPER(last_name)?
The index can be used, though the optimiser may have chosen not to use it for your particular example:
SQL> create table my_objects
2 as select object_id, object_name
3 from all_objects;
Table created.
SQL> select count(*) from my_objects;
2 /
COUNT(*)
----------
83783
SQL> alter table my_objects modify object_name null;
Table altered.
SQL> update my_objects
2 set object_name=null
3 where object_name like 'T%';
1305 rows updated.
SQL> create index my_objects_name on my_objects (lower(object_name));
Index created.
SQL> set autotrace traceonly
SQL> select * from my_objects
2 where lower(object_name) like 'emp%';
29 rows selected.
Execution Plan
----------------------------------------------------------
------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 17 | 510 | 355 (1)|
| 1 | TABLE ACCESS BY INDEX ROWID| MY_OBJECTS | 17 | 510 | 355 (1)|
|* 2 | INDEX RANGE SCAN | MY_OBJECTS_NAME | 671 | | 6 (0)|
------------------------------------------------------------------------------------
The documentation you read was presumably pointing out that, just like any other index, all-null keys are not stored in the index.
In your example you've created the same index twice - this would give an error so I'm assuming that was a mistake in pasting, not the actual code you tried.
I tried it with
CREATE INDEX idx_person_upper_surname ON person (UPPER(surname));
SELECT * FROM person WHERE UPPER(surname) LIKE 'P%';
and it produced the expected query plan:
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=1 Card=1 Bytes=67)
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'PERSON' (TABLE) (Cost=1
Card=1 Bytes=67)
2 1 INDEX (RANGE SCAN) OF 'IDX_PERSON_UPPER_SURNAME' (INDEX)
(Cost=1 Card=1)
To answer your question, yes it should work. Try double checking that you do have the second index created correctly.
Also try an explicit hint:
SELECT /*+INDEX(PERSON IDX_PERSON_UPPER_SURNAME)*/ *
FROM person
WHERE UPPER(surname) LIKE 'P%';
If that works, but only with the hint, then it is likely related to CBO statistics gone wrong, or CBO related init parameters.
Are you sure you want the index to be used? Full table scans are not bad. Depending on the size of the table, it might be more efficient to do a table scan than use an index. It also depends on the density and distribution of the data, which is why statistics are gathered. The cost based optimizer can usually be trusted to make the right choice. Unless you have a specific performance problem, I wouldn't worry too much about it.
Oracle will still use a function-based indexes with columns that contain null - I think you misinterpreted the documentation.
You need to put a nvl in the function index if you want to check for this though.
Something like...
create index idx_person_upper_surname on person (nvl(upper(surname),'N/A'));
You can then query using the index with
select * from person where nvl(upper(surname),'N/A') = 'PIERPOINT'
Although, all a bit ugly. Since most people have surnames, perhaps a "not null" is appropriate :-).
You can circumvent the problem of null values being unindexed in this or other situations by also indexing based on a literal value:
CREATE INDEX idx_person_upper_surname ON person (UPPER(surname),0);
This allows you to use the index for such queries as:
Select *
From person
Where UPPER(surname) is null;
This query would normally not usa an index, except for bitmap indexes or indexes including a nonnullable real column other than surname.