Non index column in join - sql

SQL> desc emp_1;
Name Type Nullable Default Comments
-------- ------------ -------- ------- --------
EMP_ID NUMBER
EMP_NAME VARCHAR2(20) Y
DEPTNO NUMBER(10) Y
SQL> desc dept
Name Type Nullable Default Comments
--------- ------------ -------- ------- --------
DEPT_ID NUMBER Y
DEPT_NAME VARCHAR2(20) Y
SQL> CREATE INDEX abc_idex ON emp_1 (deptno);
Index created
select /*+ index(emp_1 abc_idex) */ emp_name from emp_1
INNER JOIN dept ON emp_1.deptno = dept.dept_id
Explain Plan :-
SELECT STATEMENT, GOAL = ALL_ROWS 271 100000 800000
MERGE JOIN 271 100000 800000
TABLE ACCESS BY INDEX ROWID EXAMINBI EMP_1 267 100000 500000
INDEX FULL SCAN EXAMINBI ABC_IDEX 131 100000
SORT JOIN 4 4 12
TABLE ACCESS FULL EXAMINBI DEPT 3 4 12
select /*+ index(emp_1 abc_idex) */ emp_name from emp_1
INNER JOIN dept ON emp_1.deptno = dept.dept_id
and emp_1.emp_name=dept.dept_name
Explain Plan:-
SELECT STATEMENT, GOAL = ALL_ROWS 272 1 11
HASH JOIN 272 1 11
TABLE ACCESS FULL EXAMINBI DEPT 3 4 24
TABLE ACCESS BY INDEX ROWID EXAMINBI EMP_1 267 100000 500000
INDEX FULL SCAN EXAMINBI ABC_IDEX 131 100000
I m clearing my index conncept with your help. My understanding was oracle will skip my index hint as it needs to other column also which is not indexed(emp_name) but still emp_1 table was scanned by index in 2nd case. My question will it help in such case, where I m using another column for join where index is not used(in our example emp_name)? Should we use index hint in such case?
*Note:- I know this is emp_name and dept_name is not logical join but just for testing purpose I created the same.*

I want to know if its recommened to use index hint when in join you
are using non index columns from same table. Will it help?
Under most circumstances No
Under normal circumstances you simply do not use hints. As you can see here, you've used a hint, Oracle has followed it and done something dumb. You only use hints in very limited circumstances, usually only when you know something about the nature of the data that Oracle can not work out itself. Generally the only hint I use is the cardinality hint, as Oracle can sometimes genuinely not work out the cardinality correctly.
Do not assume that you need to regularly use hints. You don't. Even if a hint works now, it might stop working when the nature of the data changes.

On your case using index probably slows down the whole statement. It is because you are querying the whole table DEPT and EMP_1. Because of hint, Oracle has to query both full tables AND index. Do you really want this?
In simple cases like this I prefere not using hints. Optimizer does its job quite good.
If you use statement for specific department, then the result would be better
select emp_name
from emp_1 INNER JOIN dept ON emp_1.deptno = dept.dept_id
where dept.dname = 'any department'

and so:
select /*+ cardinality(0)*/ emp_name
from emp_1
INNER JOIN dept ON emp_1.deptno = dept.dept_id

Related

How to improve an Update query in Oracle

I'm trying to update two columns in an archaic Oracle database, but the query simply doesn't finish and nothing is updated. Any ideas to improve the query or something else that can be done? I don't have DBA skills/knowledge and unsure if indexing would help, so would appreciate comments in that area, too.
PERSON table: This table has 200 million distinct person_id's. There are no duplicates. The person_id is numeric and am trying to update the favorite_color and color_confidence columns, which are varchar2 and values currently NULLed out.
person table
person_id favorite_color color_confidence many_other_columns
222
333
444
TEMP_COLOR_CONFIDENCE table: I'm trying to get the favorite_color and color_confidence from this table and update to the PERSON table. This table has 150 million distinct person's, again nothing duplicated.
temp_color_confidence
person_id favorite_color color_confidence
222 R H
333 Y L
444 G M
This is my update query, which I realize only updates those found in both tables. Eventually I'll need to update the remaining 50 million with "U" -- unknown. Solving that in one shot would be ideal too, but currently just concerned that I'm not able to get this query to complete.
UPDATE person p
SET (favorite_color, color_confidence) =
(SELECT t.favorite_color, t.color_confidence
FROM temp_color_confidence t
WHERE p.person_id = t.person_id)
WHERE EXISTS (
SELECT 1
FROM temp_color_confidence t
WHERE p.person_id = t.person_id );
Here's where my ignorance shines... would indexing on person_id help, considering they are all distinct anyway? Would indexing on favorite_color help? There are less than 10 colors and only 3 confidence values.
For every person, it has to find the corresponding row in temp_color_confidence. The way to do that with the least I/O is to scan each table once and crunch them together in a single hash join, ideally all in memory. Indexes are unlikely to help with that, unless maybe temp_color_confidence is very wide and verbose and has an index on (person_id, favorite_color, color_confidence) which the optimiser can treat as a skinny table.
Using merge might be more efficient as it can avoid the second scan of temp_color_confidence:
merge into person p
using temp_color_confidence t
on (p.person_id = t.person_id)
when matched then update
set p.favorite_color = t.favorite_color, p.color_confidence = t.color_confidence;
If you are going to update every row in the table, though, you might consider instead creating a new table containing all the values you need:
create table person2
( person_id, favorite_color, color_confidence )
pctfree 0 compress
as
select p.person_id, nvl(t.favorite_color,'U'), nvl(t.color_confidence,0)
from person p
left join temp_color_confidence t
on t.person_id = p.person_id;

Improve join query in Oracle

I have a query which takes 17 seconds to execute. I have applied indexes on FIPS, STR_DT, END_DT but still it's taking time. Any suggestions on how I can improve the performance?
My query:
SELECT /*+ALL_ROWS*/ K_LF_SVA_VA.NEXTVAL VAL_REC_ID, a.REC_ID,
b.VID,
1 VA_SEQ,
51 VA_VALUE_DATATYPE,
b.VALUE VAL_NUM,
SYSDATE CREATED_DATE,
SYSDATE UPDATED_DATE
FROM CTY_REC a JOIN FIPS_CONS b
ON a.FIPS=b.FIPS AND a.STR_DT=b.STR_DT AND a.END_DT=b.END_DT;
DESC CTY_REC;
Name Null Type
------------------- ---- -------------
REC_ID NUMBER(38)
DATA_SOURCE_DATE DATE
STR_DT DATE
END_DT DATE
VID_RECSET_ID NUMBER
VID_VALSET_ID NUMBER
FIPS VARCHAR2(255)
DESC FIPS_CONS;
Name Null Type
------------- -------- -------------
STR_DT DATE
END_DT DATE
FIPS VARCHAR2(255)
VARIABLE VARCHAR2(515)
VALUE NUMBER
VID NOT NULL NUMBER
Explain Plan:
Plan hash value: 919279614
--------------------------------------------------------------
| Id | Operation | Name |
--------------------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | SEQUENCE | K_VAL |
| 2 | HASH JOIN | |
| 3 | TABLE ACCESS FULL| CTY_REC |
| 4 | TABLE ACCESS FULL| FIPS_CONS |
--------------------------------------------------------------
I have added description of tables and explain plan for my query.
On the face of it, and without information on the configuration of the sequence you're using, the number of rows in each table, and the total number of rows projected from the query, it's possible that the execution plan you have is the most efficient one for returning all rows.
The optimiser clearly thinks that the indexes will not benefit performance, and this is often more likely when you optimise for all rows, not first rows. Index-based access is single block and one row at a time, so can be inherently slower than multiblock full scans on a per-block basis.
The hash join that Oracle is using is an extremely efficient way of joining data sets. Unless the hashed table is so large that it spills to disk, the total cost is only slightly more than full scans of the two tables. We need more detailed statistics on the execution to be able to tell if the hashed table is spilling to disk, and if it is the solution may just be modified memory management, not indexes.
What might also hold up your SQL execution is calling that sequence, if the sequence's cache value is very low and the number of records is high. More info required on that -- if you need to generate a sequential identifier for each row then you could use ROWNUM.
This is basically your query:
SELECT . . .
FROM CTY_REC a JOIN
FIPS_CONS b
ON a.FIPS = b.FIPS AND a.STR_DT = b.STR_DT AND a.END_DT = b.END_DT;
You want a composite index on (FIPS, STR_DT, END_DT), perhaps on both tables:
create index idx_cty_rec_3 on cty_rec(FIPS, STR_DT, END_DT);
create index idx_fipx_con_3 on cty_rec(FIPS, STR_DT, END_DT);
Actually, only one is probably necessary but having both gives the optimizer more choices for improving the query.
You should have at least these two indexes on the table:
CTY_REC(FIPS, STR_DT, END_DT)
FIPS_CONS(FIPS, STR_DT, END_DT)
which can still be sped up with covering indexes instead:
CTY_REC(FIPS, STR_DT, END_DT, REC_ID)
FIPS_CONS(FIPS, STR_DT, END_DT, VALUE, VID)
If you wish to drive the optimizer to use the indexes,
replace /*+ all_rows */ with /*+ first_rows */

Optimize sub query

Suppose there three columns ename , city , salary. There are millions of rows in this table named emp.
ename city salary
ak newyork $5000
bk abcd $4000
ck Delhi $4000
....................
...................
Maverick newyork $8000
I want to retrieve all employees having the same city name as Maverick.
select * from emp where
city = (select city from emp where ename= 'maverick' )
I know it will work, but for performance reasons, this query is not good because there are two where clauses present in this query.
I need a query having better performance than above query.
Oracle is probably going to do a good job getting the optimal execution plan for this query:
select *
from emp
where city = (select city from emp where ename= 'maverick' ) ;
What would help the query are two indexes:
create index idx_emp_ename_city on emp(ename, city)
create index idx_emp_ename_city on emp(city)
The first would be used for the subquery. The second to look up all the matching rows. Without indexes, Oracle is going to have to read the table at least once (I think at least twice) and that is going to affect performance on such a large table.
This would give you the same output but I doubt it will perform any better.
You could compare the plans though.
select x.*
from emp x
join (select city from emp where ename = 'maverick') y
on x.city = y.city
You can also add 2 indexes, one on the ENAME column, and a separate one on the CITY column.
create index emp_idx_ename on emp(ename);
create index emp_idx_city on emp(city);
The first index will speed up the inline view whose results are being joined to, because it is searching the table on employee.
The second index will speed up the parent query, because it is searching the table for a given city.
You could create a composite index on emp(city, ename) as others have suggested since you're select only the city column where the ename is X, allowing the query in the inline view to use only the index and not the table, which I didn't initially think of. It may provide an additional boost, more or less, depending on the size of the table, although the index will also be larger.
To make sure the indexes will immediately use updated statistics related to that table, I would also run the following after you create the above indexes, so that your query will immediately start using them:
analyze table emp compute statistics;
You could use with statement... Users sugest you many dicisions
WITH new_city_tab AS (
SELECT city AS ncity
FROM emp WHERE ename='Maverick'
GROUP BY city)
SELECT *
FROM emp e,
new_city_tab c
WHERE E.city = c.ncity;
Sometimes complexity wins from the desire to narrow down the query further. Just isn't possible to optimize this query further.
You could opt to add an index to create better performance. The index should come on city and ename.
Try this to create these indexes:
create index emp_city -- for the outer where clause
on emp
( city
)
create index emp_ename_city -- for the sub query
on emp
( ename
, city
)

Efficient SQL 2000 Query for Selecting Preferred Candy

(I wish I could have come up with a more descriptive title... suggest one or edit this post if you can name the type of query I'm asking about)
Database: SQL Server 2000
Sample Data (assume 500,000 rows):
Name Candy PreferenceFactor
Jim Chocolate 1.0
Brad Lemon Drop .9
Brad Chocolate .1
Chris Chocolate .5
Chris Candy Cane .5
499,995 more rows...
Note that the number of rows with a given 'Name' is unbounded.
Desired Query Results:
Jim Chocolate 1.0
Brad Lemon Drop .9
Chris Chocolate .5
~250,000 more rows...
(Since Chris has equal preference for Candy Cane and Chocolate, a consistent result is adequate).
Question:
How do I Select Name, Candy from data where each resulting row contains a unique Name such that the Candy selected has the highest PreferenceFactor for each Name. (speedy efficient answers preferred).
What indexes are required on the table? Does it make a difference if Name and Candy are integer indexes into another table (aside from requiring some joins)?
You will find that the following query outperforms every other answer given, as it works with a single scan. This simulates MS Access's First and Last aggregate functions, which is basically what you are doing.
Of course, you'll probably have foreign keys instead of names in your CandyPreference table. To answer your question, it is in fact very much best if Candy and Name are foreign keys into another table.
If there are other columns in the CandyPreferences table, then having a covering index that includes the involved columns will yield even better performance. Making the columns as small as possible will increase the rows per page and again increase performance. If you are most often doing the query with a WHERE condition to restrict rows, then an index that covers the WHERE conditions becomes important.
Peter was on the right track for this, but had some unneeded complexity.
CREATE TABLE #CandyPreference (
[Name] varchar(20),
Candy varchar(30),
PreferenceFactor decimal(11, 10)
)
INSERT #CandyPreference VALUES ('Jim', 'Chocolate', 1.0)
INSERT #CandyPreference VALUES ('Brad', 'Lemon Drop', .9)
INSERT #CandyPreference VALUES ('Brad', 'Chocolate', .1)
INSERT #CandyPreference VALUES ('Chris', 'Chocolate', .5)
INSERT #CandyPreference VALUES ('Chris', 'Candy Cane', .5)
SELECT
[Name],
Candy = Substring(PackedData, 13, 30),
PreferenceFactor = Convert(decimal(11,10), Left(PackedData, 12))
FROM (
SELECT
[Name],
PackedData = Max(Convert(char(12), PreferenceFactor) + Candy)
FROM CandyPreference
GROUP BY [Name]
) X
DROP TABLE #CandyPreference
I actually don't recommend this method unless performance is critical. The "canonical" way to do it is OrbMan's standard Max/GROUP BY derived table and then a join to it to get the selected row. Though, that method starts to become difficult when there are several columns that participate in the selection of the Max, and the final combination of selectors can be duplicated, that is, when there is no column to provide arbitrary uniqueness as in the case here where we use the name if the PreferenceFactor is the same.
Edit: It's probably best to give some more usage notes to help improve clarity and to help people avoid problems.
As a general rule of thumb, when trying to improve query performance, you can do a LOT of extra math if it will save you I/O. Saving an entire table seek or scan speeds up the query substantially, even with all the converts and substrings and so on.
Due to precision and sorting issues, use of a floating point data type is probably a bad idea with this method. Though unless you are dealing with extremely large or small numbers, you shouldn't be using float in your database anyway.
The best data types are those that are not packed and sort in the same order after conversion to binary or char. Datetime, smalldatetime, bigint, int, smallint, and tinyint all convert directly to binary and sort correctly because they are not packed. With binary, avoid left() and right(), use substring() to get the values reliably returned to their originals.
I took advantage of Preference having only one digit in front of the decimal point in this query, allowing conversion straight to char since there is always at least a 0 before the decimal point. If more digits are possible, you would have to decimal-align the converted number so things sort correctly. Easiest might be to multiply your Preference rating so there is no decimal portion, convert to bigint, and then convert to binary(8). In general, conversion between numbers is faster than conversion between char and another data type, especially with date math.
Watch out for nulls. If there are any, you must convert them to something and then back.
select c.Name, max(c.Candy) as Candy, max(c.PreferenceFactor) as PreferenceFactor
from Candy c
inner join (
select Name, max(PreferenceFactor) as MaxPreferenceFactor
from Candy
group by Name
) cm on c.Name = cm.Name and c.PreferenceFactor = cm.MaxPreferenceFactor
group by c.Name
order by PreferenceFactor desc, Name
I tried:
SELECT X.PersonName,
(
SELECT TOP 1 Candy
FROM CandyPreferences
WHERE PersonName=X.PersonName AND PreferenceFactor=x.HighestPreference
) AS TopCandy
FROM
(
SELECT PersonName, MAX(PreferenceFactor) AS HighestPreference
FROM CandyPreferences
GROUP BY PersonName
) AS X
This seems to work, though I can't speak to efficiency without real data and a realistic load.
I did create a primary key over PersonName and Candy, though. Using SQL Server 2008 and no additional indexes shows it using two clustered index scans though, so it could be worse.
I played with this a bit more because I needed an excuse to play with the Data Generation Plan capability of "datadude". First, I refactored the one table to have separate tables for candy names and person names. I did this mostly because it allowed me to use the test data generation without having to read the documentation. The schema became:
CREATE TABLE [Candies](
[CandyID] [int] IDENTITY(1,1) NOT NULL,
[Candy] [nvarchar](50) NOT NULL,
CONSTRAINT [PK_Candies] PRIMARY KEY CLUSTERED
(
[CandyID] ASC
),
CONSTRAINT [UC_Candies] UNIQUE NONCLUSTERED
(
[Candy] ASC
)
)
GO
CREATE TABLE [Persons](
[PersonID] [int] IDENTITY(1,1) NOT NULL,
[PersonName] [nvarchar](100) NOT NULL,
CONSTRAINT [PK_Preferences.Persons] PRIMARY KEY CLUSTERED
(
[PersonID] ASC
)
)
GO
CREATE TABLE [CandyPreferences](
[PersonID] [int] NOT NULL,
[CandyID] [int] NOT NULL,
[PrefernceFactor] [real] NOT NULL,
CONSTRAINT [PK_CandyPreferences] PRIMARY KEY CLUSTERED
(
[PersonID] ASC,
[CandyID] ASC
)
)
GO
ALTER TABLE [CandyPreferences]
WITH CHECK ADD CONSTRAINT [FK_CandyPreferences_Candies] FOREIGN KEY([CandyID])
REFERENCES [Candies] ([CandyID])
GO
ALTER TABLE [CandyPreferences]
CHECK CONSTRAINT [FK_CandyPreferences_Candies]
GO
ALTER TABLE [CandyPreferences]
WITH CHECK ADD CONSTRAINT [FK_CandyPreferences_Persons] FOREIGN KEY([PersonID])
REFERENCES [Persons] ([PersonID])
GO
ALTER TABLE [CandyPreferences]
CHECK CONSTRAINT [FK_CandyPreferences_Persons]
GO
The query became:
SELECT P.PersonName, C.Candy
FROM (
SELECT X.PersonID,
(
SELECT TOP 1 CandyID
FROM CandyPreferences
WHERE PersonID=X.PersonID AND PrefernceFactor=x.HighestPreference
) AS TopCandy
FROM
(
SELECT PersonID, MAX(PrefernceFactor) AS HighestPreference
FROM CandyPreferences
GROUP BY PersonID
) AS X
) AS Y
INNER JOIN Persons P ON Y.PersonID = P.PersonID
INNER JOIN Candies C ON Y.TopCandy = C.CandyID
With 150,000 candies, 200,000 persons, and 500,000 CandyPreferences, the query took about 12 seconds and produced 200,000 rows.
The following result surprised me. I changed the query to remove the final "pretty" joins:
SELECT X.PersonID,
(
SELECT TOP 1 CandyID
FROM CandyPreferences
WHERE PersonID=X.PersonID AND PrefernceFactor=x.HighestPreference
) AS TopCandy
FROM
(
SELECT PersonID, MAX(PrefernceFactor) AS HighestPreference
FROM CandyPreferences
GROUP BY PersonID
) AS X
This now takes two or three seconds for 200,000 rows.
Now, to be clear, nothing I've done here has been meant to improve the performance of this query: I considered 12 seconds to be a success. It now says it spends 90% of its time in a clustered index seek.
Comment on Emtucifor solution (as I cant make regular comments)
I like this solution, but have some comments how it could be improved (in this specific case).
It can't be done much if you have everything in one table, but having few tables as in John Saunders' solution will make things a bit different.
As we are dealing with numbers in [CandyPreferences] table we can use math operation instead of concatenation to get max value.
I suggest PreferenceFactor to be decimal instead of real, as I believe we don't need here size of real data type, and even further I would suggest decimal(n,n) where n<10 to have only decimal part stored in 5 bytes. Assume decimal(3,3) is enough (1000 levels of preference factor), we can do simple
PackedData = Max(PreferenceFactor + CandyID)
Further, if we know we have less than 1,000,000 CandyIDs we can add cast as:
PackedData = Max(Cast(PreferenceFactor + CandyID as decimal(9,3)))
allowing sql server to use 5 bytes in temporary table
Unpacking is easy and fast using floor function.
Niikola
-- ADDED LATER ---
I tested both solutions, John's and Emtucifor's (modified to use John's structure and using my suggestions). I tested also with and without joins.
Emtucifor's solution clearly wins, but margins are not huge. It could be different if SQL server had to perform some Physical reads, but they were 0 in all cases.
Here are the queries:
SELECT
[PersonID],
CandyID = Floor(PackedData),
PreferenceFactor = Cast(PackedData-Floor(PackedData) as decimal(3,3))
FROM (
SELECT
[PersonID],
PackedData = Max(Cast([PrefernceFactor] + [CandyID] as decimal(9,3)))
FROM [z5CandyPreferences] With (NoLock)
GROUP BY [PersonID]
) X
SELECT X.PersonID,
(
SELECT TOP 1 CandyID
FROM z5CandyPreferences
WHERE PersonID=X.PersonID AND PrefernceFactor=x.HighestPreference
) AS TopCandy,
HighestPreference as PreferenceFactor
FROM
(
SELECT PersonID, MAX(PrefernceFactor) AS HighestPreference
FROM z5CandyPreferences
GROUP BY PersonID
) AS X
Select p.PersonName,
c.Candy,
y.PreferenceFactor
From z5Persons p
Inner Join (SELECT [PersonID],
CandyID = Floor(PackedData),
PreferenceFactor = Cast(PackedData-Floor(PackedData) as decimal(3,3))
FROM ( SELECT [PersonID],
PackedData = Max(Cast([PrefernceFactor] + [CandyID] as decimal(9,3)))
FROM [z5CandyPreferences] With (NoLock)
GROUP BY [PersonID]
) X
) Y on p.PersonId = Y.PersonId
Inner Join z5Candies c on c.CandyId=Y.CandyId
Select p.PersonName,
c.Candy,
y.PreferenceFactor
From z5Persons p
Inner Join (SELECT X.PersonID,
( SELECT TOP 1 cp.CandyId
FROM z5CandyPreferences cp
WHERE PersonID=X.PersonID AND cp.[PrefernceFactor]=X.HighestPreference
) CandyId,
HighestPreference as PreferenceFactor
FROM ( SELECT PersonID,
MAX(PrefernceFactor) AS HighestPreference
FROM z5CandyPreferences
GROUP BY PersonID
) AS X
) AS Y on p.PersonId = Y.PersonId
Inner Join z5Candies as c on c.CandyID=Y.CandyId
And the results:
TableName nRows
------------------ -------
z5Persons 200,000
z5Candies 150,000
z5CandyPreferences 497,445
Query Rows Affected CPU time Elapsed time
--------------------------- ------------- -------- ------------
Emtucifor (no joins) 183,289 531 ms 3,122 ms
John Saunders (no joins) 183,289 1,266 ms 2,918 ms
Emtucifor (with joins) 183,289 1,031 ms 3,990 ms
John Saunders (with joins) 183,289 2,406 ms 4,343 ms
Emtucifor (no joins)
--------------------------------------------
Table Scan count logical reads
------------------- ---------- -------------
z5CandyPreferences 1 2,022
John Saunders (no joins)
--------------------------------------------
Table Scan count logical reads
------------------- ---------- -------------
z5CandyPreferences 183,290 587,677
Emtucifor (with joins)
--------------------------------------------
Table Scan count logical reads
------------------- ---------- -------------
Worktable 0 0
z5Candies 1 526
z5CandyPreferences 1 2,022
z5Persons 1 733
John Saunders (with joins)
--------------------------------------------
Table Scan count logical reads
------------------- ---------- -------------
z5CandyPreferences 183292 587,912
z5Persons 3 802
Worktable 0 0
z5Candies 3 559
Worktable 0 0
you could use following select statements
select Name,Candy,PreferenceFactor
from candyTable ct
where PreferenceFactor =
(select max(PreferenceFactor)
from candyTable where ct.Name = Name)
but with this select you will get "Chris" 2 times in your result set.
if you want to get the the most preferred food by user than use
select top 1 Name,Candy,PreferenceFactor
from candyTable ct
where name = #name
and PreferenceFactor=
(select max([PreferenceFactor])
from candyTable where name = #name )
i think changing the name and candy to integer types might help you improve performance. you also should insert indexes on both columns.
[Edit] changed ! to #
SELECT Name, Candy, PreferenceFactor
FROM table AS a
WHERE NOT EXISTS(SELECT * FROM table AS b
WHERE b.Name = a.Name
AND (b.PreferenceFactor > a.PreferenceFactor OR (b.PreferenceFactor = a.PreferenceFactor AND b.Candy > a.Candy))
select name, candy, max(preference)
from tablename
where candy=#candy
order by name, candy
usually indexing is required on columns which are frequently included in where clause. In this case I would say indexing on name and candy columns would be of highest priority.
Having lookup tables for columns usually depends on number of repeating values with in columns. Out of 250,000 rows, if there are only 50 values that are repeating, you really need to have integer reference (foreign key) there. In this case, candy reference should be done and name reference really depends on the number of distinct people within the database.
I changed your column Name to PersonName to avoid any common reserved word conflicts.
SELECT PersonName, MAX(Candy) AS PreferredCandy, MAX(PreferenceFactor) AS Factor
FROM CandyPreference
GROUP BY PersonName
ORDER BY Factor DESC
SELECT d.Name, a.Candy, d.MaxPref
FROM myTable a, (SELECT Name, MAX(PreferenceFactor) AS MaxPref FROM myTable) as D
WHERE a.Name = d.Name AND a.PreferenceFactor = d.MaxPref
This should give you rows with matching PrefFactor for a given Name.
(e.g. if John as a HighPref of 1 for Lemon & Chocolate).
Pardon my answer as I am writing it without SQL Query Analyzer.
Something like this would work:
select name
, candy = substring(preference,7,len(preference))
-- convert back to float/numeric
, factor = convert(float,substring(preference,1,5))/10
from (
select name,
preference = (
select top 1
-- convert from float/numeric to zero-padded fixed-width string
right('00000'+convert(varchar,convert(decimal(5,0),preferencefactor*10)),5)
+ ';' + candy
from candyTable b
where a.name = b.name
order by
preferencefactor desc
, candy
)
from (select distinct name from candyTable) a
) a
Performance should be decent with with method. Check your query plan.
TOP 1 ... ORDER BY in a correlated subquery allows us to specify arbitrary rules for which row we want returned per row in the outer query. In this case, we want the highest preference factor per name, with candy for tie-breaks.
Subqueries can only return one value, so we must combine candy and preference factor into one field. The semicolon is just for readability here, but in other cases, you might use it to parse the combined field with CHARINDEX in the outer query.
If you wanted full precision in the output, you could use this instead (assuming preferencefactor is a float):
convert(varchar,preferencefactor) + ';' + candy
And then parse it back with:
factor = convert(float,substring(preference,1,charindex(';',preference)-1))
candy = substring(preference,charindex(';',preference)+1,len(preference))
I tested also ROW_NUMBER() version + added additional index
Create index IX_z5CandyPreferences On z5CandyPreferences(PersonId,PrefernceFactor,CandyID)
Response times between Emtucifor's and ROW_NUMBER() version (with index in place) are marginal (if any - test should be repeated number of times and take averages, but I expect there would not be any significant difference)
Here is query:
Select p.PersonName,
c.Candy,
y.PrefernceFactor
From z5Persons p
Inner Join (Select * from (Select cp.PersonId,
cp.CandyId,
cp.PrefernceFactor,
ROW_NUMBER() over (Partition by cp.PersonId Order by cp.PrefernceFactor, cp.CandyId ) as hp
From z5CandyPreferences cp) X
Where hp=1) Y on p.PersonId = Y.PersonId
Inner Join z5Candies c on c.CandyId=Y.CandyId
and results with and without new index:
| Without index | With Index
----------------------------------------------
Query (Aff.Rows 183,290) |CPU time Elapsed time | CPU time Elapsed time
-------------------------- |-------- ------------ | -------- ------------
Emtucifor (with joins) |1,031 ms 3,990 ms | 890 ms 3,758 ms
John Saunders (with joins) |2,406 ms 4,343 ms | 1,735 ms 3,414 ms
ROW_NUMBER() (with joins) |2,094 ms 4,888 ms | 953 ms 3,900 ms.
Emtucifor (with joins) Without index | With Index
-----------------------------------------------------------------------
Table |Scan count logical reads | Scan count logical reads
-------------------|---------- ------------- | ---------- -------------
Worktable | 0 0 | 0 0
z5Candies | 1 526 | 1 526
z5CandyPreferences | 1 2,022 | 1 990
z5Persons | 1 733 | 1 733
John Saunders (with joins) Without index | With Index
-----------------------------------------------------------------------
Table |Scan count logical reads | Scan count logical reads
-------------------|---------- ------------- | ---------- -------------
z5CandyPreferences | 183292 587,912 | 183,290 585,570
z5Persons | 3 802 | 1 733
Worktable | 0 0 | 0 0
z5Candies | 3 559 | 1 526
Worktable | 0 0 | - -
ROW_NUMBER() (with joins) Without index | With Index
-----------------------------------------------------------------------
Table |Scan count logical reads | Scan count logical reads
-------------------|---------- ------------- | ---------- -------------
z5CandyPreferences | 3 2233 | 1 990
z5Persons | 3 802 | 1 733
z5Candies | 3 559 | 1 526
Worktable | 0 0 | 0 0

How to use a function-based index on a column that contains NULLs in Oracle 10+?

Lets just say you have a table in Oracle:
CREATE TABLE person (
id NUMBER PRIMARY KEY,
given_names VARCHAR2(50),
surname VARCHAR2(50)
);
with these function-based indices:
CREATE INDEX idx_person_upper_given_names ON person (UPPER(given_names));
CREATE INDEX idx_person_upper_last_name ON person (UPPER(last_name));
Now, given_names has no NULL values but for argument's sake last_name does. If I do this:
SELECT * FROM person WHERE UPPER(given_names) LIKE 'P%'
the explain plan tells me its using the index but change it to:
SELECT * FROM person WHERE UPPER(last_name) LIKE 'P%'
it doesn't. The Oracle docs say that to use the function-based index will only be used when several conditions are met, one of which is ensuring there are no NULL values since they aren't indexed.
I've tried these queries:
SELECT * FROM person WHERE UPPER(last_name) LIKE 'P%' AND UPPER(last_name) IS NOT NULL
and
SELECT * FROM person WHERE UPPER(last_name) LIKE 'P%' AND last_name IS NOT NULL
In the latter case I even added an index on last_name but no matter what I try it uses a full table scan. Assuming I can't get rid of the NULL values, how do I get this query to use the index on UPPER(last_name)?
The index can be used, though the optimiser may have chosen not to use it for your particular example:
SQL> create table my_objects
2 as select object_id, object_name
3 from all_objects;
Table created.
SQL> select count(*) from my_objects;
2 /
COUNT(*)
----------
83783
SQL> alter table my_objects modify object_name null;
Table altered.
SQL> update my_objects
2 set object_name=null
3 where object_name like 'T%';
1305 rows updated.
SQL> create index my_objects_name on my_objects (lower(object_name));
Index created.
SQL> set autotrace traceonly
SQL> select * from my_objects
2 where lower(object_name) like 'emp%';
29 rows selected.
Execution Plan
----------------------------------------------------------
------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 17 | 510 | 355 (1)|
| 1 | TABLE ACCESS BY INDEX ROWID| MY_OBJECTS | 17 | 510 | 355 (1)|
|* 2 | INDEX RANGE SCAN | MY_OBJECTS_NAME | 671 | | 6 (0)|
------------------------------------------------------------------------------------
The documentation you read was presumably pointing out that, just like any other index, all-null keys are not stored in the index.
In your example you've created the same index twice - this would give an error so I'm assuming that was a mistake in pasting, not the actual code you tried.
I tried it with
CREATE INDEX idx_person_upper_surname ON person (UPPER(surname));
SELECT * FROM person WHERE UPPER(surname) LIKE 'P%';
and it produced the expected query plan:
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=1 Card=1 Bytes=67)
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'PERSON' (TABLE) (Cost=1
Card=1 Bytes=67)
2 1 INDEX (RANGE SCAN) OF 'IDX_PERSON_UPPER_SURNAME' (INDEX)
(Cost=1 Card=1)
To answer your question, yes it should work. Try double checking that you do have the second index created correctly.
Also try an explicit hint:
SELECT /*+INDEX(PERSON IDX_PERSON_UPPER_SURNAME)*/ *
FROM person
WHERE UPPER(surname) LIKE 'P%';
If that works, but only with the hint, then it is likely related to CBO statistics gone wrong, or CBO related init parameters.
Are you sure you want the index to be used? Full table scans are not bad. Depending on the size of the table, it might be more efficient to do a table scan than use an index. It also depends on the density and distribution of the data, which is why statistics are gathered. The cost based optimizer can usually be trusted to make the right choice. Unless you have a specific performance problem, I wouldn't worry too much about it.
Oracle will still use a function-based indexes with columns that contain null - I think you misinterpreted the documentation.
You need to put a nvl in the function index if you want to check for this though.
Something like...
create index idx_person_upper_surname on person (nvl(upper(surname),'N/A'));
You can then query using the index with
select * from person where nvl(upper(surname),'N/A') = 'PIERPOINT'
Although, all a bit ugly. Since most people have surnames, perhaps a "not null" is appropriate :-).
You can circumvent the problem of null values being unindexed in this or other situations by also indexing based on a literal value:
CREATE INDEX idx_person_upper_surname ON person (UPPER(surname),0);
This allows you to use the index for such queries as:
Select *
From person
Where UPPER(surname) is null;
This query would normally not usa an index, except for bitmap indexes or indexes including a nonnullable real column other than surname.