Checking for the presence of text in a text column efficiently - sql

I have a table with about 2,000,000 rows. I need to query one of the columns to retrieve the rows where a string exsists as part of the value.
When I run the query I will know the position of the string, but not before hand. So a view which takes a substring is not an option.
As far as I can see I have three options
using like ‘% %’
using instr
using substr
I do have the option of creating a function based index, if I am nice to the dba.
At the moment all queries are taking about two seconds. Does anyone have experience of which of these options will work best, or if there is another option? The select will be used for deletes every few seconds, it will typically select 10 rows.
edit with some more info
The problem comes about as we are using a table for storing objects with arbitrary keys and values. The objects come from outside our system so we have limited scope to control them so the text column is something like 'key1=abc,key2=def,keyn=ghi' I know this is horribly denormalised but as we don't know what the keys will be (to some extent) it is a reliable way to store and retrieve values. To retrieve a row is fairly fast as we are searching the whole of the column, which is indexed. But the performance is not good if we want to retrieve the rows with key2=def.
We may be able to create a table with columns for the most common keys, but I was wondering if there was a way to improve performance with the existing set up.

In Oracle 10:
CREATE TABLE test (tst_test VARCHAR2(200));
CREATE INDEX ix_re_1 ON test(REGEXP_REPLACE(REGEXP_SUBSTR(tst_test, 'KEY1=[^,]*'), 'KEY1=([^,]*)', '\1'))
SELECT *
FROM TEST
WHERE REGEXP_REPLACE(REGEXP_SUBSTR(TST_TEST, 'KEY1=[^,]*'), 'KEY1=([^,]*)', '\1') = 'TEST'
This will use newly selected index.
You will need as many indices as there are KEYs in you data.
Presence of an INDEX, of course, impacts performance, but it depends very little on REGEXP being there:
SQL> CREATE INDEX ix_test ON test (tst_test)
2 /
Index created
Executed in 0,016 seconds
SQL> INSERT
2 INTO test (tst_test)
3 SELECT 'KEY1=' || level || ';KEY2=' || (level + 10000)
4 FROM dual
5 CONNECT BY
6 LEVEL <= 1000000
7 /
1000000 rows inserted
Executed in 47,781 seconds
SQL> TRUNCATE TABLE test
2 /
Table truncated
Executed in 2,546 seconds
SQL> DROP INDEX ix_test
2 /
Index dropped
Executed in 0 seconds
SQL> CREATE INDEX ix_re_1 ON test(REGEXP_REPLACE(REGEXP_SUBSTR(tst_test, 'KEY1=[^,]*'), 'KEY1=([^,]*)', '\1'))
2 /
Index created
Executed in 0,015 seconds
SQL> INSERT
2 INTO test (tst_test)
3 SELECT 'KEY1=' || level || ';KEY2=' || (level + 10000)
4 FROM dual
5 CONNECT BY
6 LEVEL <= 1000000
7 /
1000000 rows inserted
Executed in 53,375 seconds
As you can see, on my not very fast machine (Core2 4300, 1 Gb RAM) you can insert 20000 records per second to an indexed field, and this rate almost does not depend on type of INDEX being used: plain or function based.

You can use Tom Kyte's runstats package to compare the performance of different implementations - running each say 1000 times in a loop. For example, I just compared LIKE with SUBSTR and it said that LIKE was faster, taking about 80% of the time of SUBSTR.
Note that "col LIKE '%xxx%'" is different from "SUBSTR(col,5,3) = 'xxx'". The equivalent LIKE would be:
col LIKE '____xxx%'
using one '_' for each leading character to be ignored.
I think whichever way you do it, the results will be similar - it always involves a full table (or perhaps full index) scan. A function-based index would only help if you knew the offset of the substring at the time of creating the index.
I am rather concerned when you say that "The select will be used for deletes every few seconds". This does rather suggest a design flaw somewhere, but without knowing the requirements it's hard to say.
UPDATE:
If your column values are like 'key1=abc,key2=def,keyn=ghi' then perhaps you could consider adding another table like this:
create table key_values
( main_table_id references main_table
, key_value varchar2(50)
, primary key (fk_col, key_value)
);
create index key_values_idx on key_values (key_value);
Split the key values up and store them in this table like this:
main_table_id key_value
123 key1=abc
123 key2=def
123 key3=ghi
(This could be done in an AFTER INSERT trigger on main_table for example)
Then your delete could be:
delete main_table
where id in (select main_table_id from key_values
where key_value = 'key2=def');

Can you provide a bit more information?
Are you querying for an arbitrary substring of a string column, or is there some syntax on the strings store in the columns that would allow for some preprocessing to minimize repeated work?
Have you already done any timing tests on your three options to determine their relative performance on the data you're querying?

I suggest reconsidering your logic.
Instead of looking for where a string exists, it may be faster to check if it has a length of >0 and is not a string.
You can use the TRANSLATE function in oracle to convert all non string characters to nulls then check if the result is null.

Separate answer to comment on the table design.
Can't you at least have a KEY/VALUE structure, so instead of storing in a single column, 'key1=abc,key2=def,keyn=ghi' you would have a child table like
KEY VALUE
key1 abc
key2 def
key3 ghi
Then you can create a single index on key and value and your queries are much simpler (since I take it you are actually looking for an exact match on a given key's value).
Some people will probably comment that this is a horrible design, but I think it's better than what you have now.

If you're always going to be looking for the same substring, then using INSTR and a function-based index makes sense to me. You could also do this if you have a small set of constant substrings you will be looking for, creating one FBI for each one.
Quassnoi's REGEXP idea looks promising too. I haven't used regular expressions inside Oracle yet.
I think that Oracle Text would be another way to go. Info on that here

Not sure about improving existing setup stuff, but Lucene (full-text search library; ported to many platforms) can really help. There's extra burden of synchronizing index with the DB, but if you have anything that resembles a service layer in some programming language this becomes an easy task.

Similar to Anton Gogolev's response, Oracle does incorporate a text search engine documented here
There's also extensible indexing, so you can build your own index structures, documented here
As you've agreed, this is a very poor data structure, and I think you will struggle to achieve the aim of deleting stuff every few seconds. Depending on how this data gets input, I'd look at properly structuring the data on load, at least to the extent of having rows of "parent_id", "key_name", "key_value".

Related

How can I improve the speed of a SQL query searching for a collection of strings

I have a table called T_TICKET with a column CallId varchar(30).
Here is an example of my data:
CallId | RelatedData
===========================================
MXZ_SQzfGMCPzUA | 0000
MXyQq6wQ7gVhzUA | 0001
MXwZN_d5krgjzUA | 0002
MXw1YXo7JOeRzUA | 0000
...
I am attempting to find records that match a collection of CallId's. Something like this:
SELECT * FROM T_TICKET WHERE CALLID IN(N'MXZInrBl1DCnzUA', N'MXZ0TWkUhHprzUA', N'MXZ_SQzfGMCPzUA', ... ,N'MXyQq6wQ7gVhzUA')
And I have anywhere from 200 - 300 CallId's that I am looking up at a time using this query. The query takes around 35 seconds to run. Is there anything I can do to either the table structure, the column type, the index, or the query itself to improve the performance of this query?
There are around 300,000 rows in T_INDEX currently. CallId is not unique. And RelatedData is not unique. I also have an index (non-clustered) on CallId.
I know the basics of SQL, but I'm not a pro. Some things I've thought of doing are:
Change the type of CallId from varchar to char.
Shorten the length of CallId (it's length is 30, but in reality, right now, I am using only 15 bytes).
I have not tried any of these yet because it requires changes to live production data. And, I am not sure they would make a significant improvement.
Would either of these options make a significant improvement? Or, are there other things I could do to make this perform faster?
First, be sure that the types are the same -- either VARCHAR() or NVARCHAR(). Then, add an index:
create index idx_t_ticket_callid on t_ticket(callid);
If the types are compatible, SQL Server should make use of the index.
Your table is what we called heap (a table without clustered index). This kind of tables only good for data loading and/or as staging table. I would recommend you to convert your table to have a clustered key. A good clustering key should be unique, static, narrow, non-nullable, and ever-increasing (eg. int/bigint identity datatype).
Another downside of heap is when you have lots of UPDATE/DELETE on your table, it will slow down your SELECT because of forwarded records. Quoting from Paul Randal about forwarded records:
If a forwarding record occurs in a heap, when the record locator points to that location, the Storage Engine gets there and says Oh, the record isn't really here – it's over there! And then it has to do another (potentially physical) I/O to get to the page with the forwarded record on. This can result in a heap being less efficient that an equivalent clustered index.
Lastly, make sure you define all your columns on your SELECT. Avoid the SELECT *. I'm guessing you are experiencing a table scan when you execute the query. What you can do is INCLUDE all columns list on your SELECT on your index like this:
CREATE INDEX [IX_T_TICKET_CallId_INCLUDE] ON [T_TICKET] ([CallId]) INCLUDE ([RelatedData]) WITH (DROP_EXISTING=ON)
It turns out there is in fact a way to significantly optimize my query without changing any data types.
This query:
SELECT * FROM T_TICKET
WHERE CALLID IN(N'MXZInrBl1DCnzUA', N'MXZ0TWkUhHprzUA', N'MXZ_SQzfGMCPzUA', ... ,N'MXyQq6wQ7gVhzUA')
is using NVARCHAR types as the input params (N'MXZInrBl1DCnzUA', N'MXZ0TWkUhHprzUA'...). As I specified in my question, CallId is VARCHAR. Sql Server was converting CallId in every row of the table to an NVARCHAR type to do the comparison, which was taking a long time (even though I have an index on CallId).
I was able to optimize it by simply NOT changing the parameter types to NVARCHAR:
SELECT * FROM T_TICKET
WHERE CALLID IN('MXZInrBl1DCnzUA', 'MXZ0TWkUhHprzUA', 'MXZ_SQzfGMCPzUA', ... ,'MXyQq6wQ7gVhzUA')
Now, instead of taking over 30 seconds to run, it only takes around .03 seconds. Thanks for all the input.

difference b/w where column='' and column like '' in sql [duplicate]

This question skirts around what I'm wondering, but the answers don't exactly address it.
It would seem that in general '=' is faster than 'like' when using wildcards. This appears to be the conventional wisdom. However, lets suppose I have a column containing a limited number of different fixed, hardcoded, varchar identifiers, and I want to select all rows matching one of them:
select * from table where value like 'abc%'
and
select * from table where value = 'abcdefghijklmn'
'Like' should only need to test the first three chars to find a match, whereas '=' must compare the entire string. In this case it would seem to me that 'like' would have an advantage, all other things being equal.
This is intended as a general, academic question, and so should not matter which DB, but it arose using SQL Server 2005.
See https://web.archive.org/web/20150209022016/http://myitforum.com/cs2/blogs/jnelson/archive/2007/11/16/108354.aspx
Quote from there:
the rules for index usage with LIKE
are loosely like this:
If your filter criteria uses equals =
and the field is indexed, then most
likely it will use an INDEX/CLUSTERED
INDEX SEEK
If your filter criteria uses LIKE,
with no wildcards (like if you had a
parameter in a web report that COULD
have a % but you instead use the full
string), it is about as likely as #1
to use the index. The increased cost
is almost nothing.
If your filter criteria uses LIKE, but
with a wildcard at the beginning (as
in Name0 LIKE '%UTER') it's much less
likely to use the index, but it still
may at least perform an INDEX SCAN on
a full or partial range of the index.
HOWEVER, if your filter criteria uses
LIKE, but starts with a STRING FIRST
and has wildcards somewhere AFTER that
(as in Name0 LIKE 'COMP%ER'), then SQL
may just use an INDEX SEEK to quickly
find rows that have the same first
starting characters, and then look
through those rows for an exact match.
(Also keep in mind, the SQL engine
still might not use an index the way
you're expecting, depending on what
else is going on in your query and
what tables you're joining to. The
SQL engine reserves the right to
rewrite your query a little to get the
data in a way that it thinks is most
efficient and that may include an
INDEX SCAN instead of an INDEX SEEK)
It's a measureable difference.
Run the following:
Create Table #TempTester (id int, col1 varchar(20), value varchar(20))
go
INSERT INTO #TempTester (id, col1, value)
VALUES
(1, 'this is #1', 'abcdefghij')
GO
INSERT INTO #TempTester (id, col1, value)
VALUES
(2, 'this is #2', 'foob'),
(3, 'this is #3', 'abdefghic'),
(4, 'this is #4', 'other'),
(5, 'this is #5', 'zyx'),
(6, 'this is #6', 'zyx'),
(7, 'this is #7', 'zyx'),
(8, 'this is #8', 'klm'),
(9, 'this is #9', 'klm'),
(10, 'this is #10', 'zyx')
GO 10000
CREATE CLUSTERED INDEX ixId ON #TempTester(id)CREATE CLUSTERED INDEX ixId ON #TempTester(id)
CREATE NONCLUSTERED INDEX ixTesting ON #TempTester(value)
Then:
SET SHOWPLAN_XML ON
Then:
SELECT * FROM #TempTester WHERE value LIKE 'abc%'
SELECT * FROM #TempTester WHERE value = 'abcdefghij'
The resulting execution plan shows you that the cost of the first operation, the LIKE comparison, is about 10 times more expensive than the = comparison.
If you can use an = comparison, please do so.
You should also keep in mind that when using like, some sql flavors will ignore indexes, and that will kill performance. This is especially true if you don't use the "starts with" pattern like your example.
You should really look at the execution plan for the query and see what it's doing, guess as little as possible.
This being said, the "starts with" pattern can and is optimized in sql server. It will use the table index. EF 4.0 switched to like for StartsWith for this very reason.
If value is unindexed, both result in a table-scan. The performance difference in this scenario will be negligible.
If value is indexed, as Daniel points out in his comment, the = will result in an index lookup which is O(log N) performance. The LIKE will (most likely - depending on how selective it is) result in a partial scan of the index >= 'abc' and < 'abd' which will require more effort than the =.
Note that I'm talking SQL Server here - not all DBMSs will be nice with LIKE.
You are asking the wrong question. In databases is not the operator performance that matters, is always the SARGability of the expression, and the coverability of the overall query. Performance of the operator itself is largely irrelevant.
So, how do LIKE and = compare in terms of SARGability? LIKE, when used with an expression that does not start with a constant (eg. when used LIKE '%something') is by definition non-SARGabale. But does that make = or LIKE 'something%' SARGable? No. As with any question about SQL performance the answer does not lie with the query of the text, but with the schema deployed. These expression may be SARGable if an index exists to satisfy them.
So, truth be told, there are small differences between = and LIKE. But asking whether one operator or other operator is 'faster' in SQL is like asking 'What goes faster, a red car or a blue car?'. You should eb asking questions about the engine size and vechicle weight, not about the color... To approach questions about optimizing relational tables, the place to look is your indexes and your expressions in the WHERE clause (and other clauses, but it usually starts with the WHERE).
A personal example using mysql 5.5: I had an inner join between 2 tables, one of 3 million rows and one of 10 thousand rows.
When using a like on an index as below(no wildcards), it took about 30 seconds:
where login like '12345678'
using 'explain' I get:
When using an '=' on the same query, it took about 0.1 seconds:
where login ='12345678'
Using 'explain' I get:
As you can see, the like completely cancelled the index seek, so query took 300 times more time.
= is much faster than LIKE, even without wildcard. I tested on MySQL with 11GB of data and more than 100 million of records, the f_time column is indexed.
SELECT * FROM XXXXX WHERE f_time = '1621442261'
#took 0.00sec and return 330 records
SELECT * FROM XXXXX WHERE f_time LIKE '1621442261'
#took 44.71sec and return 330 records
Besides all the answers, there this to consider:
'like' is case insensitive, so every character needs to be compared twice, whereas the '=' only compares once for identical characters.
This issue arises with or without indexes.
Maybe you are looking about Full Text Search.
In contrast to full-text search, the LIKE Transact-SQL predicate works on
character patterns only. Also, you cannot use the LIKE predicate to
query formatted binary data. Furthermore, a LIKE query against a large
amount of unstructured text data is much slower than an equivalent
full-text query against the same data. A LIKE query against millions
of rows of text data can take minutes to return; whereas a full-text
query can take only seconds or less against the same data, depending
on the number of rows that are returned.
I was working with a huge database that has more then 400M records and I put LIKE in search query. Here is the final results.
There were three tables tb1, tb2 and tb3. When I use EQUAL for in all tables QUERY the response time was 193ms. and when I put LIKE in one of he table the response time was 19.22 sec. and for all table LIKE response time was 112 Sec

Index on VARCHAR column

I have a table of 32,589 rows, and one of the columns is called 'Location' and is a Varchar(40) column type. The column holds a location, which is actually a suburb, all uppercase text.
A function that uses this table does a:
IF EXISTS(SELECT * FROM MyTable WHERE Location = 'A Suburb')
...
Would it be beneficial to add an index to this column, for efficiency? This is more a read-only table, so not much edits or inserts except for maintanance.
Without an index SQL Server will have to perform a table scan to find the first instance of the location you're looking for. You might get lucky and have the value be in one of the first few rows, but it could be at row 32,000, which would be a waste of time. Adding an index only takes a few second and you'll probably see a big performance gain.
I concur with #Brian Shamblen answer.
Also, try using TOP 1 in the inner select
IF EXISTS(SELECT TOP 1 * FROM MyTable WHERE Location = 'A Suburb')
You don't have to select all the records matching your criteria for EXISTS, one is enough.
An opportunistic approach to performance tuning is usually a bad idea.
To answer the specific question - if your function is using location in a where clause, and the table has more than a few hundred rows, and the values in the location column are not all identical, creating an index will speed up your function.
Whether you notice any difference is hard to say - there may be much bigger performance problems lurking in the database, and you might be fixing the wrong problem.

SQL Select is very slow with CHARINDEX

I am using sql server and have a table with 2 columns
myId varchar(80)
cchunk varchar(max)
Basicly it stores large chunk of text so thats why I need it varchar(max).
My problem is when I do a query like this:
select *
from tbchunks
where
(CHARINDEX('mystring1',tbchunks.cchunk)< CHARINDEX('mystring2',tbchunks.cchunk))
AND
CHARINDEX('mystring2',tbchunks.cchunk) - CHARINDEX('mystring1',tbchunks.cchunk) <=10
It takes about 3 seconds to complete, and the table chunks is about only 500,000 records and data returned from the above query is anywhere between 0 to 800 max
I have unclustered index on myid column, it helped with making fast select count(*) but didnt help with the above query.
I tried using Fulltext but was slow. i tried spliting the text in cchunk into smaller parts and adding an id column that will connect all those splited chunks, but ended up with a table with 2 million records of splited chunks of text (i did that so i can add index) but still the query was even slower.
EDIT:
modified the table to include primary key (int)
created fultext catalog with "Accent Senstive=true"
created fulltext index on my tabe on column "cchunk"
ran the same above query and it ended up taking 22 seconds with is much slower
UPDATE
Thanks everyone for suggesting the FullText (#Aaron Bertrand thanks!), i converted my query to this
SELECT * FROM tbchunksAS FT_TBL INNER JOIN
CONTAINSTABLE(tbchunks, cchunk, '(mystring1 NEAR mystring2)') AS KEY_TBL
ON FT_TBL.cID = KEY_TBL.[KEY]
by the way the cID is the primary key i added later.
anyway i am getting borad results and i notice that the higher the RANK column that was returned the better the results. my question is when RANK starts to get accurate?
An index isn't going to help with CHARINDEX at all. An index on a particular column is only going to be able to quickly find rows where the value in the indexed field is exactly an indexed value. I'm actually quite surprised that query only takes 3 seconds given that it has to read every single row four times (or at the very least, twice).
Well as good the ideas that were presented here, no body manage to really solve my problem but rather provided helpful tips that lead me to the solution which i would love to share.
Using Full text really was the answer like many mentioned but i managed to use the Contains in combination with Near so it can totally replace my current sql query and provide an awesome speed.
CONTAINS(tbchunks, 'NEAR ((mystring1, mystring2), 3, TRUE)')

Optimize Oracle Between Date Statement

I got an oracle SQL query that selects entries of the current day like so:
SELECT [fields]
FROM MY_TABLE T
WHERE T.EVT_END BETWEEN TRUNC(SYSDATE)
AND TRUNC(SYSDATE) + 86399/86400
AND T.TYPE = 123
Whereas the EVT_END field is of type DATE and T.TYPE is a NUMBER(15,0).
Im sure with increasing size of the table data (and ongoing time), the date constraint will decrease the result set by a much larger factor than the type constraint. (Since there are a very limited number of types)
So the basic question arising is, what's the best index to choose to make the selection on the current date faster. I especially wonder what the advantages and disadvantages of a functional index on TRUNC(T.EVT_END) to a normal index on T.EVT_END would be. When using a functional index the query would look something like that:
SELECT [fields]
FROM MY_TABLE T
WHERE TRUNC(T.EVT_END) = TRUNC(SYSDATE)
AND T.TYPE = 123
Because other queries use the mentioned date constraints without the additional type selection (or maybe with some other fields), multicolumn indexes wouldn't help me a lot.
Thanks, I'd appreciate your hints.
Your index should be TYPE, EVT_END.
CREATE INDEX PIndex
ON MY_TABLE (TYPE, EVT_END)
The optimizer plan will first go through this index to find the TYPE=123 section. Then under TYPE=123, it will have the EVT_END timestamps sorted, so it can search the b-tree for the first date in the range, and go through the dates sequentially until a data is out of the range.
Based on the query above the functional index will provide no value. For a functional index to be used the predicate in the query would need to be written as follows:
SELECT [fields]
FROM MY_TABLE T
WHERE TRUNC(T.EVT_END) BETWEEN TRUNC(SYSDATE) AND TRUNC(SYSDATE) + 86399/86400
AND T.TYPE = 123
The functional index on the column EVT_END, is being ignored. It would be better to have a normal index on the EVT_END date. For a functional index to be used the left hand of the condition must match the declaration of the functional index. I would probably write the query as:
SELECT [fields]
FROM MY_TABLE T
WHERE T.EVT_END BETWEEN TRUNC(SYSDATE) AND TRUNC(SYSDATE+1)
AND T.TYPE = 123
And I would create the following index:
CREATE INDEX bla on MY_TABLE( EVT_END )
This is assuming you are trying to find the events that ended within a day.
Results
If your index is cached, a function-based index performs best. If your index is not cached, a compressed function-based index performs best.
Below are the relative times generated by my test code. Lower is better. You cannot compare the numbers between cached and non-cached, they are totally different tests.
In cache Not in cache
Regular 120 139
FBI 100 138
Compressed FBI 126 100
I'm not sure why the FBI performs better than the regular index. (Although it's probably related to what you said about equality predicates versus range. You can see that the regular index has an extra "FILTER" step in its explain plan.) The compressed FBI has some additional overhead to uncompress the blocks. This small amount of extra CPU time is relevant when everything is already in memory, and CPU waits are most important. But when nothing is cached, and IO is more important, the reduced space of the compressed FBI helps a lot.
Assumptions
There seems to be a lot of confusion about this question. The way I read it, you only care about this one specific query, and you want to know whether a function-based index or a regular index will be faster.
I assume you do not care about other queries that may benefit from this index, additional time spent to maintain the index, if the developers remember to use it, or whether or not the optimizer chooses the index. (If the optimizer doesn't choose the index, which I think is unlikely, you can add a hint.) Let me know if any of these assumptions are wrong.
Code
--Create tables. 1 = regular, 2 = FBI, 3 = Compressed FBI
create table my_table1(evt_end date, type number) nologging;
create table my_table2(evt_end date, type number) nologging;
create table my_table3(evt_end date, type number) nologging;
--Create 1K days, each with 100K values
begin
for i in 1 .. 1000 loop
insert /*+ append */ into my_table1
select sysdate + i - 500 + (level * interval '1' second), 1
from dual connect by level <= 100000;
commit;
end loop;
end;
/
insert /*+ append */ into my_table2 select * from my_table1;
insert /*+ append */ into my_table3 select * from my_table1;
--Create indexes
create index my_table1_idx on my_table1(evt_end);
create index my_table2_idx on my_table2(trunc(evt_end));
create index my_table3_idx on my_table3(trunc(evt_end)) compress;
--Gather statistics
begin
dbms_stats.gather_table_stats(user, 'MY_TABLE1');
dbms_stats.gather_table_stats(user, 'MY_TABLE2');
dbms_stats.gather_table_stats(user, 'MY_TABLE3');
end;
/
--Get the segment size.
--This shows the main advantage of a compressed FBI, the lower space.
select segment_name, bytes/1024/1024/1024 GB
from dba_segments
where segment_name like 'MY_TABLE__IDX'
order by segment_name;
SEGMENT_NAME GB
MY_TABLE1_IDX 2.0595703125
MY_TABLE2_IDX 2.0478515625
MY_TABLE3_IDX 1.1923828125
--Test block.
--Uncomment different lines to generate 6 different test cases.
--Regular, Function-based, and Function-based compressed. Both cached and not-cached.
declare
v_count number;
v_start_time number;
v_total_time number := 0;
begin
--Uncomment two lines to test the server when it's "cold", and nothing is cached.
for i in 1 .. 10 loop
execute immediate 'alter system flush buffer_cache';
--Uncomment one line to test the server when it's "hot", and everything is cached.
--for i in 1 .. 1000 loop
v_start_time := dbms_utility.get_time;
SELECT COUNT(*)
INTO V_COUNT
--#1: Regular
FROM MY_TABLE1 T
WHERE T.EVT_END BETWEEN TRUNC(SYSDATE) AND TRUNC(SYSDATE) + 86399/86400;
--#2: Function-based
--FROM MY_TABLE2 T
--WHERE TRUNC(T.EVT_END) = TRUNC(SYSDATE);
--#3: Compressed function-based
--FROM MY_TABLE3 T
--WHERE TRUNC(T.EVT_END) = TRUNC(SYSDATE);
v_total_time := v_total_time + (dbms_utility.get_time - v_start_time);
end loop;
dbms_output.put_line('Seconds: '||v_total_time/100);
end;
/
Test Methodology
I ran each block at least 5 times, alternated between run types (in case something was running on my machine only part of the time), threw out the high and the low run times, and averaged them. The code above does not include all that logic, since it would take up 90% of this answer.
Other Things to Consider
There are still many other things to consider. My code assumes the data is inserted in a very index-friendly order. Things will be totally different if this is not true, as compression may not help at all.
Probably the best solution to this problem is to avoid it completely with partitioning. For reading the same amount of data, a full table scan is much faster than an index read because it uses multi-block IO. But there are some downsides to partitioning, like the large amount of money
required to buy the option, and extra maintenance tasks. For example, creating partitions ahead of time, or using interval partitioning (which has some other weird issues), gathering stats, deferred segment creation, etc.
Ultimately, you will need to test this yourself. But remember that testing even such a simple choice is difficult. You need realistic data, realistic tests, and a realistic environment. Realistic data is much harder than it sounds. With indexes, you cannot simply copy the data and build the indexes at once. create table my_table1 as select * from and create index ... will create a different index than if you create the table and perform a bunch of inserts and deletes in a specific order.
#S1lence:
I believe there would be a considerable time of thought behind this question being asked by you. And, I took a lot of time to post my answer here, as I don't like posting any guesses for answers.
I would like to share my websearch experience on this choice of normal Index on a date column against FBIs.
Based on my understanding on the link below, if you are about to use TRUNC function for sure, then you can strike out the option of normal index, as this consulting web space says that:
Even though the column may have an index, the trunc built-in function will invalidate the index, causing sub-optimal execution with unnecessary I/O.
I suppose that clears all. You've to go with FBI if you gonna use TRUNC for sure. Please let me know if my reply makes sense.
Oracle SQL Tuning with function-based indexes
Cheers,
Lakshmanan C.
The deicion over whether or not to use a function-based index should be driven by how you plan to write your queries. If all your queries against the date column will be in the form TRUNC(EVT_END), then you should use the FBI. However, in general it will be better to create an index on just EVT_END for the following reasons:
It will be more reusable. If you ever have queries checking particular times of the day then you can't use TRUNC.
There will be more distinct keys in the index using just the date. If you have 1,000 different times inserted during a day, EVT_END will will 1,000 distinct keys, whereas TRUNC(EVT_END) will only have 1 (this assumes that you're storing the time component and not just midnight for all the dates - in the second case both will have 1 distinct key for a day). This matters because the more distinct values an index has, the higher the selectivity of the index and the more likely it is to be used by the optimizer (see this)
The clustering factor is likely to be different, but in the case of using trunc it's more likely to go up, not down as stated in some other comments. This is because the clustering factor represents how closely the order of the values in the index match the physical storage of the data. If all your data is inserted in date order then a plain index will have the same order as the physical data. However, with TRUNC all times on a single day will map to the same value, so the order of rows in the index may be completely different to the physical data. Again, this means the trunc index is less likely to be used. This will entirely depend on your database's insertion/deletion patterns however.
Developers are more likely to write queries against where trunc isn't applied to the column (in my experience). Whether this holds true for you will depend upon your developers and the quality controls you have around deployed SQL.
Personally, I would go with Marlin's answer of TYPE, EVT_END as a first pass. You need to test this in your environment however and see how this affects this query and all others using the TYPE and EVT_END columns however.