I've narrowed a performance issue to a particular SQLite query that looks like this:
select *
from test
where (?1 is null or ident = ?1)
and (?2 is null or name = ?2)
and (?3 is null or region = ?3);
This allows any subset of the input parameters (there are more than three) with a single query. Unfortunately, using explain query plan on this yields:
1|0|0|SCAN TABLE test
So SQLite is reading through the entire table no matter what's passed in.
Changing the query to from table indexed by test_idx causes it to fail: Error: no query solution.
Removing the ?1 is null or yields a much more favorable query:
1|0|0|SEARCH TABLE test USING INDEX idx (ident=?)
However, note that only one index can be used. All matches for ident will be scanned looking for matches to other fields. Using a single index that contains all the match fields avoids this:
0|0|0|SEARCH TABLE test USING INDEX test_idx_3 (ident=? AND region=? AND name=?)
It seems reasonable to think that SQLite's query planner would be able to either eliminate or simplify each condition to a simple indexed column check, but apparently that is not the case, as query optimization happens before parameter binding, and no further simplification occurs.
The obvious solution, is to have 2^N separate queries and select the appropriate one at runtime based on which combination of inputs are to be checked. For N=2 or 3 that might be acceptable, but it's absolutely out of the question in this case.
There are, of course, a number of ways to re-organize the database that would make this type of query more reasonable, but assume that's also not practical.
So, how can I search any subset of columns in a table without losing the performance benefit of indexes on those columns?
The only progress I've been able to make is to use a query like this:
select ident, name, region
from test
where (case when ?1 is null then 1 when ident = ?1 then 1 else 0 end)
and (case when ?2 is null then 1 when name = ?2 then 1 else 0 end)
and (case when ?3 is null then 1 when region = ?3 then 1 else 0 end)
This reduces the query to an index scan, rather than a table scan:
0|0|0|SCAN TABLE test USING COVERING INDEX test_idx_3
However, it only works if there's one index containing all the columns of interest, and if the only columns being selected are those in the index. If the index isn't a "covering index" (one containing all the needed values) then SQLite doesn't use the index at all.
The way to get around the second restriction is to structure the query like this:
select ident, name, region, location
from test
where rowid in (
select rowid
from test
where (case when ?1 is null then 1 when ident = ?1 then 1 else 0 end)
and (case when ?2 is null then 1 when name = ?2 then 1 else 0 end)
and (case when ?3 is null then 1 when region = ?3 then 1 else 0 end)
)
yielding:
0|0|0|SEARCH TABLE test USING INTEGER PRIMARY KEY (rowid=?)
0|0|0|EXECUTE LIST SUBQUERY 1
1|0|0|SCAN TABLE test USING COVERING INDEX test_idx_3
This is generally faster than a full table scan, but how much faster depends on several factors:
How much data is in each row that's not in the index? If it's small, then the index scan is almost a table scan.
How many results are there? Each result is a separate primary key search, so for some large number of results in a large table, the N searches will actually be slower than a single pass through the whole table. For M results in a table of N rows, you want O[M log N] << O[N], so m < (N / log N). Call it 3% as a rule of thumb, minus the cost of an index scan:
Don't try to be clever. SQLite's prepared statements do not need much memory, so you actually could keep all 2^N of them. But preparing a query does not need much time, either, so it would be a better idea to construct each query dynamically whenever you need it.
As for the index: the documentation shows that the leftmost columns in the index must be used in the query. This means that you only need a few combinations of columns in your indexes (even for queries that do not use all index columns). In any case, you should prioritize indexes on columns with high selectivity.
Related
My first time working with indexes in database and so far I've learn that if you have a multi-column index such as index('col1', 'col2', 'col3'), and if you do a query that uses where col2='col2' and col3='col3', that index would not be use.
I also learn that if a column is very low selectivity column. Indexing is useless.
However, from my test, it seems none of the above is true at all. Can someone explain more on this?
I have a table with more than 16 million records. Let's say claimID is the primary key, then there're a historynumber column that only have 3 distinct values (1,2,3), and a last column with storeNumber that has about 1 million distinct values.
I have an index for claimID alone, another index(historynumber, claimID), and other index with index(historynumber, storeNumber), and finally index(storeNumber, historynumber).
My guess was that if I do:
select * from my_table where claimId='123456' and historynumber = 1
would be much faster than
select * from my_table where historynumber = 1 and claimId = '123456'
However, the 2 have exactly the same performance (instant). So I thought the primary key index can work on any column order. Therefore, I tried the same thing but on historynumber and storeNumber instead. The result is exactly the same. Then I start trying out on columns that has no indexes and of course the result is the same also.
Finally, I do a
select * from my_table where historynumber = 1
and the query takes so long I had to cancel it.
So my conclusion is that the column order in where clause is completely useless, and so is the column order in the index definition since it seems like the database is smart enough to tell which column is the highest selectivity column.
Could someone give me an example that could prove otherwise?
Index explanation is a huge topic.
Don't worry about the sequence of different attributes in the SQL - it has no effect whether you specify
...where claimId='123456' and historynumber = 1
or the other way round. Each SQL is checked and optimized by the optimizer. To proove how the data gets accessed you could do a EXPLAIN. Check the documentation for more details.
For your other problem
select * from my_table where historynumber = 1
with an index of (storeNumber, historynumber).
Have you ever tried to lookup the name of a caller (having the telephone number) in a telephone book?
Well it is pretty much the same for an index - so the column order when creatin the index matters!
There are techniques which could help - i.e. index jump scan - but there is no guarantee.
Check out following sites to learn a little bit more about DB2 indexes:
http://db2commerce.com/2013/09/19/db2-luw-basics-indexes/
http://use-the-index-luke.com/sql/where-clause/the-equals-operator/concatenated-keys
I have this query which takes too much time (since last 1 hour is still running) to execute:
select RL.[LINK_ID] as LINK_ID, RPA.[POSTAL_AREA_ID] as POSTAL_AREA_ID, RRN.[STREET_NAME] as STREET_NAME
from RDF_LINK as RL, RDF_POSTAL_AREA as RPA, RDF_ROAD_LINK as RRL, RDF_ROAD_NAME as RRN
where RRL.[ROAD_NAME_ID] = RRN.[ROAD_NAME_ID]
AND RPA.[POSTAL_AREA_ID] IN (RL.[LEFT_POSTAL_AREA_ID], RL.[RIGHT_POSTAL_AREA_ID])
AND RL.[LINK_ID] = RRL.[LINK_ID]
All the columns which are part of the query are indexed.
The ANALYZE command has already been. performed on database.
The database has approx. 73 millions records in the RDF_ROAD_LINK table and same number of records in other tables.
Is there any other way around to write this query?
EXPLAIN QUERY PLAN
select RL.[LINK_ID] as LINK_ID, RPA.[POSTAL_AREA_ID] as POSTAL_AREA_ID, RRN.[STREET_NAME] as STREET_NAME
from RDF_LINK as RL, RDF_POSTAL_AREA as RPA, RDF_ROAD_LINK as RRL, RDF_ROAD_NAME as RRN
where RRL.[ROAD_NAME_ID] = RRN.[ROAD_NAME_ID]
AND RPA.[POSTAL_AREA_ID] IN (RL.[LEFT_POSTAL_AREA_ID], RL.[RIGHT_POSTAL_AREA_ID])
AND RL.[LINK_ID] = RRL.[LINK_ID]
Output ::
0 0 3 SCAN TABLE RDF_ROAD_NAME AS RRN
0 1 2 SEARCH TABLE RDF_ROAD_LINK AS RRL USING INDEX IND_ROAD_NAME_ID (ROAD_NAME_ID=?)
0 2 0 SEARCH TABLE RDF_LINK AS RL USING INDEX sqlite_autoindex_RDF_LINK_1 (LINK_ID=?)
0 3 1 SEARCH TABLE RDF_POSTAL_AREA AS RPA USING COVERING INDEX sqlite_autoindex_RDF_POSTAL_AREA_1 (POSTAL_AREA_ID=?)
0 0 0 EXECUTE LIST SUBQUERY 1
This query returns all 73 million records, and has to look up the corresponding records from the other tables.
This cannot be fast because there is too much data to be cached (and with this size, it's likely that not even the indexes fit into the cache).
In a join between two tables, the database goes through all rows of the first table, and looks up the corresponding row(s) of the second table.
This means that the first table always ends up with a SCAN, because it would not make sense to use an index (going through an index would not be any faster when you need to load all rows anyway).
In this case, using an index for RDF_ROAD_NAME would be possible only if there were an additional filter on an indexed column (WHERE STREET_NAME = 'My Street'), or if the result must be sorted by an indexed column (ORDER BY ROAD_NAME_ID).
If the tables have many columns that are not used in this query, you might be able to speed it up a little bit by using covering indexes (if all data you need is already in the index, the database does not need to look up the corresponding table row):
CREATE INDEX ... ON RDF_ROAD_LINK(ROAD_NAME_ID, LINK_ID);
CREATE INDEX ... ON RDF_LINK(LINK_ID, LEFT_POSTAL_AREA_ID, RIGHT_POSTAL_AREA_ID);
Sorry just clearing my questions. Extending this question Optimizing sqlite query
I have a table:
CREATE TABLE IF NOT EXISTS [app_status](
[id] INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL ,
[status] TEXT DEFAULT NULL,
[type] INTEGER
)
I have two indexes. One on status and another on type. Which query will run faster and why?
SELECT COALESCE(min(type), 0)
FROM app_status
WHERE status IS NOT NULL
AND type IN (1,2) limit 1
QUERY PLAN o/p
0|0|0|SEARCH TABLE app_status USING INDEX idx_type (mailbox_type=?) (~10 rows)
0|0|0|EXECUTE LIST SUBQUERY 1
Or...
SELECT type FROM
app_status WHERE
status IS NOT NULL
ORDER BY type limit 1
QUERY PLAN o/p
0|0|0|SCAN TABLE app_status USING INDEX idx_type (~500000 rows)
The first query returns zero or one row matching the criteria in the WHERE clause (where status is not null and type in (1,2) , in unspecified order.
The second query finds all the rows matching the criteria in the WHERE clause (where status is not null), sorts them by type and then returns zero or 1 row.
You should note that the two queries, while they may return the identical results, are not guaranteed to. In particular, the row returned by the second query will return the first row of the result set as ordered by type, regardless of what value that type is. If the lowest value for type where `status is not null is, say 157, that is the row you are going to get. The first query, in that case, will return 0 rows.
But assuming type and status are indexed and the query can use one or more of the indexes, then my suspicion is the first query would be faster as it can seek directly to the desired row(s).
But much depends on the shape of the data (how much data is there? How is it distributed? etc.), whether or not the index is 'covering' (if the index doesn't cover all the columns in the query, then it must do additional I/O to get the data page(s) required to cover all the columns.
edited to note Looking at the execution plans you posted (not knowing Sqllite), the first plan says it should return about 10 rows; the second about 50,000 rows. Which do you think might be faster?
You should:
CREATE INDEX idx_app_staus ON app_status (status, type)
In this way the database angine will not have to look up all of the rows it can find the exact rows what it needs by the where clause. I' don't know which query is faster becouse they aren't returning the same result set but with index from above all of these kind of queries will be fast. The other two index could be dropped.
I've created an Oracle Text index like the following:
create index my_idx on my_table (text) indextype is ctxsys.context;
And I can then do the following:
select * from my_table where contains(text, '%blah%') > 0;
But lets say we have a have another column in this table, say group_id, and I wanted to do the following query instead:
select * from my_table where contains(text, '%blah%') > 0 and group_id = 43;
With the above index, Oracle will have to search for all items that contain 'blah', and then check all of their group_ids.
Ideally, I'd prefer to only search the items with group_id = 43, so I'd want an index like this:
create index my_idx on my_table (group_id, text) indextype is ctxsys.context;
Kind of like a normal index, so a separate text search can be done for each group_id.
Is there a way to do something like this in Oracle (I'm using 10g if that is important)?
Edit (clarification)
Consider a table with one million rows and the following two columns among others, A and B, both numeric. Lets say there are 500 different values of A and 2000 different values of B, and each row is unique.
Now lets consider select ... where A = x and B = y
An index on A and B separately as far as I can tell do an index search on B, which will return 500 different rows, and then do a join/scan on these rows. In any case, at least 500 rows have to be looked at (aside from the database being lucky and finding the required row early.
Whereas an index on (A,B) is much more effective, it finds the one row in one index search.
Putting separate indexes on group_id and the text I feel only leaves the query generator with two options.
(1) Use the group_id index, and scan all the resulting rows for the text.
(2) Use the text index, and scan all the resulting rows for the group_id.
(3) Use both indexes, and do a join.
Whereas I want:
(4) Use the (group_id, "text") index to find the text index under the particular group_id and scan that text index for the particular row/rows I need. No scanning and checking or joining required, much like when using an index on (A,B).
Oracle Text
1 - You can improve performance by creating the CONTEXT index with FILTER BY:
create index my_idx on my_table(text) indextype is ctxsys.context filter by group_id;
In my tests the filter by definitely improved the performance, but it was still slightly faster to just use a btree index on group_id.
2 - CTXCAT indexes use "sub-indexes", and seem to work similar to a multi-column index. This seems to be the option (4) you're looking for:
begin
ctx_ddl.create_index_set('my_table_index_set');
ctx_ddl.add_index('my_table_index_set', 'group_id');
end;
/
create index my_idx2 on my_table(text) indextype is ctxsys.ctxcat
parameters('index set my_table_index_set');
select * from my_table where catsearch(text, 'blah', 'group_id = 43') > 0
This is likely the fastest approach. Using the above query against 120MB of random text similar to your A and B scenario required only 18 consistent gets. But on the downside, creating the CTXCAT index took almost 11 minutes and used 1.8GB of space.
(Note: Oracle Text seems to work correctly here, but I'm not familiar with Text and I can't gaurentee this isn't an inappropriate use of these indexes like #NullUserException said.)
Multi-column indexes vs. index joins
For the situation you describe in your edit, normally there would not be a significant difference between using an index on (A,B) and joining separate indexes on A and B. I built some tests with data similar to what you described and an index join required only 7 consistent gets versus 2 consistent gets for the multi-column index.
The reason for this is because Oracle retrieves data in blocks. A block is usually 8K, and an index block is already sorted, so you can probably fit the 500 to 2000 values in a few blocks. If you're worried about performance, usually the IO to read and write blocks is the only thing that matters. Whether or not Oracle has to join together a few thousand rows is an inconsequential amount of CPU time.
However, this doesn't apply to Oracle Text indexes. You can join a CONTEXT index with a btree index (a "bitmap and"?), but the performance is poor.
I'd put an index on group_id and see if that's good enough. You don't say how many rows we're talking about or what performance you need.
Remember, the order in which the predicates are handled is not necessarily the order in which you wrote them in the query. Don't try to outsmart the optimizer unless you have a real reason to.
Short version: There's no need to do that. The query optimizer is smart enough to decide what's the best way to select your data. Just create a btree index on group_id, ie:
CREATE INDEX my_group_idx ON my_table (group_id);
Long version: I created a script (testperf.sql) that inserts 136 rows of dummy data.
DESC my_table;
Name Null Type
-------- -------- ---------
ID NOT NULL NUMBER(4)
GROUP_ID NUMBER(4)
TEXT CLOB
There is a btree index on group_id. To ensure the index will actually be used, run this as a dba user:
EXEC DBMS_STATS.GATHER_TABLE_STATS('<YOUR USER HERE>', 'MY_TABLE', cascade=>TRUE);
Here's how many rows each group_id has and the corresponding percentage:
GROUP_ID COUNT PCT
---------------------- ---------------------- ----------------------
1 1 1
2 2 1
3 4 3
4 8 6
5 16 12
6 32 24
7 64 47
8 9 7
Note that the query optimizer will use an index only if it thinks it's a good idea - that is, you are retrieving up to a certain percentage of rows. So, if you ask it for a query plan on:
SELECT * FROM my_table WHERE group_id = 1;
SELECT * FROM my_table WHERE group_id = 7;
You will see that for the first query, it will use the index, whereas for the second query, it will perform a full table scan, since there are too many rows for the index to be effective when group_id = 7.
Now, consider a different condition - WHERE group_id = Y AND text LIKE '%blah%' (since I am not very familiar with ctxsys.context).
SELECT * FROM my_table WHERE group_id = 1 AND text LIKE '%ipsum%';
Looking at the query plan, you will see that it will use the index on group_id. Note that the order of your conditions is not important:
SELECT * FROM my_table WHERE text LIKE '%ipsum%' AND group_id = 1;
Generates the same query plan. And if you try to run the same query on group_id = 7, you will see that it goes back to the full table scan:
SELECT * FROM my_table WHERE group_id = 7 AND text LIKE '%ipsum%';
Note that stats are gathered automatically by Oracle every day (it's scheduled to run every night and on weekends), to continually improve the effectiveness of the query optimizer. In short, Oracle does its best to optimize the optimizer, so you don't have to.
I do not have an Oracle instance at hand to test, and have not used the full-text indexing in Oracle, but I have generally had good performance with inline views, which might be an alternative to the sort of index you had in mind. Is the following syntax legit when contains() is involved?
This inline view gets you the PK values of the rows in group 43:
(
select T.pkcol
from T
where group = 43
)
If group has a normal index, and doesn't have low cardinality, fetching this set should be quick. Then you would inner join that set with T again:
select * from T
inner join
(
select T.pkcol
from T
where group = 43
) as MyGroup
on T.pkcol = MyGroup.pkcol
where contains(text, '%blah%') > 0
Hopefully the optimizer would be able to use the PK index to optimize the join and then appy the contains predicate only to the group 43 rows.
I'm reading about indexes in my database book and I was wondering if I was correct in my assumption that a WHERE clause with a non-constant expression in it will not use the index.
So if i have
SELECT * FROM statuses WHERE app_user_id % 10 = 0;
This would not use an index created on app_user_id. But
SELECT * FROM statuses WHERE app_user_id = 5;
would use the index on app_user_id.
Usually (there are other options) a database index is a B-Tree, which means that you can do range scans on it (including equality scans).
The condition app_user_id % 10 = 0 cannot be evaluated with a single range scan, which is why a database will probably not use an index.
It could still decide to use the index in another way, namely for a full scan: Reading the whole table takes more time than just reading the whole index. On the other hand, after reading the index you may still get back to the table, so the overall cost may end up being higher.
This is up to the database query optimizer to decide.
A few examples:
select app_user_id from t where app_user_id % 10 = 0
Here, you do not need the table at all, all necessary data is in the index. The database will most likely do a full index scan.
select count(*) from t where app_user_id % 10 = 0
Same. Full index scan.
select count(*) from t
Only if app_user_id is NOT NULL can this be done with the index (because NULL data is not in the index, at least on Oracle, at least on single column indexes, your database may handle this differently).
Some databases do not need to do access table or index for this, they maintain row counts in the metadata.
select * from t where app_user_id = 5
This is the classic scenario for an index. The database can look at the small section of the index tree, retrieve a small (just one if this was a unique or primary index) number of rowids and fetch those selectively from the table.
select * from t where app_user_id between 5 and 10
Another classic index case. Range scan in the tree returns a small number of rowids to fetch from the table.
select * from t where app_user_id between 5 and 10 order by app_user_id
Since index scans return ordered data, you even get the sorting for free.
select * from t where app_user_id between 5 and 1000000000
Maybe here you should not be using an index. It seems to match too many records. This is a case where having bind variables hide the range from the database could actually be detrimental.
select * from t where app_user_id between 5 and 1000000000
order by app_user_id
But here, since sorting would be very expensive (even taking up temporary swap disk space), maybe iterating in index order is good. Maybe.
select * from t where app_user_id % 10 = 0
This is difficult to decide. We need all columns, so ultimately the query needs to touch the table. The question is whether to go through an index first. The query returns approximately 10% of the whole table. That is probably too much for an index access path to be efficient. If the optimizer has reason to believe that the query returns much less than 10% of the table, an index scan followed by accessing the table might be good. Same if the table is very fragmented (lots of deleted rows eating up space).