performance of select queries oracle - sql

from millions of records in a table if i want to select few records depending on several where conditions, are there any facts to be considered for the sequence of where conditions?
for example, in a table of students in a state where conditions might include the following:
Institute Id,
DeptId (unique for each institute, i.e. deptId in a institute cannot be present on another institute.)
Now, if i have to select a list of students in a particular dept, deptId is enough for the where condition, because deptId of the students will be present only in one particular institute.
But will it improve performance if i include instituteId also before the deptid in where conditions so that records can be filtered based on institute first, and then based on deptId?
Will the order of where conditions have impact on query performance? Thanks.

Order of where conditions won't have impact on query performance. Because your RDBMS will reorganise depending on best indexed columns.
Anyway, if you have indexes on both columns you should use only DeptId. Otherwise RDBMS will perform 2 filters instead of 1, in theory it could be slower to use more conditions (depending on indexes).
But, you can try both ways to check excecution time, many things can affect performance (specially with huge bulk of data) so just test it.
Try to make 1 index for 2 columns in same time, so it could be interesting to use 2 conditions. (depending on RDBMS)

Not per se but depending on the index: if you have an index with Institute ID, DeptID the using only DeptID in the where condition will not use the index and performe a table scan (there is a lot more in this but that's the basic). Try always to create a where condition covered by the PK or another index on the table with every column in that index, if you have an index on a, b and c and a where on a and c that will not use the index.
The order of the column in the where condition will be re-organized by the DB to fit the index definition

It all depends on how your indexes are set up. If you have no indexes on your table, then it doesn't make a bit of difference because every query is going to have to scan the entire table anyway.
http://use-the-index-luke.com is an excellent introduction to indexes and how they work and how they should be set up.

Related

Indexes, EXPLAIN PLAN, and record access in Oracle SQL

I have been learning about indexes in Oracle SQL, and I wanted to conduct a small experiment with a test table to see how indexes really worked. As I discovered from an earlier post made here, the best way to do this is with EXPLAIN PLAN. However, I am running into something which confuses me.
My sample table contains attributes (EmpID, Fname, Lname, Occupation, .... etc). I populated it with 500,000 records using a java program I wrote (random names, occupations, etc). Now, here are some sample queries with and without indexes:
NO INDEX:
SELECT Fname FROM EMPLOYEE WHERE Occupation = 'DOCTOR';
EXPLAIN PLAN says:
OPERATION OPTIMIZER COST
TABLE ACCESS(FULL) TEST.EMPLOYEE ANALYZED 1169
Now I create index:
CREATE INDEX occupation_idx
ON EMPLOYEE (Occupation);
WITH INDEX "occupation_idx":
SELECT Fname FROM EMPLOYEE WHERE Occupation = 'DOCTOR';
EXPLAIN PLAN says:
OPERATION OPTIMIZER COST
TABLE ACCESS(FULL) TEST.EMPLOYEE ANALYZED 1169
So... the cost is STILL the same, 1169? Now I try this:
WITH INDEX "occupation_idx":
SELECT Occupation FROM EMPLOYEE WHERE Occupation = 'DOCTOR';
EXPLAIN PLAN says:
OPERATION OPTIMIZER COST
INDEX(RANGE SCAN) TEST.OCCUPATION_IDX ANALYZED 67
So, it appears that the index only is utilized when that column is the only one I'm pulling values from. But I thought that the point of an index was to unlock the entire record using the indexed column as the key? The search above is a pretty pointless one... it searches for values which you already know. The only worthwhile query I can think of which ONLY involves an indexed column's value (and not the rest of the record) would be an aggregate such as COUNT or something.
What am I missing?
Even with your index, Oracle decided to do a full scan for the second query.
Why did it do this? Oracle would have created two plans and come up with a cost for each:-
1) Full scan
2) Index access
Oracle selected the plan with the lower cost. Obviously it came up with the full scan as the lower cost.
If you want to see the cost of the index plan, you can do an explain plan with a hint like this to force the index usage:
SELECT /*+ INDEX(EMPLOYEE occupation_idx) */ Fname
FROM EMPLOYEE WHERE Occupation = 'DOCTOR';
If you do an explain plan on the above, you will see that the cost is greater than the full scan cost. This is why Oracle did not choose to use the index.
A simple way to consider the cost of the index plan is:-
The blevel of the index (how many blocks must be read from top to bottom)
The number of table blocks that must be subsequently read for records matching in the index. This relies on Oracle's estimate of the number of employees that have an occupation of 'DOCTOR'. In your simple example, this would be:
number of rows / number of distinct values
More complicated considerations include the clustering factory and index cost adjustments which both reflect the likelyhood that a block that is read is already in memory and hence does not need to read from disk.
Perhaps you could update your question with the results from your query with the index hint and also the results of this query:-
SELECT COUNT(*), COUNT(DISTINCT( Occupation ))
FROM EMPLOYEE;
This will allow people to comment on the cost of the index plan.
I think I see what's happening here.
When you have the index in place, and you do:
SELECT Occupation FROM EMPLOYEE WHERE Occupation = 'DOCTOR';
The execution plan will use the index. This is a no-brainer, cause all the data that's required to satisfy the query is right there in the index, and Oracle never even has to reference the table at all.
However, when you do:
SELECT Fname FROM EMPLOYEE WHERE Occupation = 'DOCTOR';
then, if Oracle uses the index, it will do an INDEX RANGE SCAN followed by a TABLE ACCESS BY ROWID to look up the Fname that corresponds to that Occupation. Now, depending on how many rows have DOCTOR for Occupation, Oracle will have to make one or more trips to the table, to look up the Fname. If, for example, you have a table, and all the employees have Occupation set to 'DOCTOR', the index isn't of much use, and Oracle will simply do a FULL TABLE SCAN of the table. If there are 10,000 employees, and only one is a DOCTOR, then again, it's a no-brainer, and Oracle will use the index.
But there are some subtleties, when you're somewhere between those two extremes. People like to talk about 'selectivity', i.e., how many rows are identifed by the index, vs. the size of the table, when discussing whether the index will be used. But, that's not really true. What Oracle really cares about is block selectivity. That is, how many blocks does it have to visit, to satisfy the query? So, first, how "wide" is the RANGE SCAN? The more limited the range of values specified by the predicate values, the better. Second, when your query needs to do table lookups, how many different blocks will it have to visit to find all the data it needs. That is, how "random" is the data in the table relative to the index order? This is called the CLUSTERING_FACTOR. If you analyze the index to collect statistics, and then look at USER_INDEXES, you'll see that the CLUSTERING_FACTOR is now populated.
So, what's CLUSTERING_FACTOR? CLUSTERING_FACTOR is the "orderedness" of the table, with respect to the index's key column(s). The value of CLUSTERING_FACTOR will always be between the number of blocks in a table and the number of rows in a table. A low CLUSTERING_FACTOR, that is, one that is very near to the number of blocks in the table, indicates a table that's very ordered, relative to the index. A high CLUSTERING_FACTOR, that is, one that is very near to the number of rows in the table, is very unordered, relative to the index.
It's an important concept to understand that the CLUSTERING_FACTOR describes the order of data in the table relative to the index. So, rebuilding an index, for example, will not change the CLUSTERING_FACTOR. It's also important to understand that the same table could have two indexes, and one could have an excellent CLUSTERING_FACTOR, and the other could have an extremely poor CLUSTERING_FACTOR. The table itself can only be ordered in one way.
So, why have I spent so much time describing CLUSTERING_FACTOR? Because when you have an execution plan that does an INDEX RANGE SCAN followed by TABLE ACCESS BY ROWID, you can be sure that the CLUSTERING_FACTOR has been considered by Oracle's optimizer, to come up with the execution plan. For example, suppose you have a 10,000 row table, and suppose 100 of the rows have Occupation = 'DOCTOR'. You write the query above, asking for the Fname of the employees whose occupation is DOCTOR. Well, Oracle can very easily and efficiently determine the rowids of the rows where occupation is DOCTOR. But, how many table blocks will Oracle need to visit, to do the Fname lookup? It could be only 1 or 2 table blocks, if the data is clustered (ordered) by Occupation in the table. But, it could be as many as 100, if the data is very unordered in the table! So, again, 10,000 row table, and, let's assume, (for the purposes of illustration and simple math) that the table has 100 rows/block, and so, 100 blocks. Depending on table order (i.e. CLUSTERING_FACTOR), the number of table block visits could be as few as 1, or as many as 100.
So, I hope this helps you understand why the optimizer may be reluctant to use an index in some cases.
An index is the copy of the table which only stores the following data:
Indexed field(s)
A pointer to the original row (rowid).
Say you have a table like this:
rowid id name occupation
[1] 1 John clerk
[2] 2 Jim manager
[3] 3 Jane boss
Then an index on occupation would look like this:
occupation rowid
boss [3]
manager [2]
clerk [1]
, with the records sorted on occupation in a B-Tree.
As you can see, if you only select the indexed fields, you only need the index (the second table).
If you select anything other than occupation:
SELECT *
FROM mytable
WHERE occupation = 'clerk'
then the engine should make two things: first find the relevant records in the index, second, find the records in the original table by rowid. It's like if you joined the two tables on rowid.
Since the rowids in the index are not in order, the reads to the original table are not sequential and can be slow. It may be faster to read the original table in sequential order and just filter the records with occupation = 'clerk'.
The engine does not "unlock" the records: it just finds the rowid in the index, and if there are not enough data in the index itself, it looks up data in the original table by the rowid found.
As a WAG. Analyze the table, and the index, then see if the plan changes.
When you are selecting just the occupation, the entire query can be satisfied from the index. The index literally has a copy of the occupation. The moment you add an additional column to the select, Oracle has to go to the data record, to get it. The optimizer chooses to read all of the data rows instead of all of the index rows, and the data rows. It's cheaper.

SQL Server index included columns

I need help understanding how to create indexes. I have a table that looks like this
Id
Name
Age
Location
Education,
PhoneNumber
My query looks like this:
SELECT *
FROM table1
WHERE name = 'sam'
What's the correct way to create an index for this with included columns?
What if the query has a order by statement?
SELECT *
FROM table1
WHERE name = 'sam'
ORDER BY id DESC
What if I have 2 parameters in my where statement?
SELECT *
FROM table1
WHERE name = 'sam'
AND age > 12
The correct way to create an index with included columns? Either via Management Studio/Toad/etc, or SQL (documentation):
CREATE INDEX idx_table_1 ON db.table_1 (name) INCLUDE (id)
What if the Query has an ORDER BY
The ORDER BY can use indexes, if the optimizer sees fit to (determined by table statistics & query). It's up to you to test if a composite index or an index with INCLUDE columns works best by reviewing the query cost.
If id is the clustered key (not always the primary key though), I probably wouldn't INCLUDE the column...
What if I have 2 parameters in my where statement?
Same as above - you need to test what works best for your query. Might be composite, or include, or separate indexes.
But keep in mind that:
tweaking for one query won't necessarily benefit every other query
indexes do slow down INSERT/UPDATE/DELETE statements, and require maintenance
You can use the Database Tuning Advisor (DTA) for index recommendations, including when some are redundant
Recommended reading
I highly recommend reading Kimberly Tripp's "The Tipping Point" for a better understanding of index decisions and impacts.
Since I do not know which exactly tasks your DB is going to implement and how many records in it, I would suggest that you take a look at the Index Basics MSDN article. It will allow you to decide yourself which indexes to create.
If ID is your primary and/or clustered index key, just create an index on Name, Age. This will cover all three queries.
Included fields are best used to retrieve row-level values for columns that are not in the filter list, or to retrieve aggregate values where the sorted field is in the GROUP BY clause.
If inserts are rare, create as much indexes as You want.
For first query create index for name column.
Id column I think already is primary key...
Create 2nd index with name and age. You can keep only one index: 'name, ag'e and it will not be much slower for 1st query.

Does the order of fields in a WHERE clause affect performance in MySQL?

I have two indexed fields in a table - type and userid (individual indexes, not a composite).
types field values are very limited (let's say it is only 0 or 1), so 50% of table records have the same type. userid values, on the other hand, come from a much larger set, so the amount of records with the same userid is small.
Will any of these queries run faster than the other:
select * from table where type=1 and userid=5
select * from table where userid=5 and type=1
Also if both fields were not indexed, would it change the behavior?
SQL was designed to be a declarative language, not a procedural one. So the query optimizer should not consider the order of the where clause predicates in determining how to apply them.
I'm probably going to waaaay over-simplify the following discussion of an SQL query optimizer. I wrote one years ago, along these lines (it was tons of fun!). If you really want to dig into modern query optimization, see Dan Tow's SQL Tuning, from O'Reilly.
In a simple SQL query optimizer, the SQL statement first gets compiled into a tree of relational algebra operations. These operations each take one or more tables as input and produce another table as output. Scan is a sequential scan that reads a table in from the database. Sort produces a sorted table. Select produces a table whose rows are selected from another table according to some selection condition. Project produces a table with only certain columns of another table. Cross Product takes two tables and produces an output table composed of every conceivable pairing of their rows.
Confusingly, the SQL SELECT clause is compiled into a relational algebra Project, while the WHERE clause turns into a relational algebra Select. The FROM clause turns into one or more Joins, each taking two tables in and producing one table out. There are other relational algebra operations involving set union, intersection, difference, and membership, but let's keep this simple.
This tree really needs to be optimized. For example, if you have:
select E.name, D.name
from Employee E, Department D
where E.id = 123456 and E.dept_id = D.dept_id
with 5,000 employees in 500 departments, executing an unoptimized tree will blindly produce all possible combinations of one Employee and one Department (a Cross Product) and then Select out just the one combination that was needed. The Scan of Employee will produce a 5,000 record table, the Scan of Department will produce a 500 record table, the Cross Product of those two tables will produce a 2,500,000 record table, and the Select on E.id will take that 2,500,000 record table and discard all but one, the record that was wanted.
[Real query processors will try not to materialize all of these intermediate tables in memory of course.]
So the query optimizer walks the tree and applies various optimizations. One is to break up each Select into a chain of Selects, one for each of the original Select's top level conditions, the ones and-ed together. (This is called "conjunctive normal form".) Then the individual smaller Selects are moved around in the tree and merged with other relational algebra operations to form more efficient ones.
In the above example, the optimizer first pushes the Select on E.id = 123456 down below the expensive Cross Product operation. This means the Cross Product just produces 500 rows (one for each combination of that employee and one department). Then the top level Select for E.dept_id = D.dept_id filters out the 499 unwanted rows. Not bad.
If there's an an index on Employee's id field, then the optimizer can combine the Scan of Employee with the Select on E.id = 123456 to form a fast index Lookup. This means that only one Employee row is read into memory from disk instead of 5,000. Things are looking up.
The final major optimization is to take the Select on E.dept_id = D.dept_id and combine it with the Cross Product. This turns it into a relational algebra Equijoin operation. This doesn't do much by itself. But if there's an index on Department.dept_id, then the lower level sequential Scan of Department feeding the Equijoin can be turned into a very fast index Lookup of our one employee's Department record.
Lesser optimizations involve pushing Project operations down. If the top level of your query just needs E.name and D.name, and the conditions need E.id, E.dept_id, and D.dept_id, then the Scan operations don't have to build intermediate tables with all the other columns, saving space during the query execution. We've turned a horribly slow query into two index lookups and not much else.
Getting more towards the original question, let's say you've got:
select E.name
from Employee E
where E.age > 21 and E.state = 'Delaware'
The unoptimized relational algebra tree, when executed, would Scan in the 5,000 employees and produce, say, the 126 ones in Delaware who are older than 21. The query optimizer also has some rough idea of the values in the database. It might know that the E.state column has the 14 states that the company has locations in, and something about the E.age distributions. So first it sees if either field is indexed. If E.state is, it makes sense to use that index to just pick out the small number of employees the query processor suspects are in Delaware based on its last computed statistics. If only E.age is, the query processor likely decides that it's not worth it, since 96% of all employees are 22 and older. So if E.state is indexed, our query processor breaks the Select and merges the E.state = 'Delaware' with the Scan to turn it into a much more efficient Index Scan.
Let's say in this example that there are no indexes on E.state and E.age. The combined Select operation takes place after the sequential "Scan" of Employee. Does it make a difference which condition in the Select is done first? Probably not a great deal. The query processor might leave them in the original order in the SQL statement, or it might be a bit more sophisticated and look at the expected expense. From the statistics, it would again find that the E.state = 'Delaware' condition should be more highly selective, so it would reverse the conditions and do that first, so that there are only 126 E.age > 21 comparisons instead of 5,000. Or it might realize that string equality comparisons are much more expensive than integer compares and leave the order alone.
At any rate, all this is very complex and your syntactic condition order is very unlikely to make a difference. I wouldn't worry about it unless you have a real performance problem and your database vendor uses the condition order as a hint.
Most query optimizers use the order in which conditions appear as a hint. If everything else is equal, they will follow that order.
However, many things can override that:
the second field has an index and the first has not
there are statistics to suggest that field 2 is more selective
the second field is easier to search (varchar(max) vs int)
So (and this is true for all SQL optimization questions) unless you observe a performance issue, it's better to optimize for clarity, not for (imagined) performance.
It shouldn't in your small example. The query optimizer should do the right thing. You can check for sure by adding explain to the front of the query. MySQL will tell you how it's joining things together and how many rows it needs to search in order to do the join. For example:
explain select * from table where type=1 and userid=5
If they were not indexed it would probably change behavior.

Unique index on two columns plus separate index on each one?

I don't know much about database optimization, but I'm trying to understand this case.
Say I have the following table:
cities
===========
state_id integer
name varchar(32)
slug varchar(32)
Now, say I want to perform queries like this:
SELECT * FROM cities WHERE state_id = 123 AND slug = 'some_city'
SELECT * FROM cities WHERE state_id = 123
If I want the "slug" for a city to be unique within its particular state, I'd add a unique index on state_id and slug.
Is that index enough? Or should I also add another on state_id so the second query is optimized? Or does the second query automatically use the unique index?
I'm working on PostgreSQL, but I feel this case is so simple that most DBMS work similarly.
Also, I know this surely doesn't make a difference on small tables, but my example is a simple one. Think of 200k+ rows tables.
Thanks!
A single unique index on (state_id, slug) should be sufficient. To be sure, of course, you'll need to run EXPLAIN and/or ANALYZE (perhaps with the help of something like http://explain.depesz.com/), but ultimately what indexes are appropriate depends very closely on what kind of queries you will be running. Remember, indexes make SELECTs faster and INSERTs, UPDATEs, and DELETEs slower, so you ideally want only as many indexes as are actually necessary.
Also, PostgreSQL has a smart query optimizer: it will use radically different search plans for queries on small tables and huge tables. If the table is small, it will just do a sequential scan and not even bother with any indexes, since the overhead of working with them is higher than just brute-force sifting through the table. This changes to a different plan once the table size passes a threshold, and may change again if the table gets larger again, or if you change your SELECT, or....
Summary: you can't trust the results of EXPLAIN and ANALYZE on datasets much smaller or different than your actual data. Make it work, then make it fast later (if you need to).
[EDIT: Misread the question... Hopefully my answer is more relevant now!]
In your case, I'd suggest 1 index on (state_id, slug). If you ever need to search just by slug, add an index on just that column. If you have those, then adding another index on state_id is unnecessary as the first index already covers it.
An index can be used whenever an initial segment of its columns are used in a WHERE clause. So e.g. an index on columns A, B and C will optimise queries containing WHERE clauses involving A, B and C, WHERE clauses with just A and B, or WHERE clauses with just A. Note that the order that columns appear in the index definition is very important -- this example index cannot be used for WHERE clauses involving just B and/or C.
(Of course it's up to the query optimiser whether or not a particular index actually gets used, but in your case with 200k rows, you can guarantee that a simple search by state_id or slug or both will use one of the indices.)
Any decent optimizer will see an index on three columns - say:
CREATE INDEX idx_1 ON SomeTable(Col1, Col2, Col3);
and will use that index for any of the following conditions:
WHERE Col1 = ...something...
WHERE Col1 = ...something... AND Col2 = ...otherthing...
WHERE Col3 = ....whatnot....
AND Col1 = ...something....
AND Col2 = ...otherthing...
That is, it will use the index if there are conditions applied to any contiguous leading subset of the columns of the index. Although I used equality, it can also apply to ranges (open - just greater than, for example) or closed (between two values).
To do optimization use EXPLAIN http://www.postgresql.org/docs/7.4/static/sql-explain.html and see for your self.
But optimization is not the most important reason to make those indexes; first it is a constraint inhibiting a database from not being logical.

What is a Covered Index?

I've just heard the term covered index in some database discussion - what does it mean?
A covering index is an index that contains all of, and possibly more, the columns you need for your query.
For instance, this:
SELECT *
FROM tablename
WHERE criteria
will typically use indexes to speed up the resolution of which rows to retrieve using criteria, but then it will go to the full table to retrieve the rows.
However, if the index contained the columns column1, column2 and column3, then this sql:
SELECT column1, column2
FROM tablename
WHERE criteria
and, provided that particular index could be used to speed up the resolution of which rows to retrieve, the index already contains the values of the columns you're interested in, so it won't have to go to the table to retrieve the rows, but can produce the results directly from the index.
This can also be used if you see that a typical query uses 1-2 columns to resolve which rows, and then typically adds another 1-2 columns, it could be beneficial to append those extra columns (if they're the same all over) to the index, so that the query processor can get everything from the index itself.
Here's an article: Index Covering Boosts SQL Server Query Performance on the subject.
Covering index is just an ordinary index. It's called "covering" if it can satisfy query without necessity to analyze data.
example:
CREATE TABLE MyTable
(
ID INT IDENTITY PRIMARY KEY,
Foo INT
)
CREATE NONCLUSTERED INDEX index1 ON MyTable(ID, Foo)
SELECT ID, Foo FROM MyTable -- All requested data are covered by index
This is one of the fastest methods to retrieve data from SQL server.
Covering indexes are indexes which "cover" all columns needed from a specific table, removing the need to access the physical table at all for a given query/ operation.
Since the index contains the desired columns (or a superset of them), table access can be replaced with an index lookup or scan -- which is generally much faster.
Columns to cover:
parameterized or static conditions; columns restricted by a parameterized or constant condition.
join columns; columns dynamically used for joining
selected columns; to answer selected values.
While covering indexes can often provide good benefit for retrieval, they do add somewhat to insert/ update overhead; due to the need to write extra or larger index rows on every update.
Covering indexes for Joined Queries
Covering indexes are probably most valuable as a performance technique for joined queries. This is because joined queries are more costly & more likely then single-table retrievals to suffer high cost performance problems.
in a joined query, covering indexes should be considered per-table.
each 'covering index' removes a physical table access from the plan & replaces it with index-only access.
investigate the plan costs & experiment with which tables are most worthwhile to replace by a covering index.
by this means, the multiplicative cost of large join plans can be significantly reduced.
For example:
select oi.title, c.name, c.address
from porderitem poi
join porder po on po.id = poi.fk_order
join customer c on c.id = po.fk_customer
where po.orderdate > ? and po.status = 'SHIPPING';
create index porder_custitem on porder (orderdate, id, status, fk_customer);
See:
http://literatejava.com/sql/covering-indexes-query-optimization/
Lets say you have a simple table with the below columns, you have only indexed Id here:
Id (Int), Telephone_Number (Int), Name (VARCHAR), Address (VARCHAR)
Imagine you have to run the below query and check whether its using index, and whether performing efficiently without I/O calls or not. Remember, you have only created an index on Id.
SELECT Id FROM mytable WHERE Telephone_Number = '55442233';
When you check for performance on this query you will be dissappointed, since Telephone_Number is not indexed this needs to fetch rows from table using I/O calls. So, this is not a covering indexed since there is some column in query which is not indexed, which leads to frequent I/O calls.
To make it a covered index you need to create a composite index on (Id, Telephone_Number).
For more details, please refer to this blog:
https://www.percona.com/blog/2006/11/23/covering-index-and-prefix-indexes/