Same query uses different indexes? - sql

Can a select query use different indexes if a change the value of a where condition?
The two following queries use different indexes and the only difference is the value of the
condition and typeenvoi='EXPORT' or and typeenvoi='MAIL'
select numenvoi,adrdest,nomdest,etat,nbessais,numappel,description,typeperiode,datedebut,datefin,codeetat,codecontrat,typeenvoi,dateentree,dateemission,typedoc,numdiffusion,nature,commentaire,criselcomp,crisite,criservice,chrono,codelangueetat,piecejointe, sujetmail, textemail
from v_envoiautomate
where etat=0 and typeenvoi='EXPORT'
and nbessais<1
select numenvoi,adrdest,nomdest,etat,nbessais,numappel,description,typeperiode,datedebut,datefin,codeetat,codecontrat,typeenvoi,dateentree,dateemission,typedoc,numdiffusion,nature,commentaire,criselcomp,crisite,criservice,chrono,codelangueetat,piecejointe, sujetmail, textemail
from v_envoiautomate
where etat=0 and typeenvoi='MAIL'
and nbessais<1
Can anyone give me an explanation?

Details on indexes are stored as statistics in a histogram-type dataset in SQL Server.
Each index is chunked into ranges, and each range contains a summary of the key values within that range, things like:
range High value
number of values in the range
number of distinct values in the range (cardinality)
number of values equal to the High value
...and so on.
You can view the statistics on a given index with:
DBCC SHOW_STATISTICS(<tablename>, <indexname>)
Each index has a couple of characteristics like density, and ultimately selectivity, that tell the query optimiser how unique each value in an index is likely to be, and how efficient this index is at quickly locating records.
As your query has three columns in the where clause, it's likely that any of these columns might have an index that could be useful to the optimiser. It's also likely that the primary key index will be considered, in the event of the selectivity of other indexes not being high enough.
Ultimately, it boils down to the optimiser making a quick judgement call on how many page reads will be necessary to read each your non-clustered indexes + bookmark lookups, with comparisons with the other values, vs. doing a table scan.
The statistics that these judgements are based on can vary wildly too; SQL Server, by default, only samples a small percentage of any significant table's rows, so the selectivity of that index might not be representative of the whole. This is particularly problematic where you have highly non-unique keys in the index.
In this specific case, I'm guessing your typeenvoi index is highly non-unique. This being so, the statistics gathered probably indicate to the optimiser that one of the values is rarer than the other, and the likelihood of that index being chosen is increased.

The query optimiser in SQL Server (as in most modern DBMS platforms) uses a methodology known as 'cost based optimisation.' In order to do this it uses statistics about the tables in the database to estimate the amount of I/O needed. The optimiser will consider a number of semantically equivalent query plans that it generates by transforming a basic query plan generated by parsing the statement.
Each plan is evaluated for cost by a heuristic based on the statistics maintained about the tables. The statistics come in various flavours:
Table and index row counts
Distributions histograms of the values in individual columns.
If the ocurrence of 'MAIL' vs. 'EXPORT' in the distribution histograms is significantly different the query optimiser can come up with different optimal plans. This is probably what happened.

Probably has to do with the "cardinality", I believe the word is, of the values in the table. If there are a lot more rows that match that clause, SQL Server may decide that one query will be more efficient using an index for a different column. This is an extreme case, but if there was one row that matched 'MAIL', it would likely use that index. If every other row in the table was 'EXPORT', but only half of those 'EXPORT' rows had an etat of 0, then it would probably use the index on that column.

Related

Performance of SQL query with condition vs. without where clause

Which SQL-query will be executed with less time — query with WHERE-clause or without, when:
WHERE-clause deals with indexed field (e.g. primary key field)
WHERE-clause deals with non-indexed field
I suppose when we're working with indexed fields, thus query with WHERE will be faster. Am I right?
As has been mentioned there is no fixed answer to this. It all depends on the particular context. But just for the sake of an answer. Take this simple query:
SELECT first_name FROM people WHERE last_name = 'Smith';
To process this query without an index, every column, last_name must be checked for every row in the table (full table scan).
With an index, you could just follow a B-tree data structure until 'Smith' was found.
With a non index the worst case looks linear (n), whereas with a B-tree it would be log n, hence computationally less expensive.
Not sure what you mean by 'query with WHERE-clause or without', but you're correct that most of the time a query with a WHERE clause on an indexed field with outperform a query whose WHERE clause on a non-indexed field.
One instance where the performance will be the same (ie indexing doesn't matter) is when you run a range based query in your where clause (ie WHERE col1 > x ). This forces a scan of the table, and thus will be the same speed as a range query on a non indexed column.
Really, it depends on the columns you reference in the where clause, the types of data in the columns, the types of queries your running, etc.
It may depend on the type of where clause you are writing. In a simple where clause, it is generally better to have an index on the field you are using (and uindexes can and should be built on more than the PK). However, you have to write a saragble where clause for the index to make any difference. See this question for some guidelines on sarability:
What makes a SQL statement sargable?
There are cases where a where clause on the primary key will be slower.
The simplest is a table with one row. Using the index requires loading both the index and the data page -- two reads. No index cuts the work in half.
That is a degenerate case, but it points to the issue -- the proportion of the rows selected. Or, more accurately, the proportion of pages needed to resolve the query.
When the desired data is on all pages, then using an index slowed things down. For a non primary key, this can be disastrous, when the table is bigger than the page cache and the accesses are random.
Since pages are ordered by a primary key, the worst case is an additional index scan -- not too bad.
Some databases use statistics on tables to decide when to use an index and when to do a full table scan. Some don't.
In short, for low selectivity queries, an index will improve performance. For high selectivity queries, using an index can result in marginally worse performance or dire performance, depending on various factors.
Some of my queries are quite complex and applying a where clause degrading the performance. For the workaround, I used temp tables and then applied where clause on them. This significantly improved the performance. Also, where I had joins especially Left Outer Join, improved the performance.

Are SQL Execution Plans based on Schema or Data or both?

I hope this question is not too obvious...I have already found lots of good information on interpreting execution plans but there is one question I haven't found the answer to.
Is the plan (and more specifically the relative CPU cost) based on the schema only, or also the actual data currently in the database?
I am try to do some analysis of where indexes are needed in my product's database, but am working with my own test system which does not have close to the amount of data a product in the field would have. I am seeing some odd things like the estimated CPU cost actually going slightly UP after adding an index, and am wondering if this is because my data set is so small.
I am using SQL Server 2005 and Management Studio to do the plans
It will be based on both Schema and Data. The Schema tells it what indexes are available, the Data tells it which is better.
The answer can change in small degrees depending on the DBMS you are using (you have not stated), but they all maintain statistics against indexes to know whether an index will help. If an index breaks 1000 rows into 900 distinct values, it is a good index to use. If an index only results in 3 different values for 1000 rows, it is not really selective so it is not very useful.
SQL Server is 100% cost-based optimizer. Other RDBMS optimizers are usually a mix of cost-based and rules-based, but SQL Server, for better or worse, is entirely cost driven. A rules based optimizer would be one that can say, for example, the order of the tables in the FROM clause determines the driving table in a join. There are no such rules in SQL Server. See SQL Statement Processing:
The SQL Server query optimizer is a
cost-based optimizer. Each possible
execution plan has an associated cost
in terms of the amount of computing
resources used. The query optimizer
must analyze the possible plans and
choose the one with the lowest
estimated cost. Some complex SELECT
statements have thousands of possible
execution plans. In these cases, the
query optimizer does not analyze all
possible combinations. Instead, it
uses complex algorithms to find an
execution plan that has a cost
reasonably close to the minimum
possible cost.
The SQL Server query optimizer does
not choose only the execution plan
with the lowest resource cost; it
chooses the plan that returns results
to the user with a reasonable cost in
resources and that returns the results
the fastest. For example, processing a
query in parallel typically uses more
resources than processing it serially,
but completes the query faster. The
SQL Server optimizer will use a
parallel execution plan to return
results if the load on the server will
not be adversely affected.
The query optimizer relies on
distribution statistics when it
estimates the resource costs of
different methods for extracting
information from a table or index.
Distribution statistics are kept for
columns and indexes. They indicate the
selectivity of the values in a
particular index or column. For
example, in a table representing cars,
many cars have the same manufacturer,
but each car has a unique vehicle
identification number (VIN). An index
on the VIN is more selective than an
index on the manufacturer. If the
index statistics are not current, the
query optimizer may not make the best
choice for the current state of the
table. For more information about
keeping index statistics current, see
Using Statistics to Improve Query
Performance.
Both schema and data.
It takes the statistics into account when building a query plan, using them to approximate the number of rows returned by each step in the query (as this can have an effect on the performance of different types of joins, etc).
A good example of this is the fact that it doesn't bother to use indexes on very small tables, as performing a table scan is faster in this situation.
I can't speak for all RDBMS systems, but Postgres specifically uses estimated table sizes as part of its efforts to construct query plans. As an example, if a table has two rows, it may choose a sequential table scan for the portion of the JOIN that uses that table, whereas if it has 10000+ rows, it may choose to use an index or hash scan (if either of those are available.) Incidentally, it used to be possible to trigger poor query plans in Postgres by joining VIEWs instead of actual tables, since there were no estimated sizes for VIEWs.
Part of how Postgres constructs its query plans depend on tunable parameters in its configuration file. More information on how Postgres constructs its query plans can be found on the Postgres website.
For SQL Server, there are many factors that contribute to the final execution plan. On a basic level, Statistics play a very large role but they are based on the data but not always all of the data. Statistics are also not always up to date. When creating or rebuilding an Index, the statistics should be based on a FULL / 100% sample of the data. However, the sample rate for automatic statistics refreshing is much lower than 100% so it is possible to sample a range that is in fact not representative of much of the data. Estimated number of rows for the operation also plays a role which can be based on the number of rows in the table or the statistics on a filtered operation. So out-of-date (or incomplete) Statistics can lead the optimizer to choose a less-than-optimal plan just as a few rows in a table can cause it to ignore indexes entirely (which can be more efficient).
As mentioned in another answer, the more unique (i.e. Selective) the data is the more useful the index will be. But keep in mind that the only guaranteed column to have statistics is the leading (or "left-most" or "first") column of the Index. SQL Server can, and does, collect statistics for other columns, even some not in any Indexes, but only if AutoCreateStatistics DB option is set (and it is by default).
Also, the existence of Foreign Keys can help the optimizer when those fields are in a query.
But one area not considered in the question is that of the Query itself. A query, slightly changed but still returning the same results, can have a radically different Execution Plan. It is also possible to invalidate the use of an Index by using:
LIKE '%' + field
or wrapping the field in a function, such as:
WHERE DATEADD(DAY, -1, field) < GETDATE()
Now, keep in mind that read operations are (ideally) faster with Indexes but DML operations (INSERT, UPDATE, and DELETE) are slower (taking more CPU and Disk I/O) as the Indexes need to be maintained.
Lastly, the "estimated" CPU, etc. values for cost are not always to be relied upon. A better test is to do:
SET STATISTICS IO ON
run query
SET STATISTICS IO OFF
and focus on "logical reads". If you reduce Logical Reads then you should be improving performance.
You will, in the end, need a set of data that comes somewhat close to what you have in Production in order to performance tune with regards to both Indexes and the Queries themselves.
Oracle specifics:
The stated cost is actually an estimated execution time, but it is given in a somewhat arcane unit of measure that has to do with estimated time for block reads. It's important to realize that the calculated cost doesn't say much about the runtime anyway, unless each and every estimate made by the optimizer was 100% perfect (which is never the case).
The optimizer uses the schema for a lot of things when deciding what transformations/heuristics can be applied to the query. Some examples of schema things that matter a lot when evaluating xplans:
Foreign key constraints (can be used for table elimiation)
Partitioning (exclude entire ranges of data)
Unique constraints (index unique vs range scans for example)
Not null constraints (anti-joins are not available with not in() on nullable columns
Data types (type conversions, specialized date arithmetics)
Materialized views (for rewriting a query against an aggregate)
Dimension Hierarchies (to determine functional dependencies)
Check constraints (the constraint is injected if it lowers cost)
Index types (b-tree(?), bitmap, joined, function based)
Column order in index (a = 1 on {a,b} = range scan, {b,a} = skip scan or FFS)
The core of the estimates comes from using the statistics gathered on actual data (or cooked). Statistics are gathered for tables, columns, indexes, partitions and probably something else too.
The following information is gathered:
Nr of rows in table/partition
Average row/col length (important for costing full scans, hash joins, sorts, temp tables)
Number of nulls in col (is_president = 'Y' is pretty much unique)
Distinct values in col (last_name is not very unique)
Min/max value in col (helps unbounded range conditions like date > x)
...to help estimate the nr of expected rows/bytes returned when filtering data. This information is used to determine what access paths and join mechanisms are available and suitable given the actual values from the SQL query compared to the statistics.
On top of all that, there is also the physical row order which affects how "good" or attractive an index become vs a full table scan. For indexes this is called "clustering factor" and is a measure of how much the row order matches the order of the index entries.

Do queries make use of more than one index at a time?

If I have a table with an index each on a different column, does the database ever make use of both indexes when executing a query? Additionally, if I have an index on 4 columns, and an additional index on one other column, could a query against all 5 columns make use of this 2nd index, or would it just be a region scan after matching the first index?
If I have a table with an index each on a different column, does the database ever make use of both indexes when executing a query?
If the cost-based query optimizer determines that it's more efficient to use more than one index, yes, it will. If it's more efficient to do a scan (and often it is), then it may not use an index, even if you think it should.
Additionally, if I have an index on 4 columns, and an additional index on one other column, could a query against all 5 columns make use of this 2nd index, or would it just be a region scan after matching the first index?
Again, if the optimizer thinks it's efficient to do so, yes it'll use that other index for the same query. If it determines the cost is higher with the index...it'll ignore it. It all depends on how selective (or rather, how selective the optimizer thinks it is, based off the latest statistics) as to whether it'll use the index. If it's not selective (won't narrow down the results much), it'll likely ignore it.
It depends on the optimizer and the query, but optimizers relatively seldom use two separate indexes on a single table in a single query. It is perfectly feasible to construct examples where they could, possibly even should - and some may actually do so. Consider:
A UNION query where the separate terms have filters on different columns (but a table scan may be as effective)
A self-join where the separate sides of the self-join have the different filters.
However, be wary of accusing the optimizer of not being efficient - there may still be advantages to resolving the query by other methods.
To answer your 'index on 4 columns' questions: it is rather unlikely. In this scenario, it is likely that the 4-column index provides good selectivity and the query is most easily resolved by applying the extra filter condition to the rows retrieved by the index scan. (Note that the answer might be different depending on whether the extra condition is connected to the other by AND (as I assumed) or OR (where using the second index might be useful).
It depends upon the queries emitted against those tables, the size of the tables and the selectivity of the data in the columns indexed.
The optimizer uses statistics to determine whether using an index will be beneficial.
1.IF I have a table with an index each on a different column, does the database ever make use of both indexes when executing a query?
It certainly can, for example if you have the table
EMPLOYEE(
id (index1)
name
address
date (index2) )
and the table
TASKS(
id
employee_id (index3)
date (index 4)
category
description)
If you do the query:
select
employee_id,date,category,description
from EMPLOYEE, TASKS where
EMPLOYEE.id=employee_id and
EMPLOYEE.date=TASKS.date
this will list all the tasks of each employee in each day and user index1 and index2 along with index4 and index3. Which will take much more time if I where lacking either index1 or index2.
2.if I have an index on 4 columns, and an additional index on one other column, could a query against all 5 columns make use of this 2nd index, or would it just be a region scan after matching the first index?
Of course it can be done, but the query should include joins on both the 4 column index and also the single column index.

TableCardinality

What does it means TableCardinality in execution plan?
I am looking at data base tuning performances
Thanks
Cardinality is the number of UNIQUE values for a given table / column (so it's not surprising that is equal to the number of entries in the primary key index as that index is most likely clustered). The cardinality of the index or table is usefull to sql server as it allows the query optimiser to make educated guesses about the possible best plans to use when referencing that table.
When the optimiser has to take your sql code and work out what to do with it, it will consider alternative plans before settling on the one that it will use to actually retrieve the data. In most real world cases the number of possible plans is too large for sqlserver to calculate the absolute best plan via sampling all possibles so the optimiser will use statistical analysis to determin a "good enough" plan.
Cardinality is one of the metrics it uses to work out such a plan as it may use this to derermine which physical joins to use (hashmaps or loops for example)
From this article
SQL Server keeps track of table cardinality when the query plan is compiled. It does so because it will trigger an automatic recompile if the actual cardinality is substantially different from the compile time cardinality.
It would seem a reasonable guess that that is what the TableCardinality in the plan XML is (and shown in the properties window) but I haven't found anything to confirm that.
Table Cardinality, in the context of the Execution Plan, is normally the same as the number of rows in the table per the latest statistics on that table.
Source: see http://support.microsoft.com/kb/195565 and search for "table cardinality" - you will see this note: "NOTE: In this strictest sense, SQL Server counts cardinality as the number of rows in the table."
Also, I checked a number of tables in various queries, and in each case, TableCardinality was equal to the number of rows in the table, or in some cases, the rowcnt value in sys.sysindexes for the primary key for that table.
Isn't it the same as Estimated Number of Rows?

Do indexes suck in SQL?

Say I have a table with a large number of rows and one of the columns which I want to index can have one of 20 values.
If I were to put an index on the column would it be large?
If so, why? If I were to partition the data into the data into 20 tables, one for each value of the column, the index size would be trivial but the indexing effect would be the same.
It's not the indexes that will suck. It's putting indexes on the wrong columns that will suck.
Seriously though, why would you need a table with a single column? What would the meaning of that data be? What purpose would it serve?
And 20 tables? I suggest you read up on database design first, or otherwise explain to us the context of your question.
Indexes (or indices) don't suck. A lot of very smart people have spent a truly remarkable amount of time of the last several decades ensuring that this is so.
Your schema, however, lacking the same amount of expertise and effort, may suck very badly indeed.
Partitioning, in the case described is equivalent to applying a clustered index. If the table is sorted otherwise (or is in arbitrary order) then the index necessarily has to occupy much more space. Depending on the platform, a non-clustered index may reduce in size as the sortedness of the rows with respect to the indexed value increases.
YMMV.
The short answer:
Do indexes suck: Yes and No
The longer answer:
They don't suck if used properly. Maybe you should start reading about how indexes work, why they can work and why they sometimes don't work.
Good starting points:
http://www.sqlservercentral.com/articles/Indexing/
No indexes don't suck, but you have to pay attention to how you use them or they can backfire on the performance of your queries.
First: Schema / design
Why would you create a table with only one column? That's probably taking normalization one step to far. Database design is one of the most important things to consider in optimizing performance
Second: Indexes
In a nutshell the indexes will help the database to perform a binary search of your record. Without an index on a column (or set of columns) the database will often fall back to a table scan. A table scan is very expensive because it involves enumerating each and every record.
It doesn't really matter THAT much for index scans how many records there are in the database table. Because of the (balanced) binary tree search doubling the amount of records will only result in one extra search step.
Determine the primary key of your table, SQL will automatically place a clustered index on that column(s). Clustered indexes perform really well. In addition you can place non-clustered indexes on columns that are used often in SELECT, JOIN, WHERE, GROUP BY and ORDER BY statements. Do remember that indexes have a certain overlap, try to never include your clustered index into a non-clustered index.
Also interesting might be the fill factor on the indexes. Do you want to optimize your table for reads (high fill factor - less storage, less IO) or for writes (low fill factor more storage, less rebuilding your database pages).
Third: Partitioning
One of the reasons to use partitioning is to optimize your data access. Let's say you have 1 million records of which 500,000 records are no longer relevant but stored for archiving purposes. In this case you could decide to partition the table and store the 500,000 old records on slow storage and the other 500,000 records on fast storage.
To measure is to know
The best way to get insight in what happens is to measure what happens to your cpu and io. Microsoft SQL server has some tools like the Profiler and Execution plans in Management Studio that will tell you the duration of your query, number of read/writes and cpu usage. Also the execution plan will tell you which or IF indexes are being used. To your surprise you might see a table scan although you didn't expect it.
Say I have a table with a large number of rows and one column which I want to index can have one of 20 values. If I were to put an index on the column would it be large?
The index size will be proportional to the number of your rows and the length of the indexed values.
The index keeps not only the indexed value, but also some kind of a pointer to the row (ROWID in Oracle, LCID in PostgreSQL, primary key in InnoDB etc).
If you have 10,000 rows and a 1 distinct value, you will still have 10,000 records in your index.
If so, why? If I were to partition the data into the data into 20 tables, one for each value of the column, the index size would be trivial but the indexing effect would be the same
In this case, you would come with 20 indexes being same in size in total as your original one.
This technique is sometimes used in fact in such called partitioned indexes. It has its advantages and drawbacks.
Standard b-tree indexes are best suited to fairly selective indexes, which this example would not be. You don't say what DBMS you are using; Oracle has another type of index called a bitmap index which is more suited to low-selectivity indexes in OLAP environments (since these indexes are expensive to maintain, making them unsuitable for OLTP environments).
The optimiser will decide bases on stats whether it thinks the index will help get the data in the fastest time; if it won't, the optmiser won't use it.
Partitioning is another strategy. In Oracle you can define a table as partitioned on some set of columns, and for the optimiser can automatically perform "partition elimination" like you suggest.
Sorry, I'm not quite sure what you mean by "large".
If your index is clustered, all the data for each record will be on the same leaf page, thereby creating the most efficient index available to your table as long as you write your queries against it properly.
If your index is non-clustered, then only the index related data will be on your leaf pages. Then, depending on suchs things as how many other indexes you have, coupled with details like your fill factor, your index may or may not be efficient. In general, if you don't have a ton of indexes on your table, you should be safe.
The efficiency of your index will also be determined by the data type of the 20 values you're speaking of going into the column. If those are pre-defined values, then their details should probably be in a lookup table with a simple primary key datatype (like Int/Number). Then add that column to your table as a foreign key with an index on the column.
Ultimately, you could have a perfect index on a column. But it's best use will be determined for the most part by the queries you write. So if your queries make use of the indexes, you're golden.
Indexes are purely for performance. If an index doesn't boost performance for the queries you're interested in, then it sucks.
As for disk usage, you have to weigh your concerns. Different SQL providers build indexes differently, but as a client, you generally trust that they do the best that can be done. In the case you're describing, a clustered index may be optimal for both size and performance.
It would be large enough to hold those values for all the rows, in a sorted order.
Say you have 20 different strings of 4 characters, and 1 million rows, it would at least be 4 million bytes (or 8 if 16-bit unicode) to hold those values.