I'm pretty new to databases, so forgive me if this is a silly question.
In modern databases, if I use an index to access a row, I believe this will be O(1) complexity. But if I do a query to select another column, will it be O(1) or O(n)? Does the database have to iterate through all the rows, or does it build a sorted list for each column?
Actually, I think access based on an index will be O(log(n)), because you'll still be searching down through a B-Tree-esque organization to get to your record.
To answer your literal question, yes if there is no index on a column, the database engine will have to look at all rows.
In the more interesting case of selecting by multiple columns, both with and without index, the situation becomes more complex: If the Query Optimizer chooses to use the index, then it'll first select rows based on the index and then apply a filter with the remaining constraints. Thus reducing the second filtering operation from O(number of rows) to O(number of selected rows by index). The ratio between these two number is called selectivity and an important statistic when choosing which index to use.
Indexes are per column, so if you use a where clause on a un-indexed column it will do a so called tablescan which is O(n).
I don't know the answer, but keep in mind that big-O notation only gives you an indication of performance for data-set sizes which are arbitrarily large.
For example, the bottleneck for database performance is typically disk seeks. Therefore, performance is greatly increased if the working data-set can be kept in memory. Big-O notation won't tell you anything about such optimizations, because they are only relevant for finite data-sets.
B-trees do not yield O(logN), that is the complexity of a binary tree.
A B-Tree is organised such that it has an entire block per node, thus once a node is found a single I/O operation can read an entire block.
With the number of items per node = blocking factor (#records/block){bfr}, a B-Tree optimized search will yield O(log bfr÷2 +1 N) I/O operations instead of O(N) I/O operations seeking a record by key.
You have indexes. Clustered indexes are physically sorted on the disk, you can have only one per table. Unclustered indexes are logically sorted and you can have many of those (careful not to abuse it either, it might slow down write actions).
If there is no index on your column then I believe it's the good old row by row method.
There are different types of indexes, different execution plans and different implementations for different databases. Most of the code of relations database is in search-optimising algorithms. There is not a single answer to your question. You can use a tool to visualise the execution plan when you want to know how a query is going to be executed.
Table without index, data save on non-order structure. When you want to search for some data, it would use the "Scan" to go check through all the data from start till end on the table.
Case 1: query table without index, query 1 record,
SQL query plan step: "table scan" all data, O(N)
Case 2: query table without index, query many records,
SQL query plan step: "table scan" all data, O(N)
Table with index, data will save on B-tree structure, which when you want to search for 1 data (on the indexed column), it would make use of the B-tree structure to find the data.
Case 3: query table with on indexed column, query 1 record,
SQL query plan step: "index seek", O(LogN)
Case 4: query table with on indexed column, query many record,
2 possible, the SQL Query Optimizer will made use of the "index statistics" to compute and determine which action step is faster to be use.
(a) SQL query plan step: "index Scan", O(N)
(b) SQL query plan step:
"index seek", O(R LogN) [R for number of records]
Related
I am joining two tables called Zasilka and Kapitola. Each one has a clustered index and Kapitola has also non clustered index with a column with which I am joining.
The query uses index seek because it expects only 1 row to be returned.
The statistics on both tables are updated.
I have tried to disable the index, then it uses merge join but it has to first order about 40000 rows, which takes a lot of resources.
Index column is mostly ordered but there are some cases when not. I just try to think about what would be the best strategy to join these tables and avoid order or seek.
And I do not know why it does not use non clustered index to join using merge.
Exectuion plan
io statistics seek
io statistics merge
You are misreading the information in the showplan, I believe. The estimate is per execution of the subtree. It estimates it will return 1 row per subtree execution and that it will execute the subtree 71,000 times. (It doesn't estimate less than one). Due to the containment assumption, it believes it will find a row when seeking (assumption of the optimizer based on usual customer behavior). In actuality, you get 46,000 rows or so back. So, the optimizer is working as expected in this case.
In the future, please post the query text, schema, and whole plan shape. It is very hard to do more than guess when you take a screenshot with most of the plan shape covered up.
I'm studying some SQL and regarding queries it says :
Create an index when the table is large and most queries are expected
to retrieve less than 2% to 4% of the rows in the table.
I just want to get a mental picture of what is meant by the statement. I understand that index is to make your query go much faster. Is it because the index will be focused on only that 2% to 4% of the table?
Databases store data in pages, data pages. One way to make queries more efficient is by reducing the number of data pages that need to be read.
A typical database page might be 8k - 64k in size. If the records are really big, there might be just one record per page. If the records are quite small, there might be hundreds or even thousands on each page.
When you have an index and a condition on the column in the where clause, this restricts the number of rows. The proportion is called the "selectivity" of the where clause.
The SQL engine has two ways of satisfying such a where clause. It can read every row and compare values in each row to the condition. This is called a "full table scan". It can just look up the appropriate values in an index. This is called an "index scan".
Now, when using an index for a where clause, what we want to do is to reduce the number of data pages being read. This happens when we are reading, on average, less than one record per page. This is where the 2% - 4% comes from. Do note that if you have very large records, the number could be much larger.
However, there is a problem with this heuristic. Indexes are used for other purposes:
An index can be used to retrieve records in order, if the index matches the order by clause (and other conditions in the query are true).
An index can be used for joining records.
An index can be used to satisfy a query in its entirety, if the columns in the index are the only ones in the query (in this case, one says that the index "covers" the query).
So, the information you read is a heuristic. It is a useful guideline, but it is definitely not set in stone.
No, ff your index is on the columns in your where condition, you will not have to do a table scan. Not doing a table scan is beneficial if a smaller portion of rows is being returned.
If you are returning 100% of the rows, there is no major difference between a table scan or an index scan.
I hope this question is not too obvious...I have already found lots of good information on interpreting execution plans but there is one question I haven't found the answer to.
Is the plan (and more specifically the relative CPU cost) based on the schema only, or also the actual data currently in the database?
I am try to do some analysis of where indexes are needed in my product's database, but am working with my own test system which does not have close to the amount of data a product in the field would have. I am seeing some odd things like the estimated CPU cost actually going slightly UP after adding an index, and am wondering if this is because my data set is so small.
I am using SQL Server 2005 and Management Studio to do the plans
It will be based on both Schema and Data. The Schema tells it what indexes are available, the Data tells it which is better.
The answer can change in small degrees depending on the DBMS you are using (you have not stated), but they all maintain statistics against indexes to know whether an index will help. If an index breaks 1000 rows into 900 distinct values, it is a good index to use. If an index only results in 3 different values for 1000 rows, it is not really selective so it is not very useful.
SQL Server is 100% cost-based optimizer. Other RDBMS optimizers are usually a mix of cost-based and rules-based, but SQL Server, for better or worse, is entirely cost driven. A rules based optimizer would be one that can say, for example, the order of the tables in the FROM clause determines the driving table in a join. There are no such rules in SQL Server. See SQL Statement Processing:
The SQL Server query optimizer is a
cost-based optimizer. Each possible
execution plan has an associated cost
in terms of the amount of computing
resources used. The query optimizer
must analyze the possible plans and
choose the one with the lowest
estimated cost. Some complex SELECT
statements have thousands of possible
execution plans. In these cases, the
query optimizer does not analyze all
possible combinations. Instead, it
uses complex algorithms to find an
execution plan that has a cost
reasonably close to the minimum
possible cost.
The SQL Server query optimizer does
not choose only the execution plan
with the lowest resource cost; it
chooses the plan that returns results
to the user with a reasonable cost in
resources and that returns the results
the fastest. For example, processing a
query in parallel typically uses more
resources than processing it serially,
but completes the query faster. The
SQL Server optimizer will use a
parallel execution plan to return
results if the load on the server will
not be adversely affected.
The query optimizer relies on
distribution statistics when it
estimates the resource costs of
different methods for extracting
information from a table or index.
Distribution statistics are kept for
columns and indexes. They indicate the
selectivity of the values in a
particular index or column. For
example, in a table representing cars,
many cars have the same manufacturer,
but each car has a unique vehicle
identification number (VIN). An index
on the VIN is more selective than an
index on the manufacturer. If the
index statistics are not current, the
query optimizer may not make the best
choice for the current state of the
table. For more information about
keeping index statistics current, see
Using Statistics to Improve Query
Performance.
Both schema and data.
It takes the statistics into account when building a query plan, using them to approximate the number of rows returned by each step in the query (as this can have an effect on the performance of different types of joins, etc).
A good example of this is the fact that it doesn't bother to use indexes on very small tables, as performing a table scan is faster in this situation.
I can't speak for all RDBMS systems, but Postgres specifically uses estimated table sizes as part of its efforts to construct query plans. As an example, if a table has two rows, it may choose a sequential table scan for the portion of the JOIN that uses that table, whereas if it has 10000+ rows, it may choose to use an index or hash scan (if either of those are available.) Incidentally, it used to be possible to trigger poor query plans in Postgres by joining VIEWs instead of actual tables, since there were no estimated sizes for VIEWs.
Part of how Postgres constructs its query plans depend on tunable parameters in its configuration file. More information on how Postgres constructs its query plans can be found on the Postgres website.
For SQL Server, there are many factors that contribute to the final execution plan. On a basic level, Statistics play a very large role but they are based on the data but not always all of the data. Statistics are also not always up to date. When creating or rebuilding an Index, the statistics should be based on a FULL / 100% sample of the data. However, the sample rate for automatic statistics refreshing is much lower than 100% so it is possible to sample a range that is in fact not representative of much of the data. Estimated number of rows for the operation also plays a role which can be based on the number of rows in the table or the statistics on a filtered operation. So out-of-date (or incomplete) Statistics can lead the optimizer to choose a less-than-optimal plan just as a few rows in a table can cause it to ignore indexes entirely (which can be more efficient).
As mentioned in another answer, the more unique (i.e. Selective) the data is the more useful the index will be. But keep in mind that the only guaranteed column to have statistics is the leading (or "left-most" or "first") column of the Index. SQL Server can, and does, collect statistics for other columns, even some not in any Indexes, but only if AutoCreateStatistics DB option is set (and it is by default).
Also, the existence of Foreign Keys can help the optimizer when those fields are in a query.
But one area not considered in the question is that of the Query itself. A query, slightly changed but still returning the same results, can have a radically different Execution Plan. It is also possible to invalidate the use of an Index by using:
LIKE '%' + field
or wrapping the field in a function, such as:
WHERE DATEADD(DAY, -1, field) < GETDATE()
Now, keep in mind that read operations are (ideally) faster with Indexes but DML operations (INSERT, UPDATE, and DELETE) are slower (taking more CPU and Disk I/O) as the Indexes need to be maintained.
Lastly, the "estimated" CPU, etc. values for cost are not always to be relied upon. A better test is to do:
SET STATISTICS IO ON
run query
SET STATISTICS IO OFF
and focus on "logical reads". If you reduce Logical Reads then you should be improving performance.
You will, in the end, need a set of data that comes somewhat close to what you have in Production in order to performance tune with regards to both Indexes and the Queries themselves.
Oracle specifics:
The stated cost is actually an estimated execution time, but it is given in a somewhat arcane unit of measure that has to do with estimated time for block reads. It's important to realize that the calculated cost doesn't say much about the runtime anyway, unless each and every estimate made by the optimizer was 100% perfect (which is never the case).
The optimizer uses the schema for a lot of things when deciding what transformations/heuristics can be applied to the query. Some examples of schema things that matter a lot when evaluating xplans:
Foreign key constraints (can be used for table elimiation)
Partitioning (exclude entire ranges of data)
Unique constraints (index unique vs range scans for example)
Not null constraints (anti-joins are not available with not in() on nullable columns
Data types (type conversions, specialized date arithmetics)
Materialized views (for rewriting a query against an aggregate)
Dimension Hierarchies (to determine functional dependencies)
Check constraints (the constraint is injected if it lowers cost)
Index types (b-tree(?), bitmap, joined, function based)
Column order in index (a = 1 on {a,b} = range scan, {b,a} = skip scan or FFS)
The core of the estimates comes from using the statistics gathered on actual data (or cooked). Statistics are gathered for tables, columns, indexes, partitions and probably something else too.
The following information is gathered:
Nr of rows in table/partition
Average row/col length (important for costing full scans, hash joins, sorts, temp tables)
Number of nulls in col (is_president = 'Y' is pretty much unique)
Distinct values in col (last_name is not very unique)
Min/max value in col (helps unbounded range conditions like date > x)
...to help estimate the nr of expected rows/bytes returned when filtering data. This information is used to determine what access paths and join mechanisms are available and suitable given the actual values from the SQL query compared to the statistics.
On top of all that, there is also the physical row order which affects how "good" or attractive an index become vs a full table scan. For indexes this is called "clustering factor" and is a measure of how much the row order matches the order of the index entries.
I have created script to find selectivity of each columns for every tables. In those some tables with less than 100 rows but selectivity of column is more than 50%.
where Selectivity = Distinct Values / Total Number Rows
So, is those column are eligible for index?
Or, can you tell, how much minimum rows require for eligibility for creating index?
I think I understand what you are trying to accomplish by calculating a 'Selectivity' value for your data but you cannot apply the rule blindly.
In fact in for certain queries the 'Selectivity' value might be really low an index will still be very beneficial. For example:
Assume a 'inbox' table with millions of rows, these rows have a 'Read' boolean field. In this case the distinct values over the number of rows will be really low. If most items are read most of the time then finding unread items with an index on this field will be very efficient.
Creating indexes index come at a cost. Although you get the benefit for reads, you pay for writes and disk usage.
I would rather recommend you profile your queries and index accordingly. You can also look at the data from sys.dm_db_missing_index_group_stats and other Dynamic management views that will give you insight on indexes usage (or missing) ones.
You can create a index on a table with 0 rows, 1 row or a 100 million rows. You can create an index where every column has the same value or unique values.
So you can create an index. The question is really should you create an index and no tool is going to tell you that because indexes can also be multi-value and it depends on what queries you run. Creating indexes is something done when performance tuning queries or preemptively when you know that you'll be creating queries that are using it.
Every index comes with a cost in terms of space and time required to do updates, inserts and deletes. You don't want to be creating them spuriously so you're really going to have to do this by hand, not as a result of a script to see how unique the value of a column is.
A general rule of thumb says that if you have a very large table (over 1 million rows), you should only use an index if a WHERE clause based on that index selects at most something in the neighborhood of 1-2% of the data.
If you have a "gender" column and roughly 50% of values are "male" and roughly 50% "female", then having an index on that really doesn't give you much - SQL Server and most other RDBMS will most likely still do a full table scan in this case, since on average, they'd have to scan at least half the table anyway, so the "detour" by using an index first and then looking up the actual full data based on that index value is just not worth it.
An index is excellent if you have something like unique keys (customer number), or a value that is quite selective. An index is not without cost - it uses up disk space, it needs to be maintained, it will slightly slow down all operations besides the SELECT - so thread carefully, it's not the best idea to just blindly index everything. Having too few indices is bad - but having too many, and the wrong ones, can be even worse! :-) Nobody ever claimed getting your indices right was easy.... :-)
But there's definitely help out there - the best source I know are Kimberly Tripp's excellent blog posts on SQL Server indexing (and many other topics).
Marc
Can a select query use different indexes if a change the value of a where condition?
The two following queries use different indexes and the only difference is the value of the
condition and typeenvoi='EXPORT' or and typeenvoi='MAIL'
select numenvoi,adrdest,nomdest,etat,nbessais,numappel,description,typeperiode,datedebut,datefin,codeetat,codecontrat,typeenvoi,dateentree,dateemission,typedoc,numdiffusion,nature,commentaire,criselcomp,crisite,criservice,chrono,codelangueetat,piecejointe, sujetmail, textemail
from v_envoiautomate
where etat=0 and typeenvoi='EXPORT'
and nbessais<1
select numenvoi,adrdest,nomdest,etat,nbessais,numappel,description,typeperiode,datedebut,datefin,codeetat,codecontrat,typeenvoi,dateentree,dateemission,typedoc,numdiffusion,nature,commentaire,criselcomp,crisite,criservice,chrono,codelangueetat,piecejointe, sujetmail, textemail
from v_envoiautomate
where etat=0 and typeenvoi='MAIL'
and nbessais<1
Can anyone give me an explanation?
Details on indexes are stored as statistics in a histogram-type dataset in SQL Server.
Each index is chunked into ranges, and each range contains a summary of the key values within that range, things like:
range High value
number of values in the range
number of distinct values in the range (cardinality)
number of values equal to the High value
...and so on.
You can view the statistics on a given index with:
DBCC SHOW_STATISTICS(<tablename>, <indexname>)
Each index has a couple of characteristics like density, and ultimately selectivity, that tell the query optimiser how unique each value in an index is likely to be, and how efficient this index is at quickly locating records.
As your query has three columns in the where clause, it's likely that any of these columns might have an index that could be useful to the optimiser. It's also likely that the primary key index will be considered, in the event of the selectivity of other indexes not being high enough.
Ultimately, it boils down to the optimiser making a quick judgement call on how many page reads will be necessary to read each your non-clustered indexes + bookmark lookups, with comparisons with the other values, vs. doing a table scan.
The statistics that these judgements are based on can vary wildly too; SQL Server, by default, only samples a small percentage of any significant table's rows, so the selectivity of that index might not be representative of the whole. This is particularly problematic where you have highly non-unique keys in the index.
In this specific case, I'm guessing your typeenvoi index is highly non-unique. This being so, the statistics gathered probably indicate to the optimiser that one of the values is rarer than the other, and the likelihood of that index being chosen is increased.
The query optimiser in SQL Server (as in most modern DBMS platforms) uses a methodology known as 'cost based optimisation.' In order to do this it uses statistics about the tables in the database to estimate the amount of I/O needed. The optimiser will consider a number of semantically equivalent query plans that it generates by transforming a basic query plan generated by parsing the statement.
Each plan is evaluated for cost by a heuristic based on the statistics maintained about the tables. The statistics come in various flavours:
Table and index row counts
Distributions histograms of the values in individual columns.
If the ocurrence of 'MAIL' vs. 'EXPORT' in the distribution histograms is significantly different the query optimiser can come up with different optimal plans. This is probably what happened.
Probably has to do with the "cardinality", I believe the word is, of the values in the table. If there are a lot more rows that match that clause, SQL Server may decide that one query will be more efficient using an index for a different column. This is an extreme case, but if there was one row that matched 'MAIL', it would likely use that index. If every other row in the table was 'EXPORT', but only half of those 'EXPORT' rows had an etat of 0, then it would probably use the index on that column.