Performance of SQL query with condition vs. without where clause - sql

Which SQL-query will be executed with less time — query with WHERE-clause or without, when:
WHERE-clause deals with indexed field (e.g. primary key field)
WHERE-clause deals with non-indexed field
I suppose when we're working with indexed fields, thus query with WHERE will be faster. Am I right?

As has been mentioned there is no fixed answer to this. It all depends on the particular context. But just for the sake of an answer. Take this simple query:
SELECT first_name FROM people WHERE last_name = 'Smith';
To process this query without an index, every column, last_name must be checked for every row in the table (full table scan).
With an index, you could just follow a B-tree data structure until 'Smith' was found.
With a non index the worst case looks linear (n), whereas with a B-tree it would be log n, hence computationally less expensive.

Not sure what you mean by 'query with WHERE-clause or without', but you're correct that most of the time a query with a WHERE clause on an indexed field with outperform a query whose WHERE clause on a non-indexed field.
One instance where the performance will be the same (ie indexing doesn't matter) is when you run a range based query in your where clause (ie WHERE col1 > x ). This forces a scan of the table, and thus will be the same speed as a range query on a non indexed column.
Really, it depends on the columns you reference in the where clause, the types of data in the columns, the types of queries your running, etc.

It may depend on the type of where clause you are writing. In a simple where clause, it is generally better to have an index on the field you are using (and uindexes can and should be built on more than the PK). However, you have to write a saragble where clause for the index to make any difference. See this question for some guidelines on sarability:
What makes a SQL statement sargable?

There are cases where a where clause on the primary key will be slower.
The simplest is a table with one row. Using the index requires loading both the index and the data page -- two reads. No index cuts the work in half.
That is a degenerate case, but it points to the issue -- the proportion of the rows selected. Or, more accurately, the proportion of pages needed to resolve the query.
When the desired data is on all pages, then using an index slowed things down. For a non primary key, this can be disastrous, when the table is bigger than the page cache and the accesses are random.
Since pages are ordered by a primary key, the worst case is an additional index scan -- not too bad.
Some databases use statistics on tables to decide when to use an index and when to do a full table scan. Some don't.
In short, for low selectivity queries, an index will improve performance. For high selectivity queries, using an index can result in marginally worse performance or dire performance, depending on various factors.

Some of my queries are quite complex and applying a where clause degrading the performance. For the workaround, I used temp tables and then applied where clause on them. This significantly improved the performance. Also, where I had joins especially Left Outer Join, improved the performance.

Related

SQL : Can WHERE clause increase a SELECT DISTINCT query's speed?

So here's the specific situation: I have primary unique indexed keys set to each entry in the database, but each row has a secondID referring to an attribute of the entry, and as such, the secondIDs are not unique. There is also another attribute of these rows, let's call it isTitle, which is NULL by default, but each group of entries with the same secondID have at least one entry with 1 isTitle value.
Considering the conditions above, would a WHERE clause increase the processing speed of the query or not? See the following:
SELECT DISTINCT secondID FROM table;
vs.
SELECT DISTINCT secondID FROM table WHERE isTitle = 1;
EDIT:
The first query without the WHERE clause is faster, but could someone explain me why? Algorithmically the process should be faster with having only one 'if' in the cycle, no?
In general, to benchmark performances of queries, you usually use queries that gives you the execution plan the query they receive in input (Every small step that the engine is performing to solve your request).
You are not mentioning your database engine (e.g. PostgreSQL, SQL Server, MySQL), but for example in PostgreSQL the query is the following:
EXPLAIN SELECT DISTINCT secondID FROM table WHERE isTitle = 1;
Going back to your question, since the isTitle is not indexed, I think the first action the engine will do is a full scan of the table to check that attribute and then perform the SELECT hence, in my opinion, the first query:
SELECT DISTINCT secondID FROM table;
will be faster.
If you want to optimize it, you can create an index on isTitle column. In such scenario, the query with the WHERE clause will become faster.
This is a very hard question to answer, particularly without specifying the database. Here are three important considerations:
Will the database engine use the index on secondID for select distinct? Any decent database optimizer should, but that doesn't mean that all do.
How wide is the table relative to the index? That is, is scanning the index really that much faster than scanning the table?
What is the ratio of isTitle = 1 to all rows with the same value of secondId?
For the first query, there are essentially two ways to process the query:
Scan the index, taking each unique value as it comes.
Scan the table, sort or hash the table, and choose the unique values.
If it is not obvious, (1) is much faster than (2), except perhaps in trivial cases where there are a small number of rows.
For the second query, the only real option is:
Scan the table, filter out the non-matching values, sort or hash the table, and choose the unique values.
The key issues here are how much data needs to be scanned and how much is filtered out. It is even possible -- if you had, say, zillions of rows per secondaryId, no additional columns, and small number of values -- that this might be comparable or slightly faster than (1) above. There is a little overhead for scanning an index and sorting a small amount of data is often quite fast.
And, this method is almost certainly faster than (2).
As mentioned in the comments, you should test the queries on your system with your data (use a reasonable amount of data!). Or, update the table statistics and learn to read execution plans.

SQL Indexing: None, Single Column, and Multiple Columns

How does indexing work in SQL and what benefits does it provide? What reason would there be for not indexing? And what is the difference between indexing a single column vs. indexing multiple columns?
How does indexing work in SQL and what benefits does it provide?
When you index columns you express your intent to query the indexed columns in conditional expressions, such as equality or range queries. With this information the storage engine can build a structure that makes such queries faster, often arranging them in tree structures. B-trees are the most common ones, but a lot of different structures exists, such as hash indices, R-tree indices for spatial data etc. Each structure is specialized in a certain type of look ups. For instance, hash indices are very fast for equality conditions, such as:
SELECT * FROM example_table WHERE type = "example";
SELECT * FROM example_table WHERE id = X;
B-trees are also fairly quick for equality look ups, but their main strength is that they support range queries:
SELECT * FROM example_table WHERE id > 5 AND id < 10
SELECT * FROM example_table WHERE type = "example" and value > 25
It is VERY important, however, when you build B-tree indices to understand that the tree is ordered in a "left-to-right" manner. I.e, if you build a B-tree index (lets call it A) on {type, value}, then you NEED to have a condition on the type-column in order for the query to be able to utilize the index. The example index can NOT be used in a query where the condition solely depends on value.
Furthermore, if you mix equality and a range condition, make sure that the equality columns are listed first in the index, otherwise the index can only be partially used.
What reason would there be for not indexing?
If the selectivity of the index is low, then you might not gain much over a table scan. say for instance that you have an index on a field called gender. Then the selectivity of that index will be low, since a lookup on that index will return half the rows of the original table. You can read a pretty simple explanation on selectivity here, and the reasoning behind it: http://mattfleming.com/node/192
Also, maintaining an index has a cost. For each data manipulation the index might need restructuring. So keeping the amount of indices to the minimum required to perform well on the queries against that table might be desirable.
What is the difference between indexing a single column vs. indexing multiple columns?
Once again, it depends on the type of queries you issue. Indexing a single column gender might not be a good idea, since the selectivity is low. When the selectivity is high then such an index makes much more sense. For instance, indices on the primary key is a very good index, since the selectivity is high (actually, it is as high as it gets. Each key in the index corresponds to exactly on record), and indices on columns with unique or highly different values (such as slugs, password hashes and what not) are also good single column indices.
There is also the concept of covering indices. Basically, each leaf in an index contains a pointer into the table where the row is stored (unless the index is a clustered index. In this case the leaf is the record). So for each index hit, the query engine has to fetch the corresponding table row, increasing the number of I/O-operations. Since I/O is extremely slow, you want to keep this to a minimum. Now, lets say that you often need to query for something, and also fetch some additional columns, then you can create a covering index, trading storage space for query performance. Example: Let's find the name and email of all users who has joined in the last 6 months (assuming MySQL):
With index on {joined_at}:
SELECT first_name, last_name, email
FROM users
WHERE joined_at > NOW() - INTERVAL 6 MONTH;
Query explanation:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE users ALL test NULL NULL NULL 873 Using where
As you can see in the type-column, the query engine resorted to a full table scan, since the index selectivity was too low to be worthwhile using in this query (too many results would be returned, and thus followed into the table, costing too much in I/O)
With index on {joined_at, first_name, last_name, email}:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE users range test,test2 test2 8 NULL 514 Using where;
Using index
Now, since all the information that is necessary to complete the query is available in the index, the query engine evaluates that it is much better to use the index (with 514 rows) instead of doing a full table scan.
So as you can see, by using covering indices we can speed up queries for partial selects of a table, even if the selectivity of the index is quite small.
How does indexing work in SQL
That's a pretty open question but basically databases store a structure that enables faster look up of information. That structure is dependent on the implementation but its typically a type of tree.
what benefits does it provide?
Queries that are SARGable can be significantly faster.*
What reason would there be for not indexing?
Some data modification queries can take longer and there is storage cost to indexes but generally speaking, both these considerations are negligible.
And what is the difference between indexing a single column vs. indexing multiple columns?
There isn't much difference but sometimes people create covering indexes** that index mutliple columns to increase the performance of a specific query.
*SARGable is from Search ARGument ABLE. Basically if you do WHERE FOO > 5 it can be faster if FOO is indexed. On the other hand WHERE h(FOO) > 5 probably won't benefit from an index.
** If all the fields used in the SELECT JOIN and WHERE of a statement are also in an index a database can retrieve all the information it needs without going back to the base table. This is called a covering index. If all the fields were in separate indexes it would only use the ones for the joins and where and then go back to the base table for the columns in the select.

Do queries make use of more than one index at a time?

If I have a table with an index each on a different column, does the database ever make use of both indexes when executing a query? Additionally, if I have an index on 4 columns, and an additional index on one other column, could a query against all 5 columns make use of this 2nd index, or would it just be a region scan after matching the first index?
If I have a table with an index each on a different column, does the database ever make use of both indexes when executing a query?
If the cost-based query optimizer determines that it's more efficient to use more than one index, yes, it will. If it's more efficient to do a scan (and often it is), then it may not use an index, even if you think it should.
Additionally, if I have an index on 4 columns, and an additional index on one other column, could a query against all 5 columns make use of this 2nd index, or would it just be a region scan after matching the first index?
Again, if the optimizer thinks it's efficient to do so, yes it'll use that other index for the same query. If it determines the cost is higher with the index...it'll ignore it. It all depends on how selective (or rather, how selective the optimizer thinks it is, based off the latest statistics) as to whether it'll use the index. If it's not selective (won't narrow down the results much), it'll likely ignore it.
It depends on the optimizer and the query, but optimizers relatively seldom use two separate indexes on a single table in a single query. It is perfectly feasible to construct examples where they could, possibly even should - and some may actually do so. Consider:
A UNION query where the separate terms have filters on different columns (but a table scan may be as effective)
A self-join where the separate sides of the self-join have the different filters.
However, be wary of accusing the optimizer of not being efficient - there may still be advantages to resolving the query by other methods.
To answer your 'index on 4 columns' questions: it is rather unlikely. In this scenario, it is likely that the 4-column index provides good selectivity and the query is most easily resolved by applying the extra filter condition to the rows retrieved by the index scan. (Note that the answer might be different depending on whether the extra condition is connected to the other by AND (as I assumed) or OR (where using the second index might be useful).
It depends upon the queries emitted against those tables, the size of the tables and the selectivity of the data in the columns indexed.
The optimizer uses statistics to determine whether using an index will be beneficial.
1.IF I have a table with an index each on a different column, does the database ever make use of both indexes when executing a query?
It certainly can, for example if you have the table
EMPLOYEE(
id (index1)
name
address
date (index2) )
and the table
TASKS(
id
employee_id (index3)
date (index 4)
category
description)
If you do the query:
select
employee_id,date,category,description
from EMPLOYEE, TASKS where
EMPLOYEE.id=employee_id and
EMPLOYEE.date=TASKS.date
this will list all the tasks of each employee in each day and user index1 and index2 along with index4 and index3. Which will take much more time if I where lacking either index1 or index2.
2.if I have an index on 4 columns, and an additional index on one other column, could a query against all 5 columns make use of this 2nd index, or would it just be a region scan after matching the first index?
Of course it can be done, but the query should include joins on both the 4 column index and also the single column index.

effect of number of projections on query performance

I am looking to improve the performance of a query which selects several columns from a table. was wondering if limiting the number of columns would have any effect on performance of the query.
Reducing the number of columns would, I think, have only very limited effect on the speed of the query but would have a potentially larger effect on the transfer speed of the data. The less data you select, the less data that would need to be transferred over the wire to your application.
I might be misunderstanding the question, but here goes anyway:
The absolute number of columns you select doesn't make a huge difference. However, which columns you select can make a significant difference depending on how the table is indexed.
If you are selecting only columns that are covered by the index, then the DB engine can use just the index for the query without ever fetching table data. If you use even one column that's not covered, though, it has to fetch the entire row (key lookup) and this will degrade performance significantly. Sometimes it will kill performance so much that the DB engine opts to do a full scan instead of even bothering with the index; it depends on the number of rows being selected.
So, if by removing columns you are able to turn this into a covering query, then yes, it can improve performance. Otherwise, probably not. Not noticeably anyway.
Quick example for SQL Server 2005+ - let's say this is your table:
ID int NOT NULL IDENTITY PRIMARY KEY CLUSTERED,
Name varchar(50) NOT NULL,
Status tinyint NOT NULL
If we create this index:
CREATE INDEX IX_MyTable
ON MyTable (Name)
Then this query will be fast:
SELECT ID
FROM MyTable
WHERE Name = 'Aaron'
But this query will be slow(er):
SELECT ID, Name, Status
FROM MyTable
WHERE Name = 'Aaron'
If we change the index to a covering index, i.e.
CREATE INDEX IX_MyTable
ON MyTable (Name)
INCLUDE (Status)
Then the second query becomes fast again because the DB engine never needs to read the row.
Limiting the number of columns has no measurable effect on the query. Almost universally, an entire row is fetched to cache. The projection happens last in the SQL pipeline.
The projection part of the processing must happen last (after GROUP BY, for instance) because it may involve creating aggregates. Also, many columns may be required for JOIN, WHERE and ORDER BY processing. More columns than are finally returned in the result set. It's hardly worth adding a step to the query plan to do projections to somehow save a little I/O.
Check your query plan documentation. There's no "project" node in the query plan. It's a small part of formulating the result set.
To get away from "whole row fetch", you have to go for a columnar ("Inverted") database.
It can depend on the server you're dealing with (and, in the case of MySQL, the storage engine). Just for example, there's at least one MySQL storage engine that does column-wise storage instead of row-wise storage, and in this case more columns really can take more time.
The other major possibility would be if you had segmented your table so some columns were stored on one server, and other columns on another (aka vertical partitioning). In this case, retrieving more columns might involve retrieving data from different servers, and it's always possible that the load is imbalanced so different servers have different response times. Of course, you usually try to keep the load reasonably balanced so that should be fairly unusual, but it's still possible (especially if, for example, if one of the servers handles some other data whose usage might vary independently from the rest).
yes, if your query can be covered by a non clustered index it will be faster since all the data is already in the index and the base table (if you have a heap) or clustered index does not need to be touched by the optimizer
To demonstrate what tvanfosson has already written, that there is a "transfer" cost I ran the following two statements on a MSSQL 2000 DB from query analyzer.
SELECT datalength(text) FROM syscomments
SELECT text FROM syscomments
Both results returned 947 rows but the first one took 5 ms and the second 973 ms.
Also because the fields are the same I would not expect indexing to factor here.

Same query uses different indexes?

Can a select query use different indexes if a change the value of a where condition?
The two following queries use different indexes and the only difference is the value of the
condition and typeenvoi='EXPORT' or and typeenvoi='MAIL'
select numenvoi,adrdest,nomdest,etat,nbessais,numappel,description,typeperiode,datedebut,datefin,codeetat,codecontrat,typeenvoi,dateentree,dateemission,typedoc,numdiffusion,nature,commentaire,criselcomp,crisite,criservice,chrono,codelangueetat,piecejointe, sujetmail, textemail
from v_envoiautomate
where etat=0 and typeenvoi='EXPORT'
and nbessais<1
select numenvoi,adrdest,nomdest,etat,nbessais,numappel,description,typeperiode,datedebut,datefin,codeetat,codecontrat,typeenvoi,dateentree,dateemission,typedoc,numdiffusion,nature,commentaire,criselcomp,crisite,criservice,chrono,codelangueetat,piecejointe, sujetmail, textemail
from v_envoiautomate
where etat=0 and typeenvoi='MAIL'
and nbessais<1
Can anyone give me an explanation?
Details on indexes are stored as statistics in a histogram-type dataset in SQL Server.
Each index is chunked into ranges, and each range contains a summary of the key values within that range, things like:
range High value
number of values in the range
number of distinct values in the range (cardinality)
number of values equal to the High value
...and so on.
You can view the statistics on a given index with:
DBCC SHOW_STATISTICS(<tablename>, <indexname>)
Each index has a couple of characteristics like density, and ultimately selectivity, that tell the query optimiser how unique each value in an index is likely to be, and how efficient this index is at quickly locating records.
As your query has three columns in the where clause, it's likely that any of these columns might have an index that could be useful to the optimiser. It's also likely that the primary key index will be considered, in the event of the selectivity of other indexes not being high enough.
Ultimately, it boils down to the optimiser making a quick judgement call on how many page reads will be necessary to read each your non-clustered indexes + bookmark lookups, with comparisons with the other values, vs. doing a table scan.
The statistics that these judgements are based on can vary wildly too; SQL Server, by default, only samples a small percentage of any significant table's rows, so the selectivity of that index might not be representative of the whole. This is particularly problematic where you have highly non-unique keys in the index.
In this specific case, I'm guessing your typeenvoi index is highly non-unique. This being so, the statistics gathered probably indicate to the optimiser that one of the values is rarer than the other, and the likelihood of that index being chosen is increased.
The query optimiser in SQL Server (as in most modern DBMS platforms) uses a methodology known as 'cost based optimisation.' In order to do this it uses statistics about the tables in the database to estimate the amount of I/O needed. The optimiser will consider a number of semantically equivalent query plans that it generates by transforming a basic query plan generated by parsing the statement.
Each plan is evaluated for cost by a heuristic based on the statistics maintained about the tables. The statistics come in various flavours:
Table and index row counts
Distributions histograms of the values in individual columns.
If the ocurrence of 'MAIL' vs. 'EXPORT' in the distribution histograms is significantly different the query optimiser can come up with different optimal plans. This is probably what happened.
Probably has to do with the "cardinality", I believe the word is, of the values in the table. If there are a lot more rows that match that clause, SQL Server may decide that one query will be more efficient using an index for a different column. This is an extreme case, but if there was one row that matched 'MAIL', it would likely use that index. If every other row in the table was 'EXPORT', but only half of those 'EXPORT' rows had an etat of 0, then it would probably use the index on that column.