SQL : Can WHERE clause increase a SELECT DISTINCT query's speed?

SQL : Can WHERE clause increase a SELECT DISTINCT query's speed? - sql

So here's the specific situation: I have primary unique indexed keys set to each entry in the database, but each row has a secondID referring to an attribute of the entry, and as such, the secondIDs are not unique. There is also another attribute of these rows, let's call it isTitle, which is NULL by default, but each group of entries with the same secondID have at least one entry with 1 isTitle value.
Considering the conditions above, would a WHERE clause increase the processing speed of the query or not? See the following:
SELECT DISTINCT secondID FROM table;
vs.
SELECT DISTINCT secondID FROM table WHERE isTitle = 1;
EDIT:
The first query without the WHERE clause is faster, but could someone explain me why? Algorithmically the process should be faster with having only one 'if' in the cycle, no?

In general, to benchmark performances of queries, you usually use queries that gives you the execution plan the query they receive in input (Every small step that the engine is performing to solve your request).
You are not mentioning your database engine (e.g. PostgreSQL, SQL Server, MySQL), but for example in PostgreSQL the query is the following:
EXPLAIN SELECT DISTINCT secondID FROM table WHERE isTitle = 1;
Going back to your question, since the isTitle is not indexed, I think the first action the engine will do is a full scan of the table to check that attribute and then perform the SELECT hence, in my opinion, the first query:
SELECT DISTINCT secondID FROM table;
will be faster.
If you want to optimize it, you can create an index on isTitle column. In such scenario, the query with the WHERE clause will become faster.

This is a very hard question to answer, particularly without specifying the database. Here are three important considerations:
Will the database engine use the index on secondID for select distinct? Any decent database optimizer should, but that doesn't mean that all do.
How wide is the table relative to the index? That is, is scanning the index really that much faster than scanning the table?
What is the ratio of isTitle = 1 to all rows with the same value of secondId?
For the first query, there are essentially two ways to process the query:
Scan the index, taking each unique value as it comes.
Scan the table, sort or hash the table, and choose the unique values.
If it is not obvious, (1) is much faster than (2), except perhaps in trivial cases where there are a small number of rows.
For the second query, the only real option is:
Scan the table, filter out the non-matching values, sort or hash the table, and choose the unique values.
The key issues here are how much data needs to be scanned and how much is filtered out. It is even possible -- if you had, say, zillions of rows per secondaryId, no additional columns, and small number of values -- that this might be comparable or slightly faster than (1) above. There is a little overhead for scanning an index and sorting a small amount of data is often quite fast.
And, this method is almost certainly faster than (2).
As mentioned in the comments, you should test the queries on your system with your data (use a reasonable amount of data!). Or, update the table statistics and learn to read execution plans.

Related

Performance of SQL query with condition vs. without where clause

Which SQL-query will be executed with less time — query with WHERE-clause or without, when:
WHERE-clause deals with indexed field (e.g. primary key field)
WHERE-clause deals with non-indexed field
I suppose when we're working with indexed fields, thus query with WHERE will be faster. Am I right?

As has been mentioned there is no fixed answer to this. It all depends on the particular context. But just for the sake of an answer. Take this simple query:
SELECT first_name FROM people WHERE last_name = 'Smith';
To process this query without an index, every column, last_name must be checked for every row in the table (full table scan).
With an index, you could just follow a B-tree data structure until 'Smith' was found.
With a non index the worst case looks linear (n), whereas with a B-tree it would be log n, hence computationally less expensive.

Not sure what you mean by 'query with WHERE-clause or without', but you're correct that most of the time a query with a WHERE clause on an indexed field with outperform a query whose WHERE clause on a non-indexed field.
One instance where the performance will be the same (ie indexing doesn't matter) is when you run a range based query in your where clause (ie WHERE col1 > x ). This forces a scan of the table, and thus will be the same speed as a range query on a non indexed column.
Really, it depends on the columns you reference in the where clause, the types of data in the columns, the types of queries your running, etc.

It may depend on the type of where clause you are writing. In a simple where clause, it is generally better to have an index on the field you are using (and uindexes can and should be built on more than the PK). However, you have to write a saragble where clause for the index to make any difference. See this question for some guidelines on sarability:
What makes a SQL statement sargable?

There are cases where a where clause on the primary key will be slower.
The simplest is a table with one row. Using the index requires loading both the index and the data page -- two reads. No index cuts the work in half.
That is a degenerate case, but it points to the issue -- the proportion of the rows selected. Or, more accurately, the proportion of pages needed to resolve the query.
When the desired data is on all pages, then using an index slowed things down. For a non primary key, this can be disastrous, when the table is bigger than the page cache and the accesses are random.
Since pages are ordered by a primary key, the worst case is an additional index scan -- not too bad.
Some databases use statistics on tables to decide when to use an index and when to do a full table scan. Some don't.
In short, for low selectivity queries, an index will improve performance. For high selectivity queries, using an index can result in marginally worse performance or dire performance, depending on various factors.

Some of my queries are quite complex and applying a where clause degrading the performance. For the workaround, I used temp tables and then applied where clause on them. This significantly improved the performance. Also, where I had joins especially Left Outer Join, improved the performance.

effect of number of projections on query performance

I am looking to improve the performance of a query which selects several columns from a table. was wondering if limiting the number of columns would have any effect on performance of the query.

Reducing the number of columns would, I think, have only very limited effect on the speed of the query but would have a potentially larger effect on the transfer speed of the data. The less data you select, the less data that would need to be transferred over the wire to your application.

I might be misunderstanding the question, but here goes anyway:
The absolute number of columns you select doesn't make a huge difference. However, which columns you select can make a significant difference depending on how the table is indexed.
If you are selecting only columns that are covered by the index, then the DB engine can use just the index for the query without ever fetching table data. If you use even one column that's not covered, though, it has to fetch the entire row (key lookup) and this will degrade performance significantly. Sometimes it will kill performance so much that the DB engine opts to do a full scan instead of even bothering with the index; it depends on the number of rows being selected.
So, if by removing columns you are able to turn this into a covering query, then yes, it can improve performance. Otherwise, probably not. Not noticeably anyway.
Quick example for SQL Server 2005+ - let's say this is your table:
ID int NOT NULL IDENTITY PRIMARY KEY CLUSTERED,
Name varchar(50) NOT NULL,
Status tinyint NOT NULL
If we create this index:
CREATE INDEX IX_MyTable
ON MyTable (Name)
Then this query will be fast:
SELECT ID
FROM MyTable
WHERE Name = 'Aaron'
But this query will be slow(er):
SELECT ID, Name, Status
FROM MyTable
WHERE Name = 'Aaron'
If we change the index to a covering index, i.e.
CREATE INDEX IX_MyTable
ON MyTable (Name)
INCLUDE (Status)
Then the second query becomes fast again because the DB engine never needs to read the row.

Limiting the number of columns has no measurable effect on the query. Almost universally, an entire row is fetched to cache. The projection happens last in the SQL pipeline.
The projection part of the processing must happen last (after GROUP BY, for instance) because it may involve creating aggregates. Also, many columns may be required for JOIN, WHERE and ORDER BY processing. More columns than are finally returned in the result set. It's hardly worth adding a step to the query plan to do projections to somehow save a little I/O.
Check your query plan documentation. There's no "project" node in the query plan. It's a small part of formulating the result set.
To get away from "whole row fetch", you have to go for a columnar ("Inverted") database.

It can depend on the server you're dealing with (and, in the case of MySQL, the storage engine). Just for example, there's at least one MySQL storage engine that does column-wise storage instead of row-wise storage, and in this case more columns really can take more time.
The other major possibility would be if you had segmented your table so some columns were stored on one server, and other columns on another (aka vertical partitioning). In this case, retrieving more columns might involve retrieving data from different servers, and it's always possible that the load is imbalanced so different servers have different response times. Of course, you usually try to keep the load reasonably balanced so that should be fairly unusual, but it's still possible (especially if, for example, if one of the servers handles some other data whose usage might vary independently from the rest).

yes, if your query can be covered by a non clustered index it will be faster since all the data is already in the index and the base table (if you have a heap) or clustered index does not need to be touched by the optimizer

To demonstrate what tvanfosson has already written, that there is a "transfer" cost I ran the following two statements on a MSSQL 2000 DB from query analyzer.
SELECT datalength(text) FROM syscomments
SELECT text FROM syscomments
Both results returned 947 rows but the first one took 5 ms and the second 973 ms.
Also because the fields are the same I would not expect indexing to factor here.

Why do SQL statements take so long when "limited"?

consider the following pgSQL statement:
SELECT DISTINCT some_field
FROM some_table
WHERE some_field LIKE 'text%'
LIMIT 10;
Consider also, that some_table consists of several million records, and that some_field has a b-tree index.
Why does the query take so long to execute (several minutes)? What I mean is, why doesnt it loop through creating the result set, and once it gets 10 of them, return the result? It looks like the execution time is the same, regardless of whether or not you include a 'LIMIT 10' or not.
Is this correct or am I missing something? Is there anything I can do to get it to return the first 10 results and 'screw' the rest?
UPDATE: If you drop the distinct, the results are returned virtually instantaneously. I do know however, that many of the some_table records are fairly unique already, and certianly when I run the query without he distinct declaration, the first 10 results are in fact unique. I also eliminated the where clause (eliminating it as a factor). So, my original question still remains, why isnt it terminating as soon as 10 matches are found?

You have a DISTINCT. This means that to find 10 distinct rows, it's necessary to scan all rows that match the predicate until 10 different some_fields are found.
Depending on your indices, the query optimizer may decide that scanning all rows is the best way to do this.
10 distinct rows could represent 10, a million, an infinity of non-distinct rows.

Can you post the results of running EXPLAIN on the query? This will reveal what Postgres is doing to execute the query, and is generally the first step in diagnosing query performance problems.
It may be sorting or constructing a hash table of the entire rowset to eliminate the non-distinct records before returning the first row to the LIMIT operator. It makes sense that the engine should be able to read a fraction of the records, returning one new distinct at a time until the LIMIT clause has satisfied its 10 quota, but there may not be an operator implemented to make that work.
Is the some_field unique? If not, it would be useless in locating distinct records. If it is, then the DISTINCT clause would be unnecessary, since that index already guarantees that each row is unique on some_field.

Any time there's an operation that involves aggregation, and "DISTINCT" certainly qualifies, the optimizer is going to do the aggration before even thinking about what's next. And aggration means scanning the whole table (in your case involving a sort, unless there's an index).
But the most likely deal-breaker is that you are grouping on an operation on a column, rather than a plain column value. The optimizer generally disregards a number of possible operations once you are operating on a column transformation of some kind. It's quite possibly not smart enough to know that the ordering of "LIKE 'text%'" and "= 'text'" is the same for grouping purposes.
And remember, you're doing an aggregation on an operation on a column.

how big is the table? do you have any indexes on the table? check your query execution plan to determine if it's doing a table scan, an index scan, or an index seek. if it's doing a table scan then you most likely dont have any indexes.
try putting an index on the field your filtering by and/or the field your selecting.

I'm suspicious it's because you don't have an ORDER BY. Without ordering, you might have to cruise a whole lot of records to get 10 that meet your criterion.

Does limiting a query to one record improve performance

Will limiting a query to one result record, improve performance in a large(ish) MySQL table if the table only has one matching result?
for example
select * from people where name = "Re0sless" limit 1
if there is only one record with that name? and what about if name was the primary key/ set to unique? and is it worth updating the query or will the gain be minimal?

If the column has
a unique index: no, it's no faster
a non-unique index: maybe, because it will prevent sending any additional rows beyond the first matched, if any exist
no index: sometimes
if 1 or more rows match the query, yes, because the full table scan will be halted after the first row is matched.
if no rows match the query, no, because it will need to complete a full table scan

If you have a slightly more complicated query, with one or more joins, the LIMIT clause gives the optimizer extra information. If it expects to match two tables and return all rows, a hash join is typically optimal. A hash join is a type of join optimized for large amounts of matching.
Now if the optimizer knows you've passed LIMIT 1, it knows that it won't be processing large amounts of data. It can revert to a loop join.
Based on the database (and even database version) this can have a huge impact on performance.

To answer your questions in order:
1) yes, if there is no index on name. The query will end as soon as it finds the first record. take off the limit and it has to do a full table scan every time.
2) no. primary/unique keys are guaranteed to be unique. The query should stop running as soon as it finds the row.

I believe the LIMIT is something done after the data set is found and the result set is being built up so I wouldn't expect it to make any difference at all. Making name the primary key will have a significant positive effect though as it will result in an index being made for the column.

If "name" is unique in the table, then there may still be a (very very minimal) gain in performance by putting the limit constraint on your query. If name is the primary key, there will likely be none.

Yes, you will notice a performance difference when dealing with the data. One record takes up less space than multiple records. Unless you are dealing with many rows, this would not be much of a difference, but once you run the query, the data has to be displayed back to you, which is costly, or dealt with programmatically. Either way, one record is easier than multiple.

Do indexes work with "IN" clause

If I have a query like:
Select EmployeeId
From Employee
Where EmployeeTypeId IN (1,2,3)
and I have an index on the EmployeeTypeId field, does SQL server still use that index?

Yeah, that's right. If your Employee table has 10,000 records, and only 5 records have EmployeeTypeId in (1,2,3), then it will most likely use the index to fetch the records. However, if it finds that 9,000 records have the EmployeeTypeId in (1,2,3), then it would most likely just do a table scan to get the corresponding EmployeeIds, as it's faster just to run through the whole table than to go to each branch of the index tree and look at the records individually.
SQL Server does a lot of stuff to try and optimize how the queries run. However, sometimes it doesn't get the right answer. If you know that SQL Server isn't using the index, by looking at the execution plan in query analyzer, you can tell the query engine to use a specific index with the following change to your query.
SELECT EmployeeId FROM Employee WITH (Index(Index_EmployeeTypeId )) WHERE EmployeeTypeId IN (1,2,3)
Assuming the index you have on the EmployeeTypeId field is named Index_EmployeeTypeId.

Usually it would, unless the IN clause covers too much of the table, and then it will do a table scan. Best way to find out in your specific case would be to run it in the query analyzer, and check out the execution plan.

Unless technology has improved in ways I can't imagine of late, the "IN" query shown will produce a result that's effectively the OR-ing of three result sets, one for each of the values in the "IN" list. The IN clause becomes an equality condition for each of the list and will use an index if appropriate. In the case of unique IDs and a large enough table then I'd expect the optimiser to use an index.
If the items in the list were to be non-unique however, and I guess in the example that a "TypeId" is a foreign key, then I'm more interested in the distribution. I'm wondering if the optimiser will check the stats for each value in the list? Say it checks the first value and finds it's in 20% of the rows (of a large enough table to matter). It'll probably table scan. But will the same query plan be used for the other two, even if they're unique?
It's probably moot - something like an Employee table is likely to be small enough that it will stay cached in memory and you probably wouldn't notice a difference between that and indexed retrieval anyway.
And lastly, while I'm preaching, beware the query in the IN clause: it's often a quick way to get something working and (for me at least) can be a good way to express the requirement, but it's almost always better restated as a join. Your optimiser may be smart enough to spot this, but then again it may not. If you don't currently performance-check against production data volumes, do so - in these days of cost-based optimisation you can't be certain of the query plan until you have a full load and representative statistics. If you can't, then be prepared for surprises in production...

So there's the potential for an "IN" clause to run a table scan, but the optimizer will
try and work out the best way to deal with it?
Whether an index is used doesn't so much vary on the type of query as much of the type and distribution of data in the table(s), how up-to-date your table statistics are, and the actual datatype of the column.
The other posters are correct that an index will be used over a table scan if:
The query won't access more than a certain percent of the rows indexed (say ~10% but should vary between DBMS's).
Alternatively, if there are a lot of rows, but relatively few unique values in the column, it also may be faster to do a table scan.
The other variable that might not be that obvious is making sure that the datatypes of the values being compared are the same. In PostgreSQL, I don't think that indexes will be used if you're filtering on a float but your column is made up of ints. There are also some operators that don't support index use (again, in PostgreSQL, the ILIKE operator is like this).
As noted though, always check the query analyser when in doubt and your DBMS's documentation is your friend.

#Mike: Thanks for the detailed analysis. There are definately some interesting points you make there. The example I posted is somewhat trivial but the basis of the question came from using NHibernate.
With NHibernate, you can write a clause like this:
int[] employeeIds = new int[]{1, 5, 23463, 32523};
NHibernateSession.CreateCriteria(typeof(Employee))
.Add(Restrictions.InG("EmployeeId",employeeIds))
NHibernate then generates a query which looks like
select * from employee where employeeid in (1, 5, 23463, 32523)
So as you and others have pointed out, it looks like there are going to be times where an index will be used or a table scan will happen, but you can't really determine that until runtime.

Select EmployeeId From Employee USE(INDEX(EmployeeTypeId))
This query will search using the index you have created. It works for me. Please do a try..

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas