Why do SQL statements take so long when "limited"? - sql

consider the following pgSQL statement:
SELECT DISTINCT some_field
FROM some_table
WHERE some_field LIKE 'text%'
LIMIT 10;
Consider also, that some_table consists of several million records, and that some_field has a b-tree index.
Why does the query take so long to execute (several minutes)? What I mean is, why doesnt it loop through creating the result set, and once it gets 10 of them, return the result? It looks like the execution time is the same, regardless of whether or not you include a 'LIMIT 10' or not.
Is this correct or am I missing something? Is there anything I can do to get it to return the first 10 results and 'screw' the rest?
UPDATE: If you drop the distinct, the results are returned virtually instantaneously. I do know however, that many of the some_table records are fairly unique already, and certianly when I run the query without he distinct declaration, the first 10 results are in fact unique. I also eliminated the where clause (eliminating it as a factor). So, my original question still remains, why isnt it terminating as soon as 10 matches are found?

You have a DISTINCT. This means that to find 10 distinct rows, it's necessary to scan all rows that match the predicate until 10 different some_fields are found.
Depending on your indices, the query optimizer may decide that scanning all rows is the best way to do this.
10 distinct rows could represent 10, a million, an infinity of non-distinct rows.

Can you post the results of running EXPLAIN on the query? This will reveal what Postgres is doing to execute the query, and is generally the first step in diagnosing query performance problems.
It may be sorting or constructing a hash table of the entire rowset to eliminate the non-distinct records before returning the first row to the LIMIT operator. It makes sense that the engine should be able to read a fraction of the records, returning one new distinct at a time until the LIMIT clause has satisfied its 10 quota, but there may not be an operator implemented to make that work.
Is the some_field unique? If not, it would be useless in locating distinct records. If it is, then the DISTINCT clause would be unnecessary, since that index already guarantees that each row is unique on some_field.

Any time there's an operation that involves aggregation, and "DISTINCT" certainly qualifies, the optimizer is going to do the aggration before even thinking about what's next. And aggration means scanning the whole table (in your case involving a sort, unless there's an index).
But the most likely deal-breaker is that you are grouping on an operation on a column, rather than a plain column value. The optimizer generally disregards a number of possible operations once you are operating on a column transformation of some kind. It's quite possibly not smart enough to know that the ordering of "LIKE 'text%'" and "= 'text'" is the same for grouping purposes.
And remember, you're doing an aggregation on an operation on a column.

how big is the table? do you have any indexes on the table? check your query execution plan to determine if it's doing a table scan, an index scan, or an index seek. if it's doing a table scan then you most likely dont have any indexes.
try putting an index on the field your filtering by and/or the field your selecting.

I'm suspicious it's because you don't have an ORDER BY. Without ordering, you might have to cruise a whole lot of records to get 10 that meet your criterion.

Related

SQL : Can WHERE clause increase a SELECT DISTINCT query's speed?

So here's the specific situation: I have primary unique indexed keys set to each entry in the database, but each row has a secondID referring to an attribute of the entry, and as such, the secondIDs are not unique. There is also another attribute of these rows, let's call it isTitle, which is NULL by default, but each group of entries with the same secondID have at least one entry with 1 isTitle value.
Considering the conditions above, would a WHERE clause increase the processing speed of the query or not? See the following:
SELECT DISTINCT secondID FROM table;
vs.
SELECT DISTINCT secondID FROM table WHERE isTitle = 1;
EDIT:
The first query without the WHERE clause is faster, but could someone explain me why? Algorithmically the process should be faster with having only one 'if' in the cycle, no?
In general, to benchmark performances of queries, you usually use queries that gives you the execution plan the query they receive in input (Every small step that the engine is performing to solve your request).
You are not mentioning your database engine (e.g. PostgreSQL, SQL Server, MySQL), but for example in PostgreSQL the query is the following:
EXPLAIN SELECT DISTINCT secondID FROM table WHERE isTitle = 1;
Going back to your question, since the isTitle is not indexed, I think the first action the engine will do is a full scan of the table to check that attribute and then perform the SELECT hence, in my opinion, the first query:
SELECT DISTINCT secondID FROM table;
will be faster.
If you want to optimize it, you can create an index on isTitle column. In such scenario, the query with the WHERE clause will become faster.
This is a very hard question to answer, particularly without specifying the database. Here are three important considerations:
Will the database engine use the index on secondID for select distinct? Any decent database optimizer should, but that doesn't mean that all do.
How wide is the table relative to the index? That is, is scanning the index really that much faster than scanning the table?
What is the ratio of isTitle = 1 to all rows with the same value of secondId?
For the first query, there are essentially two ways to process the query:
Scan the index, taking each unique value as it comes.
Scan the table, sort or hash the table, and choose the unique values.
If it is not obvious, (1) is much faster than (2), except perhaps in trivial cases where there are a small number of rows.
For the second query, the only real option is:
Scan the table, filter out the non-matching values, sort or hash the table, and choose the unique values.
The key issues here are how much data needs to be scanned and how much is filtered out. It is even possible -- if you had, say, zillions of rows per secondaryId, no additional columns, and small number of values -- that this might be comparable or slightly faster than (1) above. There is a little overhead for scanning an index and sorting a small amount of data is often quite fast.
And, this method is almost certainly faster than (2).
As mentioned in the comments, you should test the queries on your system with your data (use a reasonable amount of data!). Or, update the table statistics and learn to read execution plans.

Where clause is slowing my query from 2 seconds to 24 seconds

I am trying to write a simple query to count the results from a big table.
SELECT COUNT(*)
FROM DM.DM_CUSTOMER_SEG_BRIDGE_CORP_DW AL3
WHERE (AL3.REFERENCE_YEAR(+) =2012)
Above query is taking around 24 seconds to return me output. If I remove where clause and execute same query, it is giving me result in 2 seconds.
May i know what is the reason for that. I am relatively new to SQL queries.
Please help
Thanks,
Naveen
You might need an index on the table. Typically you will need an index on any columns used in the where clause
as for the (+) syntax I think it is redundant (i'm no Oracle expert) but see Difference between Oracle's plus (+) notation and ansi JOIN notation?
The reason may seem subtle. But there are multiple ways that Oracle could approach a query like this:
SELECT COUNT(*)
FROM DM.DM_CUSTOMER_SEG_BRIDGE_CORP_DW AL3
One way is to read all the rows in the table. Because this is a big table, that is not the most efficient approach. A second method would be to use statistics of some sort, where the number of rows are in the statistics. I don't think Oracle ever does this, but it is conceivable.
The final method is to read an index. Typically, an index would be much smaller than the table and it might already be in memory. The above query would be reading a much smaller amount of data. (Here is an interesting article on counting all the rows in a table.)
When you introduce the where clause,
WHERE (AL3.REFERENCE_YEAR(+) =2012)
Oracle can no longer scan just any index. It would have to scan the reference_year index. What is the problem? If it scanned an index, it would still need to fetch the data records to get the value of reference_year -- and that is equivalent (actually worse) than scanning the whole table.
Even with an index on reference_year, you are not guaranteed to use the index. The problem is something called selectivity. The number of rows that you are fetching may still be quite large, relative to the number of rows in the database (in this context, 10% is "quite large"). The Oracle optimize may choose to do a full table scan rather than read the index.

Does using the TOP X * format in SQL speed up queries significantly?

So lately when I run queries on huge tables I'll use the the top 10 * notation like so:
select top 10 * from BI_Sessions (nolock)
where SessionSID like 'b6d%'
and CreateDate between '03-15-2012' AND '05-18-2012'
I thought that it let's it run faster, but it doesn't seem so , this one took 4 minutes(or is that OK time)?
I guess I'm curious about whether the top functionality happens after it pulls all the data anyway(which would seem like it's inefficient).
thanks
It entirely depends on the query, with the exceptino of "Top 0". "Top 0" does return much faster.
In your case, the query has to look through the rows in a huge table to find rows that match the WHERE clause. If no rows are found, the number of rows being returned doesn't help. If the rows are at the end of the table scan, then the number of rows being returned doesn't help.
There are certain cases with more complicated queries where the "top" could affect performance. There is a difference between optimizing overall and for the first row returned. I'm not sure if SQL Server's optimizer recognizes this difference.
Well, it depends. If you do not have a covering index on BI_sessions and its a large database then the answer is probably. A good covering index may be something like: CreateDate, SessionSIS, and all the columns you actually need to return. If you do have a coveing index, then SQL will not even read the table, it will get all the data it needs from the covering index. Possibly if you specified the columns you actually need to return, 10 rows should come back in a fraction of a second.
for more useful info
http://www.mssqltips.com/sqlservertip/1078/improve-sql-server-performance-with-covering-index-enhancements/
and a bit more technical:
http://www.simple-talk.com/sql/learn-sql-server/using-covering-indexes-to-improve-query-performance/
also
http://www.sqlserverinternals.com/
and
http://www.insidesqlserver.com/thebooks.html

Faster to get one row from DB or count number of rows

I was wondering if its faster to retrieve the "count" of the number of rows or to retrieve just 1 row using limit. The purpose being to see whether theres any row when given certain Where conditions.
A count is always an expensive query because it will take a full table scan. You requirements are not really clear to me, but if you just want to see whether there is any data it would be cheaper to do a regular select with a limit to 1.
A count must physically count all rows that match your criteria, which is unnecessary work as you don't care about the number.
Look at using EXISTS.
It think it depends on which storage engine you use for the database...
Anywawy, the good practise to test wether there are or not results is to check the return value from the feth() function!
[I'm using Oracle as an example here, but the same concepts apply usually across the board.]
COUNT has to physically identify all the rows that will be returned. Depending on the complexity of the query, the query plan could require a table scan on one or more of the tables of the query. You'd need to do an EXPLAIN PLAN to know for sure.
Returning a single row may require the same processing if an ORDER BY is required. The database can't just give you the first row until
all the rows are identified and
the rows have been sorted.
Also, depending on the number of rows being returned, the complexity of the ORDER BY, and the SGA size, a temporary table might need to be created (which causes all sorts of other overhead).
If there's no ORDER BY, a single row should be faster because as soon as the data identifies a single row to return, it's done. That said, you're not guaranteed from one execution to the next that the rows are returned in the same order, so usually an ORDER BY is involved.
It's faster to retrieve the count of the number of rows

Does limiting a query to one record improve performance

Will limiting a query to one result record, improve performance in a large(ish) MySQL table if the table only has one matching result?
for example
select * from people where name = "Re0sless" limit 1
if there is only one record with that name? and what about if name was the primary key/ set to unique? and is it worth updating the query or will the gain be minimal?
If the column has
a unique index: no, it's no faster
a non-unique index: maybe, because it will prevent sending any additional rows beyond the first matched, if any exist
no index: sometimes
if 1 or more rows match the query, yes, because the full table scan will be halted after the first row is matched.
if no rows match the query, no, because it will need to complete a full table scan
If you have a slightly more complicated query, with one or more joins, the LIMIT clause gives the optimizer extra information. If it expects to match two tables and return all rows, a hash join is typically optimal. A hash join is a type of join optimized for large amounts of matching.
Now if the optimizer knows you've passed LIMIT 1, it knows that it won't be processing large amounts of data. It can revert to a loop join.
Based on the database (and even database version) this can have a huge impact on performance.
To answer your questions in order:
1) yes, if there is no index on name. The query will end as soon as it finds the first record. take off the limit and it has to do a full table scan every time.
2) no. primary/unique keys are guaranteed to be unique. The query should stop running as soon as it finds the row.
I believe the LIMIT is something done after the data set is found and the result set is being built up so I wouldn't expect it to make any difference at all. Making name the primary key will have a significant positive effect though as it will result in an index being made for the column.
If "name" is unique in the table, then there may still be a (very very minimal) gain in performance by putting the limit constraint on your query. If name is the primary key, there will likely be none.
Yes, you will notice a performance difference when dealing with the data. One record takes up less space than multiple records. Unless you are dealing with many rows, this would not be much of a difference, but once you run the query, the data has to be displayed back to you, which is costly, or dealt with programmatically. Either way, one record is easier than multiple.