When is the proper time to use the SQL statement "SELECT [results] FROM [tables] WHERE [conditions] FETCH FIRST [n] ROWS ONLY" - sql

I'm not quite sure when selecting only a certain number of rows would be better than simply making a more specific select statement. I have a feeling I'm missing something pretty straight forward but I can't figure it out. I have less than 6 months experience with any SQL and it's been cursory at that so I'm sorry if this is a really simple question but I couldn't find a clear answer.

I know of two common uses:
Paging : be sure to specify an ordering. If an ordering isn't specified, many db implementations use whatever is convenient for executing the query. This "optimal" ordering behavior could give very unpredictable results.
SELECT top 10 CustomerName
FROM Customer
WHERE CustomerID > 200 --start of page
ORDER BY CustomerID
Subqueries : many places that a subquery can be issued require that the result is a single value. top 1 is just faster than max in many cases.
--give me some customer that ordered today
SELECT CustomerName
FROM Customer
WHERE CustomerID =
(
SELECT top 1 CustomerID
FROM Orders
WHERE OrderDate = #Today
)

Custom paging, typically.

We're using the statement for the following reasons:
Show only the most relevant results (say the top 100) without having to transfer all rows from the DB to the client. In this case, we also use ORDER BY.
We just want to know if there are matching rows and have a few examples. In this case, we don't order the results and again, FETCH FIRST is much more cheap than having the DB prepare to transfer lots of rows and then throw them away at the client. This is usually during software development when need to get a feeling if a certain SQL is right.

When you want to display the values to a user, you are only likely to need N rows. Generally the database server can fetch the first N rows, faster than it can fetch all of the rows, so your screen redraw can go a little bit faster.
Oracle even has a hint, called FIRST_ROWS that suggests that getting data back quick, is more important that getting it all back efficiently.

The designers of SQL agree with you, which is why standard SQL doesn't included top/limit/fetch first, etc.

Think google search results and the number of pages there are typically for results.
Though obviously, there's much more to it in their case but that's the idea.

Aside from paging, any time you want the most or least [insert metric here] row from a table, ordering on [whatever metric] and limiting to 1 row is, IME, better than doing a subquery using MIN/MAX. Could vary by engine.

Related

When would two consecutive SELECT queries run in the same session produce different results?

I got an assignment to find a situation when two consecutive select queries produce different results.
My idea is that if we first run the first query, then we modify some records in a parallel session, and then we run the second query, the results will obviously be different.
I'm curious if there are other situations besides the one mentioned above.
The scenario you described would definitely work, although note you'd have to commit those changes (not sure id it was implied or not, but it's probably a good idea to be explicit).
Another idea that may or may not be a valid solution here is to play around with the ordering. E.g., consider a query like SELECT num_col FROM my_table. Since there is no order by clause, the database is free to return the rows in any way it chooses. Creating an index on num_col between the two queries would probably make the database prefer to query the data from it (full index scan vs full table scan), and chances are you'll get the result in a different order with and without the index.
EDIT:
Another idea could be if you query the current time (e.g., SELECT CURRENT_TIMESTAMP in PosgreSQL, other RDBMSs may have slightly different syntax) - no data in the database is changed, but consecutive calls to the same query will return different results as the time moves forward.
There is no need to assume that underlying data changes.
The simplest solution is using a volatile function. For instance, this might return different results when run at different times -- even with no changes to the underlying data:
select t.*
from t
where created_at < current_timestamp - interval '1 year';
Or:
select t.*
from t
order by random()
fetch first 100 rows only;
Actually, a query as simple as:
select random()
also meets the requirements, without actually having to involve any tables.
As you've said, the same two consecutive select would only produce different results if the data has changed between the two selects.
You could also have a select in a transaction where data was changed, and then have a rollback and to the same select and have different results but once again it implies the underlying data has changed.
If you set the With No Lock, you can query the same data two times and have different results, for example if a big update was running during one of you select. It's an edge case but it can happen.
Brent Ozark have very good explanation about the "With No Lock" issue https://www.brentozar.com/archive/2019/08/but-nolock-is-okay-when-the-data-isnt-changing-right/

Speed of paged queries in Oracle

This is a never-ending topic for me and I'm wondering if I might be overlooking something. Essentially I use two types of SQL statements in an application:
Regular queries with a "fallback" limit
Sorted and paged queries
Now, we're talking about some queries against tables with several million records, joined to 5 more tables with several million records. Clearly, we hardly want to fetch all of them, that's why we have the above two methods to limit user queries.
Case 1 is really simple. We just add an additional ROWNUM filter:
WHERE ...
AND ROWNUM < ?
That's quite fast, as Oracle's CBO will take this filter into consideration for its execution plan and probably apply a FIRST_ROWS operation (similar to the one enforced by the /*+FIRST_ROWS*/ hint.
Case 2, however is a bit more tricky with Oracle, as there is no LIMIT ... OFFSET clause as in other RDBMS. So we nest our "business" query in a technical wrapper as such:
SELECT outer.* FROM (
SELECT * FROM (
SELECT inner.*, ROWNUM as RNUM, MAX(ROWNUM) OVER(PARTITION BY 1) as TOTAL_ROWS
FROM (
[... USER SORTED business query ...]
) inner
)
WHERE ROWNUM < ?
) outer
WHERE outer.RNUM > ?
Note that the TOTAL_ROWS field is calculated to know how many pages we will have even without fetching all data. Now this paging query is usually quite satisfying. But every now and then (as I said, when querying 5M+ records, possibly including non-indexed searches), this runs for 2-3minutes.
EDIT: Please note, that a potential bottleneck is not so easy to circumvent, because of sorting that has to be applied before paging!
I'm wondering, is that state-of-the-art simulation of LIMIT ... OFFSET, including TOTAL_ROWS in Oracle, or is there a better solution that will be faster by design, e.g. by using the ROW_NUMBER() window function instead of the ROWNUM pseudo-column?
The main problem with Case 2 is that in many cases the whole query result set has to be obtained and then sorted before the first N rows can be returned - unless the ORDER BY columns are indexed and Oracle can use the index to avoid a sort. For a complex query and a large set of data this can take some time. However there may be some things you can do to improve the speed:
Try to ensure that no functions are called in the inner SQL - these may get called 5 million times just to return the first 20 rows. If you can move these function calls to the outer query they will be called less.
Use a FIRST_ROWS_n hint to nudge Oracle into optimising for the fact that you will never return all the data.
EDIT:
Another thought: you are currently presenting the user with a report that could return thousands or millions of rows, but the user is never realistically going to page through them all. Can you not force them to select a smaller amount of data e.g. by limiting the date range selected to 3 months (or whatever)?
You might want to trace the query that takes a lot of time and look at its explain plan. Most likely the performance bottleneck comes from the TOTAL_ROWS calculation. Oracle has to read all the data, even if you only fetch one row, this is a common problem that all RDBMS face with this type of query. No implementation of TOTAL_ROWS will get around that.
The radical way to speed up this type of query is to forego the TOTAL_ROWS calculation. Just display that there are additional pages. Do your users really need to know that they can page through 52486 pages? An estimation may be sufficient. That's another solution, implemented by google search for example: estimate the number of pages instead of actually counting them.
Designing an accurate and efficient estimation algorithm might not be trivial.
A "LIMIT ... OFFSET" is pretty much syntactic sugar. It might make the query look prettier, but if you still need to read the whole of a data set and sort it and get rows "50-60", then that's the work that has to be done.
If you have an index in the right order, then that can help.
It may perform better to run two queries instead of trying to count() and return the results in the same query. Oracle may be able to answer the count() without any sorting or joining to all the tables (join table elimination based on declared foreign key constraints). This is what we generally do in our application. For performance important statements, we write a separate query that we know will return the correct count as we can sometimes do better than Oracle.
Alternatively, you can make a tradeoff between performance and recency of the data. Bringing back the first 5 pages is going to be nearly as quick as bringing back the first page. So you could consider storing the results from 5 pages in a temporary table along with an expiry date for the information. Take the result from the temporary table if valid. Put a background task in to delete the expired data periodically.

Does SELECT DISTINCT imply a sort of the results

Does including DISTINCT in a SELECT query imply that the resulting set should be sorted?
I don't think it does, but I'm looking for a an authoritative answer (web link).
I've got a query like this:
Select Distinct foo
From Bar
In oracle, the results are distinct but are not in sorted order. In Jet/MS-Access there seems to be some extra work being done to ensure that the results are sort. I'm assuming that oracle is following the spec in this case and MS Access is going beyond.
Also, is there a way I can give the table a hint that it should be sorting on foo (unless otherwise specified)?
From the SQL92 specification:
If DISTINCT is specified, then let TXA be the result of eliminating redundant duplicate values from TX. Otherwise, let TXA be TX.
...
4) If an is not specified, then the ordering of the rows of Q is implementation-dependent.
Ultimately the real answer is that DISTINCT and ORDER BY are two separate parts of the SQL statement; If you don't have an ORDER BY clause, the results by definition will not be specifically ordered.
No. There are a number of circumstances in which a DISTINCT in Oracle does not imply a sort, the most important of which is the hashing algorithm used in 10g+ for both group by and distinct operations.
Always specify ORDER BY if you want an ordered result set, even in 9i and below.
There is no "authoritative" answer link, since this is something that no SQL server guarantees.
You will often see results in order when using distinct as a side effect of the best methods of finding those results. However, any number of other things can mix up the results, and some server may hand back results in such a way as to not give them sorted even if it had to sort to get the results.
Bottom line: if your server doesn't guarantee something you shouldn't count on it.
Not to my knowledge, no. The only reason I can think of is that SQL Server would internally sort the data in order to detect and filter out duplicates, and thus return it in a "pre-sorted" manner. But I wouldn't rely on that "side effect" :-)
No, it is not implying a sort. In my experience, it sorts by the known index, which may happen to be foo.
Why be subtle? Why not specific Select Distinct foo from Bar Order by foo?
On at least one server I've used (probably either Oracle or SQL Server, about six years ago), SELECT DISTINCT was rejected if you didn't have an ORDER BY clause. It was accepted on the "other" server (Oracle or SQL Server). Your mileage may vary.
No, the results are not sorted. If you want to give it a 'hint', you can certainly supply an ORDER BY:
select distinct foo
from bar
order by foo
But keep in mind that you might want to sort on more than just alphabetically. Instead you might want to sort on criteria on other fields. See:
http://weblogs.sqlteam.com/jeffs/archive/2007/12/13/select-distinct-order-by-error.aspx
As the answers mostly say, DISTINCT does not mandate a sort - only ORDER BY mandates that. However, one standard way of achieving DISTINCT results is to sort; the other is to hash the values (which tends to lead to semi-random sequencing). Relying on the sort effect of DISTINCT would be foolish.
In my case (SQL server), as an example I had a list of countries with a numerical value X assigned against each. When I did a select distinct * from Table order by X, it ordered it by X but at the same time result set countries were also ordered which was not directly implemented.
From my experience, I'll say that distinct does imply an implicit sort.
Yes. Oracle does use a sort do calculate a distinct. You can see that if you look at the explain plan. The fact that it did a sort for that calculation does not in any way imply
that the result set will be sorted. If you want the result set sorted, you are required to use the ORDER BY clause.

MIN/MAX vs ORDER BY and LIMIT

Out of the following queries, which method would you consider the better one? What are your reasons (code efficiency, better maintainability, less WTFery)...
SELECT MIN(`field`)
FROM `tbl`;
SELECT `field`
FROM `tbl`
ORDER BY `field`
LIMIT 1;
In the worst case, where you're looking at an unindexed field, using MIN() requires a single full pass of the table. Using SORT and LIMIT requires a filesort. If run against a large table, there would likely be a significant difference in percieved performance. As an anecdotal data point, MIN() took .36s while SORT and LIMIT took .84s against a 106,000 row table on my dev server.
If, however, you're looking at an indexed column, the difference is harder to notice (meaningless data point is 0.00s in both cases). Looking at the output of explain, however, it looks like MIN() is able to simply pluck the smallest value from the index ('Select tables optimized away' and 'NULL' rows) whereas the SORT and LIMIT still needs needs to do an ordered traversal of the index (106,000 rows). The actual performance impact is probably negligible.
It looks like MIN() is the way to go - it's faster in the worst case, indistinguishable in the best case, is standard SQL and most clearly expresses the value you're trying to get. The only case where it seems that using SORT and LIMIT would be desirable would be, as mson mentioned, where you're writing a general operation that finds the top or bottom N values from arbitrary columns and it's not worth writing out the special-case operation.
SELECT MIN(`field`)
FROM `tbl`;
Simply because it is ANSI compatible. Limit 1 is particular to MySql as TOP is to SQL Server.
As mson and Sean McSomething have pointed out, MIN is preferable.
One other reason where ORDER BY + LIMIT is useful is if you want to get the value of a different column than the MIN column.
Example:
SELECT some_other_field, field
FROM tbl
ORDER BY field
LIMIT 1
I think the answers depends on what you are doing.
If you have a 1 off query and the intent is as simple as you specified, select min(field) is preferable.
However, it is common to have these types of requirements change into - grab top n results, grab nth - mth results, etc.
I don't think it's too terrible an idea to commit to your chosen database. Changing dbs should not be made lightly and have to revise is the price you pay when you make this move.
Why limit yourself now, for pain you may or may not feel later on?
I do think it's good to stay ANSI as much as possible, but that's just a guideline...
Given acceptable performance I would use the first one because it is semantically closer to the intent.
If the performance was an issue, (Most modern optimizers will probalbly optimize both to the same query plan, although you have to test to verify that) then of course I would use the faster one.
user650654 said that ORDER BY with LIMIT 1 useful when one need "to get the value of a different column than the MIN column". I think, in this case we still have better performance with two single passes using MIN instead of sorting (hoping this is optimized :()
SELECT some_other_field, field
FROM tbl
WHERE field=(SELECT MIN(field) FROM tbl)

Where should I do the rowcount when checking for existence: sql or php?

In the case when I want to check, if a certain entry in the database exists I have two options.
I can create an sql query using COUNT() and then check, if the result is >0...
...or I can just retrieve the record(s) and then count the number of rows in the returned rowset. For example with $result->num_rows;
What's better/faster? in mysql? in general?
YMMV, but I suspect that if you are only checking for existence, and don't need to use the retrieved data in any way, the COUNT() query will be faster. How much faster will depend on how much data.
The fastest is probably asking the database if something exists:
SELECT EXISTS ([your query here])
SELECT 1
FROM (SELECT 1) t
WHERE EXISTS( SELECT * FROM foo WHERE id = 42 )
Just tested, works fine on MySQL v5
COUNT(*) is generally less efficient if:
you can have duplicates (because the
DBMS will have to exhaustively
search all of the records/indexes to
give you the exact answer) or
have NULL entries (for the same
reason)
If you are COUNT'ing based on a WHERE clause that is guaranteed to produce a single record (or 0) and the DBMS knows this (based upon UNIQUE indexes), then it ought to be just as efficient. But, it is unlikely that you will always have this condition. Also, the DBMS may not always pick up on this depending on the version and DBMS.
Counting in the application (when you don't need the row) is almost always guaranteed to be slower/worse because:
You have to send data to the client, the client has to buffer it and do some work
You may bump out things in the DBMS MRU/LRU data cache that are more important
Your DBMS will (generally) have to do more disk I/O to fetch record data that you will never use
You have more network activity
Of course, if you want to DO something with the row if it exists, then it is definitely faster/best to simply try and fetch the row to begin with!
If all you are doing is checking for the existance, then
Select count(*) ...
But if you will retrieve the data if it exists, then just get the data and check it in PHP, otherwise you'll have two calls.
For me is in the database.
Making a count(1) is faster than $result->num_rows because in the $result->num_rows you make 2 operations 1 select and a count if the select has a count is faster to get the result.
Except if you also want the information from the db.
If you want raw speed, benchmark! In addition to the methods others have suggested:
SELECT 1 FROM table_name WHERE ... LIMIT 1
may be faster due to avoiding the subselect. Benchmark it.
SELECT COUNT(*) FROM table
is the best choice, this operation is extremely fast both on small tables and large tables. While it's possible that
SELECT id FROM table
is faster on small tables, the difference in speed will be microscopic. But if you have a large table, this operation can be very slow.
Therefore, your best bet is to always choose to COUNT(*) the table (and it's faster to do * than it is to pick a specific column) as overall, it will be the fastest operation.
I would definitely do it in the PHP to decrease load on the database.
In order to get a count and get the returned rows in SQL you would have to do two queries.. a COUNT and then a SELECT
The PHP way gives you everything you need in one result object.