Does ROW_NUMBER queried data require to fetch the entire result set? - sql

I'm trying to write query that's not using offset (because as I just have learnt offset fetches all data which causes performance overhead). with ROW_NUMBER window function. For instance:
SELECT id
FROM(
SELECT id, ROW_NUMBER() over (order by id) rn
FROM users) sq
WHERE rn > 1000
Does it require all rows to be fetched as it would be with offset 1000? I mean, does it make a sense to use such query instead of
SELECT if
FROM users
OFFSET 1000
? Do I get performance improvement on large amount of data?

Check out the window function docs. Window functions operate on the result set, after the fetch:
Window functions are permitted only in the SELECT list and the ORDER
BY clause of the query. They are forbidden elsewhere, such as in GROUP
BY, HAVING and WHERE clauses. This is because they logically execute
after the processing of those clauses. Also, window functions execute
after regular aggregate functions. This means it is valid to include
an aggregate function call in the arguments of a window function, but
not vice versa.

Does it make sense to use the row_number() query? Well, it produces the same result set. However, the query basically has to assign row_number() to all the rows in order to find the ones that meet the requirement.
The second query, however, is lacking an order by. When using offset, you should have an order by:
SELECT id
FROM users u
ORDER BY id
OFFSET 1000
I would imagine that this is more efficient than using row_number(), but actual timings would demonstrate that.

Related

Can't use ORDER BY in a derived table

I am trying to select the last 20 rows of my SQL Database, but I get this error:
[Microsoft][ODBC Driver 17 for SQL Server][SQL Server]The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP, OFFSET or FOR XML is also specified.
My query is:
SELECT TOP 20 * FROM (SELECT * FROM TBArticles ORDER BY id_art DESC)
I think it's because I am using ORDER BY in this second expression... but what can I do to select the 20 last rows fixing this error?
You don't need a subquery for this:
SELECT TOP 20 *
FROM TBArticles
ORDER BY id_art DESC
The documentation is quite clear on the use of ORDER BY in subqueries:
The ORDER BY clause is not valid in views, inline functions, derived tables, and subqueries, unless either the TOP or OFFSET and FETCH clauses are also specified. When ORDER BY is used in these objects, the clause is used only to determine the rows returned by the TOP clause or OFFSET and FETCH clauses. The ORDER BY clause does not guarantee ordered results when these constructs are queried, unless ORDER BY is also specified in the query itself.
Gordon's answer is probably the most direct way to handle your requirement. However, if you wanted to use a query along the same lines as the pattern you were already using, you could use ROW_NUMBER here:
SELECT *
FROM
(
SELECT *, ROW_NUMBER() OVER (ORDER BY id_art DESC) rn
FROM TBArticles
) t
WHERE rn <= 20;
By computing a row number in the derived table, the ordering "sticks" in the same way your original query was expecting.

SQL ORDER BY in SQL table returning function

so I have simple function trying to get two fields from database. I'm trying to use order by for the results however I cannot use ORDER BY in return clause.
It tells me
The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP, OFFSET or FOR XML is also specified.
Is is it possible to use ORDER BY in RETURN statement? I would like to avoid using order by when executing the function.
CREATE FUNCTION goalsGames1 () RETURNS TABLE
AS RETURN(
SELECT MAX(goals_scored) goals,
no_games
FROM Player
GROUP BY no_games
ORDER BY no_games DESC )
One trick to skip this error is using top as it is mentioned in the error message:
CREATE FUNCTION goalsGames1 () RETURNS TABLE
AS RETURN(
SELECT Top 100 Percent MAX(goals_scored) goals,
no_games
FROM Player
GROUP BY no_games
ORDER BY no_games DESC )
I would like to avoid using order by when executing the function.
If you are using the function and want the results in a particular order, then you need to use ORDER BY.
This is quite clearly stated in the documentation:
The ORDER clause does not guarantee ordered results when a SELECT query is executed, unless ORDER BY is also specified in the query.
use order by intimes of selection your function not in times of creation
so use here in select * from goalsGames1 order by col
and your error tells you where order by is invalid
You cannot order by inside a function, the idea is to order the resultset returned by the function.
select *
from dbo.goalsGames1()
order by no_games
Even if you would order by inside the function, there is no guaranty that this ordering would be preserved when the resultset is returned. The executing query (select * from functionname) has to be responsible for setting the order, not the function or view.
Who ever receives the rows is the only one that can order them, so in this case, the select * from goalsGames1() is the receiver, and this query has to order the results.

What is the execution order of the PARTITION BY clause compared to other SQL clauses?

I cannot find any source mentioning execution order for Partition By window functions in SQL.
Is it in the same order as Group By?
For example table like:
Select *, row_number() over (Partition by Name)
from NPtable
Where Name = 'Peter'
I understand if Where gets executed first, it will only look at Name = 'Peter', then execute window function that just aggregates this particular person instead of entire table aggregation, which is much more efficient.
But when the query is:
Select top 1 *, row_number() over (Partition by Name order by Date)
from NPtable
Where Date > '2018-01-02 00:00:00'
Doesn't the window function need to be executed against the entire table first then applies the Date> condition otherwise the result is wrong?
Window functions are executed/calculated at the same stage as SELECT, stage 5 in your table. In other words, window functions are applied to all rows that are "visible" in the SELECT stage.
In your second example
Select top 1 *,
row_number() over (Partition by Name order by Date)
from NPtable
Where Date > '2018-01-02 00:00:00'
WHERE is logically applied before Partition by Name of the row_number() function.
Note, that this is logical order of processing the query, not necessarily how the engine physically processes the data.
If query optimiser decides that it is cheaper to scan the whole table and later discard dates according to the WHERE filter, it can do it. But, any kind of these transformations must be performed in such a way that the final result is consistent with the order of the logical steps outlined in the table you showed.
It is part of the SELECT phase of the query execution. There are different types of SELECT clauses, based on the query.
SELECT FOR
SELECT GROUP BY
SELECT ORDER BY
SELECT OVER
SELECT INTO
SELECT HAVING
PARTITION BY comes in the SELECT OVER clause. Here, a window of the result set is generated out of the result set generated in the previous stages: FROM, WHERE, GROUP BY etc.
The OVER clause defines a window or user-specified set of rows within
a query result set. A window function then computes a value for each
row in the window. You can use the OVER clause with functions to
compute aggregated values such as moving averages, cumulative
aggregates, running totals, or a top N per group results.
OVER ( [ PARTITION BY value_expression ] [ order_by_clause ] )
Arguments
PARTITION BY Divides the query result set into partitions. The window
function is applied to each partition separately and computation
restarts for each partition.
value_expression Specifies the column by which the rowset is
partitioned. value_expression can only refer to columns made available
by the FROM clause. value_expression cannot refer to expressions or
aliases in the select list. value_expression can be a column
expression, scalar subquery, scalar function, or user-defined
variable.
Defines the logical order of the rows within each
partition of the result set. That is, it specifies the logical order
in which the window functioncalculation is performed.
order_by_expression Specifies a column or expression on which to sort.
order_by_expression can only refer to columns made available by the
FROM clause. An integer cannot be specified to represent a column name
or alias.
You can read more about it SELECT-OVER
row_number() (and other window functions) are allowed in two clauses:
SELECT
ORDER BY
The function is parsed along with the rest of the clause. After all, it is a function present in the clause. In both cases, the WHERE clause would be -- logically -- applied first, so the results would be after filtering.
Do note that this is a logical parsing of the query. The actual execution may have little to do with the structure of the query.

Does partition by / order by imply ordering in a query?

I see no difference in:
select
ID,
TYPE,
XTIME,
first_value(XTIME) over (partition by TYPE order by XTIME)
from SERIES;
and:
select
ID,
TYPE,
XTIME,
first_value(XTIME) over (partition by TYPE order by XTIME)
from SERIES
order by TYPE, XTIME;
Does partition by / order by imply ordering in a query?
No it does not.
The result set from a query is ordered only when it has an explicit order by clause. Otherwise the result set is in an arbitrary order. Often, the underlying algorithms used for running the query determine the ordering of the result set.
You can only depend on the final ordering when you have an order by clause for the outermost select. That the result sets are in the same order is simply a coincidence.
I am curious why you don't simply use this?
min(xtime) over (partition by type)
The difference between the two queries is that the second one would be ordered ascending by TYPE and then XTIME, while the first one would, in theory, have random ordering.
The PARTITION BY and ORDER BY clause used in the call to FIRST_VALUE have an effect on that particular analytic function, but do not by themselves affect the final order of the result set.
In general, if you want to order a result set in SQL, you need to use an ORDER BY clause. If both your queries happen to have the same ordering right now, it is only by chance, and is not guaranteed. How might this happen? Well, if Oracle is iterating the clustered index of your table, and that ordering happens to coincide with ORDER BY TYPE, XTIME, then you would see this order even without using ORDER BY.

What is the result set ordering when using window functions that have `order by` components?

I'm working on a query on the SEDE:
select top 20
row_number() over(order by "percentage approved" desc, approved desc),
row_number() over(order by "total edits" asc),
*
from editors
where "total edits" > 30
What is the ordering of the result set, taking into account the two window functions?
I suspect it's undefined but couldn't find a definitive answer. OTOH, results from queries with one such window function were ordered according to the over(order by ...) clause.
The results can be returned in any order.
Now, they will often be returned in the same order as specified in the OVER clause, but this is just because SQL Server is likely to pick a query plan that sorts the rows to calculate the aggregate. This is by no means guaranteed, as it could pick a different query plan at any time, especially as you make your query more complex which extends the space of possible query plans.
The result set of ANY SQL Server query that doesn't have an explicit ORDER BY is undefined.
This includes when you have window functions within the query, or an ORDER BY in a subquery. The result order will depend on a lot of factors, none of which are guaranteed unless you specify an ORDER BY.