ROW_NUMBER Without ORDER BY - sql

I've to add row number in my existing query so that I can track how much data has been added into Redis. If my query failed so I can start from that row no which is updated in other table.
Query to get data start after 1000 row from table
SELECT * FROM (SELECT *, ROW_NUMBER() OVER (Order by (select 1)) as rn ) as X where rn > 1000
Query is working fine. If any way that I can get the row no without using order by.
What is select 1 here?
Is the query optimized or I can do it by other ways. Please provide the better solution.

There is no need to worry about specifying constant in the ORDER BY expression. The following is quoted from the Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions written by Itzik Ben-Gan (it was available for free download from Microsoft free e-books site):
As mentioned, a window order clause is mandatory, and SQL Server
doesn’t allow the ordering to be based on a constant—for example,
ORDER BY NULL. But surprisingly, when passing an expression based on a
subquery that returns a constant—for example, ORDER BY (SELECT
NULL)—SQL Server will accept it. At the same time, the optimizer
un-nests, or expands, the expression and realizes that the ordering is
the same for all rows. Therefore, it removes the ordering requirement
from the input data. Here’s a complete query demonstrating this
technique:
SELECT actid, tranid, val,
ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
FROM dbo.Transactions;
Observe in the properties of the Index Scan iterator that the Ordered
property is False, meaning that the iterator is not required to return
the data in index key order
The above means that when you are using constant ordering is not performed. I will strongly recommend to read the book as Itzik Ben-Gan describes in depth how the window functions are working and how to optimize various of cases when they are used.

Try just order by 1. Read the error message. Then reinstate the order by (select 1). Realise that whoever wrote this has, at some point, read the error message and then decided that the right thing to do is to trick the system into not raising an error rather than realising the fundamental truth that the error was trying to alert them to.
Tables have no inherent order. If you want some form of ordering that you can rely upon, it's up to you to provide enough deterministic expression(s) to any ORDER BY clause such that each row is uniquely identified and ordered.
Anything else, including tricking the system into not emitting errors, is hoping that the system will do something sensible without using the tools provided to you to ensure that it does something sensible - a well specified ORDER BY clause.

You can use any literal value
ex
order by (select 0)
order by (select null)
order by (select 'test')
etc
Refer this for more information
https://exploresql.com/2017/03/31/row_number-function-with-no-specific-order/

What is select 1 here?
In this scenario, the author of query does not really have any particular sorting in mind.
ROW_NUMBER requires ORDER BY clause so providing it is a way to satisfy the parser.
Sorting by "constant" will create "undeterministic" order(query optimizer is able to choose whatever order it found suitable).
Easiest way to think about it is as:
ROW_NUMBER() OVER(ORDER BY 1) -- error
ROW_NUMBER() OVER(ORDER BY NULL) -- error
There are few possible scenarios to provide constant expression to "trick" query optimizer:
ROW_NUMBER() OVER(ORDER BY (SELECT 1)) -- already presented
Other options:
ROW_NUMBER() OVER(ORDER BY 1/0) -- should not be used
ROW_NUMBER() OVER(ORDER BY ##SPID)
ROW_NUMBER() OVER(ORDER BY DB_ID())
ROW_NUMBER() OVER(ORDER BY USER_ID())
db<>fiddle demo

Related

What SQL query operations can change their value dependent on order?

[see addendum below]
Recently I was going through an SQL script as part of a task to check functionality of a data science process. I had a copy of a section of the script which had multiple sub queries and I refactored it to put the sub queries up the top in a with-clause. I usually think of this as an essentially syntactic refactoring operation that is semantically neutral. However, the operation of the script changed.
Investigation showed that it was due to the use of a row number over a partition
in which the ordering within the partition was not complete. Changing the structure
of the code changed something in the execution plan, that changed the order within
the slack left by the incomplete ordering.
I made a note of this point and became less confident of this refactoring, although
I hold the position that order should not affect the semantics, at least as long as
it can be avoided.
My question is ...
other than assigning a row number, what operations have a value that is changed
by the ordering?
I realize now that the question was a bit too open - both answers below were useful to me, but I cannot pick one over the other as THE right answer. I have up-voted both. Thanks. [I rethought that, and will pick one of the answers, rather than none. The one I pick was a bit more on target].
I also realize that the core of the problem was my not having strongly enough in mind that any refactoring can potentially change the contingent order in which the rows are returned. From now on, if I refactor and it changes the result - I will look for issues with ordering.
When windowed functions are involved, especially the ROW_NUMBER() the first thing to check is if the columns used for ordering produce a stable sort.
For instance:
CREATE TABLE t(id INT, grp VARCHAR(100), d DATE, val VARCHAR(100));
INSERT INTO t(id, grp, d, val)
VALUES (1, 'grpA', '2021-10-16', 'b')
,(2, 'grpA', '2021-10-16', 'a')
,(3, 'grpA', '2021-10-15', 'c')
,(4, 'grpA', '2021-10-14', 'd')
,(5, 'grpB', '2021-10-13', 'a')
,(6, 'grpB', '2021-10-13', 'g')
,(7, 'grpB', '2021-10-12', 'h');
-- the sort is not stable, d column has a tie
SELECT *
FROM (
SELECT t.*, ROW_NUMBER() OVER(PARTITION BY grp ORDER BY d DESC) AS rn
FROM t) sub
WHERE sub.rn = 1 AND sub.val = 'a';
Depending of the order of operation it could return:
0 rows
1 row (id: 2)
1 row (id: 5)
2 rows(id: 2 and 5)
When query is refactored it could cause choosing a differnt path to access the data thus different result.
To check if sort is stable windowed COUNT could be used using all available columns:
SELECT *
FROM (
SELECT t.*, COUNT(*) OVER(PARTITION BY grp, d ) AS cnt
FROM t) sub
WHERE cnt > 1;
db<>fiddle demo
So are you saying to had gaps in the row_numbers()? or duplicate row_numbers? or just row numbers jumped around (unstable?)
Which functions are altered by incomplete/unstable order by functions, all the ones where you put OBER BY in the window function. Thus ROW_NUMBER or LAG or LEAD
But in general a sub-select and a CTE (with clause) are the same, the primary difference is multiple things can JOIN the same CTE (thus the Common part) this can be good/bad as you might save on some expensive calculation, but you might also slow down a critical path, and make the whole execution time slower.
Or the data might be a little more processed (due to JOIN's etc) and then the incomplete ODERBY/instability might be exposed.
SQL is not based on set theory but on list theory. It is true that many join-based operations have an output such that the underlying bag of elements is a function of the underlying bag of elements in the input - but there are operations, such as row_number() as mentioned, in which this is not the case.
I would like to add a more obscure effect not mentioned in the other answers so far. Floating point arithmetic. Since the order of adding up floating point numbers does actually make a difference, it is possible that using a different ordering clause can produce different floating point values.
In the case mentioned in the posted question, this did actually happen - although only in the 10th decimal place. But that can be enough to change which value is bigger than another, and so make a discrete and significant change to the result of the outermost query.
Another example would be LISTAGG. I inherited some code that used LISTAGG, but didn't give consistent answers when I tweaked it, because it didn't include the ordering clause: WITHIN GROUP ( ORDER BY ...).
From the Snowflake docs:
If you do not specify the WITHIN GROUP (<orderby_clause>), the order
of elements within each list is unpredictable. (An ORDER BY clause
outside the WITHIN GROUP clause applies to the order of the output
rows, not to the order of the list elements within a row.)

bigquery resource limited exeeded due to order by

Whem I am running the following query, I get a 'resource limited exceeded'-error. If I remove the last line (the order by clause) it works:
SELECT
id,
INTEGER(-position / (CASE WHEN fallback = 0 THEN 2 ELSE 1 END)) AS major_sort
FROM (
SELECT
id,
fallback,
ROW_NUMBER() OVER(PARTITION BY fallback) AS position
FROM
[table] AS r
ORDER BY
r.score DESC ) AS r
ORDER BY major_sort DESC
Actually the entire last line would be:
ORDER BY major_sort DESC, r.score DESC
But neither that would probably make things even worse.
Any idea how I could change the query to circumvent this problem?
((If you wonder what this query does: the table contains a 'ranking' with multiple fallback strategies and I want to create an ordering like this: 'AABAABAABAAB' with 'A' and 'B' being the fallback strategies. If you have a better idea how to achieve this; please feel free to tell me :D))
A top-level ORDER BY will always serialize execution of your query: it will force all computation onto a single node for the purpose of sorting. That's the cause of the resources exceeded error.
I'm not sure I fully understand your goal with the query, so it's hard to suggest alternatives, but you might consider putting an ORDER BY clause within the OVER(PARTITION BY ...) clause. Sorting a single partition can be done in parallel and may be closer to what you want.
More general advice on ordering:
Order is not preserved during BQ queries, so if there's an ordering that you want to preserve on the input rows, make sure it's encoded in your data as an extra field.
The use cases for large amounts of globally-sorted data are somewhat limited. Often when users run into resource limitations with ORDER BY, we find that they're actually looking for something slightly different (locally ordered data, or "top N"), and that it's possible to get rid of the global ORDER BY completely.

Calculating SQL Server ROW_NUMBER() OVER() for a derived table

In some other databases (e.g. DB2, or Oracle with ROWNUM), I can omit the ORDER BY clause in a ranking function's OVER() clause. For instance:
ROW_NUMBER() OVER()
This is particularly useful when used with ordered derived tables, such as:
SELECT t.*, ROW_NUMBER() OVER()
FROM (
SELECT ...
ORDER BY
) t
How can this be emulated in SQL Server? I've found people using this trick, but that's wrong, as it will behave non-deterministically with respect to the order from the derived table:
-- This order here ---------------------vvvvvvvv
SELECT t.*, ROW_NUMBER() OVER(ORDER BY (SELECT 1))
FROM (
SELECT TOP 100 PERCENT ...
-- vvvvv ----redefines this order here
ORDER BY
) t
A concrete example (as can be seen on SQLFiddle):
SELECT v, ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
FROM (
SELECT TOP 100 PERCENT 1 UNION ALL
SELECT TOP 100 PERCENT 2 UNION ALL
SELECT TOP 100 PERCENT 3 UNION ALL
SELECT TOP 100 PERCENT 4
-- This descending order is not maintained in the outer query
ORDER BY 1 DESC
) t(v)
Also, I cannot reuse any expression from the derived table to reproduce the ORDER BY clause in my case, as the derived table might not be available as it may be provided by some external logic.
So how can I do it? Can I do it at all?
The Row_Number() OVER (ORDER BY (SELECT 1)) trick should NOT be seen as a way to avoid changing the order of underlying data. It is only a means to avoid causing the server to perform an additional and unneeded sort (it may still perform the sort but it's going to cost the minimum amount possible when compared to sorting by a column).
All queries in SQL server ABSOLUTELY MUST have an ORDER BY clause in the outermost query for the results to be reliably ordered in a guaranteed way.
The concept of "retaining original order" does not exist in relational databases. Tables and queries must always be considered unordered until and unless an ORDER BY clause is specified in the outermost query.
You could try the same unordered query 100,000 times and always receive it with the same ordering, and thus come to believe you can rely on said ordering. But that would be a mistake, because one day, something will change and it will not have the order you expect. One example is when a database is upgraded to a new version of SQL Server--this has caused many a query to change its ordering. But it doesn't have to be that big a change. Something as little as adding or removing an index can cause differences. And more: Installing a service pack. Partitioning a table. Creating an indexed view that includes the table in question. Reaching some tipping point where a scan is chosen instead of a seek. And so on.
Do not rely on results to be ordered unless you have said "Server, ORDER BY".

What is the result set ordering when using window functions that have `order by` components?

I'm working on a query on the SEDE:
select top 20
row_number() over(order by "percentage approved" desc, approved desc),
row_number() over(order by "total edits" asc),
*
from editors
where "total edits" > 30
What is the ordering of the result set, taking into account the two window functions?
I suspect it's undefined but couldn't find a definitive answer. OTOH, results from queries with one such window function were ordered according to the over(order by ...) clause.
The results can be returned in any order.
Now, they will often be returned in the same order as specified in the OVER clause, but this is just because SQL Server is likely to pick a query plan that sorts the rows to calculate the aggregate. This is by no means guaranteed, as it could pick a different query plan at any time, especially as you make your query more complex which extends the space of possible query plans.
The result set of ANY SQL Server query that doesn't have an explicit ORDER BY is undefined.
This includes when you have window functions within the query, or an ORDER BY in a subquery. The result order will depend on a lot of factors, none of which are guaranteed unless you specify an ORDER BY.

Sql server ROW_NUMBER() & Rank() function detail....how it works

i never use sql server ROW_NUMBER() function. so i read some article regarding ROW_NUMBER(),PARTITION & RANK() etc but still not clear to me.
i found the syntax is like
SELECT top 10 ROW_NUMBER() OVER(ORDER BY JID DESC) AS 'Row Number',
JID,Specialist, jobstate, jobtype FROM bbajobs
SELECT top 10 ROW_NUMBER() OVER(PARTITION BY JID ORDER BY JID DESC) AS 'Row Number',
JID,Specialist, jobstate, jobtype FROM bbajobs
i have few question
1) what over() function does. why we need to specify column name in over function like OVER(ORDER BY JID DESC)
2) i saw sometime people use PARTITION keyword. what it is?
it is also used in over function like OVER(PARTITION BY JID ORDER BY JID DESC)
3) in what type of situation we have to use PARTITION keyword
4) when we specify PARTITION keyword in over then also we need to specify order by also why. only PARTITION keyword can not be used in over clause.
5) what type of situation one should use RANK function
6) what is CTE and what is the advantage of using CTE. it is just like temporary view.
anyone get any performance boost if he/she use CTE other than reusability?
please discuss my points in detail. it will be very much helpful if some one make me understand with small & easy example for all the keyword like ROW_NUMBER(),PARTITION & RANK(). thanks
OVER Clause (Transact-SQL)
Ranking Functions (Transact-SQL)
ROW_NUMBER (Transact-SQL)
RANK (Transact-SQL)
You need ORDER BY because sets have no order otherwise. You need it for a standard SELECT
PARTITION BY resets the COUNT per partition
Many
See point 1. You can use PARTITION by itself for SUM, COUNT etc
See MSDN
Separate question