Whem I am running the following query, I get a 'resource limited exceeded'-error. If I remove the last line (the order by clause) it works:
SELECT
id,
INTEGER(-position / (CASE WHEN fallback = 0 THEN 2 ELSE 1 END)) AS major_sort
FROM (
SELECT
id,
fallback,
ROW_NUMBER() OVER(PARTITION BY fallback) AS position
FROM
[table] AS r
ORDER BY
r.score DESC ) AS r
ORDER BY major_sort DESC
Actually the entire last line would be:
ORDER BY major_sort DESC, r.score DESC
But neither that would probably make things even worse.
Any idea how I could change the query to circumvent this problem?
((If you wonder what this query does: the table contains a 'ranking' with multiple fallback strategies and I want to create an ordering like this: 'AABAABAABAAB' with 'A' and 'B' being the fallback strategies. If you have a better idea how to achieve this; please feel free to tell me :D))
A top-level ORDER BY will always serialize execution of your query: it will force all computation onto a single node for the purpose of sorting. That's the cause of the resources exceeded error.
I'm not sure I fully understand your goal with the query, so it's hard to suggest alternatives, but you might consider putting an ORDER BY clause within the OVER(PARTITION BY ...) clause. Sorting a single partition can be done in parallel and may be closer to what you want.
More general advice on ordering:
Order is not preserved during BQ queries, so if there's an ordering that you want to preserve on the input rows, make sure it's encoded in your data as an extra field.
The use cases for large amounts of globally-sorted data are somewhat limited. Often when users run into resource limitations with ORDER BY, we find that they're actually looking for something slightly different (locally ordered data, or "top N"), and that it's possible to get rid of the global ORDER BY completely.
Related
[see addendum below]
Recently I was going through an SQL script as part of a task to check functionality of a data science process. I had a copy of a section of the script which had multiple sub queries and I refactored it to put the sub queries up the top in a with-clause. I usually think of this as an essentially syntactic refactoring operation that is semantically neutral. However, the operation of the script changed.
Investigation showed that it was due to the use of a row number over a partition
in which the ordering within the partition was not complete. Changing the structure
of the code changed something in the execution plan, that changed the order within
the slack left by the incomplete ordering.
I made a note of this point and became less confident of this refactoring, although
I hold the position that order should not affect the semantics, at least as long as
it can be avoided.
My question is ...
other than assigning a row number, what operations have a value that is changed
by the ordering?
I realize now that the question was a bit too open - both answers below were useful to me, but I cannot pick one over the other as THE right answer. I have up-voted both. Thanks. [I rethought that, and will pick one of the answers, rather than none. The one I pick was a bit more on target].
I also realize that the core of the problem was my not having strongly enough in mind that any refactoring can potentially change the contingent order in which the rows are returned. From now on, if I refactor and it changes the result - I will look for issues with ordering.
When windowed functions are involved, especially the ROW_NUMBER() the first thing to check is if the columns used for ordering produce a stable sort.
For instance:
CREATE TABLE t(id INT, grp VARCHAR(100), d DATE, val VARCHAR(100));
INSERT INTO t(id, grp, d, val)
VALUES (1, 'grpA', '2021-10-16', 'b')
,(2, 'grpA', '2021-10-16', 'a')
,(3, 'grpA', '2021-10-15', 'c')
,(4, 'grpA', '2021-10-14', 'd')
,(5, 'grpB', '2021-10-13', 'a')
,(6, 'grpB', '2021-10-13', 'g')
,(7, 'grpB', '2021-10-12', 'h');
-- the sort is not stable, d column has a tie
SELECT *
FROM (
SELECT t.*, ROW_NUMBER() OVER(PARTITION BY grp ORDER BY d DESC) AS rn
FROM t) sub
WHERE sub.rn = 1 AND sub.val = 'a';
Depending of the order of operation it could return:
0 rows
1 row (id: 2)
1 row (id: 5)
2 rows(id: 2 and 5)
When query is refactored it could cause choosing a differnt path to access the data thus different result.
To check if sort is stable windowed COUNT could be used using all available columns:
SELECT *
FROM (
SELECT t.*, COUNT(*) OVER(PARTITION BY grp, d ) AS cnt
FROM t) sub
WHERE cnt > 1;
db<>fiddle demo
So are you saying to had gaps in the row_numbers()? or duplicate row_numbers? or just row numbers jumped around (unstable?)
Which functions are altered by incomplete/unstable order by functions, all the ones where you put OBER BY in the window function. Thus ROW_NUMBER or LAG or LEAD
But in general a sub-select and a CTE (with clause) are the same, the primary difference is multiple things can JOIN the same CTE (thus the Common part) this can be good/bad as you might save on some expensive calculation, but you might also slow down a critical path, and make the whole execution time slower.
Or the data might be a little more processed (due to JOIN's etc) and then the incomplete ODERBY/instability might be exposed.
SQL is not based on set theory but on list theory. It is true that many join-based operations have an output such that the underlying bag of elements is a function of the underlying bag of elements in the input - but there are operations, such as row_number() as mentioned, in which this is not the case.
I would like to add a more obscure effect not mentioned in the other answers so far. Floating point arithmetic. Since the order of adding up floating point numbers does actually make a difference, it is possible that using a different ordering clause can produce different floating point values.
In the case mentioned in the posted question, this did actually happen - although only in the 10th decimal place. But that can be enough to change which value is bigger than another, and so make a discrete and significant change to the result of the outermost query.
Another example would be LISTAGG. I inherited some code that used LISTAGG, but didn't give consistent answers when I tweaked it, because it didn't include the ordering clause: WITHIN GROUP ( ORDER BY ...).
From the Snowflake docs:
If you do not specify the WITHIN GROUP (<orderby_clause>), the order
of elements within each list is unpredictable. (An ORDER BY clause
outside the WITHIN GROUP clause applies to the order of the output
rows, not to the order of the list elements within a row.)
I've to add row number in my existing query so that I can track how much data has been added into Redis. If my query failed so I can start from that row no which is updated in other table.
Query to get data start after 1000 row from table
SELECT * FROM (SELECT *, ROW_NUMBER() OVER (Order by (select 1)) as rn ) as X where rn > 1000
Query is working fine. If any way that I can get the row no without using order by.
What is select 1 here?
Is the query optimized or I can do it by other ways. Please provide the better solution.
There is no need to worry about specifying constant in the ORDER BY expression. The following is quoted from the Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions written by Itzik Ben-Gan (it was available for free download from Microsoft free e-books site):
As mentioned, a window order clause is mandatory, and SQL Server
doesn’t allow the ordering to be based on a constant—for example,
ORDER BY NULL. But surprisingly, when passing an expression based on a
subquery that returns a constant—for example, ORDER BY (SELECT
NULL)—SQL Server will accept it. At the same time, the optimizer
un-nests, or expands, the expression and realizes that the ordering is
the same for all rows. Therefore, it removes the ordering requirement
from the input data. Here’s a complete query demonstrating this
technique:
SELECT actid, tranid, val,
ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
FROM dbo.Transactions;
Observe in the properties of the Index Scan iterator that the Ordered
property is False, meaning that the iterator is not required to return
the data in index key order
The above means that when you are using constant ordering is not performed. I will strongly recommend to read the book as Itzik Ben-Gan describes in depth how the window functions are working and how to optimize various of cases when they are used.
Try just order by 1. Read the error message. Then reinstate the order by (select 1). Realise that whoever wrote this has, at some point, read the error message and then decided that the right thing to do is to trick the system into not raising an error rather than realising the fundamental truth that the error was trying to alert them to.
Tables have no inherent order. If you want some form of ordering that you can rely upon, it's up to you to provide enough deterministic expression(s) to any ORDER BY clause such that each row is uniquely identified and ordered.
Anything else, including tricking the system into not emitting errors, is hoping that the system will do something sensible without using the tools provided to you to ensure that it does something sensible - a well specified ORDER BY clause.
You can use any literal value
ex
order by (select 0)
order by (select null)
order by (select 'test')
etc
Refer this for more information
https://exploresql.com/2017/03/31/row_number-function-with-no-specific-order/
What is select 1 here?
In this scenario, the author of query does not really have any particular sorting in mind.
ROW_NUMBER requires ORDER BY clause so providing it is a way to satisfy the parser.
Sorting by "constant" will create "undeterministic" order(query optimizer is able to choose whatever order it found suitable).
Easiest way to think about it is as:
ROW_NUMBER() OVER(ORDER BY 1) -- error
ROW_NUMBER() OVER(ORDER BY NULL) -- error
There are few possible scenarios to provide constant expression to "trick" query optimizer:
ROW_NUMBER() OVER(ORDER BY (SELECT 1)) -- already presented
Other options:
ROW_NUMBER() OVER(ORDER BY 1/0) -- should not be used
ROW_NUMBER() OVER(ORDER BY ##SPID)
ROW_NUMBER() OVER(ORDER BY DB_ID())
ROW_NUMBER() OVER(ORDER BY USER_ID())
db<>fiddle demo
In some other databases (e.g. DB2, or Oracle with ROWNUM), I can omit the ORDER BY clause in a ranking function's OVER() clause. For instance:
ROW_NUMBER() OVER()
This is particularly useful when used with ordered derived tables, such as:
SELECT t.*, ROW_NUMBER() OVER()
FROM (
SELECT ...
ORDER BY
) t
How can this be emulated in SQL Server? I've found people using this trick, but that's wrong, as it will behave non-deterministically with respect to the order from the derived table:
-- This order here ---------------------vvvvvvvv
SELECT t.*, ROW_NUMBER() OVER(ORDER BY (SELECT 1))
FROM (
SELECT TOP 100 PERCENT ...
-- vvvvv ----redefines this order here
ORDER BY
) t
A concrete example (as can be seen on SQLFiddle):
SELECT v, ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
FROM (
SELECT TOP 100 PERCENT 1 UNION ALL
SELECT TOP 100 PERCENT 2 UNION ALL
SELECT TOP 100 PERCENT 3 UNION ALL
SELECT TOP 100 PERCENT 4
-- This descending order is not maintained in the outer query
ORDER BY 1 DESC
) t(v)
Also, I cannot reuse any expression from the derived table to reproduce the ORDER BY clause in my case, as the derived table might not be available as it may be provided by some external logic.
So how can I do it? Can I do it at all?
The Row_Number() OVER (ORDER BY (SELECT 1)) trick should NOT be seen as a way to avoid changing the order of underlying data. It is only a means to avoid causing the server to perform an additional and unneeded sort (it may still perform the sort but it's going to cost the minimum amount possible when compared to sorting by a column).
All queries in SQL server ABSOLUTELY MUST have an ORDER BY clause in the outermost query for the results to be reliably ordered in a guaranteed way.
The concept of "retaining original order" does not exist in relational databases. Tables and queries must always be considered unordered until and unless an ORDER BY clause is specified in the outermost query.
You could try the same unordered query 100,000 times and always receive it with the same ordering, and thus come to believe you can rely on said ordering. But that would be a mistake, because one day, something will change and it will not have the order you expect. One example is when a database is upgraded to a new version of SQL Server--this has caused many a query to change its ordering. But it doesn't have to be that big a change. Something as little as adding or removing an index can cause differences. And more: Installing a service pack. Partitioning a table. Creating an indexed view that includes the table in question. Reaching some tipping point where a scan is chosen instead of a seek. And so on.
Do not rely on results to be ordered unless you have said "Server, ORDER BY".
I have a query:
SELECT ROW_NUMBER() OVER(ORDER BY LogId) AS RowNum
FROM [Log] l
where RowNum = 1
and I'm getting the following error:
Invalid column name 'RowNum'.
I did some search here and found that column aliasing is not available in WHERE.
so I tried the the following and it worked:
select *
from
(
SELECT ROW_NUMBER() OVER(ORDER BY LogId) AS RowNum
FROM [Log] l
) as t
where t.RowNum = 1
Is there a better way, from performance point of view, to make this query?
Thanks in advance.
That's just the way it is.
Column aliases can not be used on the same logical level where they were defined. You will have to use the derived table (sub-query) as you have found out.
If you are concerned about performance, then don't. The derived table is mere syntactical sugar, it won't make the query slower (compared to the solution you tried first).
An alternative to this specific query, which won't perform any different but is simpler to write:
SELECT TOP 1 <col list> FROM dbo.[Log] ORDER BY LogId;
As #a_horse explained, don't be concerned that because your second query looks like more code that it is more expensive. If you want to measure the efficiency of different queries that get the same results, compare their execution plans, not code complexity.
I'm working on a query on the SEDE:
select top 20
row_number() over(order by "percentage approved" desc, approved desc),
row_number() over(order by "total edits" asc),
*
from editors
where "total edits" > 30
What is the ordering of the result set, taking into account the two window functions?
I suspect it's undefined but couldn't find a definitive answer. OTOH, results from queries with one such window function were ordered according to the over(order by ...) clause.
The results can be returned in any order.
Now, they will often be returned in the same order as specified in the OVER clause, but this is just because SQL Server is likely to pick a query plan that sorts the rows to calculate the aggregate. This is by no means guaranteed, as it could pick a different query plan at any time, especially as you make your query more complex which extends the space of possible query plans.
The result set of ANY SQL Server query that doesn't have an explicit ORDER BY is undefined.
This includes when you have window functions within the query, or an ORDER BY in a subquery. The result order will depend on a lot of factors, none of which are guaranteed unless you specify an ORDER BY.