[see addendum below]
Recently I was going through an SQL script as part of a task to check functionality of a data science process. I had a copy of a section of the script which had multiple sub queries and I refactored it to put the sub queries up the top in a with-clause. I usually think of this as an essentially syntactic refactoring operation that is semantically neutral. However, the operation of the script changed.
Investigation showed that it was due to the use of a row number over a partition
in which the ordering within the partition was not complete. Changing the structure
of the code changed something in the execution plan, that changed the order within
the slack left by the incomplete ordering.
I made a note of this point and became less confident of this refactoring, although
I hold the position that order should not affect the semantics, at least as long as
it can be avoided.
My question is ...
other than assigning a row number, what operations have a value that is changed
by the ordering?
I realize now that the question was a bit too open - both answers below were useful to me, but I cannot pick one over the other as THE right answer. I have up-voted both. Thanks. [I rethought that, and will pick one of the answers, rather than none. The one I pick was a bit more on target].
I also realize that the core of the problem was my not having strongly enough in mind that any refactoring can potentially change the contingent order in which the rows are returned. From now on, if I refactor and it changes the result - I will look for issues with ordering.
When windowed functions are involved, especially the ROW_NUMBER() the first thing to check is if the columns used for ordering produce a stable sort.
For instance:
CREATE TABLE t(id INT, grp VARCHAR(100), d DATE, val VARCHAR(100));
INSERT INTO t(id, grp, d, val)
VALUES (1, 'grpA', '2021-10-16', 'b')
,(2, 'grpA', '2021-10-16', 'a')
,(3, 'grpA', '2021-10-15', 'c')
,(4, 'grpA', '2021-10-14', 'd')
,(5, 'grpB', '2021-10-13', 'a')
,(6, 'grpB', '2021-10-13', 'g')
,(7, 'grpB', '2021-10-12', 'h');
-- the sort is not stable, d column has a tie
SELECT *
FROM (
SELECT t.*, ROW_NUMBER() OVER(PARTITION BY grp ORDER BY d DESC) AS rn
FROM t) sub
WHERE sub.rn = 1 AND sub.val = 'a';
Depending of the order of operation it could return:
0 rows
1 row (id: 2)
1 row (id: 5)
2 rows(id: 2 and 5)
When query is refactored it could cause choosing a differnt path to access the data thus different result.
To check if sort is stable windowed COUNT could be used using all available columns:
SELECT *
FROM (
SELECT t.*, COUNT(*) OVER(PARTITION BY grp, d ) AS cnt
FROM t) sub
WHERE cnt > 1;
db<>fiddle demo
So are you saying to had gaps in the row_numbers()? or duplicate row_numbers? or just row numbers jumped around (unstable?)
Which functions are altered by incomplete/unstable order by functions, all the ones where you put OBER BY in the window function. Thus ROW_NUMBER or LAG or LEAD
But in general a sub-select and a CTE (with clause) are the same, the primary difference is multiple things can JOIN the same CTE (thus the Common part) this can be good/bad as you might save on some expensive calculation, but you might also slow down a critical path, and make the whole execution time slower.
Or the data might be a little more processed (due to JOIN's etc) and then the incomplete ODERBY/instability might be exposed.
SQL is not based on set theory but on list theory. It is true that many join-based operations have an output such that the underlying bag of elements is a function of the underlying bag of elements in the input - but there are operations, such as row_number() as mentioned, in which this is not the case.
I would like to add a more obscure effect not mentioned in the other answers so far. Floating point arithmetic. Since the order of adding up floating point numbers does actually make a difference, it is possible that using a different ordering clause can produce different floating point values.
In the case mentioned in the posted question, this did actually happen - although only in the 10th decimal place. But that can be enough to change which value is bigger than another, and so make a discrete and significant change to the result of the outermost query.
Another example would be LISTAGG. I inherited some code that used LISTAGG, but didn't give consistent answers when I tweaked it, because it didn't include the ordering clause: WITHIN GROUP ( ORDER BY ...).
From the Snowflake docs:
If you do not specify the WITHIN GROUP (<orderby_clause>), the order
of elements within each list is unpredictable. (An ORDER BY clause
outside the WITHIN GROUP clause applies to the order of the output
rows, not to the order of the list elements within a row.)
Related
Is there any special reason that SQL only implements FIRST_VALUE and LAST_VALUE as a windowing function instead of an aggregation function? I find it quite common to encounter problems such as "find the item with highest price in each category". While other languages (such as python) provide MIN/MAX functions with keywords such that
MAX(item_name, key=lambda x: revenue[x])
is possible, In SQL the only way to tackle this problem seems to be:
WITH temp as(
SELECT *, FIRST_VALUE(item_name) OVER(PARTITION BY category ORDER BY revenue) as fv
FROM catalog)
SELECT category, MAX(fv) -- MIN(fv) also OK
FROM temp
GROUP BY category;
Is there a special reason that there is no "aggregation version" of FIRST_VALUE such that
SELECT category, FIRST_VALUE(item_name, revenue)
FROM catalog
GROUP BY
category
or is it just the way it is?
That’s just the way it is, as far as I’m concerned. I suspect the only real answer would be “because it’s not in the SQL spec” and the only people who could really answer as to why it’s not in the spec are the people who write it. Questions of the form “what was (name of relevant external authority) thinking when they mandated that (name of product) should operate like this” are actually typically off topic here because very few people can reliably and factually answer.. I don’t even like my own answer here, as it feels like an extended comment on a question that cannot realistically be answered
Aggregate functions work on sets of data and while some of them might require some implied ordering operation such as median, the functions are always about the column they’re operating on, not a “give me the value of this column based on the ordering of that column”.
There are plenty of window/analytic functions that don’t have a corollary aggregation version, and window functions have a different end use intent than aggregation. You could conceive that some of them perform aggregation and then join the aggregation result back to the main data in order to relate the agg result to the particular row, but I wouldn’t assume the two facilities (agg vs window) are related at all
As far as I understand the python (not a python dev), it is not doing any aggregation, it's searching a list of item_name strings and looking each up in a dictionary that returns the revenue for that item, and returning the item_name that has the largest revenue. There wasn't any grouping there, it's much more like a SELECT TOP 1 item_name ORDER BY revenue and is only really good for returning a single item, rather than a load of items that are all maxes within their group, unless it's used within a loop that is processing a different list of item name each time
I know your question wasn't exactly about this particular SQL query but it may be helpful for you if I mention a couple of things on it. I'm not really sure what:
WITH temp as(
SELECT *, FIRST_VALUE(item_name) OVER(PARTITION BY category ORDER BY revenue) as fv
FROM catalog
)
SELECT category, MAX(fv) -- MIN(fv) also OK
FROM temp
GROUP BY category;
Gives you over something like:
SELECT DISTINCT category, FIRST_VALUE(item_name) OVER(PARTITION BY category ORDER BY revenue) as fv
FROM catalog
The analytic/window will produce the same value for every category (the partition) so it seems that really all the extra group by is doing is reducing the repeated values - which could be more simply answered by just getting the values you want and using distinct to quash the duplicates (one of the few cases where I would advocate such)
In the more general sense of "I want the entire most X row as determined by highest/lowest Y" we typically use row number for that:
WITH temp as(
SELECT *, ROW_NUMBER(item_name) OVER(PARTITION BY category ORDER BY revenue) as rn
FROM catalog)
SELECT *
FROM temp
WHERE rn = 1;
Though I find it more compact/readable to dispense with the CTE and just use a sub query but YMMV
I've to add row number in my existing query so that I can track how much data has been added into Redis. If my query failed so I can start from that row no which is updated in other table.
Query to get data start after 1000 row from table
SELECT * FROM (SELECT *, ROW_NUMBER() OVER (Order by (select 1)) as rn ) as X where rn > 1000
Query is working fine. If any way that I can get the row no without using order by.
What is select 1 here?
Is the query optimized or I can do it by other ways. Please provide the better solution.
There is no need to worry about specifying constant in the ORDER BY expression. The following is quoted from the Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions written by Itzik Ben-Gan (it was available for free download from Microsoft free e-books site):
As mentioned, a window order clause is mandatory, and SQL Server
doesn’t allow the ordering to be based on a constant—for example,
ORDER BY NULL. But surprisingly, when passing an expression based on a
subquery that returns a constant—for example, ORDER BY (SELECT
NULL)—SQL Server will accept it. At the same time, the optimizer
un-nests, or expands, the expression and realizes that the ordering is
the same for all rows. Therefore, it removes the ordering requirement
from the input data. Here’s a complete query demonstrating this
technique:
SELECT actid, tranid, val,
ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
FROM dbo.Transactions;
Observe in the properties of the Index Scan iterator that the Ordered
property is False, meaning that the iterator is not required to return
the data in index key order
The above means that when you are using constant ordering is not performed. I will strongly recommend to read the book as Itzik Ben-Gan describes in depth how the window functions are working and how to optimize various of cases when they are used.
Try just order by 1. Read the error message. Then reinstate the order by (select 1). Realise that whoever wrote this has, at some point, read the error message and then decided that the right thing to do is to trick the system into not raising an error rather than realising the fundamental truth that the error was trying to alert them to.
Tables have no inherent order. If you want some form of ordering that you can rely upon, it's up to you to provide enough deterministic expression(s) to any ORDER BY clause such that each row is uniquely identified and ordered.
Anything else, including tricking the system into not emitting errors, is hoping that the system will do something sensible without using the tools provided to you to ensure that it does something sensible - a well specified ORDER BY clause.
You can use any literal value
ex
order by (select 0)
order by (select null)
order by (select 'test')
etc
Refer this for more information
https://exploresql.com/2017/03/31/row_number-function-with-no-specific-order/
What is select 1 here?
In this scenario, the author of query does not really have any particular sorting in mind.
ROW_NUMBER requires ORDER BY clause so providing it is a way to satisfy the parser.
Sorting by "constant" will create "undeterministic" order(query optimizer is able to choose whatever order it found suitable).
Easiest way to think about it is as:
ROW_NUMBER() OVER(ORDER BY 1) -- error
ROW_NUMBER() OVER(ORDER BY NULL) -- error
There are few possible scenarios to provide constant expression to "trick" query optimizer:
ROW_NUMBER() OVER(ORDER BY (SELECT 1)) -- already presented
Other options:
ROW_NUMBER() OVER(ORDER BY 1/0) -- should not be used
ROW_NUMBER() OVER(ORDER BY ##SPID)
ROW_NUMBER() OVER(ORDER BY DB_ID())
ROW_NUMBER() OVER(ORDER BY USER_ID())
db<>fiddle demo
My understanding of using summary functions in SQL is that each field in the select statement that doesn't use a summary function, should be listed in the group by statement.
select a, b, c, sum(n) as sum_of_n
from table
group by a, b, c
My question is, why do we need to list the fields? Shouldn't the SQL syntax parser be implemented in a way that we can just tell it to group and it can figure out the groups based on whichever fields are in the select and aren't using summary functions?:
select a, b, c, sum(n) as sum_of_n
from table
group
I feel like I'm unnecessarily repeating myself when I write SQL code. What circumstances exist where we would not want it to automatically figure this out, or where it couldn't automatically figure this out?
To decrease the chances of errors in your statement. Explicitly spelling out the GROUP BY columns helps to ensure that the user wrote would they intended to write. You might be surprised at the number of posts that show up on Stackoverflow in which the user is grouping on columns that make no sense, but they have no idea why they aren't getting the data that they expect.
Also, consider the scenario where a user might want to group on more columns than are actually in the SELECT statement. For example, if I wanted the average of the most money that my customers have spent then I might write something like this:
SELECT
AVG(max_amt)
FROM (SELECT MAX(amt) FROM Invoices GROUP BY customer_id) SQ
In this case I can't simply use GROUP, I need to spell out the column(s) on which I'm grouping. The SQL engine could allow the user to explicitly list columns, but use a default if they are not listed, but then the chances of bugs drastically increases.
One way to think of it is like strongly typed programming languages. Making the programmer explicitly spell things out decreases the chance of bugs popping up because the engine made an assumption that the programmer didn't expect.
This is required to determine explicitly how do you want to group the records because, for example, you may use columns for grouping that are not listed in result set.
However, there are RDBMS which allow to not specify GROUP BY clause using aggregate functions like MySQL.
My first reaction would be that 'it is what it is' =)
But on thinking it through, the reason TSQL works like this is because the SELECT and the GROUP BY are two distinct parts of all the operations going on in the query.
This might not be the best example, but it does show that you can GROUP on different (well, 'more') fields than you are actually SELECTing.
SELECT brand = Convert(varchar(100), ''), model = Convert(varchar(100), ''), some_number = Convert(int, 0)
INTO #test
WHERE 1 = 2
INSERT #test (brand, model, some_number)
VALUES ('Ford', 'Focus', 10),
('Ford', 'Focus', 25),
('Ford', 'Kagu', 23),
('DMC', '12', 88)
SELECT brand, model, MAX(some_number)
FROM #test
GROUP BY brand, model
SELECT brand, MAX(some_number)
FROM #test
GROUP BY brand, model
Not all RDBMS's are like this, e.g. MySQL allows for omitting fields from the GROUP BY that are nevertheless in the SELECT part. From what I've seen, it then picks a random value ('there is no such a thing as an implicit first') and uses that in the SELECT .. I think, my knowledge on MySQL is rather limited but I've seen some examples here and there and they always confused me as I'm used to the strict requirement of TSQL you just described.
In addition, you can group by your columns in a different order than select
select a, b, c, sum(d)
from table
group by c,a,b
Also a lot of DBs allow you to skip column names, you can just specify which columns are going to be included in the group by using select position
select a, b, c, sum(d)
from table
group by 3,1,2
Whem I am running the following query, I get a 'resource limited exceeded'-error. If I remove the last line (the order by clause) it works:
SELECT
id,
INTEGER(-position / (CASE WHEN fallback = 0 THEN 2 ELSE 1 END)) AS major_sort
FROM (
SELECT
id,
fallback,
ROW_NUMBER() OVER(PARTITION BY fallback) AS position
FROM
[table] AS r
ORDER BY
r.score DESC ) AS r
ORDER BY major_sort DESC
Actually the entire last line would be:
ORDER BY major_sort DESC, r.score DESC
But neither that would probably make things even worse.
Any idea how I could change the query to circumvent this problem?
((If you wonder what this query does: the table contains a 'ranking' with multiple fallback strategies and I want to create an ordering like this: 'AABAABAABAAB' with 'A' and 'B' being the fallback strategies. If you have a better idea how to achieve this; please feel free to tell me :D))
A top-level ORDER BY will always serialize execution of your query: it will force all computation onto a single node for the purpose of sorting. That's the cause of the resources exceeded error.
I'm not sure I fully understand your goal with the query, so it's hard to suggest alternatives, but you might consider putting an ORDER BY clause within the OVER(PARTITION BY ...) clause. Sorting a single partition can be done in parallel and may be closer to what you want.
More general advice on ordering:
Order is not preserved during BQ queries, so if there's an ordering that you want to preserve on the input rows, make sure it's encoded in your data as an extra field.
The use cases for large amounts of globally-sorted data are somewhat limited. Often when users run into resource limitations with ORDER BY, we find that they're actually looking for something slightly different (locally ordered data, or "top N"), and that it's possible to get rid of the global ORDER BY completely.
In some other databases (e.g. DB2, or Oracle with ROWNUM), I can omit the ORDER BY clause in a ranking function's OVER() clause. For instance:
ROW_NUMBER() OVER()
This is particularly useful when used with ordered derived tables, such as:
SELECT t.*, ROW_NUMBER() OVER()
FROM (
SELECT ...
ORDER BY
) t
How can this be emulated in SQL Server? I've found people using this trick, but that's wrong, as it will behave non-deterministically with respect to the order from the derived table:
-- This order here ---------------------vvvvvvvv
SELECT t.*, ROW_NUMBER() OVER(ORDER BY (SELECT 1))
FROM (
SELECT TOP 100 PERCENT ...
-- vvvvv ----redefines this order here
ORDER BY
) t
A concrete example (as can be seen on SQLFiddle):
SELECT v, ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
FROM (
SELECT TOP 100 PERCENT 1 UNION ALL
SELECT TOP 100 PERCENT 2 UNION ALL
SELECT TOP 100 PERCENT 3 UNION ALL
SELECT TOP 100 PERCENT 4
-- This descending order is not maintained in the outer query
ORDER BY 1 DESC
) t(v)
Also, I cannot reuse any expression from the derived table to reproduce the ORDER BY clause in my case, as the derived table might not be available as it may be provided by some external logic.
So how can I do it? Can I do it at all?
The Row_Number() OVER (ORDER BY (SELECT 1)) trick should NOT be seen as a way to avoid changing the order of underlying data. It is only a means to avoid causing the server to perform an additional and unneeded sort (it may still perform the sort but it's going to cost the minimum amount possible when compared to sorting by a column).
All queries in SQL server ABSOLUTELY MUST have an ORDER BY clause in the outermost query for the results to be reliably ordered in a guaranteed way.
The concept of "retaining original order" does not exist in relational databases. Tables and queries must always be considered unordered until and unless an ORDER BY clause is specified in the outermost query.
You could try the same unordered query 100,000 times and always receive it with the same ordering, and thus come to believe you can rely on said ordering. But that would be a mistake, because one day, something will change and it will not have the order you expect. One example is when a database is upgraded to a new version of SQL Server--this has caused many a query to change its ordering. But it doesn't have to be that big a change. Something as little as adding or removing an index can cause differences. And more: Installing a service pack. Partitioning a table. Creating an indexed view that includes the table in question. Reaching some tipping point where a scan is chosen instead of a seek. And so on.
Do not rely on results to be ordered unless you have said "Server, ORDER BY".