What is the result set ordering when using window functions that have `order by` components? - sql

I'm working on a query on the SEDE:
select top 20
row_number() over(order by "percentage approved" desc, approved desc),
row_number() over(order by "total edits" asc),
*
from editors
where "total edits" > 30
What is the ordering of the result set, taking into account the two window functions?
I suspect it's undefined but couldn't find a definitive answer. OTOH, results from queries with one such window function were ordered according to the over(order by ...) clause.

The results can be returned in any order.
Now, they will often be returned in the same order as specified in the OVER clause, but this is just because SQL Server is likely to pick a query plan that sorts the rows to calculate the aggregate. This is by no means guaranteed, as it could pick a different query plan at any time, especially as you make your query more complex which extends the space of possible query plans.

The result set of ANY SQL Server query that doesn't have an explicit ORDER BY is undefined.
This includes when you have window functions within the query, or an ORDER BY in a subquery. The result order will depend on a lot of factors, none of which are guaranteed unless you specify an ORDER BY.

Related

There is any equivalent for TOP and BOTTOM in SQL that available in influxdb?

I am porting the queries in influx DB to timescaleDB (Postgres SQL). I am currently stuck in the TOP and BOTTOM functions. Is there any equivalent in Postgres SQL or any suggestions to achieve it?
For constant one, I did like that,
TOP('field', 1) -> MAX('field')
BOTTOM('field', 1) -> MIN('field')
What about others like,
TOP('field', 5)
BOTTOM('field', 5)
Edit 1:
Does Using LIMIT with ORDER BY also work with GROUP BY because the limit is executed after group by Right What if want something like this
Thank You
Using window functions is probably the most versatile way to do this:
select *
from (
select t.*,
dense_rank() over (partition by ??? order by ??? asc) as rnk
from the_table t
) x
where x.rnk = 3; --<< adjust here
Rows in a relational database have no implied sort order. So "top" or "bottom" only makes sense if you also provide an order by. From your question is completely unclear what that would be.
Using order by .. asc returns the "bottom rows", using order by .. desc returns the "top rows"
If you want top/bottom for the entire table (instead of one row "per group"), the leave out the partition by
dense_rank() will return multiple rows with the same "rank" when the rows have the same highest (or lowest) value in the column you are sorting by. If you don't want that (and pick an arbitrary one from those "duplicates") then use row_number() instead.

Does partition by / order by imply ordering in a query?

I see no difference in:
select
ID,
TYPE,
XTIME,
first_value(XTIME) over (partition by TYPE order by XTIME)
from SERIES;
and:
select
ID,
TYPE,
XTIME,
first_value(XTIME) over (partition by TYPE order by XTIME)
from SERIES
order by TYPE, XTIME;
Does partition by / order by imply ordering in a query?
No it does not.
The result set from a query is ordered only when it has an explicit order by clause. Otherwise the result set is in an arbitrary order. Often, the underlying algorithms used for running the query determine the ordering of the result set.
You can only depend on the final ordering when you have an order by clause for the outermost select. That the result sets are in the same order is simply a coincidence.
I am curious why you don't simply use this?
min(xtime) over (partition by type)
The difference between the two queries is that the second one would be ordered ascending by TYPE and then XTIME, while the first one would, in theory, have random ordering.
The PARTITION BY and ORDER BY clause used in the call to FIRST_VALUE have an effect on that particular analytic function, but do not by themselves affect the final order of the result set.
In general, if you want to order a result set in SQL, you need to use an ORDER BY clause. If both your queries happen to have the same ordering right now, it is only by chance, and is not guaranteed. How might this happen? Well, if Oracle is iterating the clustered index of your table, and that ordering happens to coincide with ORDER BY TYPE, XTIME, then you would see this order even without using ORDER BY.

Does ROW_NUMBER queried data require to fetch the entire result set?

I'm trying to write query that's not using offset (because as I just have learnt offset fetches all data which causes performance overhead). with ROW_NUMBER window function. For instance:
SELECT id
FROM(
SELECT id, ROW_NUMBER() over (order by id) rn
FROM users) sq
WHERE rn > 1000
Does it require all rows to be fetched as it would be with offset 1000? I mean, does it make a sense to use such query instead of
SELECT if
FROM users
OFFSET 1000
? Do I get performance improvement on large amount of data?
Check out the window function docs. Window functions operate on the result set, after the fetch:
Window functions are permitted only in the SELECT list and the ORDER
BY clause of the query. They are forbidden elsewhere, such as in GROUP
BY, HAVING and WHERE clauses. This is because they logically execute
after the processing of those clauses. Also, window functions execute
after regular aggregate functions. This means it is valid to include
an aggregate function call in the arguments of a window function, but
not vice versa.
Does it make sense to use the row_number() query? Well, it produces the same result set. However, the query basically has to assign row_number() to all the rows in order to find the ones that meet the requirement.
The second query, however, is lacking an order by. When using offset, you should have an order by:
SELECT id
FROM users u
ORDER BY id
OFFSET 1000
I would imagine that this is more efficient than using row_number(), but actual timings would demonstrate that.

Calculating SQL Server ROW_NUMBER() OVER() for a derived table

In some other databases (e.g. DB2, or Oracle with ROWNUM), I can omit the ORDER BY clause in a ranking function's OVER() clause. For instance:
ROW_NUMBER() OVER()
This is particularly useful when used with ordered derived tables, such as:
SELECT t.*, ROW_NUMBER() OVER()
FROM (
SELECT ...
ORDER BY
) t
How can this be emulated in SQL Server? I've found people using this trick, but that's wrong, as it will behave non-deterministically with respect to the order from the derived table:
-- This order here ---------------------vvvvvvvv
SELECT t.*, ROW_NUMBER() OVER(ORDER BY (SELECT 1))
FROM (
SELECT TOP 100 PERCENT ...
-- vvvvv ----redefines this order here
ORDER BY
) t
A concrete example (as can be seen on SQLFiddle):
SELECT v, ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
FROM (
SELECT TOP 100 PERCENT 1 UNION ALL
SELECT TOP 100 PERCENT 2 UNION ALL
SELECT TOP 100 PERCENT 3 UNION ALL
SELECT TOP 100 PERCENT 4
-- This descending order is not maintained in the outer query
ORDER BY 1 DESC
) t(v)
Also, I cannot reuse any expression from the derived table to reproduce the ORDER BY clause in my case, as the derived table might not be available as it may be provided by some external logic.
So how can I do it? Can I do it at all?
The Row_Number() OVER (ORDER BY (SELECT 1)) trick should NOT be seen as a way to avoid changing the order of underlying data. It is only a means to avoid causing the server to perform an additional and unneeded sort (it may still perform the sort but it's going to cost the minimum amount possible when compared to sorting by a column).
All queries in SQL server ABSOLUTELY MUST have an ORDER BY clause in the outermost query for the results to be reliably ordered in a guaranteed way.
The concept of "retaining original order" does not exist in relational databases. Tables and queries must always be considered unordered until and unless an ORDER BY clause is specified in the outermost query.
You could try the same unordered query 100,000 times and always receive it with the same ordering, and thus come to believe you can rely on said ordering. But that would be a mistake, because one day, something will change and it will not have the order you expect. One example is when a database is upgraded to a new version of SQL Server--this has caused many a query to change its ordering. But it doesn't have to be that big a change. Something as little as adding or removing an index can cause differences. And more: Installing a service pack. Partitioning a table. Creating an indexed view that includes the table in question. Reaching some tipping point where a scan is chosen instead of a seek. And so on.
Do not rely on results to be ordered unless you have said "Server, ORDER BY".

max(), group by and order by

I have following SQL statement.
SELECT t.client_id,max(t.points) AS "max" FROM sessions GROUP BY t.client_id;
It simply lists client id's with maximum amount of points they've achieved. Now I want to sort the results by max(t.points). Normally I would use ORDER BY, but I have no idea how to use it with groups. I know using value from SELECT list is prohibited in following clauses, so adding ORDER BY max at the end of query won't work.
How can I sort those results after grouping, then?
Best regards
SELECT t.client_id, max(t.points) AS "max"
FROM sessions t
GROUP BY t.client_id
order by max(t.points) desc
It is not quite correct that values from the SELECT list are prohibited in following clauses. In fact, ORDER BY is logically processed after the SELECT list and can refer to SELECT list result names (in contrast with GROUP BY). So the normal way to write your query would be
SELECT t.client_id, max(t.points) AS "max"
FROM sessions
GROUP BY t.client_id
ORDER BY max;
This way of expressing it is SQL-92 and should be very portable. The other way to do it is by column number, e.g.,
ORDER BY 2;
These are the only two ways to do this in SQL-92.
SQL:1999 and later also allow referring to arbitrary expressions in the sort list, so you could just do ORDER BY max(t.points), but that's clearly more cumbersome, and possibly less portable. The ordering by column number was removed in SQL:1999, so it's technically no longer standard, but probably still widely supported.
Since you have tagged as Postgres: Postgres allows a non-standard GROUP BY and ORDER BY column number. So you could have
SELECT t.client_id, max(t.points) AS "max"
FROM sessions t
GROUP BY 1
order by 2 desc
After parsing, this is identical to RedFilter’s solution.