Does partition by / order by imply ordering in a query? - sql

I see no difference in:
select
ID,
TYPE,
XTIME,
first_value(XTIME) over (partition by TYPE order by XTIME)
from SERIES;
and:
select
ID,
TYPE,
XTIME,
first_value(XTIME) over (partition by TYPE order by XTIME)
from SERIES
order by TYPE, XTIME;
Does partition by / order by imply ordering in a query?

No it does not.
The result set from a query is ordered only when it has an explicit order by clause. Otherwise the result set is in an arbitrary order. Often, the underlying algorithms used for running the query determine the ordering of the result set.
You can only depend on the final ordering when you have an order by clause for the outermost select. That the result sets are in the same order is simply a coincidence.
I am curious why you don't simply use this?
min(xtime) over (partition by type)

The difference between the two queries is that the second one would be ordered ascending by TYPE and then XTIME, while the first one would, in theory, have random ordering.
The PARTITION BY and ORDER BY clause used in the call to FIRST_VALUE have an effect on that particular analytic function, but do not by themselves affect the final order of the result set.
In general, if you want to order a result set in SQL, you need to use an ORDER BY clause. If both your queries happen to have the same ordering right now, it is only by chance, and is not guaranteed. How might this happen? Well, if Oracle is iterating the clustered index of your table, and that ordering happens to coincide with ORDER BY TYPE, XTIME, then you would see this order even without using ORDER BY.

Related

In SQL, does groupby on an ordered query behave the same as doing both in the same query?

Are the following queries identical, or might I get different results (in any major DB system, e.g. MSSQL, MySQL, Postgres, SQLite):
Doing both in the same query:
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_value
vs. ordering in a subquery:
SELECT group, some_agg_func(some_value)
FROM (
SELECT group, some_value
FROM my_table
ORDER BY some_other_value
) as alias
GROUP BY group
Looking at the first sample:
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_value
Let's think about what GROUP BY does by looking at this imaginary sample data:
A B
- -
1 1
1 2
Then think about this query:
SELECT A
FROM SampleData
GROUP BY A
ORDER BY B
The GROUP BY clause puts the two rows into a single group. Then we want to order by B... but the two rows in the group have different values for B. Which should it use?
Obviously in this situation it doesn't really matter: there's only one row in the results, so the order is not relevant. But generally, how does the database know what to do?
The database could guess which one you want, or just take the first value, or the last — whatever those mean in a setting where the data is unordered by definition. And in fact this is what MySql will try to do for you: it will try to guess are your meaning. But this response is really inappropriate. You specified an in-exact query; the only correct thing to do is throw an error, which is what most databases will do.
Now let's look at the second sample:
SELECT group, some_agg_func(some_value)
FROM (
SELECT group, some_value
FROM my_table
ORDER BY some_other_value
) as alias
GROUP BY group
Here it is important to remember databases have their roots in relational set theory, and what we think of as "tables" are more formally described as Unordered Relations. Again: the idea of being "unordered" is baked into the very nature of a table at the deepest level.
In this case the inner query can run and create results in the specified order, and then the outer query can use that with GROUP BY to create a new set... but just like tables, query results are unordered relations. Without an ORDER BY clause the final result is also unordered by definition.
Now you might tend to get results in the order you want, but the reality is all bets are off. In fact, the databases that run this query will tend to give you results in the order in which they first encountered each group, which will not tend to match the ORDER BY because the GROUP BY expression is looking at completely different columns. Other databases (Sql Server is in this group) will not even allow the query to run, though I might prefer a warning here.
So now we come to the final section, where we must re-think the question, like this:
How can I use GROUP BY on the one group column, while also ordering by some_other_column not in the group?
The answer is each group can contain multiple rows, and so you must tell the database which row to look at to get the correct (specific) some_other_column value. The typical way to do this is with another aggregate function, which might look like this:
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_agg_func(some_other_column)
That code will run without error on pretty much any database.
Just be careful here. On one hand, when people want to do this it's often for the common case where they know every record for some_other_column in each group will have the same value. For example, you might GROUP BY UserID, but ORDER BY Email, where of course every record with the same UserID should have the same Email address. As humans, we have the ability to make that kind of inference. Computers, however, don't handle that kind of thinking as well, and so we help it out with an extra aggregate function like MIN() or MAX().
On the other hand, if you're not careful sometimes the two different aggregate functions don't match up, and you end up showing the value from one row in the group, while using a completely different row from the group for the ORDER BY expression in a way that is not good.
Tables are unordered sets of data. A query result is a table. So if you select from a subquery that contains an ORDER BY clause, that clause means nothing; the data set is unordered by definition. The DBMS is free to ignore the ORDER BY clause. Some DBMS may even issue a warning or error, but I suppose it's more common that the ORDER BY clause just has no effect - at least not guaranteed.
In this query
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_value
you try to order your results by some_other_value. If this is meant to be a column, you can't, because that other column is no part of your results. You'll get a syntax error. If some_other_value is a fixed value, then there is nothing ordered, because you'd have the same sort key for every row. But it can be an expression based on your result data (group key and aggreation results) and you can order your result rows by that.
In this query
SELECT group, some_agg_func(some_value)
FROM (
SELECT group, some_value
FROM my_table
ORDER BY some_other_value
) as alias
GROUP BY group
the ORDER BY clause has no effect. You could just as well just select FROM my_table directly:
SELECT group, some_agg_func(some_value)
FROM my_table as alias
GROUP BY group
This gets the results unordered (or at least the order you see is not guaranteed to be thus every time you run that query), because your query doesn't have an ORDER BY clause.

SQL Group By Column Part Number giving the data from most recent received date

New qith SQL my group by is not working and I am wanting it to pull the most recent POReleases.DateReceived date and group by part number. Here is what I have
SELECT POReleases.PONum, POReleases.PartNo, POReleases.JobNo, POReleases.Qty, POReleases.QtyRejected, POReleases.QtyCanceled, POReleases.DueDate, POReleases.DateReceived, PODet.ProdCode, PODet.Unit, PODet.UnitCost, PODet.QtyOrd, PODet.QtyRec, PODet.QtyReject, PODet.QtyCancel
FROM Waples.dbo.PODet PODet, Waples.dbo.POReleases POReleases
WHERE PODet.PartNo = POReleases.PartNo AND PODet.PONum = POReleases.PONum AND ((POReleases.DateReceived>{ts '2010-01-01 00:00:00'}))
GROUP BY PartNo
For starters, columns specified in the GROUP BY should be present in the select statement too. Here in your case only "PartNo" is used in GROUP BY clause whereas so many columns are used in the SELECT statement.
You can try WITH CTE to achieve this,
WITH CTE AS (
SELECT *, ROW_NUMBER() OVER( PARTITION BY PartNo ORDER BY POReleases.DateReceived DESC) AS PartNoCount
FROM TABLENAME
) SELECT * FROM CTE
When you write an SQL statement, you should think about the logical flow, which might be technically slightly inaccurate due to optimizations, but still, it is a good thing to think about it like this:
without the from clause specifying the source relation, the filter cannot be evaluated, so at least logically, the from is the first thing to evaluate
without the where clause specifying which records should be kept from the source relation, the filtered records cannot be grouped, so, at least logically, the where precedes the group by
without the group by, specifying the groups, you cannot select values from the groups, so, at least logically, group by precedes select
So, the projection (select) is executed on the groups of filtered records, which are groups themselves. Since the groups have an attribute, namely PartNo, it becomes an aggregated column. The other columns, which were reachable before the group by, can no longer be reached in the select. If you want to reach them, you need to group by them as well, or use aggregated functions for them, since if you have a group by, you will be able to select only the aggregated columns, which are either aggregated functions or columns which became aggregated due to their presence in the group by.
Since you did not specify how this query is not working, I will have to assume that you have a syntax error in the selection, due to the fact that you refer to columns which are not aggregated. Also, you might want to use join instead of Descartes multiplication and finally, if you want to filter the groups, not the records of the initial relation (which is the result of a Descartes multiplication in your case), then you might consider using a having clause.

Does ROW_NUMBER queried data require to fetch the entire result set?

I'm trying to write query that's not using offset (because as I just have learnt offset fetches all data which causes performance overhead). with ROW_NUMBER window function. For instance:
SELECT id
FROM(
SELECT id, ROW_NUMBER() over (order by id) rn
FROM users) sq
WHERE rn > 1000
Does it require all rows to be fetched as it would be with offset 1000? I mean, does it make a sense to use such query instead of
SELECT if
FROM users
OFFSET 1000
? Do I get performance improvement on large amount of data?
Check out the window function docs. Window functions operate on the result set, after the fetch:
Window functions are permitted only in the SELECT list and the ORDER
BY clause of the query. They are forbidden elsewhere, such as in GROUP
BY, HAVING and WHERE clauses. This is because they logically execute
after the processing of those clauses. Also, window functions execute
after regular aggregate functions. This means it is valid to include
an aggregate function call in the arguments of a window function, but
not vice versa.
Does it make sense to use the row_number() query? Well, it produces the same result set. However, the query basically has to assign row_number() to all the rows in order to find the ones that meet the requirement.
The second query, however, is lacking an order by. When using offset, you should have an order by:
SELECT id
FROM users u
ORDER BY id
OFFSET 1000
I would imagine that this is more efficient than using row_number(), but actual timings would demonstrate that.

What is the result set ordering when using window functions that have `order by` components?

I'm working on a query on the SEDE:
select top 20
row_number() over(order by "percentage approved" desc, approved desc),
row_number() over(order by "total edits" asc),
*
from editors
where "total edits" > 30
What is the ordering of the result set, taking into account the two window functions?
I suspect it's undefined but couldn't find a definitive answer. OTOH, results from queries with one such window function were ordered according to the over(order by ...) clause.
The results can be returned in any order.
Now, they will often be returned in the same order as specified in the OVER clause, but this is just because SQL Server is likely to pick a query plan that sorts the rows to calculate the aggregate. This is by no means guaranteed, as it could pick a different query plan at any time, especially as you make your query more complex which extends the space of possible query plans.
The result set of ANY SQL Server query that doesn't have an explicit ORDER BY is undefined.
This includes when you have window functions within the query, or an ORDER BY in a subquery. The result order will depend on a lot of factors, none of which are guaranteed unless you specify an ORDER BY.

max(), group by and order by

I have following SQL statement.
SELECT t.client_id,max(t.points) AS "max" FROM sessions GROUP BY t.client_id;
It simply lists client id's with maximum amount of points they've achieved. Now I want to sort the results by max(t.points). Normally I would use ORDER BY, but I have no idea how to use it with groups. I know using value from SELECT list is prohibited in following clauses, so adding ORDER BY max at the end of query won't work.
How can I sort those results after grouping, then?
Best regards
SELECT t.client_id, max(t.points) AS "max"
FROM sessions t
GROUP BY t.client_id
order by max(t.points) desc
It is not quite correct that values from the SELECT list are prohibited in following clauses. In fact, ORDER BY is logically processed after the SELECT list and can refer to SELECT list result names (in contrast with GROUP BY). So the normal way to write your query would be
SELECT t.client_id, max(t.points) AS "max"
FROM sessions
GROUP BY t.client_id
ORDER BY max;
This way of expressing it is SQL-92 and should be very portable. The other way to do it is by column number, e.g.,
ORDER BY 2;
These are the only two ways to do this in SQL-92.
SQL:1999 and later also allow referring to arbitrary expressions in the sort list, so you could just do ORDER BY max(t.points), but that's clearly more cumbersome, and possibly less portable. The ordering by column number was removed in SQL:1999, so it's technically no longer standard, but probably still widely supported.
Since you have tagged as Postgres: Postgres allows a non-standard GROUP BY and ORDER BY column number. So you could have
SELECT t.client_id, max(t.points) AS "max"
FROM sessions t
GROUP BY 1
order by 2 desc
After parsing, this is identical to RedFilter’s solution.