Why can't I GROUP BY 1 when it's OK to ORDER BY 1? - sql

Why are column ordinals legal for ORDER BY but not for GROUP BY? That is, can anyone tell me why this query
SELECT OrgUnitID, COUNT(*) FROM Employee AS e GROUP BY OrgUnitID
cannot be written as
SELECT OrgUnitID, COUNT(*) FROM Employee AS e GROUP BY 1
When it's perfectly legal to write a query like
SELECT OrgUnitID FROM Employee AS e ORDER BY 1
?
I'm really wondering if there's something subtle about the relational calculus, or something, that would prevent the grouping from working right.
The thing is, my example is pretty trivial. It's common that the column that I want to group by is actually a calculation, and having to repeat the exact same calculation in the GROUP BY is (a) annoying and (b) makes errors during maintenance much more likely. Here's a simple example:
SELECT DATEPART(YEAR,LastSeenOn), COUNT(*)
FROM Employee AS e
GROUP BY DATEPART(YEAR,LastSeenOn)
I would think that SQL's rule of normalize to only represent data once in the database ought to extend to code as well. I'd want to only right that calculation expression once (in the SELECT column list), and be able to refer to it by ordinal in the GROUP BY.
Clarification: I'm specifically working on SQL Server 2008, but I wonder about an overall answer nonetheless.

One of the reasons is because ORDER BY is the last thing that runs in a SQL Query, here is the order of operations
FROM clause
WHERE clause
GROUP BY clause
HAVING clause
SELECT clause
ORDER BY clause
so once you have the columns from the SELECT clause you can use ordinal positioning
EDIT, added this based on the comment
Take this for example
create table test (a int, b int)
insert test values(1,2)
go
The query below will parse without a problem, it won't run
select a as b, b as a
from test
order by 6
here is the error
Msg 108, Level 16, State 1, Line 3
The ORDER BY position number 6 is out of range of the number of items in the select list.
This also parses fine
select a as b, b as a
from test
group by 1
But it blows up with this error
Msg 164, Level 15, State 1, Line 3
Each GROUP BY expression must contain at least one column that is not an outer reference.

There is a lot of elementary inconsistencies in SQL, and use of scalars is one of them. For example, anyone might expect
select * from countries
order by 1
and
select * from countries
order by 1.00001
to be a similar queries (the difference between the two can be made infinitesimally small, after all), which are not.

I'm not sure if the standard specifies if it is valid, but I believe it is implementation-dependent. I just tried your first example with one SQL engine, and it worked fine.

use aliasses :
SELECT DATEPART(YEAR,LastSeenOn) as 'seen_year', COUNT(*) as 'count'
FROM Employee AS e
GROUP BY 'seen_year'
** EDIT **
if GROUP BY alias is not allowed for you, here's a solution / workaround:
SELECT seen_year
, COUNT(*) AS Total
FROM (
SELECT DATEPART(YEAR,LastSeenOn) as seen_year, *
FROM Employee AS e
) AS inline_view
GROUP
BY seen_year

databases that don't support this basically are choosing not to. understand the order of the processing of the various steps, but it is very easy (as many databases have shown) to parse the sql, understand it, and apply the translation for you. Where its really a pain is when a column is a long case statement. having to repeat that in the group by clause is super annoying. yes, you can do the nested query work around as someone demonstrated above, but at this point it is just lack of care about your users to not support group by column numbers.

Related

In SQL, does groupby on an ordered query behave the same as doing both in the same query?

Are the following queries identical, or might I get different results (in any major DB system, e.g. MSSQL, MySQL, Postgres, SQLite):
Doing both in the same query:
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_value
vs. ordering in a subquery:
SELECT group, some_agg_func(some_value)
FROM (
SELECT group, some_value
FROM my_table
ORDER BY some_other_value
) as alias
GROUP BY group
Looking at the first sample:
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_value
Let's think about what GROUP BY does by looking at this imaginary sample data:
A B
- -
1 1
1 2
Then think about this query:
SELECT A
FROM SampleData
GROUP BY A
ORDER BY B
The GROUP BY clause puts the two rows into a single group. Then we want to order by B... but the two rows in the group have different values for B. Which should it use?
Obviously in this situation it doesn't really matter: there's only one row in the results, so the order is not relevant. But generally, how does the database know what to do?
The database could guess which one you want, or just take the first value, or the last — whatever those mean in a setting where the data is unordered by definition. And in fact this is what MySql will try to do for you: it will try to guess are your meaning. But this response is really inappropriate. You specified an in-exact query; the only correct thing to do is throw an error, which is what most databases will do.
Now let's look at the second sample:
SELECT group, some_agg_func(some_value)
FROM (
SELECT group, some_value
FROM my_table
ORDER BY some_other_value
) as alias
GROUP BY group
Here it is important to remember databases have their roots in relational set theory, and what we think of as "tables" are more formally described as Unordered Relations. Again: the idea of being "unordered" is baked into the very nature of a table at the deepest level.
In this case the inner query can run and create results in the specified order, and then the outer query can use that with GROUP BY to create a new set... but just like tables, query results are unordered relations. Without an ORDER BY clause the final result is also unordered by definition.
Now you might tend to get results in the order you want, but the reality is all bets are off. In fact, the databases that run this query will tend to give you results in the order in which they first encountered each group, which will not tend to match the ORDER BY because the GROUP BY expression is looking at completely different columns. Other databases (Sql Server is in this group) will not even allow the query to run, though I might prefer a warning here.
So now we come to the final section, where we must re-think the question, like this:
How can I use GROUP BY on the one group column, while also ordering by some_other_column not in the group?
The answer is each group can contain multiple rows, and so you must tell the database which row to look at to get the correct (specific) some_other_column value. The typical way to do this is with another aggregate function, which might look like this:
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_agg_func(some_other_column)
That code will run without error on pretty much any database.
Just be careful here. On one hand, when people want to do this it's often for the common case where they know every record for some_other_column in each group will have the same value. For example, you might GROUP BY UserID, but ORDER BY Email, where of course every record with the same UserID should have the same Email address. As humans, we have the ability to make that kind of inference. Computers, however, don't handle that kind of thinking as well, and so we help it out with an extra aggregate function like MIN() or MAX().
On the other hand, if you're not careful sometimes the two different aggregate functions don't match up, and you end up showing the value from one row in the group, while using a completely different row from the group for the ORDER BY expression in a way that is not good.
Tables are unordered sets of data. A query result is a table. So if you select from a subquery that contains an ORDER BY clause, that clause means nothing; the data set is unordered by definition. The DBMS is free to ignore the ORDER BY clause. Some DBMS may even issue a warning or error, but I suppose it's more common that the ORDER BY clause just has no effect - at least not guaranteed.
In this query
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_value
you try to order your results by some_other_value. If this is meant to be a column, you can't, because that other column is no part of your results. You'll get a syntax error. If some_other_value is a fixed value, then there is nothing ordered, because you'd have the same sort key for every row. But it can be an expression based on your result data (group key and aggreation results) and you can order your result rows by that.
In this query
SELECT group, some_agg_func(some_value)
FROM (
SELECT group, some_value
FROM my_table
ORDER BY some_other_value
) as alias
GROUP BY group
the ORDER BY clause has no effect. You could just as well just select FROM my_table directly:
SELECT group, some_agg_func(some_value)
FROM my_table as alias
GROUP BY group
This gets the results unordered (or at least the order you see is not guaranteed to be thus every time you run that query), because your query doesn't have an ORDER BY clause.

SQL GROUP BY 1 2 3 and SQL Order of Execution

This may be a dumb question but I am really confused. So according to the SQL Query Order of Execution, the GROUP BY clause will be executed before the SELECT clause. However it allows to do something like:
SELECT field_1, SUM(field_2) FROM myTable GROUP BY 1
My confusion is that if GROUP BY clause happens before SELECT, in this scenario I provided, how does SQL know what 1 is? It works with ORDER BY clause and it makes sense to me because ORDER BY clause happens after SELECT.
Can someone help me out? Thanks in advance!
https://www.periscopedata.com/blog/sql-query-order-of-operations
My understanding is because it's ordinal notation and for the SELECT statement to pass syntax validation you have to have at least selected a column. So the 1 is stating the first column in the select statement since it knows you have a column selected.
EDIT:
I see people saying you can't use ordinal notation and they are right if you're using SQL Server. You can use it in MySQL though.
select a,b,c from emp group by 1,2,3. First it will group by column a then b and c. It works based on the column after the select statement.
Each GROUP BY expression must contain at least one column that is not an outer reference. You cannot group by 1 if it is not a column in your table.

SQL Server - Pagination Without Order By Clause

My situation is that a SQL statement which is not predictable, is given to the program and I need to do pagination on top of it. The final SQL statement would be similar to the following one:
SELECT * FROM (*Given SQL Statement*) b
OFFSET 0 ROWS FETCH NEXT 50 ROWS ONLY;
The problem here is that the *Given SQL Statement* is unpredictable. It may or may not contain order by clause. I am not able to change the query result of this SQL Statement and I need to do pagination on it.
I searched for solution on the Internet, but all of them suggested to use an arbitrary column, like primary key, in order by clause. But it will change the original order.
The short answer is that it can't be done, or at least can't be done properly.
The problem is that SQL Server (or any RDBMS) does not and can not guarantee the order of the records returned from a query without an order by clause.
This means that you can't use paging on such queries.
Further more, if you use an order by clause on a column that appears multiple times in your resultset, the order of the result set is still not guaranteed inside groups of values in said column - quick example:
;WITH cte (a, b)
AS
(
SELECT 1, 'a'
UNION ALL
SELECT 1, 'b'
UNION ALL
SELECT 2, 'a'
UNION ALL
SELECT 2, 'b'
)
SELECT *
FROM cte
ORDER BY a
Both result sets are valid, and you can't know in advance what will you get:
a b
-----
1 b
1 a
2 b
2 a
a b
-----
1 a
1 b
2 a
2 b
(and of course, you might get other sorts)
The problem here is that the *Given SQL Statement" is unpredictable. It may or may not contain order by clause.
your inner query(unpredictable sql statement) should not contain order by,even if it contains,order is not guaranteed.
To get guaranteed order,you have to order by some column.for the results to be deterministic,the ordered column/columns should be unique
Please note: what I'm about to suggest is probably horribly inefficient and should really only be used to help you go back to the project leader and tell them that pagination of an unordered query should not be done. Having said that...
From your comments you say you are able to change the SQL statement before it is executed.
You could write the results of the original query to a temporary table, adding row count field to be used for subsequent pagination ordering.
Therefore any original ordering is preserved and you can now paginate.
But of course the reason for needing pagination in the first place is to avoid sending large amounts of data to the client application. Although this does prevent that, you will still be copying data to a temp table which, depending on the row size and count, could be very slow.
You also have the problem that the page size is coming from the client as part of the SQL statement. Parsing the statement to pick that out could be tricky.
As other notified using anyway without using a sorted query will not be safe, But as you know about it and search about it, I can suggest using a query like this (But not recommended as a good way)
;with cte as (
select *,
row_number() over (order by (select 0)) rn
from (
-- Your query
) t
)
select *
from cte
where rn between (#pageNumber-1)*#pageSize+1 and #pageNumber*#pageSize
[SQL Fiddle Demo]
I finally found a simple way to do it without any order by on a specific column:
declare #start AS INTEGER = 1, #count AS INTEGER = 5;
select * from (SELECT *,ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS fakeCounter
FROM (select * from mytable) AS t) AS t2 order by fakeCounter OFFSET #start ROWS
FETCH NEXT #count ROWS ONLY
where select * from mytable can be any query

SQL Syntax - Why do we need to list individual fields in an SQL group-by statement?

My understanding of using summary functions in SQL is that each field in the select statement that doesn't use a summary function, should be listed in the group by statement.
select a, b, c, sum(n) as sum_of_n
from table
group by a, b, c
My question is, why do we need to list the fields? Shouldn't the SQL syntax parser be implemented in a way that we can just tell it to group and it can figure out the groups based on whichever fields are in the select and aren't using summary functions?:
select a, b, c, sum(n) as sum_of_n
from table
group
I feel like I'm unnecessarily repeating myself when I write SQL code. What circumstances exist where we would not want it to automatically figure this out, or where it couldn't automatically figure this out?
To decrease the chances of errors in your statement. Explicitly spelling out the GROUP BY columns helps to ensure that the user wrote would they intended to write. You might be surprised at the number of posts that show up on Stackoverflow in which the user is grouping on columns that make no sense, but they have no idea why they aren't getting the data that they expect.
Also, consider the scenario where a user might want to group on more columns than are actually in the SELECT statement. For example, if I wanted the average of the most money that my customers have spent then I might write something like this:
SELECT
AVG(max_amt)
FROM (SELECT MAX(amt) FROM Invoices GROUP BY customer_id) SQ
In this case I can't simply use GROUP, I need to spell out the column(s) on which I'm grouping. The SQL engine could allow the user to explicitly list columns, but use a default if they are not listed, but then the chances of bugs drastically increases.
One way to think of it is like strongly typed programming languages. Making the programmer explicitly spell things out decreases the chance of bugs popping up because the engine made an assumption that the programmer didn't expect.
This is required to determine explicitly how do you want to group the records because, for example, you may use columns for grouping that are not listed in result set.
However, there are RDBMS which allow to not specify GROUP BY clause using aggregate functions like MySQL.
My first reaction would be that 'it is what it is' =)
But on thinking it through, the reason TSQL works like this is because the SELECT and the GROUP BY are two distinct parts of all the operations going on in the query.
This might not be the best example, but it does show that you can GROUP on different (well, 'more') fields than you are actually SELECTing.
SELECT brand = Convert(varchar(100), ''), model = Convert(varchar(100), ''), some_number = Convert(int, 0)
INTO #test
WHERE 1 = 2
INSERT #test (brand, model, some_number)
VALUES ('Ford', 'Focus', 10),
('Ford', 'Focus', 25),
('Ford', 'Kagu', 23),
('DMC', '12', 88)
SELECT brand, model, MAX(some_number)
FROM #test
GROUP BY brand, model
SELECT brand, MAX(some_number)
FROM #test
GROUP BY brand, model
Not all RDBMS's are like this, e.g. MySQL allows for omitting fields from the GROUP BY that are nevertheless in the SELECT part. From what I've seen, it then picks a random value ('there is no such a thing as an implicit first') and uses that in the SELECT .. I think, my knowledge on MySQL is rather limited but I've seen some examples here and there and they always confused me as I'm used to the strict requirement of TSQL you just described.
In addition, you can group by your columns in a different order than select
select a, b, c, sum(d)
from table
group by c,a,b
Also a lot of DBs allow you to skip column names, you can just specify which columns are going to be included in the group by using select position
select a, b, c, sum(d)
from table
group by 3,1,2

Confused syntax in Where clause

what does the line (rowid,0) mean in the following query
select * from emp
WHERE (ROWID,0) in (
select rowid, mod(rownum,2) from emp
);
i dont get the line WHERE (ROWID,0).
what is it?
thanx in advance
IN clause in Oracle SQL can support column groups. You can do things like this:
select ...
from tab1
where (tab1.col1, tab1.col2) in (
select tab2.refcol1, tab2.refcol2
from tab2
)
That can be useful in many cases.
In your particular case, the subquery use for the second expression mod(rownum,2). Since there is no order by, that means that rownum will be in whichever order the database retrieves the rows - that might be a full table scan or a fast full index scan.
Then by using mod every other row in the subquery gets the value 0, every other row gets the value 1.
The IN clause then filters on second value in the subquery being equal to 0. The end result is that this query retrieves half of your employees. Which half will depend on which access path the optimizer chooses.
Not sure what dialect of sql you're using, but it appears that since the subquery in the IN clause has two columns in the select list, then the (ROWID,0) indicates which columns align with the subquery. I have never seen multiple columns in an IN statment's select list before.
This is a syntax used by some databases (but not all) that allows you to do in with multiple values.
With in, this is the same as:
where exists (select 1
from emp e2
where e2.rowid = emp.rowid and
mod(rownum, 2) = 0
)
I should note that if you are using Oracle (which allows this syntax), then you are using rownum in a subquery with no order by. The results are going to be rather arbitrary. However, the intention seems to be to return every other row, in some sense.