SQL Syntax - Why do we need to list individual fields in an SQL group-by statement? - sql

My understanding of using summary functions in SQL is that each field in the select statement that doesn't use a summary function, should be listed in the group by statement.
select a, b, c, sum(n) as sum_of_n
from table
group by a, b, c
My question is, why do we need to list the fields? Shouldn't the SQL syntax parser be implemented in a way that we can just tell it to group and it can figure out the groups based on whichever fields are in the select and aren't using summary functions?:
select a, b, c, sum(n) as sum_of_n
from table
group
I feel like I'm unnecessarily repeating myself when I write SQL code. What circumstances exist where we would not want it to automatically figure this out, or where it couldn't automatically figure this out?

To decrease the chances of errors in your statement. Explicitly spelling out the GROUP BY columns helps to ensure that the user wrote would they intended to write. You might be surprised at the number of posts that show up on Stackoverflow in which the user is grouping on columns that make no sense, but they have no idea why they aren't getting the data that they expect.
Also, consider the scenario where a user might want to group on more columns than are actually in the SELECT statement. For example, if I wanted the average of the most money that my customers have spent then I might write something like this:
SELECT
AVG(max_amt)
FROM (SELECT MAX(amt) FROM Invoices GROUP BY customer_id) SQ
In this case I can't simply use GROUP, I need to spell out the column(s) on which I'm grouping. The SQL engine could allow the user to explicitly list columns, but use a default if they are not listed, but then the chances of bugs drastically increases.
One way to think of it is like strongly typed programming languages. Making the programmer explicitly spell things out decreases the chance of bugs popping up because the engine made an assumption that the programmer didn't expect.

This is required to determine explicitly how do you want to group the records because, for example, you may use columns for grouping that are not listed in result set.
However, there are RDBMS which allow to not specify GROUP BY clause using aggregate functions like MySQL.

My first reaction would be that 'it is what it is' =)
But on thinking it through, the reason TSQL works like this is because the SELECT and the GROUP BY are two distinct parts of all the operations going on in the query.
This might not be the best example, but it does show that you can GROUP on different (well, 'more') fields than you are actually SELECTing.
SELECT brand = Convert(varchar(100), ''), model = Convert(varchar(100), ''), some_number = Convert(int, 0)
INTO #test
WHERE 1 = 2
INSERT #test (brand, model, some_number)
VALUES ('Ford', 'Focus', 10),
('Ford', 'Focus', 25),
('Ford', 'Kagu', 23),
('DMC', '12', 88)
SELECT brand, model, MAX(some_number)
FROM #test
GROUP BY brand, model
SELECT brand, MAX(some_number)
FROM #test
GROUP BY brand, model
Not all RDBMS's are like this, e.g. MySQL allows for omitting fields from the GROUP BY that are nevertheless in the SELECT part. From what I've seen, it then picks a random value ('there is no such a thing as an implicit first') and uses that in the SELECT .. I think, my knowledge on MySQL is rather limited but I've seen some examples here and there and they always confused me as I'm used to the strict requirement of TSQL you just described.

In addition, you can group by your columns in a different order than select
select a, b, c, sum(d)
from table
group by c,a,b
Also a lot of DBs allow you to skip column names, you can just specify which columns are going to be included in the group by using select position
select a, b, c, sum(d)
from table
group by 3,1,2

Related

Aggregating on a column that is also being grouped on

I know there's a lot of confusion related to grouping/aggregation etc, and I thought that I had a pretty decent grasp on the whole thing until I saw something along the lines of
SELECT A, SUM(B)
FROM T
GROUP BY A
HAVING COUNT(A)>1;
At first this puzzled me since it seemed performing an aggregate on a column that is also being grouped on is redundant, since by definition the value for the group will be distinct. But then I thought about it and it kind of makes sense for duplicate values in the table, if the aggregation was done before the grouping. In my head, it seems like its treating it more like this kind of query
SELECT A, SUM(B)
FROM T
WHERE A in (SELECT A FROM T GROUP BY A HAVING COUNT(*)>1)
GROUP BY A;
As opposed to another selection operator on each group after the grouping is done (since to me that doesn't make much sense).
So my question is multifold: Can elements being grouped on be included in the HAVING clause at all? Can elements being grouped on be aggregated on (in the HAVING clause or elsewhere like SELECT clause)? If the previous statements hold, is my understanding of what this operation means correct?
NOTE: This question is mainly about standard (ansi) SQL but info on particular implementations would also be interesting
The arguments to an aggregation function can include the keys being aggregated.
That said, the more common way to count rows in each group is to use COUNT(*). I would recommend:
SELECT A, SUM(B)
FROM T
GROUP BY A
HAVING COUNT(*) > 1;
There is a slight overhead to using COUNT(A) because the value of A needs to be checked against NULL in each row.

Custom Sorting in SQL order by clause?

Here is the situation that I am trying to solve:
I have a query that could return a set of records. The field being sorted by could have a number of different values - for the sake of this question we will say that the value could be A, B, C, D, E or Z
Now depending on the results of the query, the sorting needs to behave as follows:
If only A-E records are found then sorting them "naturally" is okay. But if a Z record is in the results, then it needs to be the first result in the query, but the rest of the records should be in "natural" sort order.
For instance, if A C D are found, then the result should be
A
C
D
But if A B D E Z are found then the result should be sorted:
Z
A
B
D
E
Currently, the query looks like:
SELECT NAME, SOME_OTHER_FIELDS FROM TABLE ORDER BY NAME
I know I can code a sort function to do what I want, but because of how I am using the results, I can't seem to use because the results are being handled by a third party library, to which I am just passing the SQL query. It is then processing the results, and there seems to be no hooks for me to sort the results and just pass the results to the library. It needs to do the SQL query itself, and I have no access to the source code of the library.
So for all of you SQL gurus out there, can you provide a query for me that will do what I want?
How do you identify the Z record? What sets it apart? Once you understand that, add it to your ORDER BY clause.
SELECT name, *
FROM [table]
WHERE (x)
ORDER BY
(
CASE
WHEN (record matches Z) THEN 0
ELSE 1
END
),
name
This way, only the Z record will match the first ordering, and all other records will be sorted by the second-order sort (name). You can exclude the second-order sort if you really don't need it.
For example, if Z is the character string 'Bob', then your query might be:
SELECT name, *
FROM [table]
WHERE (x)
ORDER BY
(
CASE
WHEN name='Bob' THEN 0
ELSE 1
END
), name
My examples are for T-SQL, since you haven't mentioned which database you're using.
There are a number of ways to solve this problem and the best solution depends on a number of factors that you don't discuss such as the nature of those A..Z values and what database product you're using.
If you have only a single value that has to sort on top, you can ORDER BY an expression that maps that value to the lowest possible sort value (with CASE or IIF or IFEQ, depending on your database).
If you have several different special sort values you could ORDER BY a more complicated expression or you could UNION together several SELECTs, with one SELECT for the default sorts and an extra SELECT for each special value. The SELECTs would include a sort column.
Finally, if you have quite a few values you can put the sort values into a separate table and JOIN that table into your query.
Not sure what DB you use - the following works for Oracle:
SELECT
NAME,
SOME_OTHER_FIELDS,
DECODE (NAME, 'Z', '_', NAME ) SORTFIELD
FROM TABLE
ORDER BY DECODE (NAME, 'Z', '_', NAME ) ASC

Why can't I GROUP BY 1 when it's OK to ORDER BY 1?

Why are column ordinals legal for ORDER BY but not for GROUP BY? That is, can anyone tell me why this query
SELECT OrgUnitID, COUNT(*) FROM Employee AS e GROUP BY OrgUnitID
cannot be written as
SELECT OrgUnitID, COUNT(*) FROM Employee AS e GROUP BY 1
When it's perfectly legal to write a query like
SELECT OrgUnitID FROM Employee AS e ORDER BY 1
?
I'm really wondering if there's something subtle about the relational calculus, or something, that would prevent the grouping from working right.
The thing is, my example is pretty trivial. It's common that the column that I want to group by is actually a calculation, and having to repeat the exact same calculation in the GROUP BY is (a) annoying and (b) makes errors during maintenance much more likely. Here's a simple example:
SELECT DATEPART(YEAR,LastSeenOn), COUNT(*)
FROM Employee AS e
GROUP BY DATEPART(YEAR,LastSeenOn)
I would think that SQL's rule of normalize to only represent data once in the database ought to extend to code as well. I'd want to only right that calculation expression once (in the SELECT column list), and be able to refer to it by ordinal in the GROUP BY.
Clarification: I'm specifically working on SQL Server 2008, but I wonder about an overall answer nonetheless.
One of the reasons is because ORDER BY is the last thing that runs in a SQL Query, here is the order of operations
FROM clause
WHERE clause
GROUP BY clause
HAVING clause
SELECT clause
ORDER BY clause
so once you have the columns from the SELECT clause you can use ordinal positioning
EDIT, added this based on the comment
Take this for example
create table test (a int, b int)
insert test values(1,2)
go
The query below will parse without a problem, it won't run
select a as b, b as a
from test
order by 6
here is the error
Msg 108, Level 16, State 1, Line 3
The ORDER BY position number 6 is out of range of the number of items in the select list.
This also parses fine
select a as b, b as a
from test
group by 1
But it blows up with this error
Msg 164, Level 15, State 1, Line 3
Each GROUP BY expression must contain at least one column that is not an outer reference.
There is a lot of elementary inconsistencies in SQL, and use of scalars is one of them. For example, anyone might expect
select * from countries
order by 1
and
select * from countries
order by 1.00001
to be a similar queries (the difference between the two can be made infinitesimally small, after all), which are not.
I'm not sure if the standard specifies if it is valid, but I believe it is implementation-dependent. I just tried your first example with one SQL engine, and it worked fine.
use aliasses :
SELECT DATEPART(YEAR,LastSeenOn) as 'seen_year', COUNT(*) as 'count'
FROM Employee AS e
GROUP BY 'seen_year'
** EDIT **
if GROUP BY alias is not allowed for you, here's a solution / workaround:
SELECT seen_year
, COUNT(*) AS Total
FROM (
SELECT DATEPART(YEAR,LastSeenOn) as seen_year, *
FROM Employee AS e
) AS inline_view
GROUP
BY seen_year
databases that don't support this basically are choosing not to. understand the order of the processing of the various steps, but it is very easy (as many databases have shown) to parse the sql, understand it, and apply the translation for you. Where its really a pain is when a column is a long case statement. having to repeat that in the group by clause is super annoying. yes, you can do the nested query work around as someone demonstrated above, but at this point it is just lack of care about your users to not support group by column numbers.

The used SELECT statements have a different number of columns

For examples I don't know how many rows in each table are and I try to do like this:
SELECT * FROM members
UNION
SELECT * FROM inventory
What can I put to the second SELECT instead of * to remove this error without adding NULL's?
Put the columns names explicitly rather than *, and make sure the number of columns and data types match for the same column in each select.
Update:
I really don't think you want to be UNIONing those tables, based on the tables names. They don't seem to contain related data. If you post your schema and describe what you are trying to achieve it is likely we can provide better help.
you could do
SELECT *
from members
UNION
SELECT inventory.*, 'dummy1' AS membersCol1, 'dummy2' AS membersCol2
from inventory;
Where membersCol1, membersCol12, etc... are the names of columns from members that are not in inventory. That way both queries in the union will have the same columns (Assuming that all the columns in inventory are the same as in members which seems very strange to me... but hey, it's your schema).
UPDATE:
As HLGEM pointed out, this will only work if inventory has columns with the same names as members, and in the same order. Naming all the columns explicitly is the best idea, but since I don't know the names I can't exactly do that. If I did, it might look something like this:
SELECT id, name, member_role, member_type
from members
UNION
SELECT id, name, '(dummy for union)' AS member_role, '(dummy for union)' AS member_type
from inventory;
I don't like using NULL for dummy values because then it's not always clear which part of the union a record came from - using 'dummy' makes it clear that the record is from the part of the union that didn't have that record (though sometimes this might not matter). The very idea of unioning these two tables seems very strange to me because I very much doubt they'd have more than 1 or 2 columns with the same name, but you asked the question in such a way that I imagine in your scenario this somehow makes sense.
Are you sure you don't want a join instead? It is unlikely that UNOIN will give you what you want given the table names.
Try this
(SELECT * FROM members) ;
(SELECT * FROM inventory);
Just add semicolons after both the select statements and don't use union or anything else. This solved my error.
I don't know how many rows in each table
Are you sure this isn't what you want?
SELECT 'members' AS TableName, Count(*) AS Cnt FROM members
UNION ALL
SELECT 'inventory', Count(*) FROM inventory
Each SELECT statement within the MySQL UNION ALL operator must have the same number of fields in the result sets with similar data types
Visit https://www.techonthenet.com/mysql/union_all.php

Can I use non-aggregate columns with group by?

You cannot (should not) put non-aggregates in the SELECT line of a GROUP BY query.
I would however like access the one of the non-aggregates associated with the max. In plain english, I want a table with the oldest id of each kind.
CREATE TABLE stuff (
id int,
kind int,
age int
);
This query gives me the information I'm after:
SELECT kind, MAX(age)
FROM stuff
GROUP BY kind;
But it's not in the most useful form. I really want the id associated with each row so I can use it in later queries.
I'm looking for something like this:
SELECT id, kind, MAX(age)
FROM stuff
GROUP BY kind;
That outputs this:
SELECT stuff.*
FROM
stuff,
( SELECT kind, MAX(age)
FROM stuff
GROUP BY kind) maxes
WHERE
stuff.kind = maxes.kind AND
stuff.age = maxes.age
It really seems like there should be a way to get this information without needing to join. I just need the SQL engine to remember the other columns when it's calculating the max.
You can't get the Id of the row that MAX found, because there might not be only one id with the maximum age.
You cannot (should not) put non-aggregates in the SELECT line of a GROUP BY query.
You can, and have to, define what you are grouping by for the aggregate function to return the correct result.
MySQL (and SQLite) decided in their infinite wisdom that they would go against spec, and allow queries to accept GROUP BY clauses missing columns quoted in the SELECT - it effectively makes these queries not portable.
It really seems like there should be a way to get this information without needing to join.
Without access to the analytic/ranking/windowing functions that MySQL doesn't support, the self join to a derived table/inline view is the most portable means of getting the result you desire.
I think it's tempting indeed to ask the system to solve the problem in one pass rather than having to do the job twice (find the max, and the find the corresponding id). You can do using CONCAT (as suggested in Naktibalda refered article), not sure that would be more effeciant
SELECT MAX( CONCAT( LPAD(age, 10, '0'), '-', id)
FROM STUFF1
GROUP BY kind;
Should work, you have to split the answer to get the age and the id.
(That's really ugly though)
In recent databases you can use sum() over (parition by ...) to solve this problem:
select id, kind, age as max_age from (
select id, kind, age, max(age) over (partition by kind) as mage
from table)
where age = mage
This can then be single pass
PostgesSQL's DISTINCT ON will be useful here.
SELECT DISTINCT ON (kind) kind, id, age
FROM stuff
ORDER BY kind, age DESC;
This groups by kind and returns the first row in the ordered format. As we have ordered by age in descending order, we will get the row with max age for kind.
P.S. columns in DISTINCT ON should appear first in order by
You have to have a join because the aggregate function max retrieves many rows and chooses the max.
So you need a join to choose the one that the agregate function has found.
To put it a different way how would you expect the query to behave if you replaced max with sum?
An inner join might be more efficient than your sub query though.