Select and Group by together - sql

I have my query like this:
Select
a.abc,
a.cde,
a.efg,
a.agh,
c.dummy
p.test
max(b.this)
sum(b.sugar)
sum(b.bucket)
sum(b.something)
followed by some outer join and inner join. Now the problem is when in group by
group by
a.abc,
a.cde,
a.efg,
a.agh,
c.dummy,
p.test
The query works fine. But if I remove any one of them from group by it gives:
SQLSTATE: 42803
Can anyone explain the cause of this error?

Generally, any column that isn't in the group by section can only be included in the select section if it has an aggregating function applied to it. Or, another way, any non-aggregated data in the select section must be grouped on.
Otherewise, how do you know what you want done with it. For example, if you group on a.abc, there can only be one thing that a.abc can be for that grouped row (since all other values of a.abc will come out in a different row). Here's a short example, with a table containing:
LastName FirstName Salary
-------- --------- ------
Smith John 123456
Smith George 111111
Diablo Pax 999999
With the query select LastName, Salary from Employees group by LastName, you would expect to see:
LastName Salary
-------- ------
Smith ??????
Diablo 999999
The salary for the Smiths is incalculable since you don't know what function to apply to it, which is what's causing that error. In other words, the DBMS doesn't know what to do with 123456 and 111111 to get a single value for the grouped row.
If you instead used select LastName, sum(Salary) from Employees group by LastName (or max() or min() or ave() or any other aggregating function), the DBMS would know what to do. For sum(), it will simply add them and give you 234567.
In your query, the equivalent of trying to use Salary without an aggregating function is to change sum(b.this) to just b.this but not include it in the group by section. Or alternatively, remove one of the group by columns without changing it to an aggregation in the select section.
In both cases, you'll have one row that has multiple possible values for the column.
The DB2 docs at publib for sqlstate 42803 describe your problem:
A column reference in the SELECT or HAVING clause is invalid, because it is not a grouping column; or a column reference in the GROUP BY clause is invalid.

SQL will insist that any column in the SELECT section is either included in the GROUP BY section or has an aggregate function applied to it in the SELECT section.
This article gives a nice explanation of why this is the case. The article is sql server specific but the principle should be roughly similar for all RDBMS

Related

Why Is This Column Name Invalid? [duplicate]

This question already has answers here:
GROUP BY / aggregate function confusion in SQL
(5 answers)
Closed 3 years ago.
I got an error -
Column 'Employee.EmpID' is invalid in the select list because it is
not contained in either an aggregate function or the GROUP BY clause.
select loc.LocationID, emp.EmpID
from Employee as emp full join Location as loc
on emp.LocationID = loc.LocationID
group by loc.LocationID
This situation fits into the answer given by Bill Karwin.
correction for above, fits into answer by ExactaBox -
select loc.LocationID, count(emp.EmpID) -- not count(*), don't want to count nulls
from Employee as emp full join Location as loc
on emp.LocationID = loc.LocationID
group by loc.LocationID
ORIGINAL QUESTION -
For the SQL query -
select *
from Employee as emp full join Location as loc
on emp.LocationID = loc.LocationID
group by (loc.LocationID)
I don't understand why I get this error. All I want to do is join the tables and then group all the employees in a particular location together.
I think I have a partial explanation for my own question. Tell me if its ok -
To group all employees that work in the same location we have to first mention the LocationID.
Then, we cannot/do not mention each employee ID next to it. Rather, we mention the total number of employees in that location, ie we should SUM() the employees working in that location. Why do we do it the latter way, i am not sure.
So, this explains the "it is not contained in either an aggregate function" part of the error.
What is the explanation for the GROUP BY clause part of the error ?
Suppose I have the following table T:
a b
--------
1 abc
1 def
1 ghi
2 jkl
2 mno
2 pqr
And I do the following query:
SELECT a, b
FROM T
GROUP BY a
The output should have two rows, one row where a=1 and a second row where a=2.
But what should the value of b show on each of these two rows? There are three possibilities in each case, and nothing in the query makes it clear which value to choose for b in each group. It's ambiguous.
This demonstrates the single-value rule, which prohibits the undefined results you get when you run a GROUP BY query, and you include any columns in the select-list that are neither part of the grouping criteria, nor appear in aggregate functions (SUM, MIN, MAX, etc.).
Fixing it might look like this:
SELECT a, MAX(b) AS x
FROM T
GROUP BY a
Now it's clear that you want the following result:
a x
--------
1 ghi
2 pqr
Your query will work in MYSQL if you set to disable ONLY_FULL_GROUP_BY server mode (and by default It is). But in this case, you are using different RDBMS. So to make your query work, add all non-aggregated columns to your GROUP BY clause, eg
SELECT col1, col2, SUM(col3) totalSUM
FROM tableName
GROUP BY col1, col2
Non-Aggregated columns means the column is not pass into aggregated functions like SUM, MAX, COUNT, etc..
Basically, what this error is saying is that if you are going to use the GROUP BY clause, then your result is going to be a relation/table with a row for each group, so in your SELECT statement you can only "select" the column that you are grouping by and use aggregate functions on that column because the other columns will not appear in the resulting table.
"All I want to do is join the tables and then group all the employees
in a particular location together."
It sounds like what you want is for the output of the SQL statement to list every employee in the company, but first all the people in the Anaheim office, then the people in the Buffalo office, then the people in the Cleveland office (A, B, C, get it, obviously I don't know what locations you have).
In that case, lose the GROUP BY statement. All you need is ORDER BY loc.LocationID

Number of tuples outputted by GROUP BY primitive [duplicate]

I know that if you have one aggregate function in a SELECT statement, then all the other values in the statement must be either aggregate functions, or listed in a GROUP BY clause. I don't understand why that's the case.
If I do:
SELECT Name, 'Jones' AS Surname FROM People
I get:
NAME SURNAME
Dave Jones
Susan Jones
Amy Jones
So, the DBMS has taken a value from each row, and appended a single value to it in the result set. That's fine. But if that works, why can't I do:
SELECT Name, COUNT(Name) AS Surname FROM People
It seems like the same idea, take a value from each row and append a single value. But instead of:
NAME SURNAME
Dave 3
Susan 3
Amy 3
I get:
You tried to execute a query that does not include the specified expression 'ContactName' as part of an aggregate function.
I know it's not allowed, but the two circumstances seem so similar that I don't understand why. Is it to make the DBMS easier to implement? If anyone can explain to me why it doesn't work like I think it should, I'd be very grateful.
Aggregates doesn't work on a complete result, they only work on a group in a result.
Consider a table containing:
Person Pet
-------- --------
Amy Cat
Amy Dog
Amy Canary
Dave Dog
Susan Snake
Susan Spider
If you use a query that groups on Person, it will divide the data into these groups:
Amy:
Amy Cat
Amy Dog
Amy Canary
Dave:
Dave Dog
Susan:
Susan Snake
Susan Spider
If you use an aggreage, for exmple the count aggregate, it will produce one result for each group:
Amy:
Amy Cat
Amy Dog
Amy Canary count(*) = 3
Dave:
Dave Dog count(*) = 1
Susan:
Susan Snake
Susan Spider count(*) = 2
So, the query select Person, count(*) from People group by Person gives you one record for each group:
Amy 3
Dave 1
Susan 2
If you try to get the Pet field in the result also, that doesn't work because there may be multiple values for that field in each group.
(Some databases, like MySQL, does allow that anyway, and just returns any random value from within the group, and it's your responsibility to know if the result is sensible or not.)
If you use an aggregate, but doesn't specify any grouping, the query will still be grouped, and the entire result is a single group. So the query select count(*) from Person will create a single group containing all records, and the aggregate can count the records in that group. The result contains one row from each group, and as there is only one group, there will be one row in the result.
Think about it this way: when you call COUNT without grouping, it "collapses" the table to a single group making it impossible to access the individual items within a group in a select clause.
You can still get your result using a subquery or a cross join:
SELECT p1.Name, COUNT(p2.Name) AS Surname FROM People p1 CROSS JOIN People p2 GROUP BY p1.Name
SELECT Name, (SELECT COUNT(Name) FROM People) AS Surname FROM People
As others explained, when you have a GROUP BY or you are using an aggregate function like COUNT() in the SELECT list, you are doing a grouping of rows and therefore collapsing matching rows into one for every group.
When you only use aggregate functions in the SELECT list, without GROUP BY, think of it as you have a GROUP BY 1, so all rows are grouped, collapsed into one. So, if you have a hundred rows, the database can't really show you a name as there are a hundred of them.
However, for RDBMSs that have "windowing" functions, what you want is feasible. E.g. use aggregate functions without a GROUP BY.
Example for SQL-Server, where all rows (names) in the table are counted:
SELECT Name
, COUNT(*) OVER() AS cnt
FROM People
How does the above work?
It shows the Name like the
COUNT(*) OVER() AS cnt did not
exist and
It shows the COUNT(*) like if it was making a total grouping of the
table.
Another example. If you have a Surname field on the table, you can have something like this to show all rows grouped by Surname and counting how many people have same Surname:
SELECT Name
, Surname
, COUNT(*) OVER(PARTITION BY Surname) AS cnt
FROM People
Your query implicitly asks for different types of rows in your result set, and that is not allowed. All rows returned should be of the same type and have the same kind of columns.
'SELECT name, surname' wants to returns a row for every row in the table.
'SELECT COUNT(*)' wants to return a single row combining the results of all the rows in the table.
I think you're correct that in this case the database could plausibly just do both queries and then copy the result of 'SELECT COUNT(*)' into every result. One reason for not doing this is that it would be a stealth performance hit: you'd effectively be doing an extra self-join without declaring it anywhere.
Other answers have explained how to write a working version of this query, so I won't go into that.
The aggregate function and the group by clause aren't separate things, they're parts of the same thing that appear in different places in the query. If you wish to aggregate on a column, you must say what function to use for aggregation; if you wish to have an aggregation function, it has to be applied over some column.
The aggregate function takes values from multiple rows with a specific condition and combines them into one value. This condition is defined by the GROUP BYin your statement. So you can't use an aggregate function without a GROUP BY
With
SELECT Name, 'Jones' AS Surname FROM People
you simply select an additional column with a fixed value... but with
SELECT Name, COUNT(Name) AS Surname FROM People GROUP BY Name
you tell the DBMS to select the Names, remember how often every Name occured in the table and collapse them into one row. So if you omit the GROUP BY the DBMS can't tell, how to collapse the records

Many Fields In the Group By Clause

I am learning SQL now, and I have a question. I recently came across a query that hand a large number of column names in the group by clause. I've used group by clauses before, and I've only ever seen one column name included in it.
SELECT TransportType.Description, TransportType.CargoCapacity, TransportType.Range, Transport.SerialNumber, Transport.PurchaseDate, Transport.RetiredDate,
MAX(Repair.BeginWorkDate) AS LatestRepairDate
FROM Transport INNER JOIN
TransportType ON Transport.TransportTypeID = TransportType.TransportTypeID LEFT OUTER JOIN
Repair ON Transport.TransportNumber = Repair.TransportNumber
GROUP BY TransportType.Description, TransportType.CargoCapacity, TransportType.Range, Transport.SerialNumber, Transport.PurchaseDate,
Transport.RetiredDate
HAVING (Transport.RetiredDate IS NULL)
ORDER BY TransportType.Description, Transport.SerialNumber
Why are there so many columns in the group by clause?
Except in MySQL & SQLite (which are lenient about the GROUP BY with sometimes indeterminate results), most RDBMS require every non-aggregated column (MAX(),MIN(),SUM(),COUNT(), etc) that appears in the SELECT list to be in the GROUP BY.
The behavior of MySQL & SQLite when columns from SELECT aren't listed in GROUP BY is not well defined. If for example, you execute a query like:
SELECT firstname, lastname, COUNT(*) FROM names GROUP BY lastname
MySQL would give you a result without complaint.
However, if your table included two different values of firstname having the same lastname, your resultant COUNT(*) would count both of them while only returning the firstname of one of them. What's more, which firstname MySQL chooses to return isn't defined so you can't really rely on it returning the first of the pair, for example.
From a table like:
firstname, lastname
--------------------
Jane Smith
John Smith
Peter Jones
The not-fully-correct result might be:
firstname, lastname, COUNT(*)
-----------------------------
Jane Smith 2 <----wrong!
Peter Jones 1
Outside MySQL & SQLite, columns referenced anywhere in the SELECT list not also appearing in the GROUP BY will result in a query parse error.
Commonly here on Stack Overflow, we encounter users with questions about the GROUP BY, having just begun working with an RDBMS that is stricter about its usage. If you learn aggregates in MySQL first, chances are you'll need to relearn to do them properly when moving to a different RDBMS.

Why can't you mix Aggregate values and Non-Aggregate values in a single SELECT?

I know that if you have one aggregate function in a SELECT statement, then all the other values in the statement must be either aggregate functions, or listed in a GROUP BY clause. I don't understand why that's the case.
If I do:
SELECT Name, 'Jones' AS Surname FROM People
I get:
NAME SURNAME
Dave Jones
Susan Jones
Amy Jones
So, the DBMS has taken a value from each row, and appended a single value to it in the result set. That's fine. But if that works, why can't I do:
SELECT Name, COUNT(Name) AS Surname FROM People
It seems like the same idea, take a value from each row and append a single value. But instead of:
NAME SURNAME
Dave 3
Susan 3
Amy 3
I get:
You tried to execute a query that does not include the specified expression 'ContactName' as part of an aggregate function.
I know it's not allowed, but the two circumstances seem so similar that I don't understand why. Is it to make the DBMS easier to implement? If anyone can explain to me why it doesn't work like I think it should, I'd be very grateful.
Aggregates doesn't work on a complete result, they only work on a group in a result.
Consider a table containing:
Person Pet
-------- --------
Amy Cat
Amy Dog
Amy Canary
Dave Dog
Susan Snake
Susan Spider
If you use a query that groups on Person, it will divide the data into these groups:
Amy:
Amy Cat
Amy Dog
Amy Canary
Dave:
Dave Dog
Susan:
Susan Snake
Susan Spider
If you use an aggreage, for exmple the count aggregate, it will produce one result for each group:
Amy:
Amy Cat
Amy Dog
Amy Canary count(*) = 3
Dave:
Dave Dog count(*) = 1
Susan:
Susan Snake
Susan Spider count(*) = 2
So, the query select Person, count(*) from People group by Person gives you one record for each group:
Amy 3
Dave 1
Susan 2
If you try to get the Pet field in the result also, that doesn't work because there may be multiple values for that field in each group.
(Some databases, like MySQL, does allow that anyway, and just returns any random value from within the group, and it's your responsibility to know if the result is sensible or not.)
If you use an aggregate, but doesn't specify any grouping, the query will still be grouped, and the entire result is a single group. So the query select count(*) from Person will create a single group containing all records, and the aggregate can count the records in that group. The result contains one row from each group, and as there is only one group, there will be one row in the result.
Think about it this way: when you call COUNT without grouping, it "collapses" the table to a single group making it impossible to access the individual items within a group in a select clause.
You can still get your result using a subquery or a cross join:
SELECT p1.Name, COUNT(p2.Name) AS Surname FROM People p1 CROSS JOIN People p2 GROUP BY p1.Name
SELECT Name, (SELECT COUNT(Name) FROM People) AS Surname FROM People
As others explained, when you have a GROUP BY or you are using an aggregate function like COUNT() in the SELECT list, you are doing a grouping of rows and therefore collapsing matching rows into one for every group.
When you only use aggregate functions in the SELECT list, without GROUP BY, think of it as you have a GROUP BY 1, so all rows are grouped, collapsed into one. So, if you have a hundred rows, the database can't really show you a name as there are a hundred of them.
However, for RDBMSs that have "windowing" functions, what you want is feasible. E.g. use aggregate functions without a GROUP BY.
Example for SQL-Server, where all rows (names) in the table are counted:
SELECT Name
, COUNT(*) OVER() AS cnt
FROM People
How does the above work?
It shows the Name like the
COUNT(*) OVER() AS cnt did not
exist and
It shows the COUNT(*) like if it was making a total grouping of the
table.
Another example. If you have a Surname field on the table, you can have something like this to show all rows grouped by Surname and counting how many people have same Surname:
SELECT Name
, Surname
, COUNT(*) OVER(PARTITION BY Surname) AS cnt
FROM People
Your query implicitly asks for different types of rows in your result set, and that is not allowed. All rows returned should be of the same type and have the same kind of columns.
'SELECT name, surname' wants to returns a row for every row in the table.
'SELECT COUNT(*)' wants to return a single row combining the results of all the rows in the table.
I think you're correct that in this case the database could plausibly just do both queries and then copy the result of 'SELECT COUNT(*)' into every result. One reason for not doing this is that it would be a stealth performance hit: you'd effectively be doing an extra self-join without declaring it anywhere.
Other answers have explained how to write a working version of this query, so I won't go into that.
The aggregate function and the group by clause aren't separate things, they're parts of the same thing that appear in different places in the query. If you wish to aggregate on a column, you must say what function to use for aggregation; if you wish to have an aggregation function, it has to be applied over some column.
The aggregate function takes values from multiple rows with a specific condition and combines them into one value. This condition is defined by the GROUP BYin your statement. So you can't use an aggregate function without a GROUP BY
With
SELECT Name, 'Jones' AS Surname FROM People
you simply select an additional column with a fixed value... but with
SELECT Name, COUNT(Name) AS Surname FROM People GROUP BY Name
you tell the DBMS to select the Names, remember how often every Name occured in the table and collapse them into one row. So if you omit the GROUP BY the DBMS can't tell, how to collapse the records

JOIN on another table after GROUP BY and COUNT

I'm trying to make sense of the right way to use JOIN, COUNT(*), and GROUP BY to do a pretty simple query. I've actually gotten it to work (see below) but from what I've read, I'm using an extra GROUP BY that I shouldn't be.
(Note: The problem below isn't my actual problem (which deals with more complicated tables), but I've tried to come up with an analogous problem)
I have two tables:
Table: Person
-------------
key name cityKey
1 Alice 1
2 Bob 2
3 Charles 2
4 David 1
Table: City
-------------
key name
1 Albany
2 Berkeley
3 Chico
I'd like to do a query on the People (with some WHERE clause) that returns
the number of matching people in each city
the key for the city
the name of the city.
If I do
SELECT COUNT(Person.key) AS count, City.key AS cityKey, City.name AS cityName
FROM Person
LEFT JOIN City ON Person.cityKey = City.key
GROUP BY Person.cityKey, City.name
I get the result that I want
count cityKey cityName
2 1 Albany
2 2 Berkeley
However, I've read that throwing in that last part of the GROUP BY clause (City.name) just to make it work is wrong.
So what's the right way to do this? I've been trying to google for an answer, but I feel like there's something fundamental that I'm just not getting.
I don't think that it's "wrong" in this case, because you've got a one-to-one relationship between city name and city key. You could rewrite it such that you join to a sub-select to get the count of persons to cities by key, to the city table again for the name, but it's debatable that that'd be better. It's a matter of style and opinion I guess.
select PC.ct, City.key, City.name
from City
join (select count(Person.key) ct, cityKey key from Person group by cityKey) PC
on City.key = PC.key
if my SQL isn't too rusty :-)
...I've read that throwing in that last part of the GROUP BY clause (City.name) just to make it work is wrong.
You misunderstand, you got it backwards.
Standard SQL requires you to specify in the GROUP BY all the columns mentioned in the SELECT that are not wrapped in aggregate functions. If you don't want certain columns in the GROUP BY, wrap them in aggregate functions. Depending on the database, you could use the analytic/windowing function OVER...
However, MySQL and SQLite provide the "feature" where you can omit these columns from the group by - which leads to no end of "why doesn't this port from MySQL to fill_in_the_blank database?!" Stackoverflow and numerous other sites & forums.
However, I've read that throwing in
that last part of the GROUP BY clause
(City.name) just to make it work is
wrong.
It's not wrong. You have to understand how the Query Optimizer sees your query. The order in which it is parsed is what requires you to "throw the last part in." The optimizer sees your query in something akin to this order:
the required tables are joined
the composite dataset is filtered through the WHERE clause
the remaining rows are chopped into groups by the GROUP BY clause, and aggregated
they are then filtered again, through the HAVING clause
finally operated on, by SELECT / ORDER BY, UPDATE or DELETE.
The point here is that it's not that the GROUP BY has to name all the columns in the SELECT, but in fact it is the opposite - the SELECT cannot include any columns not already in the GROUP BY.
Your query would only work on MySQL, because you group on Person.cityKey but select city.key. All other databases would require you to use an aggregate like min(city.key), or to add City.key to the group by clause.
Because the combination of city name and city key is unique, the following are equivalent:
select count(person.key), min(city.key), min(city.name)
...
group by person.citykey
Or:
select count(person.key), city.key, city.name
...
group by person.citykey, city.key, city.name
Or:
select count(person.key), city.key, max(city.name)
...
group by city.key
All rows in the group will have the same city name and key, so it doesn't matter if you use the max or min aggregate.
P.S. If you'd like to count only different persons, even if they have multiple rows, try:
count(DISTINCT person.key)
instead of
count(person.key)