I know that if you have one aggregate function in a SELECT statement, then all the other values in the statement must be either aggregate functions, or listed in a GROUP BY clause. I don't understand why that's the case.
If I do:
SELECT Name, 'Jones' AS Surname FROM People
I get:
NAME SURNAME
Dave Jones
Susan Jones
Amy Jones
So, the DBMS has taken a value from each row, and appended a single value to it in the result set. That's fine. But if that works, why can't I do:
SELECT Name, COUNT(Name) AS Surname FROM People
It seems like the same idea, take a value from each row and append a single value. But instead of:
NAME SURNAME
Dave 3
Susan 3
Amy 3
I get:
You tried to execute a query that does not include the specified expression 'ContactName' as part of an aggregate function.
I know it's not allowed, but the two circumstances seem so similar that I don't understand why. Is it to make the DBMS easier to implement? If anyone can explain to me why it doesn't work like I think it should, I'd be very grateful.
Aggregates doesn't work on a complete result, they only work on a group in a result.
Consider a table containing:
Person Pet
-------- --------
Amy Cat
Amy Dog
Amy Canary
Dave Dog
Susan Snake
Susan Spider
If you use a query that groups on Person, it will divide the data into these groups:
Amy:
Amy Cat
Amy Dog
Amy Canary
Dave:
Dave Dog
Susan:
Susan Snake
Susan Spider
If you use an aggreage, for exmple the count aggregate, it will produce one result for each group:
Amy:
Amy Cat
Amy Dog
Amy Canary count(*) = 3
Dave:
Dave Dog count(*) = 1
Susan:
Susan Snake
Susan Spider count(*) = 2
So, the query select Person, count(*) from People group by Person gives you one record for each group:
Amy 3
Dave 1
Susan 2
If you try to get the Pet field in the result also, that doesn't work because there may be multiple values for that field in each group.
(Some databases, like MySQL, does allow that anyway, and just returns any random value from within the group, and it's your responsibility to know if the result is sensible or not.)
If you use an aggregate, but doesn't specify any grouping, the query will still be grouped, and the entire result is a single group. So the query select count(*) from Person will create a single group containing all records, and the aggregate can count the records in that group. The result contains one row from each group, and as there is only one group, there will be one row in the result.
Think about it this way: when you call COUNT without grouping, it "collapses" the table to a single group making it impossible to access the individual items within a group in a select clause.
You can still get your result using a subquery or a cross join:
SELECT p1.Name, COUNT(p2.Name) AS Surname FROM People p1 CROSS JOIN People p2 GROUP BY p1.Name
SELECT Name, (SELECT COUNT(Name) FROM People) AS Surname FROM People
As others explained, when you have a GROUP BY or you are using an aggregate function like COUNT() in the SELECT list, you are doing a grouping of rows and therefore collapsing matching rows into one for every group.
When you only use aggregate functions in the SELECT list, without GROUP BY, think of it as you have a GROUP BY 1, so all rows are grouped, collapsed into one. So, if you have a hundred rows, the database can't really show you a name as there are a hundred of them.
However, for RDBMSs that have "windowing" functions, what you want is feasible. E.g. use aggregate functions without a GROUP BY.
Example for SQL-Server, where all rows (names) in the table are counted:
SELECT Name
, COUNT(*) OVER() AS cnt
FROM People
How does the above work?
It shows the Name like the
COUNT(*) OVER() AS cnt did not
exist and
It shows the COUNT(*) like if it was making a total grouping of the
table.
Another example. If you have a Surname field on the table, you can have something like this to show all rows grouped by Surname and counting how many people have same Surname:
SELECT Name
, Surname
, COUNT(*) OVER(PARTITION BY Surname) AS cnt
FROM People
Your query implicitly asks for different types of rows in your result set, and that is not allowed. All rows returned should be of the same type and have the same kind of columns.
'SELECT name, surname' wants to returns a row for every row in the table.
'SELECT COUNT(*)' wants to return a single row combining the results of all the rows in the table.
I think you're correct that in this case the database could plausibly just do both queries and then copy the result of 'SELECT COUNT(*)' into every result. One reason for not doing this is that it would be a stealth performance hit: you'd effectively be doing an extra self-join without declaring it anywhere.
Other answers have explained how to write a working version of this query, so I won't go into that.
The aggregate function and the group by clause aren't separate things, they're parts of the same thing that appear in different places in the query. If you wish to aggregate on a column, you must say what function to use for aggregation; if you wish to have an aggregation function, it has to be applied over some column.
The aggregate function takes values from multiple rows with a specific condition and combines them into one value. This condition is defined by the GROUP BYin your statement. So you can't use an aggregate function without a GROUP BY
With
SELECT Name, 'Jones' AS Surname FROM People
you simply select an additional column with a fixed value... but with
SELECT Name, COUNT(Name) AS Surname FROM People GROUP BY Name
you tell the DBMS to select the Names, remember how often every Name occured in the table and collapse them into one row. So if you omit the GROUP BY the DBMS can't tell, how to collapse the records
Related
I am learning SQL now, and I have a question. I recently came across a query that hand a large number of column names in the group by clause. I've used group by clauses before, and I've only ever seen one column name included in it.
SELECT TransportType.Description, TransportType.CargoCapacity, TransportType.Range, Transport.SerialNumber, Transport.PurchaseDate, Transport.RetiredDate,
MAX(Repair.BeginWorkDate) AS LatestRepairDate
FROM Transport INNER JOIN
TransportType ON Transport.TransportTypeID = TransportType.TransportTypeID LEFT OUTER JOIN
Repair ON Transport.TransportNumber = Repair.TransportNumber
GROUP BY TransportType.Description, TransportType.CargoCapacity, TransportType.Range, Transport.SerialNumber, Transport.PurchaseDate,
Transport.RetiredDate
HAVING (Transport.RetiredDate IS NULL)
ORDER BY TransportType.Description, Transport.SerialNumber
Why are there so many columns in the group by clause?
Except in MySQL & SQLite (which are lenient about the GROUP BY with sometimes indeterminate results), most RDBMS require every non-aggregated column (MAX(),MIN(),SUM(),COUNT(), etc) that appears in the SELECT list to be in the GROUP BY.
The behavior of MySQL & SQLite when columns from SELECT aren't listed in GROUP BY is not well defined. If for example, you execute a query like:
SELECT firstname, lastname, COUNT(*) FROM names GROUP BY lastname
MySQL would give you a result without complaint.
However, if your table included two different values of firstname having the same lastname, your resultant COUNT(*) would count both of them while only returning the firstname of one of them. What's more, which firstname MySQL chooses to return isn't defined so you can't really rely on it returning the first of the pair, for example.
From a table like:
firstname, lastname
--------------------
Jane Smith
John Smith
Peter Jones
The not-fully-correct result might be:
firstname, lastname, COUNT(*)
-----------------------------
Jane Smith 2 <----wrong!
Peter Jones 1
Outside MySQL & SQLite, columns referenced anywhere in the SELECT list not also appearing in the GROUP BY will result in a query parse error.
Commonly here on Stack Overflow, we encounter users with questions about the GROUP BY, having just begun working with an RDBMS that is stricter about its usage. If you learn aggregates in MySQL first, chances are you'll need to relearn to do them properly when moving to a different RDBMS.
I have my query like this:
Select
a.abc,
a.cde,
a.efg,
a.agh,
c.dummy
p.test
max(b.this)
sum(b.sugar)
sum(b.bucket)
sum(b.something)
followed by some outer join and inner join. Now the problem is when in group by
group by
a.abc,
a.cde,
a.efg,
a.agh,
c.dummy,
p.test
The query works fine. But if I remove any one of them from group by it gives:
SQLSTATE: 42803
Can anyone explain the cause of this error?
Generally, any column that isn't in the group by section can only be included in the select section if it has an aggregating function applied to it. Or, another way, any non-aggregated data in the select section must be grouped on.
Otherewise, how do you know what you want done with it. For example, if you group on a.abc, there can only be one thing that a.abc can be for that grouped row (since all other values of a.abc will come out in a different row). Here's a short example, with a table containing:
LastName FirstName Salary
-------- --------- ------
Smith John 123456
Smith George 111111
Diablo Pax 999999
With the query select LastName, Salary from Employees group by LastName, you would expect to see:
LastName Salary
-------- ------
Smith ??????
Diablo 999999
The salary for the Smiths is incalculable since you don't know what function to apply to it, which is what's causing that error. In other words, the DBMS doesn't know what to do with 123456 and 111111 to get a single value for the grouped row.
If you instead used select LastName, sum(Salary) from Employees group by LastName (or max() or min() or ave() or any other aggregating function), the DBMS would know what to do. For sum(), it will simply add them and give you 234567.
In your query, the equivalent of trying to use Salary without an aggregating function is to change sum(b.this) to just b.this but not include it in the group by section. Or alternatively, remove one of the group by columns without changing it to an aggregation in the select section.
In both cases, you'll have one row that has multiple possible values for the column.
The DB2 docs at publib for sqlstate 42803 describe your problem:
A column reference in the SELECT or HAVING clause is invalid, because it is not a grouping column; or a column reference in the GROUP BY clause is invalid.
SQL will insist that any column in the SELECT section is either included in the GROUP BY section or has an aggregate function applied to it in the SELECT section.
This article gives a nice explanation of why this is the case. The article is sql server specific but the principle should be roughly similar for all RDBMS
I know that if you have one aggregate function in a SELECT statement, then all the other values in the statement must be either aggregate functions, or listed in a GROUP BY clause. I don't understand why that's the case.
If I do:
SELECT Name, 'Jones' AS Surname FROM People
I get:
NAME SURNAME
Dave Jones
Susan Jones
Amy Jones
So, the DBMS has taken a value from each row, and appended a single value to it in the result set. That's fine. But if that works, why can't I do:
SELECT Name, COUNT(Name) AS Surname FROM People
It seems like the same idea, take a value from each row and append a single value. But instead of:
NAME SURNAME
Dave 3
Susan 3
Amy 3
I get:
You tried to execute a query that does not include the specified expression 'ContactName' as part of an aggregate function.
I know it's not allowed, but the two circumstances seem so similar that I don't understand why. Is it to make the DBMS easier to implement? If anyone can explain to me why it doesn't work like I think it should, I'd be very grateful.
Aggregates doesn't work on a complete result, they only work on a group in a result.
Consider a table containing:
Person Pet
-------- --------
Amy Cat
Amy Dog
Amy Canary
Dave Dog
Susan Snake
Susan Spider
If you use a query that groups on Person, it will divide the data into these groups:
Amy:
Amy Cat
Amy Dog
Amy Canary
Dave:
Dave Dog
Susan:
Susan Snake
Susan Spider
If you use an aggreage, for exmple the count aggregate, it will produce one result for each group:
Amy:
Amy Cat
Amy Dog
Amy Canary count(*) = 3
Dave:
Dave Dog count(*) = 1
Susan:
Susan Snake
Susan Spider count(*) = 2
So, the query select Person, count(*) from People group by Person gives you one record for each group:
Amy 3
Dave 1
Susan 2
If you try to get the Pet field in the result also, that doesn't work because there may be multiple values for that field in each group.
(Some databases, like MySQL, does allow that anyway, and just returns any random value from within the group, and it's your responsibility to know if the result is sensible or not.)
If you use an aggregate, but doesn't specify any grouping, the query will still be grouped, and the entire result is a single group. So the query select count(*) from Person will create a single group containing all records, and the aggregate can count the records in that group. The result contains one row from each group, and as there is only one group, there will be one row in the result.
Think about it this way: when you call COUNT without grouping, it "collapses" the table to a single group making it impossible to access the individual items within a group in a select clause.
You can still get your result using a subquery or a cross join:
SELECT p1.Name, COUNT(p2.Name) AS Surname FROM People p1 CROSS JOIN People p2 GROUP BY p1.Name
SELECT Name, (SELECT COUNT(Name) FROM People) AS Surname FROM People
As others explained, when you have a GROUP BY or you are using an aggregate function like COUNT() in the SELECT list, you are doing a grouping of rows and therefore collapsing matching rows into one for every group.
When you only use aggregate functions in the SELECT list, without GROUP BY, think of it as you have a GROUP BY 1, so all rows are grouped, collapsed into one. So, if you have a hundred rows, the database can't really show you a name as there are a hundred of them.
However, for RDBMSs that have "windowing" functions, what you want is feasible. E.g. use aggregate functions without a GROUP BY.
Example for SQL-Server, where all rows (names) in the table are counted:
SELECT Name
, COUNT(*) OVER() AS cnt
FROM People
How does the above work?
It shows the Name like the
COUNT(*) OVER() AS cnt did not
exist and
It shows the COUNT(*) like if it was making a total grouping of the
table.
Another example. If you have a Surname field on the table, you can have something like this to show all rows grouped by Surname and counting how many people have same Surname:
SELECT Name
, Surname
, COUNT(*) OVER(PARTITION BY Surname) AS cnt
FROM People
Your query implicitly asks for different types of rows in your result set, and that is not allowed. All rows returned should be of the same type and have the same kind of columns.
'SELECT name, surname' wants to returns a row for every row in the table.
'SELECT COUNT(*)' wants to return a single row combining the results of all the rows in the table.
I think you're correct that in this case the database could plausibly just do both queries and then copy the result of 'SELECT COUNT(*)' into every result. One reason for not doing this is that it would be a stealth performance hit: you'd effectively be doing an extra self-join without declaring it anywhere.
Other answers have explained how to write a working version of this query, so I won't go into that.
The aggregate function and the group by clause aren't separate things, they're parts of the same thing that appear in different places in the query. If you wish to aggregate on a column, you must say what function to use for aggregation; if you wish to have an aggregation function, it has to be applied over some column.
The aggregate function takes values from multiple rows with a specific condition and combines them into one value. This condition is defined by the GROUP BYin your statement. So you can't use an aggregate function without a GROUP BY
With
SELECT Name, 'Jones' AS Surname FROM People
you simply select an additional column with a fixed value... but with
SELECT Name, COUNT(Name) AS Surname FROM People GROUP BY Name
you tell the DBMS to select the Names, remember how often every Name occured in the table and collapse them into one row. So if you omit the GROUP BY the DBMS can't tell, how to collapse the records
I'm trying to make sense of the right way to use JOIN, COUNT(*), and GROUP BY to do a pretty simple query. I've actually gotten it to work (see below) but from what I've read, I'm using an extra GROUP BY that I shouldn't be.
(Note: The problem below isn't my actual problem (which deals with more complicated tables), but I've tried to come up with an analogous problem)
I have two tables:
Table: Person
-------------
key name cityKey
1 Alice 1
2 Bob 2
3 Charles 2
4 David 1
Table: City
-------------
key name
1 Albany
2 Berkeley
3 Chico
I'd like to do a query on the People (with some WHERE clause) that returns
the number of matching people in each city
the key for the city
the name of the city.
If I do
SELECT COUNT(Person.key) AS count, City.key AS cityKey, City.name AS cityName
FROM Person
LEFT JOIN City ON Person.cityKey = City.key
GROUP BY Person.cityKey, City.name
I get the result that I want
count cityKey cityName
2 1 Albany
2 2 Berkeley
However, I've read that throwing in that last part of the GROUP BY clause (City.name) just to make it work is wrong.
So what's the right way to do this? I've been trying to google for an answer, but I feel like there's something fundamental that I'm just not getting.
I don't think that it's "wrong" in this case, because you've got a one-to-one relationship between city name and city key. You could rewrite it such that you join to a sub-select to get the count of persons to cities by key, to the city table again for the name, but it's debatable that that'd be better. It's a matter of style and opinion I guess.
select PC.ct, City.key, City.name
from City
join (select count(Person.key) ct, cityKey key from Person group by cityKey) PC
on City.key = PC.key
if my SQL isn't too rusty :-)
...I've read that throwing in that last part of the GROUP BY clause (City.name) just to make it work is wrong.
You misunderstand, you got it backwards.
Standard SQL requires you to specify in the GROUP BY all the columns mentioned in the SELECT that are not wrapped in aggregate functions. If you don't want certain columns in the GROUP BY, wrap them in aggregate functions. Depending on the database, you could use the analytic/windowing function OVER...
However, MySQL and SQLite provide the "feature" where you can omit these columns from the group by - which leads to no end of "why doesn't this port from MySQL to fill_in_the_blank database?!" Stackoverflow and numerous other sites & forums.
However, I've read that throwing in
that last part of the GROUP BY clause
(City.name) just to make it work is
wrong.
It's not wrong. You have to understand how the Query Optimizer sees your query. The order in which it is parsed is what requires you to "throw the last part in." The optimizer sees your query in something akin to this order:
the required tables are joined
the composite dataset is filtered through the WHERE clause
the remaining rows are chopped into groups by the GROUP BY clause, and aggregated
they are then filtered again, through the HAVING clause
finally operated on, by SELECT / ORDER BY, UPDATE or DELETE.
The point here is that it's not that the GROUP BY has to name all the columns in the SELECT, but in fact it is the opposite - the SELECT cannot include any columns not already in the GROUP BY.
Your query would only work on MySQL, because you group on Person.cityKey but select city.key. All other databases would require you to use an aggregate like min(city.key), or to add City.key to the group by clause.
Because the combination of city name and city key is unique, the following are equivalent:
select count(person.key), min(city.key), min(city.name)
...
group by person.citykey
Or:
select count(person.key), city.key, city.name
...
group by person.citykey, city.key, city.name
Or:
select count(person.key), city.key, max(city.name)
...
group by city.key
All rows in the group will have the same city name and key, so it doesn't matter if you use the max or min aggregate.
P.S. If you'd like to count only different persons, even if they have multiple rows, try:
count(DISTINCT person.key)
instead of
count(person.key)
Whats the best way to do this, when looking for distinct rows?
SELECT DISTINCT name, address
FROM table;
I still want to return all fields, ie address1, city etc but not include them in the DISTINCT row check.
Then you have to decide what to do when there are multiple rows with the same value for the column you want the distinct check to check against, but with different val;ues in the other columns. In this case how does the query processor know which of the multiple values in the other columns to output, if you don't care, then just write a group by on the distinct column, with Min(), or Max() on all the other ones..
EDIT: I agree with comments from others that as long as you have multiple dependant columns in the same table (e.g., Address1, Address2, City, State ) That this approach is going to give you mixed (and therefore inconsistent ) results. If each column attribute in the table is independant ( if addresses are all in an Address Table and only an AddressId is in this table) then it's not as significant an issue... cause at least all the columns from a join to the Address table will generate datea for the same address, but you are still getting a more or less random selection of one of the set of multiple addresses...
This will not mix and match your city, state, etc. and should give you the last one added even:
select b.*
from (
select max(id) id, Name, Address
from table a
group by Name, Address) as a
inner join table b
on a.id = b.id
When you have a mixed set of fields, some of which you want to be DISTINCT and others that you just want to appear, you require an aggregate query rather than DISTINCT. DISTINCT is only for returning single copies of identical fieldsets. Something like this might work:
SELECT name,
GROUP_CONCAT(DISTINCT address) AS addresses,
GROUP_CONCAT(DISTINCT city) AS cities
FROM the_table
GROUP BY name;
The above will get one row for each name. addresses contains a comma delimted string of all the addresses for that name once. cities does the sames for all the cities.
However, I don't see how the results of this query are going to be useful. It will be impossible to tell which address belongs to which city.
If, as is often the case, you are trying to create a query that will output rows in the format you require for presentation, you're much better off accepting multiple rows and then processing the query results in your application layer.
I don't think you can do this because it doesn't really make sense.
name | address | city | etc...
abc | 123 | def | ...
abc | 123 | hij | ...
if you were to include city, but not have it as part of the distinct clause, the value of city would be unpredictable unless you did something like Max(city).
You can do
SELECT DISTINCT Name, Address, Max (Address1), Max (City)
FROM table
Use #JBrooks answer below. He has a better answer.
Return all Fields and Distinct Rows
If you're using SQL Server 2005 or above you can use the RowNumber function. This will get you the row with the lowest ID for each name. If you want to 'group' by more columns, add them in the PARTITION BY section of the RowNumber.
SELECT id, Name, Address, ...
(select id, Name, Address, ...,
ROW_NUMBER() OVER (PARTITION BY Name ORDER BY id) AS RowNo
from table) sub
WHERE RowNo = 1