Aggregating on a column that is also being grouped on - sql

I know there's a lot of confusion related to grouping/aggregation etc, and I thought that I had a pretty decent grasp on the whole thing until I saw something along the lines of
SELECT A, SUM(B)
FROM T
GROUP BY A
HAVING COUNT(A)>1;
At first this puzzled me since it seemed performing an aggregate on a column that is also being grouped on is redundant, since by definition the value for the group will be distinct. But then I thought about it and it kind of makes sense for duplicate values in the table, if the aggregation was done before the grouping. In my head, it seems like its treating it more like this kind of query
SELECT A, SUM(B)
FROM T
WHERE A in (SELECT A FROM T GROUP BY A HAVING COUNT(*)>1)
GROUP BY A;
As opposed to another selection operator on each group after the grouping is done (since to me that doesn't make much sense).
So my question is multifold: Can elements being grouped on be included in the HAVING clause at all? Can elements being grouped on be aggregated on (in the HAVING clause or elsewhere like SELECT clause)? If the previous statements hold, is my understanding of what this operation means correct?
NOTE: This question is mainly about standard (ansi) SQL but info on particular implementations would also be interesting

The arguments to an aggregation function can include the keys being aggregated.
That said, the more common way to count rows in each group is to use COUNT(*). I would recommend:
SELECT A, SUM(B)
FROM T
GROUP BY A
HAVING COUNT(*) > 1;
There is a slight overhead to using COUNT(A) because the value of A needs to be checked against NULL in each row.

Related

In SQL, does groupby on an ordered query behave the same as doing both in the same query?

Are the following queries identical, or might I get different results (in any major DB system, e.g. MSSQL, MySQL, Postgres, SQLite):
Doing both in the same query:
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_value
vs. ordering in a subquery:
SELECT group, some_agg_func(some_value)
FROM (
SELECT group, some_value
FROM my_table
ORDER BY some_other_value
) as alias
GROUP BY group
Looking at the first sample:
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_value
Let's think about what GROUP BY does by looking at this imaginary sample data:
A B
- -
1 1
1 2
Then think about this query:
SELECT A
FROM SampleData
GROUP BY A
ORDER BY B
The GROUP BY clause puts the two rows into a single group. Then we want to order by B... but the two rows in the group have different values for B. Which should it use?
Obviously in this situation it doesn't really matter: there's only one row in the results, so the order is not relevant. But generally, how does the database know what to do?
The database could guess which one you want, or just take the first value, or the last — whatever those mean in a setting where the data is unordered by definition. And in fact this is what MySql will try to do for you: it will try to guess are your meaning. But this response is really inappropriate. You specified an in-exact query; the only correct thing to do is throw an error, which is what most databases will do.
Now let's look at the second sample:
SELECT group, some_agg_func(some_value)
FROM (
SELECT group, some_value
FROM my_table
ORDER BY some_other_value
) as alias
GROUP BY group
Here it is important to remember databases have their roots in relational set theory, and what we think of as "tables" are more formally described as Unordered Relations. Again: the idea of being "unordered" is baked into the very nature of a table at the deepest level.
In this case the inner query can run and create results in the specified order, and then the outer query can use that with GROUP BY to create a new set... but just like tables, query results are unordered relations. Without an ORDER BY clause the final result is also unordered by definition.
Now you might tend to get results in the order you want, but the reality is all bets are off. In fact, the databases that run this query will tend to give you results in the order in which they first encountered each group, which will not tend to match the ORDER BY because the GROUP BY expression is looking at completely different columns. Other databases (Sql Server is in this group) will not even allow the query to run, though I might prefer a warning here.
So now we come to the final section, where we must re-think the question, like this:
How can I use GROUP BY on the one group column, while also ordering by some_other_column not in the group?
The answer is each group can contain multiple rows, and so you must tell the database which row to look at to get the correct (specific) some_other_column value. The typical way to do this is with another aggregate function, which might look like this:
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_agg_func(some_other_column)
That code will run without error on pretty much any database.
Just be careful here. On one hand, when people want to do this it's often for the common case where they know every record for some_other_column in each group will have the same value. For example, you might GROUP BY UserID, but ORDER BY Email, where of course every record with the same UserID should have the same Email address. As humans, we have the ability to make that kind of inference. Computers, however, don't handle that kind of thinking as well, and so we help it out with an extra aggregate function like MIN() or MAX().
On the other hand, if you're not careful sometimes the two different aggregate functions don't match up, and you end up showing the value from one row in the group, while using a completely different row from the group for the ORDER BY expression in a way that is not good.
Tables are unordered sets of data. A query result is a table. So if you select from a subquery that contains an ORDER BY clause, that clause means nothing; the data set is unordered by definition. The DBMS is free to ignore the ORDER BY clause. Some DBMS may even issue a warning or error, but I suppose it's more common that the ORDER BY clause just has no effect - at least not guaranteed.
In this query
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_value
you try to order your results by some_other_value. If this is meant to be a column, you can't, because that other column is no part of your results. You'll get a syntax error. If some_other_value is a fixed value, then there is nothing ordered, because you'd have the same sort key for every row. But it can be an expression based on your result data (group key and aggreation results) and you can order your result rows by that.
In this query
SELECT group, some_agg_func(some_value)
FROM (
SELECT group, some_value
FROM my_table
ORDER BY some_other_value
) as alias
GROUP BY group
the ORDER BY clause has no effect. You could just as well just select FROM my_table directly:
SELECT group, some_agg_func(some_value)
FROM my_table as alias
GROUP BY group
This gets the results unordered (or at least the order you see is not guaranteed to be thus every time you run that query), because your query doesn't have an ORDER BY clause.

Still confusing the rules around selecting columns, group by, and joins

I am still confused by the syntax rules of using GROUP BY. I understand we use GROUP BY when there is some aggregate function. If I have even one aggregate function in a SQL statement, do I need to put all of my selected columns into my GROUP BY statement? I don't have a specific query to ask about but when I try to do joins, I get errors. In particular, when I use a count(*) in a statement and/or a join, I just seem to mess it up.
I use BigQuery at my job. I am regularly floored by strange gaps in knowledge.
Thank you!
This is a little complicated.
First, no aggregation functions are needed in an aggregation query. So this is allowed:
select a
from t
group by a;
This is equivalent, by the way, to:
select distinct a
from t;
If there are aggregation functions, then no group by is needed. So, this is allowed:
select max(a)
from t;
Such an aggregation query -- with no group by -- always returns one row. This is true even if the table is empty or a where clause filters out all the rows. In that case, most aggregation functions return NULL, with the notable exception of count() that returns 0.
Next, if you mix aggregation functions and non-aggregation expressions in the select, then in general you want the non-aggregation, non-constant expressions in the group by. I should note that you can do:
select a, concat(a, 'bcd'), count(*)
from t
group by a;
This should work, but sometimes BigQuery gets confused and will want the expression in the group by.
Finally, the SQL standard supports a query like this:
select t.*, count(*)
from t join
u
using (foo)
group by t.a;
When a is the primary key (or equivalent) in t. However, BigQuery does not have primary keys, so this is not relevant to that database.

SQL Syntax - Why do we need to list individual fields in an SQL group-by statement?

My understanding of using summary functions in SQL is that each field in the select statement that doesn't use a summary function, should be listed in the group by statement.
select a, b, c, sum(n) as sum_of_n
from table
group by a, b, c
My question is, why do we need to list the fields? Shouldn't the SQL syntax parser be implemented in a way that we can just tell it to group and it can figure out the groups based on whichever fields are in the select and aren't using summary functions?:
select a, b, c, sum(n) as sum_of_n
from table
group
I feel like I'm unnecessarily repeating myself when I write SQL code. What circumstances exist where we would not want it to automatically figure this out, or where it couldn't automatically figure this out?
To decrease the chances of errors in your statement. Explicitly spelling out the GROUP BY columns helps to ensure that the user wrote would they intended to write. You might be surprised at the number of posts that show up on Stackoverflow in which the user is grouping on columns that make no sense, but they have no idea why they aren't getting the data that they expect.
Also, consider the scenario where a user might want to group on more columns than are actually in the SELECT statement. For example, if I wanted the average of the most money that my customers have spent then I might write something like this:
SELECT
AVG(max_amt)
FROM (SELECT MAX(amt) FROM Invoices GROUP BY customer_id) SQ
In this case I can't simply use GROUP, I need to spell out the column(s) on which I'm grouping. The SQL engine could allow the user to explicitly list columns, but use a default if they are not listed, but then the chances of bugs drastically increases.
One way to think of it is like strongly typed programming languages. Making the programmer explicitly spell things out decreases the chance of bugs popping up because the engine made an assumption that the programmer didn't expect.
This is required to determine explicitly how do you want to group the records because, for example, you may use columns for grouping that are not listed in result set.
However, there are RDBMS which allow to not specify GROUP BY clause using aggregate functions like MySQL.
My first reaction would be that 'it is what it is' =)
But on thinking it through, the reason TSQL works like this is because the SELECT and the GROUP BY are two distinct parts of all the operations going on in the query.
This might not be the best example, but it does show that you can GROUP on different (well, 'more') fields than you are actually SELECTing.
SELECT brand = Convert(varchar(100), ''), model = Convert(varchar(100), ''), some_number = Convert(int, 0)
INTO #test
WHERE 1 = 2
INSERT #test (brand, model, some_number)
VALUES ('Ford', 'Focus', 10),
('Ford', 'Focus', 25),
('Ford', 'Kagu', 23),
('DMC', '12', 88)
SELECT brand, model, MAX(some_number)
FROM #test
GROUP BY brand, model
SELECT brand, MAX(some_number)
FROM #test
GROUP BY brand, model
Not all RDBMS's are like this, e.g. MySQL allows for omitting fields from the GROUP BY that are nevertheless in the SELECT part. From what I've seen, it then picks a random value ('there is no such a thing as an implicit first') and uses that in the SELECT .. I think, my knowledge on MySQL is rather limited but I've seen some examples here and there and they always confused me as I'm used to the strict requirement of TSQL you just described.
In addition, you can group by your columns in a different order than select
select a, b, c, sum(d)
from table
group by c,a,b
Also a lot of DBs allow you to skip column names, you can just specify which columns are going to be included in the group by using select position
select a, b, c, sum(d)
from table
group by 3,1,2

SQL Group By Column Part Number giving the data from most recent received date

New qith SQL my group by is not working and I am wanting it to pull the most recent POReleases.DateReceived date and group by part number. Here is what I have
SELECT POReleases.PONum, POReleases.PartNo, POReleases.JobNo, POReleases.Qty, POReleases.QtyRejected, POReleases.QtyCanceled, POReleases.DueDate, POReleases.DateReceived, PODet.ProdCode, PODet.Unit, PODet.UnitCost, PODet.QtyOrd, PODet.QtyRec, PODet.QtyReject, PODet.QtyCancel
FROM Waples.dbo.PODet PODet, Waples.dbo.POReleases POReleases
WHERE PODet.PartNo = POReleases.PartNo AND PODet.PONum = POReleases.PONum AND ((POReleases.DateReceived>{ts '2010-01-01 00:00:00'}))
GROUP BY PartNo
For starters, columns specified in the GROUP BY should be present in the select statement too. Here in your case only "PartNo" is used in GROUP BY clause whereas so many columns are used in the SELECT statement.
You can try WITH CTE to achieve this,
WITH CTE AS (
SELECT *, ROW_NUMBER() OVER( PARTITION BY PartNo ORDER BY POReleases.DateReceived DESC) AS PartNoCount
FROM TABLENAME
) SELECT * FROM CTE
When you write an SQL statement, you should think about the logical flow, which might be technically slightly inaccurate due to optimizations, but still, it is a good thing to think about it like this:
without the from clause specifying the source relation, the filter cannot be evaluated, so at least logically, the from is the first thing to evaluate
without the where clause specifying which records should be kept from the source relation, the filtered records cannot be grouped, so, at least logically, the where precedes the group by
without the group by, specifying the groups, you cannot select values from the groups, so, at least logically, group by precedes select
So, the projection (select) is executed on the groups of filtered records, which are groups themselves. Since the groups have an attribute, namely PartNo, it becomes an aggregated column. The other columns, which were reachable before the group by, can no longer be reached in the select. If you want to reach them, you need to group by them as well, or use aggregated functions for them, since if you have a group by, you will be able to select only the aggregated columns, which are either aggregated functions or columns which became aggregated due to their presence in the group by.
Since you did not specify how this query is not working, I will have to assume that you have a syntax error in the selection, due to the fact that you refer to columns which are not aggregated. Also, you might want to use join instead of Descartes multiplication and finally, if you want to filter the groups, not the records of the initial relation (which is the result of a Descartes multiplication in your case), then you might consider using a having clause.

Why does SQL force me to repeat all non-aggregated fields from my SELECT clause in my GROUP BY clause? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
This has bugged me for a long time.
99% of the time, the GROUP BY clause is an exact copy of the SELECT clause, minus the aggregate functions (MAX, SUM, etc.).
This breaks the Don't Repeat Yourself principle.
When can the GROUP BY clause not contain an exact copy of the SELECT clause minus the aggregate functions?
edit
I realise that some implementations allow you to have different fields in the GROUP BY than in the SELECT (hence 99%, not 100%), but surely that's a very minor exception?
Can someone explain what is supposed to be returned if you use different fields?
Thanks.
I tend to agree with you - this is one of many cases where SQL should have slightly smarter defaults to save us all some typing. For example, imagine if this were legal:
Select ClientName, InvoiceAmount, Sum(PaymentAmount) Group By *
where "*" meant "all the non-aggregate fields". If everybody knew that's how it worked, then there would be no confusion. You could sub in a specific list of fields if you wanted to do something tricky, but the splat means "all of 'em" (which in this context means, all the possible ones).
Granted, "*" means something different here than in the SELECT clause, so maybe a different character would work better:
Select ClientName, InvoiceAmount, Sum(PaymentAmount) Group By !
There are a few other areas like that where SQL just isn't as eloquent as it could be. But at this point, it's probably too entrenched to make many big changes like that.
Because they are two different things, you can group by items that aren't in the select clause
EDIT:
Also, is it safe to make that assumption?
I have a SQL statement
Select ClientName, InvAmt, Sum(PayAmt) as PayTot
Is it "correct" for the server to assume I want to group by ClientName AND InvoiceAmount?
I personally prefer (and think it's safer) to have this code
Select ClientName, InvAmt, Sum(PayAmt) as PayTot
Group By ClientName
throw an error, prompting me to change the code to
Select ClientName, Sum(InvAmt) as InvTot, Sum(PayAmt) as PayTot
Group By ClientName
I hope/expect we'll see something more comprehensive soon; a SQL history lesson on the subject would be useful and informative. Anyone? Anyone? Bueller?
In the meantime, I can observe the following:
SQL predates the DRY principle, at least as far as it it was documented in The Pragmatic Programmer.
Not all DBs require the full list: Sybase, for example, will happily execute queries like
SELECT a, b, COUNT(*)
FROM some_table
GROUP BY a
... which (at least every time I accidentally ran such a monster) often leads to such enormous inadvertent recordsets that panic-stricken requests quickly ensue, begging the DBAs to bounce the server. The result is a sort of partial Cartesian product, but I think it may mostly be a failure on Sybase's part to implement the SQL standard properly.
Perhaps we need a shorthand form - call it GroupSelect
GroupSelect Field1, Field2, sum(Field3) From SomeTable Where (X = "3")
This way, the parser need only throw an error if you leave out an aggregate function.
The good reason for it is that you would get incorrect results more often than not if you did not specify all columns. Suppose you have three columns, col1, col2 and col3.
Suppose your data looks like this:
Col1 Col2 Col3
a b 1
a c 1
b b 2
a b 3
select col1, col2, sum(col3) from mytable group by col1, col2
would give the following results:
Col1 Col2 Col3
a b 4
a c 1
b b 2
How would it interpret
select col1, col2, sum(col3) from mytable group by col1
My guess would be
Col1 Col2 Col3
a b 5
a c 5
b b 2
These are clearly bad results. Of course the more complex the query and the more joins the less likely it would be that the query would return correct results or that the programmer would even know if they were incorrect.
Personally I'm glad that group by requires the fields.
I agree with GROUP BY ALL, GROUP BY *, or something similar. As mentioned in the original post, in 99% (perhaps more) of the cases you want to group by all non-aggregate columns/expressions.
Here is however one example where you would need GROUP BY columns, for backward compatibility reasons.
SELECT
MIN(COUNT(*)) min_same_combination_cnt,
MAX(COUNT(*)) max_same_comb_cnt,
AVG(COUNT(*)) avg_same_comb_cnt,
SUM(COUNT(*)) total_records,
COUNT(COUNT(*)) distinct_combinations_cnt
FROM <some table>
GROUP BY <list of columns>
This works in Oracle. I use it to estimate selectivity on columns. The group by is applied to the inner aggregate function. Then, the outer aggregate is applied.
It would be nice to put forward a suggestion for this improvement to the SQL standard. I just don't know how that works.
Actually, wouldn't that be 100% of the time? Is there a case in which you can have a (non-aggregate) column in the select that is not in the GROUP BY?
I don't have an answer though. It certainly does seem like a awkward moment for the language.
I share the op's view that repeating is a bit annoying, especially if the non-aggregate fields contain elaborate statements like ifs and functions and a whole lot of other things. It would be nice if there could be some shorthand in the group by clause - at least a column alias. Referring to the columns by number may be another option, albeit one that probably has their own problems.
There could be a situation that you needed to extract one id of all the rows grouped, and sum of their quantities - for example. In this case you would i.e. group them by name and leave ids not grouped. SQLite seems to work this way.
Since group by result in single tuple for a whole group of tuples so other non group by attributes must be used in aggregate function only. If u add non group by attribute in select then sql cant decide which which value to be select from that group.