Order of fields in a group by clause - sql

In sql, if I have multiple columns in the group by clause, does it make a difference in which order I specify them?
For example:
select col1,col2, col3,sum(col4) from table1 group by col1,col2, col3?
Is it same as:
select col1,col2, col3,sum(col4) from table1 group by col2,col1, col3?

No. The order doesn't matter for the GROUP BY clause. As SQL is declarative. So it doesnt matter in which order you specify the columns in group by clause.
In MySql if you are using WITH ROLLUP then the order of group by may matter otherwise it will not. In SQL Server ROLL UP is deprecated.

The order of group by list do not matter.
You can interpret order by clause as build groups clause, that ensures that any one of its groups is at least in a field different from all others.
In this concept, order do not matter, the field list are just the "knives" that break apart the recordset.
Also I warn you: group by may or may not return records as its list says (it is not waranted). If a order (in the full output recordset) is needed, also use order by

Related

SQL - Difference between .* and * in aggregate function query

SELECT reviews.*, COUNT(comments.review_id)
AS comment_count
FROM reviews
LEFT JOIN comments ON comments.review_id = reviews.review_id
GROUP BY reviews.review_id
ORDER BY reviews.review_id ASC;
When I run this code I get exactly what I want from my SQL query, however if I run the following
SELECT *, COUNT(comments.review_id)
AS comment_count
FROM reviews
LEFT JOIN comments ON comments.review_id = reviews.review_id
GROUP BY reviews.review_id
ORDER BY reviews.review_id ASC;
then I get an error "column must appear in GROUP BY clause or be used in an aggregate function
Just wondered what the difference was and why the behaviour is different.
Thanks
In the first example, the column are taken only from the reviews table. Although not databases allow the use of SELECT * with GROUP BY, it is allowed by Standard SQL, assuming that review_id is the primary key.
The issue is that that you are including columns in the SELECT that are not included in the GROUP BY. This is only allowed -- in certain databases -- under very special circumstances, where the columns in the GROUP BY are declared to uniquely identify each row (which a primary key does).
The second example has columns from comments that do not meet this condition. Hence it is not allowed.
In the select part of the query with group by, you can chose only those columns which you used in group by.
Since you did group by reviews.review_id, you can get the output for the first case. In the second query you are try to get all the records and that is not possible with group by.
You can use window function if you need to select columns which are not present in your group by clause. Hope it makes sense.
https://www.windowfunctions.com/

In SQL, does groupby on an ordered query behave the same as doing both in the same query?

Are the following queries identical, or might I get different results (in any major DB system, e.g. MSSQL, MySQL, Postgres, SQLite):
Doing both in the same query:
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_value
vs. ordering in a subquery:
SELECT group, some_agg_func(some_value)
FROM (
SELECT group, some_value
FROM my_table
ORDER BY some_other_value
) as alias
GROUP BY group
Looking at the first sample:
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_value
Let's think about what GROUP BY does by looking at this imaginary sample data:
A B
- -
1 1
1 2
Then think about this query:
SELECT A
FROM SampleData
GROUP BY A
ORDER BY B
The GROUP BY clause puts the two rows into a single group. Then we want to order by B... but the two rows in the group have different values for B. Which should it use?
Obviously in this situation it doesn't really matter: there's only one row in the results, so the order is not relevant. But generally, how does the database know what to do?
The database could guess which one you want, or just take the first value, or the last — whatever those mean in a setting where the data is unordered by definition. And in fact this is what MySql will try to do for you: it will try to guess are your meaning. But this response is really inappropriate. You specified an in-exact query; the only correct thing to do is throw an error, which is what most databases will do.
Now let's look at the second sample:
SELECT group, some_agg_func(some_value)
FROM (
SELECT group, some_value
FROM my_table
ORDER BY some_other_value
) as alias
GROUP BY group
Here it is important to remember databases have their roots in relational set theory, and what we think of as "tables" are more formally described as Unordered Relations. Again: the idea of being "unordered" is baked into the very nature of a table at the deepest level.
In this case the inner query can run and create results in the specified order, and then the outer query can use that with GROUP BY to create a new set... but just like tables, query results are unordered relations. Without an ORDER BY clause the final result is also unordered by definition.
Now you might tend to get results in the order you want, but the reality is all bets are off. In fact, the databases that run this query will tend to give you results in the order in which they first encountered each group, which will not tend to match the ORDER BY because the GROUP BY expression is looking at completely different columns. Other databases (Sql Server is in this group) will not even allow the query to run, though I might prefer a warning here.
So now we come to the final section, where we must re-think the question, like this:
How can I use GROUP BY on the one group column, while also ordering by some_other_column not in the group?
The answer is each group can contain multiple rows, and so you must tell the database which row to look at to get the correct (specific) some_other_column value. The typical way to do this is with another aggregate function, which might look like this:
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_agg_func(some_other_column)
That code will run without error on pretty much any database.
Just be careful here. On one hand, when people want to do this it's often for the common case where they know every record for some_other_column in each group will have the same value. For example, you might GROUP BY UserID, but ORDER BY Email, where of course every record with the same UserID should have the same Email address. As humans, we have the ability to make that kind of inference. Computers, however, don't handle that kind of thinking as well, and so we help it out with an extra aggregate function like MIN() or MAX().
On the other hand, if you're not careful sometimes the two different aggregate functions don't match up, and you end up showing the value from one row in the group, while using a completely different row from the group for the ORDER BY expression in a way that is not good.
Tables are unordered sets of data. A query result is a table. So if you select from a subquery that contains an ORDER BY clause, that clause means nothing; the data set is unordered by definition. The DBMS is free to ignore the ORDER BY clause. Some DBMS may even issue a warning or error, but I suppose it's more common that the ORDER BY clause just has no effect - at least not guaranteed.
In this query
SELECT group, some_agg_func(some_value)
FROM my_table
GROUP BY group
ORDER BY some_other_value
you try to order your results by some_other_value. If this is meant to be a column, you can't, because that other column is no part of your results. You'll get a syntax error. If some_other_value is a fixed value, then there is nothing ordered, because you'd have the same sort key for every row. But it can be an expression based on your result data (group key and aggreation results) and you can order your result rows by that.
In this query
SELECT group, some_agg_func(some_value)
FROM (
SELECT group, some_value
FROM my_table
ORDER BY some_other_value
) as alias
GROUP BY group
the ORDER BY clause has no effect. You could just as well just select FROM my_table directly:
SELECT group, some_agg_func(some_value)
FROM my_table as alias
GROUP BY group
This gets the results unordered (or at least the order you see is not guaranteed to be thus every time you run that query), because your query doesn't have an ORDER BY clause.

Select Distinct On while Order By a different column

I'm coming from a MySQL background, where GROUP BY worked very differently than in Postgres. In Postgres - and apparently any standards-based SQL database - you have to group by all selected columns, while in MySQL you can handpick which ones to group by.
I read that you can get an equivalent effect with DISTINCT ON, and for the most part that's the case. The hitch is that you have to ORDER BY all the distinct columns, and this ordering has to be the left-most ordering. That's a problem when I want to order primarily by another column.
Right now my query looks like this:
SELECT
DISTINCT ON (eventable_id, eventable_type)
events.eventable_id, events.eventable_type, events.*
FROM events
WHERE <query>
ORDER BY eventable_id, eventable_type, events.created_at DESC
I would like to swap around the order by to look like this:
ORDER BY events.created_at, eventable_id, eventable_type DESC
Any advice for getting this to work?
Since you are selecting events.*, you shouldn't add eventable_id, and eventable_type to the output columns redundantly. Would result in duplicate column names. You know that you don't have to include the columns in the DISTINCT ON clause in target list, right?
Also, it's probably faster to use eventable_type DESC right away, since you have that in your final sort order. That's allowed, too.
SELECT DISTINCT ON (eventable_id, eventable_type)
*
FROM events
WHERE <condition>
ORDER BY eventable_id, eventable_type DESC, created_at DESC
#Denis already covers the rest: make that a subquery and order as you like in the outer query.
The alternative would be a subselect with GROUP BY and max(), but that yields multiple columns per group, when the latest created_at per group is not unique. (May or may not be desirable.) And it's probably still slower than DISTINCT ON with an additional ORDER BY step. Test with EXPLAIN ANALYZE.
SELECT e.*
FROM events e
JOIN (
SELECT eventable_id, eventable_type, max(created_at) AS created_at
FROM events
WHERE <condition>
GROUP BY 1, 2 DESC
) sub USING (eventable_id, eventable_type, created_at) -- maybe not unique
WHERE <repeat condition if dupes may be eliminated>
ORDER BY e.created_at, e.eventable_id, e.eventable_type DESC
If Postgres complains, use a subselect:
select * from ( ... ) q order by ...
(If it does, though, I'd take it as a hint that the query plan will suck.)

max(), group by and order by

I have following SQL statement.
SELECT t.client_id,max(t.points) AS "max" FROM sessions GROUP BY t.client_id;
It simply lists client id's with maximum amount of points they've achieved. Now I want to sort the results by max(t.points). Normally I would use ORDER BY, but I have no idea how to use it with groups. I know using value from SELECT list is prohibited in following clauses, so adding ORDER BY max at the end of query won't work.
How can I sort those results after grouping, then?
Best regards
SELECT t.client_id, max(t.points) AS "max"
FROM sessions t
GROUP BY t.client_id
order by max(t.points) desc
It is not quite correct that values from the SELECT list are prohibited in following clauses. In fact, ORDER BY is logically processed after the SELECT list and can refer to SELECT list result names (in contrast with GROUP BY). So the normal way to write your query would be
SELECT t.client_id, max(t.points) AS "max"
FROM sessions
GROUP BY t.client_id
ORDER BY max;
This way of expressing it is SQL-92 and should be very portable. The other way to do it is by column number, e.g.,
ORDER BY 2;
These are the only two ways to do this in SQL-92.
SQL:1999 and later also allow referring to arbitrary expressions in the sort list, so you could just do ORDER BY max(t.points), but that's clearly more cumbersome, and possibly less portable. The ordering by column number was removed in SQL:1999, so it's technically no longer standard, but probably still widely supported.
Since you have tagged as Postgres: Postgres allows a non-standard GROUP BY and ORDER BY column number. So you could have
SELECT t.client_id, max(t.points) AS "max"
FROM sessions t
GROUP BY 1
order by 2 desc
After parsing, this is identical to RedFilter’s solution.

Give priority to ORDER BY over a GROUP BY in MySQL without subquery

I have the following query which does what I want, but I suspect it is possible to do this without a subquery:
SELECT *
FROM (SELECT *
FROM 'versions'
ORDER BY 'ID' DESC) AS X
GROUP BY 'program'
What I need is to group by program, but returning the results for the objects in versions with the highest value of "ID".
In my past experience, a query like this should work in MySQL, but for some reason, it's not:
SELECT *
FROM 'versions'
GROUP BY 'program'
ORDER BY MAX('ID') DESC
What I want to do is have MySQL do the ORDER BY first and then the GROUP BY, but it insists on doing the GROUP BY first followed by the ORDER BY. i.e. it is sorting the results of the grouping instead of grouping the results of the ordering.
Of course it is not possible to write
SELECT * FROM 'versions' ORDER BY 'ID' DESC GROUP BY 'program'
Thanks.
By definition, ORDER BY is processed after grouping with GROUP BY. By definition, the conceptual way any SELECT statement is processed is:
Compute the cartesian product of all tables referenced in the FROM clause
Apply the join criteria from the FROM clause to filter the results
Apply the filter criteria in the WHERE clause to further filter the results
Group the results into subsets based on the GROUP BY clause, collapsing the results to a single row for each such subset and computing the values of any aggregate functions -- SUM(), MAX(), AVG(), etc. -- for each such subset. Note that if no GROUP BY clause is specified, the results are treated as if there is a single subset and any aggregate functions apply to the entire results set, collapsing it to a single row.
Filter the now-grouped results based on the HAVING clause.
Sort the results based on the ORDER BY clause.
The only columns allowed in the results set of a SELECT with a GROUP BY clause are, of course,
The columns referenced in the GROUP BY clause
Aggregate functions (such as MAX())
literal/constants
expresssions derived from any of the above.
Only broken SQL implementations allow things like select xxx,yyy,a,b,c FROM foo GROUP BY xxx,yyy — the references to colulmsn a, b and c are meaningless/undefined, given that the individual groups have been collapsed to a single row,
This should do it and work pretty well as long as there is a composite index on (program,id). The subquery should only inspect the very first id for each program branch, and quickly retrieve the required record from the outer query.
select v.*
from
(
select program, MAX(id) id
from versions
group by program
) m
inner join versions v on m.program=v.program and m.id=v.id
SELECT v.*
FROM (
SELECT DISTINCT program
FROM versions
) vd
JOIN versions v
ON v.id =
(
SELECT vi.id
FROM versions vi
WHERE vi.program = vd.program
ORDER BY
vi.program DESC, vi.id DESC
LIMIT 1
)
Create an index on (program, id) for this to work fast.
Regarding your original query:
SELECT * FROM 'versions' GROUP BY 'program' ORDER BY MAX('ID') DESC
This query would not parse in any SQL dialect except MySQL.
It abuses MySQL's ability to return ungrouped and unaggregated expressions from a GROUP BY statement.