Adding ORDER BY gives error "must appear in the group by clause or be used in an aggregate function" - sql

I want to add an ORDER_BY on car_max_price in my query but don't want it in the GROUP_BY.
How to fix the must appear in the group by clause or be used in an aggregate function error? Any idea of what's wrong?
subquery = (
session.query(
garage.id,
garage.name,
func.max(func.coalesce(car.max_price, 0)).label("car_max_price"),
func.jsonb_build_object(
text("'type'"),
car.type,
text("'color'"),
car.color
text("'max_price'"),
func.max(car.max_price),
).label("some_cars"),
)
.group_by(
garage,
car.type,
)
.subquery("subquery")
)
query = (
session.query(
func.jsonb_build_object(
text("'id'"),
subquery.c.id,
text("'name'"),
subquery.c.name,
text("'some_cars'"),
func.jsonb_agg(
func.distinct(subquery.c.some_cars),
).label("some_cars"),
)
.select_from(subquery)
.group_by(
subquery.c.id,
subquery.c.name,
)
.order_by(
subquery.c.car_max_price
)
)
return query

When aggregating, you cannot order by an un-aggregated column. There may be any number of different values in the same aggregated group of rows for it. So you have to define what to use exactly (by way of another aggregate function).
Order by min(subquery.c.car_max_price) or max(subquery.c.car_max_price) or avg(subquery.c.car_max_price) or whatever you actually need.
There is one exception to this rule: if the PRIMARY KEY of a table is listed in the GROUP BY clause, that covers all columns of that table. See:
PostgreSQL - GROUP BY clause
But while operating on a derived table (subquery) that exception cannot apply. A derived table cannot have a PK constraint.

Related

SQL - Difference between .* and * in aggregate function query

SELECT reviews.*, COUNT(comments.review_id)
AS comment_count
FROM reviews
LEFT JOIN comments ON comments.review_id = reviews.review_id
GROUP BY reviews.review_id
ORDER BY reviews.review_id ASC;
When I run this code I get exactly what I want from my SQL query, however if I run the following
SELECT *, COUNT(comments.review_id)
AS comment_count
FROM reviews
LEFT JOIN comments ON comments.review_id = reviews.review_id
GROUP BY reviews.review_id
ORDER BY reviews.review_id ASC;
then I get an error "column must appear in GROUP BY clause or be used in an aggregate function
Just wondered what the difference was and why the behaviour is different.
Thanks
In the first example, the column are taken only from the reviews table. Although not databases allow the use of SELECT * with GROUP BY, it is allowed by Standard SQL, assuming that review_id is the primary key.
The issue is that that you are including columns in the SELECT that are not included in the GROUP BY. This is only allowed -- in certain databases -- under very special circumstances, where the columns in the GROUP BY are declared to uniquely identify each row (which a primary key does).
The second example has columns from comments that do not meet this condition. Hence it is not allowed.
In the select part of the query with group by, you can chose only those columns which you used in group by.
Since you did group by reviews.review_id, you can get the output for the first case. In the second query you are try to get all the records and that is not possible with group by.
You can use window function if you need to select columns which are not present in your group by clause. Hope it makes sense.
https://www.windowfunctions.com/

Still confusing the rules around selecting columns, group by, and joins

I am still confused by the syntax rules of using GROUP BY. I understand we use GROUP BY when there is some aggregate function. If I have even one aggregate function in a SQL statement, do I need to put all of my selected columns into my GROUP BY statement? I don't have a specific query to ask about but when I try to do joins, I get errors. In particular, when I use a count(*) in a statement and/or a join, I just seem to mess it up.
I use BigQuery at my job. I am regularly floored by strange gaps in knowledge.
Thank you!
This is a little complicated.
First, no aggregation functions are needed in an aggregation query. So this is allowed:
select a
from t
group by a;
This is equivalent, by the way, to:
select distinct a
from t;
If there are aggregation functions, then no group by is needed. So, this is allowed:
select max(a)
from t;
Such an aggregation query -- with no group by -- always returns one row. This is true even if the table is empty or a where clause filters out all the rows. In that case, most aggregation functions return NULL, with the notable exception of count() that returns 0.
Next, if you mix aggregation functions and non-aggregation expressions in the select, then in general you want the non-aggregation, non-constant expressions in the group by. I should note that you can do:
select a, concat(a, 'bcd'), count(*)
from t
group by a;
This should work, but sometimes BigQuery gets confused and will want the expression in the group by.
Finally, the SQL standard supports a query like this:
select t.*, count(*)
from t join
u
using (foo)
group by t.a;
When a is the primary key (or equivalent) in t. However, BigQuery does not have primary keys, so this is not relevant to that database.

How to call a function using the results of a query ordered

I'm trying to call a function on each of the values that fit my query ordered by date. The reason being that the (black box) function is internally aggregating values into a string and I need the aggregated values to be in the timestamp order.
The things that I have tried are below: (function returns a boolean value and does blackbox things that I do not know and cannot modify)
-- This doesn't work
SELECT
bool_and (
function(MT.id)
)
FROM my_table MT
WHERE ...conditions...
ORDER BY MT.timestamp_value, MT.id
and got the error column "mt.timestamp_value" must appear in the GROUP BY clause or be used in an aggregate function. If I remove the ORDER BY as below, it will also work:
-- This works!
SELECT
bool_and (
function(MT.id)
)
FROM my_table MT
WHERE ...conditions...
I also tried removing the function and only selected MT.id and it worked, but with the function it doesn't. So I tried using the GROUP BY clause.
Doing that, I tried:
-- This also doesn't work
SELECT
bool_and(
function(MT.id)
)
FROM my_table MT
WHERE ...conditions...
GROUP BY MT.id, MT.timestamp_value
ORDER BY MT.timestamp_value, MT.id
but this gives the error more than one row returned by a subquery used as an expression. MT.id is the primary key btw. It also works without the function and just SELECT MT.id
Ideally, a fix to either one of the code bits above would be nice or otherwise something that fulfills the following:
-- Not real postgresql code but something I want it to do
SELECT FUNCTION(id)
FOR EACH id in (MY SELECT STATEMENT HERE ORDERED)
In response to #a_horse_with_no_name
This code falls under a section of another query that looks like the below:
SELECT Function2()
WHERE true = (
(my_snippet)
AND (...)
AND (...)
...
)
The error is clear. The subquery SELECT function(MT.id) is returning more than 1 row and the calling function bool_and can only operate on 1 row at a time. Adjust the subquery so that it only returns 1 record.
Issue resolution:
I discovered the reason that everything was failing was because of the AND in
WHERE true = (
(my_snippet)
AND (...)
AND (...)
...
)
What happened was that using GROUP BY and using ORDER BY caused the value returned by my snippet to be multiple rows of true.
Without the GROUP BY and the ORDER BY, it only returned a single row of true.
So my solution was to wrap the code into another bool_and and use that.
SELECT
bool_and(vals)
FROM(
SELECT
bool_and(
function(MT.id)
) as vals
FROM my_table MT
WHERE ...conditions...
GROUP BY MT.id, MT.timestamp_value
ORDER BY MT.timestamp_value, MT.id
) t
Since I have to guess the reason, and the way that you are trying to accomplish your stated goal:
"return value of the function is aggregated into a string"
And since:
You are using: bool_and and therefore the return value of the function must be boolean, and
The only aggregation I can see is the bool_and aggregation into either true or false, and
You mention that the function is a black box to you,
I would presume that instead of:
"return value of the function is aggregated into a string"
You meant to say: function is aggregating (input/transformed input) values into a string,
And you need this aggregating to be in a certain order.
Further I assume that you own the my_table and can create indexes on it.
So, if you need the function being used in the context:
bool_and ( function(MT.id) )
to process (and therefore aggregate into string) MT.id inputs (or their transformed values) in a certain order, you need to create a clustered index in that order for your my_table table.
To accomplish that in postgresql, you need to (instead of using the group by and order by):
create that_index, in the order you need for the aggregation, for your my_table table, and then
run: CLUSTER my_table USING that_index to physically bake in that order to table structure, and therefore ensure the default aggregation order to be in that order in the bool_and ( function(MT.id) ) aggregation.
(see CLUSTER for more info)

SQL ORDER BY in SQL table returning function

so I have simple function trying to get two fields from database. I'm trying to use order by for the results however I cannot use ORDER BY in return clause.
It tells me
The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP, OFFSET or FOR XML is also specified.
Is is it possible to use ORDER BY in RETURN statement? I would like to avoid using order by when executing the function.
CREATE FUNCTION goalsGames1 () RETURNS TABLE
AS RETURN(
SELECT MAX(goals_scored) goals,
no_games
FROM Player
GROUP BY no_games
ORDER BY no_games DESC )
One trick to skip this error is using top as it is mentioned in the error message:
CREATE FUNCTION goalsGames1 () RETURNS TABLE
AS RETURN(
SELECT Top 100 Percent MAX(goals_scored) goals,
no_games
FROM Player
GROUP BY no_games
ORDER BY no_games DESC )
I would like to avoid using order by when executing the function.
If you are using the function and want the results in a particular order, then you need to use ORDER BY.
This is quite clearly stated in the documentation:
The ORDER clause does not guarantee ordered results when a SELECT query is executed, unless ORDER BY is also specified in the query.
use order by intimes of selection your function not in times of creation
so use here in select * from goalsGames1 order by col
and your error tells you where order by is invalid
You cannot order by inside a function, the idea is to order the resultset returned by the function.
select *
from dbo.goalsGames1()
order by no_games
Even if you would order by inside the function, there is no guaranty that this ordering would be preserved when the resultset is returned. The executing query (select * from functionname) has to be responsible for setting the order, not the function or view.
Who ever receives the rows is the only one that can order them, so in this case, the select * from goalsGames1() is the receiver, and this query has to order the results.

Give priority to ORDER BY over a GROUP BY in MySQL without subquery

I have the following query which does what I want, but I suspect it is possible to do this without a subquery:
SELECT *
FROM (SELECT *
FROM 'versions'
ORDER BY 'ID' DESC) AS X
GROUP BY 'program'
What I need is to group by program, but returning the results for the objects in versions with the highest value of "ID".
In my past experience, a query like this should work in MySQL, but for some reason, it's not:
SELECT *
FROM 'versions'
GROUP BY 'program'
ORDER BY MAX('ID') DESC
What I want to do is have MySQL do the ORDER BY first and then the GROUP BY, but it insists on doing the GROUP BY first followed by the ORDER BY. i.e. it is sorting the results of the grouping instead of grouping the results of the ordering.
Of course it is not possible to write
SELECT * FROM 'versions' ORDER BY 'ID' DESC GROUP BY 'program'
Thanks.
By definition, ORDER BY is processed after grouping with GROUP BY. By definition, the conceptual way any SELECT statement is processed is:
Compute the cartesian product of all tables referenced in the FROM clause
Apply the join criteria from the FROM clause to filter the results
Apply the filter criteria in the WHERE clause to further filter the results
Group the results into subsets based on the GROUP BY clause, collapsing the results to a single row for each such subset and computing the values of any aggregate functions -- SUM(), MAX(), AVG(), etc. -- for each such subset. Note that if no GROUP BY clause is specified, the results are treated as if there is a single subset and any aggregate functions apply to the entire results set, collapsing it to a single row.
Filter the now-grouped results based on the HAVING clause.
Sort the results based on the ORDER BY clause.
The only columns allowed in the results set of a SELECT with a GROUP BY clause are, of course,
The columns referenced in the GROUP BY clause
Aggregate functions (such as MAX())
literal/constants
expresssions derived from any of the above.
Only broken SQL implementations allow things like select xxx,yyy,a,b,c FROM foo GROUP BY xxx,yyy — the references to colulmsn a, b and c are meaningless/undefined, given that the individual groups have been collapsed to a single row,
This should do it and work pretty well as long as there is a composite index on (program,id). The subquery should only inspect the very first id for each program branch, and quickly retrieve the required record from the outer query.
select v.*
from
(
select program, MAX(id) id
from versions
group by program
) m
inner join versions v on m.program=v.program and m.id=v.id
SELECT v.*
FROM (
SELECT DISTINCT program
FROM versions
) vd
JOIN versions v
ON v.id =
(
SELECT vi.id
FROM versions vi
WHERE vi.program = vd.program
ORDER BY
vi.program DESC, vi.id DESC
LIMIT 1
)
Create an index on (program, id) for this to work fast.
Regarding your original query:
SELECT * FROM 'versions' GROUP BY 'program' ORDER BY MAX('ID') DESC
This query would not parse in any SQL dialect except MySQL.
It abuses MySQL's ability to return ungrouped and unaggregated expressions from a GROUP BY statement.