Impala mathematical operation containing avg fails with AnalysisException

Impala mathematical operation containing avg fails with AnalysisException - hive

I am attempting to subtract a value in a column (column_18) from the average of another column (avg(column_19)) and obtain this result as a third column (result) for each row of the table:
cur.execute("Select avg(column_19) - column_18 as result FROM test1")
This doesn't seem to be working well, and I get this error:
impala.error.HiveServer2Error: AnalysisException: select list expression not produced by aggregation output (missing from GROUP BY clause?): SUM(column_19) / COUNT(column_19) - column_18
I do not want the result to be grouped

avg() in this context is an aggregate function, which means that it is applied to a group of rows, which may specified with a GROUP BY clause (or all rows if not specified). The output of an aggregate expression is a single value per-group, so it is not applied per-row as you want.
However, you can accomplish what you're trying to do in a few ways. I think the easiest is by using avg() as an analytic function. For example, you can do something like:
select column_19, column_18, (avg(column_19) over () - column_18) as result from test1
See the documentation for more details about how aggregations and analytic functions work.

Related

Why do I have to group by every column in redshift?

I had a query made on aurora sql, it was working nice, but now I need to do the same in redshift, but when I do so, it throws an error asking me to group by by every column, but obviously I don't want that.
This is the query:
select
rut,
name,
id,
sum(cantidad_retornos) as cantidad_retornos,
sum(cantidad_aceptadas) as cantidad_aceptadas,
sum(cantidad_auto_accept) as cantidad_auto_accept,
sum(cantidad_rechazadas) as cantidad_rechazadas,
sum(cantidad_aceptadas) - sum(cantidad_auto_accept) as cantidad_aceptadas_manual,
coalesce((sum(cantidad_aceptadas) - sum(cantidad_auto_accept)) / nullif(sum(cantidad_aceptadas),0)) as per_aceptadas_manual,
coalesce(sum(cantidad_auto_accept) / nullif(sum(cantidad_aceptadas),0),0) as per_aceptadas_auto,
coalesce(sum(cantidad_rechazadas) / nullif(sum(cantidad_retornos),0),0) AS rechazo_per,
case
when coalesce(sum(cantidad_rechazadas) / nullif(sum(cantidad_retornos),0) ,0) < 0.1 or cantidad_retornos < 10 then 'Confiable'
when coalesce(sum(cantidad_rechazadas) / nullif(sum(cantidad_retornos),0),0) >= 0.1 and coalesce(sum(cantidad_rechazadas) / nullif(sum(cantidad_retornos),0),0) < 0.5 then 'Estándar'
when coalesce(sum(cantidad_rechazadas) / nullif(sum(cantidad_retornos),0),0) >= 0.5 then 'Poco confiable'
else 'Sin clasificar'
end as nivel_confianza
from table
where 1=1
group by id, name, rut
I tried to group by every column, but it doesn't throw the result that I need
The error that I get:
ERROR: column "reporte_sellers_date.cantidad_retornos" must appear in the GROUP BY clause or be used in an aggregate function
If I group by the third column, it throws the same error but with the column number 4

In the first option in the CASE statement you have or cantidad_retornos without any aggregating function such as SUM(). This is why Redshift is saying it needs to be in a group by. You also alias this name to the sum of the column of the same name. So the is a choice the database needs to make about which one to use - the source column or the aggregate. It looks like Aurora is choosing the aggregate but Redshift is choosing the source column.
Using the same name for an aggregate as a source column is not a good idea as you are relying on the query compiler to make a choice for you. This means the query can break when the compiler is updated or if you port the query to a different database.
To fix this you can either add the SUM() aggregation to the use of cantidad_retornos in the CASE statement or use the aggregate from above in the query but give it a unique name.

Access SQL GROUP BY problem (eg. tbl_Produktion.ID not part of the aggregation-function)

I want to group by two columns, however MS Access won't let me do it.
Here is the code I wrote:
SELECT
tbl_Produktion.Datum, tbl_Produktion.Schichtleiter,
tbl_Produktion.ProduktionsID, tbl_Produktion.Linie,
tbl_Produktion.Schicht, tbl_Produktion.Anzahl_Schichten_P,
tbl_Produktion.Schichtteam, tbl_Produktion.Von, tbl_Produktion.Bis,
tbl_Produktion.Pause, tbl_Produktion.Kunde, tbl_Produktion.TeileNr,
tbl_Produktion.FormNr, tbl_Produktion.LabyNr,
SUM(tbl_Produktion.Stueckzahl_Prod),
tbl_Produktion.Stueckzahl_Ausschuss, tbl_Produktion.Ausschussgrund,
tbl_Produktion.Kommentar, tbl_Produktion.StvSchichtleiter,
tbl_Produktion.Von2, tbl_Produktion.Bis2, tbl_Produktion.Pause2,
tbl_Produktion.Arbeiter3, tbl_Produktion.Von3, tbl_Produktion.Bis3,
tbl_Produktion.Pause3, tbl_Produktion.Arbeiter4,
tbl_Produktion.Von4, tbl_Produktion.Bis4, tbl_Produktion.Pause4,
tbl_Produktion.Leiharbeiter5, tbl_Produktion.Von5,
tbl_Produktion.Bis5, tbl_Produktion.Pause5,
tbl_Produktion.Leiharbeiter6, tbl_Produktion.Von6,
tbl_Produktion.Bis6, tbl_Produktion.Pause6, tbl_Produktion.Muster
FROM
tbl_Personal
INNER JOIN
tbl_Produktion ON tbl_Personal.PersID = tbl_Produktion.Schichtleiter
GROUP BY
tbl_Produktion.Datum, tbl_Produktion.Schichtleiter;
It works when I group it by all the columns, but not like this.
The error message say that the rest of the columns aren't part of the aggregation-function (translated from german to english as best as I could).
PS.: I also need the sum of "tbl_Produktion.Stueckzahl_Prod" therefore I tried using the SUM function (couldn't try it yet).

Have you tried something along these lines?
SELECT
tbl_Produktion.Datum, tbl_Produktion.Schichtleiter,
MAX(tbl_Produktion.ProduktionsID), MAX(tbl_Produktion.Linie),
MAX(tbl_Produktion.Schicht), MAX(tbl_Produktion.Anzahl_Schichten_P),
MAX(tbl_Produktion.Schichtteam), MAX(tbl_Produktion.Von), MAX(tbl_Produktion.Bis),
SUM(tbl_Produktion.Stueckzahl_Prod)
FROM
tbl_Personal
INNER JOIN
tbl_Produktion ON tbl_Personal.PersID = tbl_Produktion.Schichtleiter
GROUP BY
tbl_Produktion.Datum, tbl_Produktion.Schichtleiter;
I have used the MAX function for all the data except the two items you specify in the GROUP BY and the one where you desire the SUM. I took the liberty of leaving out mush of your data just to get started.
Using the MAX function turns out to be a convenient workaround when the data item is known to be unique within each group. We cannot know your data or your itent, so we cannot tell you whether MAX will yield the results you need.

If you use an aggregation function in the select clause, you must group by every column that you're selecting that's not an aggregation. If you don't want to do that for some reason (perhaps it changes the output of the aggregation in way that you don't intend) you either must think of an aggregate to use (pick a value. Average? Max? Min?) or just do two selects, one for the aggregate, and one for the non-aggregates. But, then, you have to decide how to get the non-aggregated fields that make sense for the aggregate (or show them all in a table, I suppose?)

Same return with and without the SUM operator PostgreSQL

I'm using PostgreSQL 10 and trying to run this query. I started with a CTE which I am referencing as 'query.'
SELECT
ROW_NUMBER()OVER() AS my_new_id,
query.geom AS geom,
query.pop AS pop,
query.name,
query.distance AS dist,
query.amenity_size,
((amenity_size)/(distance)^2) AS attract_score,
SUM((amenity_size)/(distance)^2) AS tot_attract_score,
((amenity_size)/(distance)^2) / SUM((amenity_size)/(distance)^2) as marketshare
INTO table_mktshare
FROM query
WHERE
distance > 0
GROUP BY
query.name,
query.amenity_size,
query.geom,
query.pop,
query.distance
The query runs but the problem lies in the 'markeshare' column. It returns the same answer with or without the SUM operator and returns one, which appears to make both the attract_score and the tot_attract_score the same. Why is the SUM operator read the same as the expression above it?

This is occurring specifically because each combination of columns in the group by clause uniquely identifies one row in the table. I don't know if this is intentional, but more normally, one would expect something like this:
SELECT ROW_NUMBER() OVER() AS my_new_id,
query.geom AS geom, query.pop AS pop, query.name,
SUM((amenity_size)/(distance)^2) AS tot_attract_score,
INTO table_mktshare
FROM query
WHERE distance > 0
GROUP BY query.name, query.geom, query.pop;
This is not your intention, but it does give a flavor of what's expected.

How to properly compute weighted average for zeroes in SQL

I have following problem - I'm computing weighted average in SQL, as following: SUM(Value * Weight) / SUM(Weight). However, there can be issue that rows are empty => SUM(Weight) == 0), and in this case the query fails. Is it somehow possible to return '0' as result in this case?
I have tried CASE SUM(Weight) WHEN 0 THEN 0 ELSE SUM(Value * Weight) / SUM(Weight) END, but I'm afraid that it evaluates SUM(Weight) twice, and that can be fairly expensive in my case.

Use NULLIF and ISNULL:
ISNULL(SUM(Value * Weight) / NULLIF(SUM(Weight),0),0)

The SQL engine doesn't compute sum(Weight) twice, just once. The conceptual process is:
compute the full cartesian join of all tables in the from clause
apply the join criteria to filter the results
apply the where clause criteria to filter the results
partition this result set into groups as defined by the group by clause
collapse each such group into one row, computing any aggregate functions that have been specified, keeping only those columns listed in the result set (aggregrate functions and grouping columns),
apply the criteria in the having clause to filter the grouped results,
drop all columns but those specified in the queries result columns, creating those that are computed expressions.
apply the ordering specified in the order by statement.
No actual SQL engine does this, but it must behave as if that is what happened. Your aggregate function is computed just once, along with any other aggregate functions, in a single pass.

Group by SQL statement

So I got this statement, which works fine:
SELECT MAX(patient_history_date_bio) AS med_date, medication_name
FROM biological
WHERE patient_id = 12)
GROUP BY medication_name
But, I would like to have the corresponding medication_dose also. So I type this up
SELECT MAX(patient_history_date_bio) AS med_date, medication_name, medication_dose
FROM biological
WHERE (patient_id = 12)
GROUP BY medication_name
But, it gives me an error saying:
"coumn 'biological.medication_dose' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.".
So I try adding medication_dose to the GROUP BY clause, but then it gives me extra rows that I don't want.
I would like to get the latest row for each medication in my table. (The latest row is determined by the max function, getting the latest date).
How do I fix this problem?

Use:
SELECT b.medication_name,
b.patient_history_date_bio AS med_date,
b.medication_dose
FROM BIOLOGICAL b
JOIN (SELECT y.medication_name,
MAX(y.patient_history_date_bio) AS max_date
FROM BIOLOGICAL y
GROUP BY y.medication_name) x ON x.medication_name = b.medication_name
AND x.max_date = b.patient_history_date_bio
WHERE b.patient_id = ?

If you really have to, as one quick workaround, you can apply an aggregate function to your medication_dose such as MAX(medication_dose).
However note that this is normally an indication that you are either building the query incorrectly, or that you need to refactor/normalize your database schema. In your case, it looks like you are tackling the query incorrectly. The correct approach should the one suggested by OMG Poinies in another answer.
You may be interested in checking out the following interesting article which describes the reasons behind this error:
But WHY Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?

You need to put max(medication_dose) in your select. Group by returns a result set that contains distinct values for fields in your group by clause, so apparently you have multiple records that have the same medication_name, but different doses, so you are getting two results.
By putting in max(medication_dose) it will return the maximum dose value for each medication_name. You can use any aggregate function on dose (max, min, avg, sum, etc.)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Impala mathematical operation containing avg fails with AnalysisException - hive

Related

Why do I have to group by every column in redshift?

Access SQL GROUP BY problem (eg. tbl_Produktion.ID not part of the aggregation-function)

Same return with and without the SUM operator PostgreSQL

How to properly compute weighted average for zeroes in SQL

Group by SQL statement

Categories

Resources