How to properly compute weighted average for zeroes in SQL - sql

I have following problem - I'm computing weighted average in SQL, as following: SUM(Value * Weight) / SUM(Weight). However, there can be issue that rows are empty => SUM(Weight) == 0), and in this case the query fails. Is it somehow possible to return '0' as result in this case?
I have tried CASE SUM(Weight) WHEN 0 THEN 0 ELSE SUM(Value * Weight) / SUM(Weight) END, but I'm afraid that it evaluates SUM(Weight) twice, and that can be fairly expensive in my case.

Use NULLIF and ISNULL:
ISNULL(SUM(Value * Weight) / NULLIF(SUM(Weight),0),0)

The SQL engine doesn't compute sum(Weight) twice, just once. The conceptual process is:
compute the full cartesian join of all tables in the from clause
apply the join criteria to filter the results
apply the where clause criteria to filter the results
partition this result set into groups as defined by the group by clause
collapse each such group into one row, computing any aggregate functions that have been specified, keeping only those columns listed in the result set (aggregrate functions and grouping columns),
apply the criteria in the having clause to filter the grouped results,
drop all columns but those specified in the queries result columns, creating those that are computed expressions.
apply the ordering specified in the order by statement.
No actual SQL engine does this, but it must behave as if that is what happened. Your aggregate function is computed just once, along with any other aggregate functions, in a single pass.

Related

Same return with and without the SUM operator PostgreSQL

I'm using PostgreSQL 10 and trying to run this query. I started with a CTE which I am referencing as 'query.'
SELECT
ROW_NUMBER()OVER() AS my_new_id,
query.geom AS geom,
query.pop AS pop,
query.name,
query.distance AS dist,
query.amenity_size,
((amenity_size)/(distance)^2) AS attract_score,
SUM((amenity_size)/(distance)^2) AS tot_attract_score,
((amenity_size)/(distance)^2) / SUM((amenity_size)/(distance)^2) as marketshare
INTO table_mktshare
FROM query
WHERE
distance > 0
GROUP BY
query.name,
query.amenity_size,
query.geom,
query.pop,
query.distance
The query runs but the problem lies in the 'markeshare' column. It returns the same answer with or without the SUM operator and returns one, which appears to make both the attract_score and the tot_attract_score the same. Why is the SUM operator read the same as the expression above it?
This is occurring specifically because each combination of columns in the group by clause uniquely identifies one row in the table. I don't know if this is intentional, but more normally, one would expect something like this:
SELECT ROW_NUMBER() OVER() AS my_new_id,
query.geom AS geom, query.pop AS pop, query.name,
SUM((amenity_size)/(distance)^2) AS tot_attract_score,
INTO table_mktshare
FROM query
WHERE distance > 0
GROUP BY query.name, query.geom, query.pop;
This is not your intention, but it does give a flavor of what's expected.

Impala mathematical operation containing avg fails with AnalysisException

I am attempting to subtract a value in a column (column_18) from the average of another column (avg(column_19)) and obtain this result as a third column (result) for each row of the table:
cur.execute("Select avg(column_19) - column_18 as result FROM test1")
This doesn't seem to be working well, and I get this error:
impala.error.HiveServer2Error: AnalysisException: select list expression not produced by aggregation output (missing from GROUP BY clause?): SUM(column_19) / COUNT(column_19) - column_18
I do not want the result to be grouped
avg() in this context is an aggregate function, which means that it is applied to a group of rows, which may specified with a GROUP BY clause (or all rows if not specified). The output of an aggregate expression is a single value per-group, so it is not applied per-row as you want.
However, you can accomplish what you're trying to do in a few ways. I think the easiest is by using avg() as an analytic function. For example, you can do something like:
select column_19, column_18, (avg(column_19) over () - column_18) as result from test1
See the documentation for more details about how aggregations and analytic functions work.

MDX WHERE vs FILTER options

I have a query in which I need to do some filtering. I can do it in a subcube, but I am wondering if I could do this in a WHERE clause without subcube. I think this solution would be faster/cleaner. I need to filter out product models with IB>0 in last month, this is my solution so far (only part of a query):
SELECT {[Measures].[AFR],[Measures].[IB]} ON COLUMNS,
([dim_ProductModel].[ODM].children)*[Dim_Date].[Date Full].children ON ROWS
FROM
(
SELECT
FILTER([dim_ProductModel].[Product Model].children,
([Measures].[IB]*[Dim_Date].[Date Full].&[2014-04-01]>0)) ON COLUMNS FROM
[cub_dashboard_spares]
)
however, I would prefer to have it in one query without subquery something like this (its not working though):
SELECT {[Measures].[AFR],[Measures].[IB]} ON COLUMNS,
([dim_ProductModel].[ODM].children)*[Dim_Date].[Date Full].children ON ROWS
FROM
[cub_dashboard_spares]
WHERE FILTER([dim_ProductModel].[Product Model].children,
([Measures].[IB]*[Dim_Date].[Date Full].&[2014-04-01]>0))
I get some error message kind of:
he MDX function CURRENTMEMBER failed because the coordinate for the ... contains a set..
I basically understand why is he not accepting is as in an WHERE clause I should be more specific but I wonder if there is some possibility to rewrite it so that it works.
I don't want that ProductModel appears in the results set.
SELECT {[Measures].[AFR],[Measures].[IB]} ON COLUMNS,
([dim_ProductModel].[ODM].children)*[Dim_Date].[Date Full].children ON ROWS
FROM
[cub_dashboard_spares]
WHERE
({[dim_ProductModel].[Product Model].children},
[Measures].[IB],
PERIODSTODATE(
[Dim_Date].[Date Full], //<<needs to be a level from your Dim_date
[Dim_Date].[Date Full].&[2014-04-01]) //<<needs to be a member from the levelyou have used in above argument
)

Use of the HAVING clause when using muliple sums

I was having a problem getting mulitple sums from multiple tables. Short story, my answer was solved in the "sql sum data from multiple tables" thread on this site. But where it came up short, is that now I'd like to only show sums that are greater than a certain amount. So while I have sub-selects in my select, I think I need to use a HAVING clause to filter the summed amounts that are too low.
Example, using the code specified in the link above (more specifically the answer that the owner has chosen as correct), I would only like to see a query result if SUM(AP2.Value) > 1500. Any thoughts?
If you need to filter on the results of ANY aggregate function, you MUST use a HAVING clause. WHERE is applied at the row level as the DB scans the tables for matching things. HAVING is applied basically immediately before the result set is sent out to the client. At the time WHERE operates, the aggregate function results are not (and cannot) be available, so you have to use a HAVING clause, which is applied after the main query is complete and all aggregate results are available.
So... long story short, yes, you'll need to do
SELECT ...
FROM ...
WHERE ...
HAVING (SUM_AP > 1500)
Note that you can use column aliases in the having clause. In technical terms, having on a query as above works basically exactly the same as wrapping the initial query in another query and applying another WHERE clause on the wrapper:
SELECT *
FROM (
SELECT ...
) AS child
WHERE (SUM_AP > 1500)
You could wrap that query as a subselect and then specify your criteria in the WHERE clause:
SELECT
PROJECT,
SUM_AP,
SUM_INV
FROM (
SELECT
AP1.[PROJECT],
(SELECT SUM(AP2.Value) FROM AP AS AP2 WHERE AP2.PROJECT = AP1.PROJECT) AS SUM_AP,
(SELECT SUM(INV2.Value) FROM INV AS INV2 WHERE INV2.PROJECT = AP1.PROJECT) AS SUM_INV
FROM AP AS AP1
INNER JOIN INV AS INV1 ON
AP1.[PROJECT] = INV1.[PROJECT]
WHERE
AP1.[PROJECT] = 'XXXXX'
GROUP BY
AP1.[PROJECT]
) SQ
WHERE
SQ.SUM_AP > 1500

Group by SQL statement

So I got this statement, which works fine:
SELECT MAX(patient_history_date_bio) AS med_date, medication_name
FROM biological
WHERE patient_id = 12)
GROUP BY medication_name
But, I would like to have the corresponding medication_dose also. So I type this up
SELECT MAX(patient_history_date_bio) AS med_date, medication_name, medication_dose
FROM biological
WHERE (patient_id = 12)
GROUP BY medication_name
But, it gives me an error saying:
"coumn 'biological.medication_dose' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.".
So I try adding medication_dose to the GROUP BY clause, but then it gives me extra rows that I don't want.
I would like to get the latest row for each medication in my table. (The latest row is determined by the max function, getting the latest date).
How do I fix this problem?
Use:
SELECT b.medication_name,
b.patient_history_date_bio AS med_date,
b.medication_dose
FROM BIOLOGICAL b
JOIN (SELECT y.medication_name,
MAX(y.patient_history_date_bio) AS max_date
FROM BIOLOGICAL y
GROUP BY y.medication_name) x ON x.medication_name = b.medication_name
AND x.max_date = b.patient_history_date_bio
WHERE b.patient_id = ?
If you really have to, as one quick workaround, you can apply an aggregate function to your medication_dose such as MAX(medication_dose).
However note that this is normally an indication that you are either building the query incorrectly, or that you need to refactor/normalize your database schema. In your case, it looks like you are tackling the query incorrectly. The correct approach should the one suggested by OMG Poinies in another answer.
You may be interested in checking out the following interesting article which describes the reasons behind this error:
But WHY Must That Column Be Contained in an Aggregate Function or the GROUP BY clause?
You need to put max(medication_dose) in your select. Group by returns a result set that contains distinct values for fields in your group by clause, so apparently you have multiple records that have the same medication_name, but different doses, so you are getting two results.
By putting in max(medication_dose) it will return the maximum dose value for each medication_name. You can use any aggregate function on dose (max, min, avg, sum, etc.)