Is there a standard for SQL aggregate function calculation?

Is there a standard for SQL aggregate function calculation? - sql

Is there a standard on SQL implementaton for multiple calls to the same aggregate function in the same query?
For example, consider the following example, based on a popular example schema:
SELECT Customer,SUM(OrderPrice) FROM Orders
GROUP BY Customer
HAVING SUM(OrderPrice)>1000
Presumably, it takes computation time to calculate the value of SUM(OrderPrice). Is this cost incurred for each reference to the aggregate function, or is the result stored for a particular query?
Or, is there no standard for SQL engine implementation for this case?

Although I have worked with many different DBMS, I will only show you the result of proving this on SQL Server. Consider this query, which even includes a CAST in the expression. Looking at the query plan, the expression sum(cast(number as bigint)) is only taken once, which is defined as DEFINE:([Expr1005]=SUM([Expr1006])).
set showplan_text on
select type, sum(cast(number as bigint))
from master..spt_values
group by type
having sum(cast(number as bigint)) > 100000
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|--Filter(WHERE:([Expr1005]>(100000)))
|--Hash Match(Aggregate, HASH:([Expr1004]), RESIDUAL:([Expr1004] = [Expr1004]) DEFINE:([Expr1005]=SUM([Expr1006])))
|--Compute Scalar(DEFINE:([Expr1004]=CONVERT(nchar(3),[mssqlsystemresource].[sys].[spt_values].[type],0), [Expr1006]=CONVERT(bigint,[mssqlsystemresource].[sys].[spt_values].[number],0)))
|--Index Scan(OBJECT:([mssqlsystemresource].[sys].[spt_values].[ix2_spt_values_nu_nc]))
It may not be very obvious above, since it doesn't show the SELECT result, so I have added a *10 to the query below. Notice that it now includes one extra step DEFINE:([Expr1006]=[Expr1005]*(10)) (steps run bottom to top) which demonstrates that the new expression required it to perform an extra calculation. Yet, even this is optimized, as it doesn't recalculate the entire expression - merely, it is taking Expr1005 and multiplying that by 10!
set showplan_text on
select type, sum(cast(number as bigint))*10
from master..spt_values
group by type
having sum(cast(number as bigint)) > 100000
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|--Compute Scalar(DEFINE:([Expr1006]=[Expr1005]*(10)))
|--Filter(WHERE:([Expr1005]>(100000)))
|--Hash Match(Aggregate, HASH:([Expr1004]), RESIDUAL:([Expr1004] = [Expr1004]) DEFINE:([Expr1005]=SUM([Expr1007])))
|--Compute Scalar(DEFINE:([Expr1004]=CONVERT(nchar(3),[mssqlsystemresource].[sys].[spt_values].[type],0), [Expr1007]=CONVERT(bigint,[mssqlsystemresource].[sys].[spt_values].[number],0)))
|--Index Scan(OBJECT:([mssqlsystemresource].[sys].[spt_values].[ix2_spt_values_nu_nc]))
This is very likely how all the other DBMS work as well, at least considering the major ones i.e. PostgreSQL, Sybase, Oracle, DB2, Firebird, MySQL.

Related

SQLite alias (AS) not working in the same query

I'm stuck in an (apparently) extremely trivial task that I can't make work , and I really feel no chance than to ask for advice.
I used to deal with PHP/MySQL more than 10 years ago and I might be quite rusty now that I'm dealing with an SQLite DB using Qt5.
Basically I'm selecting some records while wanting to make some math operations on the fetched columns. I recall (and re-read some documentation and examples) that the keyword "AS" is going to conveniently rename (alias) a value.
So for example I have this query, where "X" is an integer number that I render into this big Qt string before executing it with a QSqlQuery. This query lets me select all the electronic components used in a Project and calculate how many of them to order (rounding to the nearest multiple of 5) and the total price per component.
SELECT Inventory.id, UsedItems.pid, UsedItems.RefDes, Inventory.name, Inventory.category,
Inventory.type, Inventory.package, Inventory.value, Inventory.manufacturer,
Inventory.price, UsedItems.qty_used as used_qty,
UsedItems.qty_used*X AS To_Order,
ROUND((UsedItems.qty_used*X/5)+0.5)*5*CAST((X > 0) AS INT) AS Nearest5,
Inventory.price*Nearest5 AS TotPrice
FROM Inventory
LEFT JOIN UsedItems ON Inventory.id=UsedItems.cid
WHERE UsedItems.pid='1'
ORDER BY RefDes, value ASC
So, for example, I aliased UsedItems.qty_used as used_qty. At first I tried to use it in the next field, multiplying it by X, writing "used_qty*X AS To_Order" ... Query failed. Well, no worries, I had just put the original tab.field name and it worked.
Going further, I have a complex calculation and I want to use its result on the next field, but the same issue popped out: if I alias "ROUND(...)" AS Nearest5, and then try to use this value by multiplying it in the next field, the query will fail.
Please note: the query WORKS, but ONLY if I don't use aliases in the following fields, namely if I don't use the alias Nearest5 in the TotPrice field. I just want to avoid re-writing the whole ROUND(...) thing for the TotPrice field.
What am I missing/doing wrong? Either SQLite does not support aliases on the same query or I am using a wrong syntax and I am just too stuck/confused to see the mistake (which I'm sure it has to be really stupid).

Column aliases defined in a SELECT cannot be used:
For other expressions in the same SELECT.
For filtering in the WHERE.
For conditions in the FROM clause.
Many databases also restrict their use in GROUP BY and HAVING.
All databases support them in ORDER BY.
This is how SQL works. The issue is two things:
The logic order of processing clauses in the query (i.e. how they are compiled). This affects the scoping of parameters.
The order of processing expressions in the SELECT. This is indeterminate. There is no requirement for the ordering of parameters.
For a simple example, what should x refer to in this example?
select x as a, y as x
from t
where x = 2;
By not allowing duplicates, SQL engines do not have to make a choice. The value is always t.x.

You can try with nested queries.
A SELECT query can be nested in another SELECT query within the FROM clause;
multiple queries can be nested, for example by following the following pattern:
SELECT *,[your last Expression] AS LastExp From (SELECT *,[your Middle Expression] AS MidExp FROM (SELECT *,[your first Expression] AS FirstExp FROM yourTables));
Obviously, respecting the order that the expressions of the innermost select query can be used by subsequent select queries:
the first expressions can be used by all other queries, but the other intermediate expressions can only be used by queries that are further upstream.
For your case, your query may be:
SELECT *, PRC*Nearest5 AS TotPrice FROM (SELECT *, ROUND((UsedItems.qty_used*X/5)+0.5)*5*CAST((X > 0) AS INT) AS Nearest5 FROM (SELECT Inventory.id, UsedItems.pid, UsedItems.RefDes, Inventory.name, Inventory.category, Inventory.type, Inventory.package, Inventory.value, Inventory.manufacturer, Inventory.price AS PRC, UsedItems.qty_used*X AS To_Order FROM Inventory LEFT JOIN UsedItems ON Inventory.id=UsedItems.cid WHERE UsedItems.pid='1' ORDER BY RefDes, value ASC))

SQL - HAVING (execution vs structure)

I'm a beginner, studying on my own... please help me to clarify something about a query: I am working with a soccer database and trying to answer this question: list all seasons with an avg goal per Match rate of over 1, in Matchs that didn’t end with a draw;
The right query for it is:
select season,round((sum(home_team_goal+away_team_goal) *1.0) /count(id),3) as ratio
from match
where home_team_goal != away_team_goal
group by season
having ratio > 1
I don't understand 2 things about this query:
Why do I *1.0? why is it necessary?
I know that the execution in SQL is by this order:
from
where
group
having
select
So how does this query include: having ratio>1 if the "ratio" is only defined in the "select" which is executed AFTER the HAVING?
Am I confused?
Thanks in advance for the help!

The multiplication is added as a typecast to convert INT to FLOAT because by default sum of ints is int and the division looses decimal places after dividing 2 ints.
HAVING. You can consider HAVING as WHERE but applied to the query results. Imagine the query is executed first without HAVING and then the HAVING condition is applied to result rows leaving only suitable ones.
In you case you first select grouped data and calculate aggregated results and then skip unnecessary results of aggregation.

the *1.0 is used for its ".0" part so that it tells the system to treat the expression as a decimal, and thus not make an integer division which would cut-off the decimal part (eg 1 instead of 1.33).
About the second part: select being at the end just means that the last thing
to be done is showing the data. Hoewever, assigning an alias to a calculated field is being done, you could say, at first priority. Still, I am a bit doubtful; I am almost certain field aliases cannot be used in the where/group by/having in, say, sql server.

There is no order of execution of a SQL query. SQL is a descriptive language not a procedural language. A SQL query describes the result set that the query is producing. The SQL engine can execute it however it likes. In fact, most SQL engines compile the query into a directed acyclic graph, which looks nothing like the original query.
What you are referring to might be better phrased as the "order of interpretation". This is more simply described by simple rules. Column aliases can be used in the ORDER BY clause in any database. They cannot be used in the FROM, WHERE, or GROUP BY clauses. Some databases -- such as SQLite -- allow them to be referenced in the HAVING clause.
As for the * 1.0, it is because some databases -- such as SQLite -- do integer arithmetic. However, the logic that you want is probably more simply expressed as:
round((avg(home_team_goal + away_team_goal * 1.0), 3)

How to properly compute weighted average for zeroes in SQL

I have following problem - I'm computing weighted average in SQL, as following: SUM(Value * Weight) / SUM(Weight). However, there can be issue that rows are empty => SUM(Weight) == 0), and in this case the query fails. Is it somehow possible to return '0' as result in this case?
I have tried CASE SUM(Weight) WHEN 0 THEN 0 ELSE SUM(Value * Weight) / SUM(Weight) END, but I'm afraid that it evaluates SUM(Weight) twice, and that can be fairly expensive in my case.

Use NULLIF and ISNULL:
ISNULL(SUM(Value * Weight) / NULLIF(SUM(Weight),0),0)

The SQL engine doesn't compute sum(Weight) twice, just once. The conceptual process is:
compute the full cartesian join of all tables in the from clause
apply the join criteria to filter the results
apply the where clause criteria to filter the results
partition this result set into groups as defined by the group by clause
collapse each such group into one row, computing any aggregate functions that have been specified, keeping only those columns listed in the result set (aggregrate functions and grouping columns),
apply the criteria in the having clause to filter the grouped results,
drop all columns but those specified in the queries result columns, creating those that are computed expressions.
apply the ordering specified in the order by statement.
No actual SQL engine does this, but it must behave as if that is what happened. Your aggregate function is computed just once, along with any other aggregate functions, in a single pass.

SQL Server aggregate performance

I am wondering whether SQL Server knows to 'cache' if you like aggregates while in a query, if they are used again.
For example,
Select Sum(Field),
Sum(Field) / 12
From Table
Would SQL Server know that it has already calculated the Sum function on the first field and then just divide it by 12 for the second? Or would it run the Sum function again then divide it by 12?
Thanks

It calculates once
Select
Sum(Price),
Sum(Price) / 12
From
MyTable
The plan gives:
|--Compute Scalar(DEFINE:([Expr1004]=[Expr1003]/(12.)))
|--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [Expr1010]=(0) THEN NULL ELSE [Expr1011] END))
|--Stream Aggregate(DEFINE:([Expr1010]=Count(*), [Expr1011]=SUM([myDB].[dbo].[MyTable].[Price])))
|--Index Scan(OBJECT:([myDB].[dbo].[MyTable].[IX_SomeThing]))
This table has 1.35 million rows
Expr1011 = SUM
Expr1003 = some internal thing to do with "no rows" etc but is Expr1011 basically
Expr1004 = Expr1011 / 12

According to the execution plan, it doesn't re-sum the column.

good question, i think the answer is no, it doesn't not cache it.
I ran a test query with around 3000 counts in it, and it was much slower than one with only a few. Still want to test if the query would be just as slow selecting just plain columns
edit: OK, i just tried selecting a large amount of columns or just one, and the amount of columns (when talking about thousands being returned) does effect the speed.
Overall, unless you are using that aggregate number a ton of times in your query, you should be fine. Push comes to shove, you could always save the outcome to a variable and do the math after the fact.

Behavior of SQL OR and AND operator

We have following expression as T-Sql Query:
Exp1 OR Exp2
Is Exp2 evaluated when Exp1 is True? I think there is no need to evaluate it.
Similarly; for,
Exp1 AND Exp2
is Exp2 evaluated when Exp1 is false?

Unlike in some programming languages, you cannot count on short-circuiting in T-SQL WHERE clauses. It might happen, or it might not.

SQL Server doesn't necessarily evaluate expressions in left to right order. Evaluation order is controlled by the execution plan and the plan is chosen based on the overall estimated cost for the whole of a query. So there is no certainty that SQL will perform the kind of short circuit optimisation you are describing. That flexibility is what makes the opimiser useful. For example it could be that the second expression in each case can be evaluated more efficiently than the first (if it is indexed or subject to some constraint for example).
SQL also uses three-value logic, which means that some of the equivalence rules used in two-value logic don't apply (although that doesn't alter the specific example you describe).

SQL Server query operators OR and AND are commutative. There is no inherent order and the query optimizer is free to choose the path of least cost to begin evaluation. Once the plan is set, the other part is not evaluated if a result is pre-determined.
This knowledge allows queries like
select * from master..spt_values
where (type = 'P' or 1=#param1)
and (1=#param2 or number < 1000)
option (recompile)
Where the pattern of evaluation is guaranteed to short circuit when #param is set to 1. This pattern is typical of optional filters. Notice that it does not matter whether the #params are tested before or after the other part.
If you are very good with SQL and know for a fact that the query is best forced down a certain plan, you can game SQL Server using CASE statements, which are always evaluated in nested order. Example below will force type='P' to always be evaluated first.
select *
from master..spt_values
where
case when type='P' then
case when number < 100 then 1
end end = 1
If you don't believe order of evaluation of the last query, try this
select *
from master..spt_values
where
case when type='P' then
case when number < 0 then
case when 1/0=1 then 1
end end end = 1
Even though the constants in the expression 1/0=1 is the least cost to evaluate, it is NEVER evaluated - otherwise the query would have resulted in divide-by-zero instead of returning no rows (there are no rows in master..spt_values matching both conditions).

SQL Server sometimes performs boolean short circuiting, and sometimes does not.
It depends upon the query execution plan that is generated. The execution plan chosen depends on several factors, including the selectivity of the columns in the WHERE clause, table size, available indexes etc.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas