I really want to use the median window function as an aggregate function.
I currently am forced to use the window function in a sub-select, and then aggregate over it like this:
SELECT id, MIN(avg) AS mean, MIN(median) AS median, COUNT(*)
FROM (
SELECT id, AVG(metric) OVER(PARTITION BY id), MEDIAN(metric) OVER(PARTITION BY id)
FROM data_table
)
GROUP BY id;
Is there a way to aggregate over a window function result so there's only one SELECT statement?
Strictly speaking, your example query could be rewritten:
SELECT id,
AVG(metric),
MEDIAN(metric),
COUNT(*)
FROM data_table
GROUP BY id;
But I'm wondering if you just picked a poor example that happens to be mathematically capable of simplification. This is a special case because the subquery and the main query are aggregating on the same field, and the outer aggregates are picking a minimum from what would be a set of identical values.
If that's not the case and your actual query and subquery are not grouping by the same field, then the answer is no, you need a subquery for two reasons:
First, by ANSI definition, window functions are evaluated after the WHERE, GROUP BY, and HAVING clauses. There is no clause to specify your desired behavior of aggregating after a window function, so you must use a subquery or CTE.
Second, even if you eliminated the windowing from the OVER() clause you need to GROUP BY data you only know after the first round of aggregation has been completed.
Related
Question in the title. Thanks for the time.
EXAMPLE v
SELECT
customer_id,
SUM(unit_price * quantity) AS total_price
FROM orders o
JOIN order_items oi
ON o.order_id = oi.order_id
GROUP BY customer_id
Yes, grouping does go hand in hand with aggregate functions, for any resulting column not contained within an aggregate function
Grouping and the operations of aggregation are typically linked to three keywords:
any aggregation function
the GROUP BY clause
the DISTINCT (ON) modifier after the SELECT keyword
You can use:
aggregation functions alone when they are used on every field found inside the SELECT statement
the GROUP BY clause alone - an allowed bad practice as you're intending to do an aggregation on your field but you're not specifying any aggregation function inside the SELECT statement (really you should go with DISTINCT ON)
the DISTINCT modifier alone to select distinct rows (not fields)
the DISTINCT ON modifier only when accompanied by an ORDER BY clause that defines the order for extracting the rows
aggregation functions + the GROUP BY clause, that is forced to contain all fields found in the SELECT statement that are not object of aggregation
the DISTINCT modifier + the GROUP BY clause - you can do it but the GROUP BY clause really is superfluous as it is already implied by the DISTINCT keyword itself.
You can't use:
aggregation functions + the GROUP BY clause when non-aggregated fields included in the SELECT statement are not found within the GROUP BY clause
aggregation functions alone when in the SELECT statement they are found together with non-aggregated fields.
When you can't aggregate because you need to select multiple fields, typically window functions + filtering operations (with a WHERE clause) can come in handy for computing values and row by row and removing unneeded records.
I distinctly recall that T-SQL will never let you mix LAG and WHERE. For example,
SELECT FOO
WHERE LAG(BAR) OVER (ORDER BY DATE) > 7
will never work. T-SQL will not run it no matter what you do. But does T-SQL ever let you mix LAG with HAVING?
Note: All that an answer needs to do is either give a theory-based or documentation-based reason why it does not, or give any example at all of where it does.
From Logical Processing Order of the SELECT statement:
The following steps show the logical processing order, or binding
order, for a SELECT statement......
FROM
ON
JOIN
WHERE
GROUP BY
WITH CUBE or WITH ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP
Window functions are evaluated at the level of SELECT, which comes after HAVING, so the answer is no you can't use window functions in the HAVING clause.
Having clause can only be used with Group by clause. In order to use Group by the listed columns should be aggregated using Group by columns. Group by can only be used with aggregate functions like min,max,sum,count functions. Hence it is not possible to combine having clause along with the LAG analytical function.
In order to use LAG and Having, one should use CTE or subquery.
I am still confused by the syntax rules of using GROUP BY. I understand we use GROUP BY when there is some aggregate function. If I have even one aggregate function in a SQL statement, do I need to put all of my selected columns into my GROUP BY statement? I don't have a specific query to ask about but when I try to do joins, I get errors. In particular, when I use a count(*) in a statement and/or a join, I just seem to mess it up.
I use BigQuery at my job. I am regularly floored by strange gaps in knowledge.
Thank you!
This is a little complicated.
First, no aggregation functions are needed in an aggregation query. So this is allowed:
select a
from t
group by a;
This is equivalent, by the way, to:
select distinct a
from t;
If there are aggregation functions, then no group by is needed. So, this is allowed:
select max(a)
from t;
Such an aggregation query -- with no group by -- always returns one row. This is true even if the table is empty or a where clause filters out all the rows. In that case, most aggregation functions return NULL, with the notable exception of count() that returns 0.
Next, if you mix aggregation functions and non-aggregation expressions in the select, then in general you want the non-aggregation, non-constant expressions in the group by. I should note that you can do:
select a, concat(a, 'bcd'), count(*)
from t
group by a;
This should work, but sometimes BigQuery gets confused and will want the expression in the group by.
Finally, the SQL standard supports a query like this:
select t.*, count(*)
from t join
u
using (foo)
group by t.a;
When a is the primary key (or equivalent) in t. However, BigQuery does not have primary keys, so this is not relevant to that database.
I cannot find any source mentioning execution order for Partition By window functions in SQL.
Is it in the same order as Group By?
For example table like:
Select *, row_number() over (Partition by Name)
from NPtable
Where Name = 'Peter'
I understand if Where gets executed first, it will only look at Name = 'Peter', then execute window function that just aggregates this particular person instead of entire table aggregation, which is much more efficient.
But when the query is:
Select top 1 *, row_number() over (Partition by Name order by Date)
from NPtable
Where Date > '2018-01-02 00:00:00'
Doesn't the window function need to be executed against the entire table first then applies the Date> condition otherwise the result is wrong?
Window functions are executed/calculated at the same stage as SELECT, stage 5 in your table. In other words, window functions are applied to all rows that are "visible" in the SELECT stage.
In your second example
Select top 1 *,
row_number() over (Partition by Name order by Date)
from NPtable
Where Date > '2018-01-02 00:00:00'
WHERE is logically applied before Partition by Name of the row_number() function.
Note, that this is logical order of processing the query, not necessarily how the engine physically processes the data.
If query optimiser decides that it is cheaper to scan the whole table and later discard dates according to the WHERE filter, it can do it. But, any kind of these transformations must be performed in such a way that the final result is consistent with the order of the logical steps outlined in the table you showed.
It is part of the SELECT phase of the query execution. There are different types of SELECT clauses, based on the query.
SELECT FOR
SELECT GROUP BY
SELECT ORDER BY
SELECT OVER
SELECT INTO
SELECT HAVING
PARTITION BY comes in the SELECT OVER clause. Here, a window of the result set is generated out of the result set generated in the previous stages: FROM, WHERE, GROUP BY etc.
The OVER clause defines a window or user-specified set of rows within
a query result set. A window function then computes a value for each
row in the window. You can use the OVER clause with functions to
compute aggregated values such as moving averages, cumulative
aggregates, running totals, or a top N per group results.
OVER ( [ PARTITION BY value_expression ] [ order_by_clause ] )
Arguments
PARTITION BY Divides the query result set into partitions. The window
function is applied to each partition separately and computation
restarts for each partition.
value_expression Specifies the column by which the rowset is
partitioned. value_expression can only refer to columns made available
by the FROM clause. value_expression cannot refer to expressions or
aliases in the select list. value_expression can be a column
expression, scalar subquery, scalar function, or user-defined
variable.
Defines the logical order of the rows within each
partition of the result set. That is, it specifies the logical order
in which the window functioncalculation is performed.
order_by_expression Specifies a column or expression on which to sort.
order_by_expression can only refer to columns made available by the
FROM clause. An integer cannot be specified to represent a column name
or alias.
You can read more about it SELECT-OVER
row_number() (and other window functions) are allowed in two clauses:
SELECT
ORDER BY
The function is parsed along with the rest of the clause. After all, it is a function present in the clause. In both cases, the WHERE clause would be -- logically -- applied first, so the results would be after filtering.
Do note that this is a logical parsing of the query. The actual execution may have little to do with the structure of the query.
I have a table DATE_VALUE like this:
Date Value
---- -----
01/01/2012 1.5
02/01/2012 1.7
03/01/2012 1.3
04/01/2012 2.1
05/01/2012 3.4
I want to calculate variance between differences of value between 2 consecutive dates.
However this simple query does not work:
select variance(lead( value,1) OVER (order by date) - value)
from DATE_VALUE
I got an error:
ORA-30483: window functions are not allowed here
30483. 00000 - "window functions are not allowed here"
*Cause: Window functions are allowed only in the SELECT list of a query.
And, window function cannot be an argument to another window or group
function.
The query works fine if I move the variance function out of the query:
select variance(difvalue) from (
select lead( value,1) OVER (order by rundate) - value as difvalue
from DATE_VALUE
);
I wonder if there is any way to modify the query such that there is no sub-query used?
From Oracle reference:
Analytic functions are the last set of operations performed in a query
except for the final ORDER BY clause. All joins and all WHERE, GROUP
BY, and HAVING clauses are completed before the analytic functions are
processed. Therefore, analytic functions can appear only in the select
list or ORDER BY clause.
Aggregate functions are commonly used with the GROUP BY clause in a
SELECT statement, where Oracle Database divides the rows of a queried
table or view into groups. In a query containing a GROUP BY clause,
the elements of the select list can be aggregate functions, GROUP BY
expressions, constants, or expressions involving one of these. Oracle
applies the aggregate functions to each group of rows and returns a
single result row for each group.
If you omit the GROUP BY clause, then Oracle applies aggregate
functions in the select list to all the rows in the queried table or
view.
So you cannot put analytic functions inside aggregate functions because aggregate functions are performed before analytic (but you can use aggregate functions inside analytic functions).
P.S. What's wrong with subqueries by the way?