Oracle - Use analytic function inside aggregate function - sql

I have a table DATE_VALUE like this:
Date Value
---- -----
01/01/2012 1.5
02/01/2012 1.7
03/01/2012 1.3
04/01/2012 2.1
05/01/2012 3.4
I want to calculate variance between differences of value between 2 consecutive dates.
However this simple query does not work:
select variance(lead( value,1) OVER (order by date) - value)
from DATE_VALUE
I got an error:
ORA-30483: window functions are not allowed here
30483. 00000 - "window functions are not allowed here"
*Cause: Window functions are allowed only in the SELECT list of a query.
And, window function cannot be an argument to another window or group
function.
The query works fine if I move the variance function out of the query:
select variance(difvalue) from (
select lead( value,1) OVER (order by rundate) - value as difvalue
from DATE_VALUE
);
I wonder if there is any way to modify the query such that there is no sub-query used?

From Oracle reference:
Analytic functions are the last set of operations performed in a query
except for the final ORDER BY clause. All joins and all WHERE, GROUP
BY, and HAVING clauses are completed before the analytic functions are
processed. Therefore, analytic functions can appear only in the select
list or ORDER BY clause.
Aggregate functions are commonly used with the GROUP BY clause in a
SELECT statement, where Oracle Database divides the rows of a queried
table or view into groups. In a query containing a GROUP BY clause,
the elements of the select list can be aggregate functions, GROUP BY
expressions, constants, or expressions involving one of these. Oracle
applies the aggregate functions to each group of rows and returns a
single result row for each group.
If you omit the GROUP BY clause, then Oracle applies aggregate
functions in the select list to all the rows in the queried table or
view.
So you cannot put analytic functions inside aggregate functions because aggregate functions are performed before analytic (but you can use aggregate functions inside analytic functions).
P.S. What's wrong with subqueries by the way?

Related

Can LAG be used with HAVING?

I distinctly recall that T-SQL will never let you mix LAG and WHERE. For example,
SELECT FOO
WHERE LAG(BAR) OVER (ORDER BY DATE) > 7
will never work. T-SQL will not run it no matter what you do. But does T-SQL ever let you mix LAG with HAVING?
Note: All that an answer needs to do is either give a theory-based or documentation-based reason why it does not, or give any example at all of where it does.
From Logical Processing Order of the SELECT statement:
The following steps show the logical processing order, or binding
order, for a SELECT statement......
FROM
ON
JOIN
WHERE
GROUP BY
WITH CUBE or WITH ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP
Window functions are evaluated at the level of SELECT, which comes after HAVING, so the answer is no you can't use window functions in the HAVING clause.
Having clause can only be used with Group by clause. In order to use Group by the listed columns should be aggregated using Group by columns. Group by can only be used with aggregate functions like min,max,sum,count functions. Hence it is not possible to combine having clause along with the LAG analytical function.
In order to use LAG and Having, one should use CTE or subquery.

Still confusing the rules around selecting columns, group by, and joins

I am still confused by the syntax rules of using GROUP BY. I understand we use GROUP BY when there is some aggregate function. If I have even one aggregate function in a SQL statement, do I need to put all of my selected columns into my GROUP BY statement? I don't have a specific query to ask about but when I try to do joins, I get errors. In particular, when I use a count(*) in a statement and/or a join, I just seem to mess it up.
I use BigQuery at my job. I am regularly floored by strange gaps in knowledge.
Thank you!
This is a little complicated.
First, no aggregation functions are needed in an aggregation query. So this is allowed:
select a
from t
group by a;
This is equivalent, by the way, to:
select distinct a
from t;
If there are aggregation functions, then no group by is needed. So, this is allowed:
select max(a)
from t;
Such an aggregation query -- with no group by -- always returns one row. This is true even if the table is empty or a where clause filters out all the rows. In that case, most aggregation functions return NULL, with the notable exception of count() that returns 0.
Next, if you mix aggregation functions and non-aggregation expressions in the select, then in general you want the non-aggregation, non-constant expressions in the group by. I should note that you can do:
select a, concat(a, 'bcd'), count(*)
from t
group by a;
This should work, but sometimes BigQuery gets confused and will want the expression in the group by.
Finally, the SQL standard supports a query like this:
select t.*, count(*)
from t join
u
using (foo)
group by t.a;
When a is the primary key (or equivalent) in t. However, BigQuery does not have primary keys, so this is not relevant to that database.

Count()over() have repeated records

I often use sum() over() to calculate cumulative value,but today,I tried count ()over(),the result is out of my expectation,can someone explain why the result have repeated records on the same day?
I know the regular way is to count (distinct I'd) group by date,and then sum()over(order by date),just curious for the result of "count(id)over(order by date)"
Select pre.date,count(person_id) over (order by pre.date)
From (select distinct person_id, date from events) pre
The result will be repeated records for the same day.
Because your outer query has not filtered or aggregated the results from the inner query. It returns the same number of rows.
You want aggregation:
select pre.date, count(*) as cnt_on_date,
sum(count(*)) over (order by pre.date) as running_count
from (select distinct person_id, date from events) pre
group by pre.date;
Almost all analytical functions, except row_number() which comes to mind, do not differentiate ties for the same value of columns in order by clause. In some documentation it is stated directly:
Oracle
If you specify a logical window with the RANGE keyword, then the function returns the same result for each of the rows
Postgresql
By default, if ORDER BY is supplied then the frame consists of all rows from the start of the partition up through the current row, plus any following rows that are equal to the current row according to the ORDER BY clause.
My SQL
With 'ORDER BY': The default frame includes rows from the partition start through the current row, including all peers of the current row (rows equal to the current row according to the ORDER BY clause).
But in general, the addition of ORDER BY in analytical clause implicitly sets window specification to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. As window calculation is made for each row in the defined window, with default to RANGE rows with the same value of ORDER BY columns will come into the same window and will produce the same result. So to have a real running total, there should be ROWS BETWEEN or more detail column in ORDER BY part of analytic clause. Functions that does not support windowing clause are exception of this rule, but it sometimes not documented directly, so I will not try to list them here. Functions that can be used as aggregate are not exception in general and produce the same value.

What is the execution order of the PARTITION BY clause compared to other SQL clauses?

I cannot find any source mentioning execution order for Partition By window functions in SQL.
Is it in the same order as Group By?
For example table like:
Select *, row_number() over (Partition by Name)
from NPtable
Where Name = 'Peter'
I understand if Where gets executed first, it will only look at Name = 'Peter', then execute window function that just aggregates this particular person instead of entire table aggregation, which is much more efficient.
But when the query is:
Select top 1 *, row_number() over (Partition by Name order by Date)
from NPtable
Where Date > '2018-01-02 00:00:00'
Doesn't the window function need to be executed against the entire table first then applies the Date> condition otherwise the result is wrong?
Window functions are executed/calculated at the same stage as SELECT, stage 5 in your table. In other words, window functions are applied to all rows that are "visible" in the SELECT stage.
In your second example
Select top 1 *,
row_number() over (Partition by Name order by Date)
from NPtable
Where Date > '2018-01-02 00:00:00'
WHERE is logically applied before Partition by Name of the row_number() function.
Note, that this is logical order of processing the query, not necessarily how the engine physically processes the data.
If query optimiser decides that it is cheaper to scan the whole table and later discard dates according to the WHERE filter, it can do it. But, any kind of these transformations must be performed in such a way that the final result is consistent with the order of the logical steps outlined in the table you showed.
It is part of the SELECT phase of the query execution. There are different types of SELECT clauses, based on the query.
SELECT FOR
SELECT GROUP BY
SELECT ORDER BY
SELECT OVER
SELECT INTO
SELECT HAVING
PARTITION BY comes in the SELECT OVER clause. Here, a window of the result set is generated out of the result set generated in the previous stages: FROM, WHERE, GROUP BY etc.
The OVER clause defines a window or user-specified set of rows within
a query result set. A window function then computes a value for each
row in the window. You can use the OVER clause with functions to
compute aggregated values such as moving averages, cumulative
aggregates, running totals, or a top N per group results.
OVER ( [ PARTITION BY value_expression ] [ order_by_clause ] )
Arguments
PARTITION BY Divides the query result set into partitions. The window
function is applied to each partition separately and computation
restarts for each partition.
value_expression Specifies the column by which the rowset is
partitioned. value_expression can only refer to columns made available
by the FROM clause. value_expression cannot refer to expressions or
aliases in the select list. value_expression can be a column
expression, scalar subquery, scalar function, or user-defined
variable.
Defines the logical order of the rows within each
partition of the result set. That is, it specifies the logical order
in which the window functioncalculation is performed.
order_by_expression Specifies a column or expression on which to sort.
order_by_expression can only refer to columns made available by the
FROM clause. An integer cannot be specified to represent a column name
or alias.
You can read more about it SELECT-OVER
row_number() (and other window functions) are allowed in two clauses:
SELECT
ORDER BY
The function is parsed along with the rest of the clause. After all, it is a function present in the clause. In both cases, the WHERE clause would be -- logically -- applied first, so the results would be after filtering.
Do note that this is a logical parsing of the query. The actual execution may have little to do with the structure of the query.

Aggregate window function output in postgresql (redshift)

I really want to use the median window function as an aggregate function.
I currently am forced to use the window function in a sub-select, and then aggregate over it like this:
SELECT id, MIN(avg) AS mean, MIN(median) AS median, COUNT(*)
FROM (
SELECT id, AVG(metric) OVER(PARTITION BY id), MEDIAN(metric) OVER(PARTITION BY id)
FROM data_table
)
GROUP BY id;
Is there a way to aggregate over a window function result so there's only one SELECT statement?
Strictly speaking, your example query could be rewritten:
SELECT id,
AVG(metric),
MEDIAN(metric),
COUNT(*)
FROM data_table
GROUP BY id;
But I'm wondering if you just picked a poor example that happens to be mathematically capable of simplification. This is a special case because the subquery and the main query are aggregating on the same field, and the outer aggregates are picking a minimum from what would be a set of identical values.
If that's not the case and your actual query and subquery are not grouping by the same field, then the answer is no, you need a subquery for two reasons:
First, by ANSI definition, window functions are evaluated after the WHERE, GROUP BY, and HAVING clauses. There is no clause to specify your desired behavior of aggregating after a window function, so you must use a subquery or CTE.
Second, even if you eliminated the windowing from the OVER() clause you need to GROUP BY data you only know after the first round of aggregation has been completed.