What is the execution order of the PARTITION BY clause compared to other SQL clauses? - sql

I cannot find any source mentioning execution order for Partition By window functions in SQL.
Is it in the same order as Group By?
For example table like:
Select *, row_number() over (Partition by Name)
from NPtable
Where Name = 'Peter'
I understand if Where gets executed first, it will only look at Name = 'Peter', then execute window function that just aggregates this particular person instead of entire table aggregation, which is much more efficient.
But when the query is:
Select top 1 *, row_number() over (Partition by Name order by Date)
from NPtable
Where Date > '2018-01-02 00:00:00'
Doesn't the window function need to be executed against the entire table first then applies the Date> condition otherwise the result is wrong?

Window functions are executed/calculated at the same stage as SELECT, stage 5 in your table. In other words, window functions are applied to all rows that are "visible" in the SELECT stage.
In your second example
Select top 1 *,
row_number() over (Partition by Name order by Date)
from NPtable
Where Date > '2018-01-02 00:00:00'
WHERE is logically applied before Partition by Name of the row_number() function.
Note, that this is logical order of processing the query, not necessarily how the engine physically processes the data.
If query optimiser decides that it is cheaper to scan the whole table and later discard dates according to the WHERE filter, it can do it. But, any kind of these transformations must be performed in such a way that the final result is consistent with the order of the logical steps outlined in the table you showed.

It is part of the SELECT phase of the query execution. There are different types of SELECT clauses, based on the query.
SELECT FOR
SELECT GROUP BY
SELECT ORDER BY
SELECT OVER
SELECT INTO
SELECT HAVING
PARTITION BY comes in the SELECT OVER clause. Here, a window of the result set is generated out of the result set generated in the previous stages: FROM, WHERE, GROUP BY etc.
The OVER clause defines a window or user-specified set of rows within
a query result set. A window function then computes a value for each
row in the window. You can use the OVER clause with functions to
compute aggregated values such as moving averages, cumulative
aggregates, running totals, or a top N per group results.
OVER ( [ PARTITION BY value_expression ] [ order_by_clause ] )
Arguments
PARTITION BY Divides the query result set into partitions. The window
function is applied to each partition separately and computation
restarts for each partition.
value_expression Specifies the column by which the rowset is
partitioned. value_expression can only refer to columns made available
by the FROM clause. value_expression cannot refer to expressions or
aliases in the select list. value_expression can be a column
expression, scalar subquery, scalar function, or user-defined
variable.
Defines the logical order of the rows within each
partition of the result set. That is, it specifies the logical order
in which the window functioncalculation is performed.
order_by_expression Specifies a column or expression on which to sort.
order_by_expression can only refer to columns made available by the
FROM clause. An integer cannot be specified to represent a column name
or alias.
You can read more about it SELECT-OVER

row_number() (and other window functions) are allowed in two clauses:
SELECT
ORDER BY
The function is parsed along with the rest of the clause. After all, it is a function present in the clause. In both cases, the WHERE clause would be -- logically -- applied first, so the results would be after filtering.
Do note that this is a logical parsing of the query. The actual execution may have little to do with the structure of the query.

Related

Count()over() have repeated records

I often use sum() over() to calculate cumulative value,but today,I tried count ()over(),the result is out of my expectation,can someone explain why the result have repeated records on the same day?
I know the regular way is to count (distinct I'd) group by date,and then sum()over(order by date),just curious for the result of "count(id)over(order by date)"
Select pre.date,count(person_id) over (order by pre.date)
From (select distinct person_id, date from events) pre
The result will be repeated records for the same day.
Because your outer query has not filtered or aggregated the results from the inner query. It returns the same number of rows.
You want aggregation:
select pre.date, count(*) as cnt_on_date,
sum(count(*)) over (order by pre.date) as running_count
from (select distinct person_id, date from events) pre
group by pre.date;
Almost all analytical functions, except row_number() which comes to mind, do not differentiate ties for the same value of columns in order by clause. In some documentation it is stated directly:
Oracle
If you specify a logical window with the RANGE keyword, then the function returns the same result for each of the rows
Postgresql
By default, if ORDER BY is supplied then the frame consists of all rows from the start of the partition up through the current row, plus any following rows that are equal to the current row according to the ORDER BY clause.
My SQL
With 'ORDER BY': The default frame includes rows from the partition start through the current row, including all peers of the current row (rows equal to the current row according to the ORDER BY clause).
But in general, the addition of ORDER BY in analytical clause implicitly sets window specification to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. As window calculation is made for each row in the defined window, with default to RANGE rows with the same value of ORDER BY columns will come into the same window and will produce the same result. So to have a real running total, there should be ROWS BETWEEN or more detail column in ORDER BY part of analytic clause. Functions that does not support windowing clause are exception of this rule, but it sometimes not documented directly, so I will not try to list them here. Functions that can be used as aggregate are not exception in general and produce the same value.

Window functions in view called with "where" gives bad execution plan

I have a view that uses LAG.
CREATE VIEW V_ImportedReadingDay2
AS
SELECT
ID,
PlacementID,
LAG(Reading, 1) OVER (PARTITION BY MeterNumber ORDER BY Date) AS Val
FROM dbo.ImportedReadingDay
If I call it using "WHERE" it gets an execution plan much worse than if just calling the query.
SELECT
ID,
PlacementID,
LAG(Reading, 1) OVER (PARTITION BY MeterNumber ORDER BY Date) AS Val
FROM dbo.ImportedReadingDay
WHERE (PlacementID = 12404)
SELECT *
FROM V_ImportedReadingDay2
WHERE (PlacementID = 12404)
This is a known problem. You can google the problem.
I have found two solutions. Either use a table valued function or move the LAG outside of the view.
BUT I'd like to know if there are any other solutions since none of these work for me since I have to use the view in a client software.
Your two queries aren't logically the same. So, of course, they don't get the same execution plan.
Consider these queries:
select name,LAG(column_id) OVER (ORDER BY system_type_id) as cid
from sys.columns
where name='name'
select * from (
select name,LAG(column_id) OVER (ORDER BY system_type_id) as cid
from sys.columns
) t
where name='name'
Because of the logical processing order of queries, the WHERE clause is processed before the SELECT clause. So, for the first query, we first filter the sys.columns table to only retrieve rows with a particular name, and then we apply the LAG() function just on this filtered set (so, the lagged value will definitely come from another row which matches the filter).
For the second query, we first (logically) process the subquery. We're performing the LAG() function across the whole set of rows (because the subquery doesn't have any filters/WHERE clause) and then (in the outer query) we're filtering the set of rows. Importantly, that means that the lagged value may have been pulled from a row which doesn't match the final filter.
Well, when you use a view, it's similar to my second query. The value of Val retrieved when you use your view is not guaranteed to be from a row with a PlacementID equal to 12404.
This was a simplified view just for this example.
In the real one I partition the LAG.
I found out that partitioning the LAG with the same as used in the "WHERE" (in this case PlacementID) solved the performance issue.

SQL Group By Column Part Number giving the data from most recent received date

New qith SQL my group by is not working and I am wanting it to pull the most recent POReleases.DateReceived date and group by part number. Here is what I have
SELECT POReleases.PONum, POReleases.PartNo, POReleases.JobNo, POReleases.Qty, POReleases.QtyRejected, POReleases.QtyCanceled, POReleases.DueDate, POReleases.DateReceived, PODet.ProdCode, PODet.Unit, PODet.UnitCost, PODet.QtyOrd, PODet.QtyRec, PODet.QtyReject, PODet.QtyCancel
FROM Waples.dbo.PODet PODet, Waples.dbo.POReleases POReleases
WHERE PODet.PartNo = POReleases.PartNo AND PODet.PONum = POReleases.PONum AND ((POReleases.DateReceived>{ts '2010-01-01 00:00:00'}))
GROUP BY PartNo
For starters, columns specified in the GROUP BY should be present in the select statement too. Here in your case only "PartNo" is used in GROUP BY clause whereas so many columns are used in the SELECT statement.
You can try WITH CTE to achieve this,
WITH CTE AS (
SELECT *, ROW_NUMBER() OVER( PARTITION BY PartNo ORDER BY POReleases.DateReceived DESC) AS PartNoCount
FROM TABLENAME
) SELECT * FROM CTE
When you write an SQL statement, you should think about the logical flow, which might be technically slightly inaccurate due to optimizations, but still, it is a good thing to think about it like this:
without the from clause specifying the source relation, the filter cannot be evaluated, so at least logically, the from is the first thing to evaluate
without the where clause specifying which records should be kept from the source relation, the filtered records cannot be grouped, so, at least logically, the where precedes the group by
without the group by, specifying the groups, you cannot select values from the groups, so, at least logically, group by precedes select
So, the projection (select) is executed on the groups of filtered records, which are groups themselves. Since the groups have an attribute, namely PartNo, it becomes an aggregated column. The other columns, which were reachable before the group by, can no longer be reached in the select. If you want to reach them, you need to group by them as well, or use aggregated functions for them, since if you have a group by, you will be able to select only the aggregated columns, which are either aggregated functions or columns which became aggregated due to their presence in the group by.
Since you did not specify how this query is not working, I will have to assume that you have a syntax error in the selection, due to the fact that you refer to columns which are not aggregated. Also, you might want to use join instead of Descartes multiplication and finally, if you want to filter the groups, not the records of the initial relation (which is the result of a Descartes multiplication in your case), then you might consider using a having clause.

Does ROW_NUMBER queried data require to fetch the entire result set?

I'm trying to write query that's not using offset (because as I just have learnt offset fetches all data which causes performance overhead). with ROW_NUMBER window function. For instance:
SELECT id
FROM(
SELECT id, ROW_NUMBER() over (order by id) rn
FROM users) sq
WHERE rn > 1000
Does it require all rows to be fetched as it would be with offset 1000? I mean, does it make a sense to use such query instead of
SELECT if
FROM users
OFFSET 1000
? Do I get performance improvement on large amount of data?
Check out the window function docs. Window functions operate on the result set, after the fetch:
Window functions are permitted only in the SELECT list and the ORDER
BY clause of the query. They are forbidden elsewhere, such as in GROUP
BY, HAVING and WHERE clauses. This is because they logically execute
after the processing of those clauses. Also, window functions execute
after regular aggregate functions. This means it is valid to include
an aggregate function call in the arguments of a window function, but
not vice versa.
Does it make sense to use the row_number() query? Well, it produces the same result set. However, the query basically has to assign row_number() to all the rows in order to find the ones that meet the requirement.
The second query, however, is lacking an order by. When using offset, you should have an order by:
SELECT id
FROM users u
ORDER BY id
OFFSET 1000
I would imagine that this is more efficient than using row_number(), but actual timings would demonstrate that.

Aggregate window function output in postgresql (redshift)

I really want to use the median window function as an aggregate function.
I currently am forced to use the window function in a sub-select, and then aggregate over it like this:
SELECT id, MIN(avg) AS mean, MIN(median) AS median, COUNT(*)
FROM (
SELECT id, AVG(metric) OVER(PARTITION BY id), MEDIAN(metric) OVER(PARTITION BY id)
FROM data_table
)
GROUP BY id;
Is there a way to aggregate over a window function result so there's only one SELECT statement?
Strictly speaking, your example query could be rewritten:
SELECT id,
AVG(metric),
MEDIAN(metric),
COUNT(*)
FROM data_table
GROUP BY id;
But I'm wondering if you just picked a poor example that happens to be mathematically capable of simplification. This is a special case because the subquery and the main query are aggregating on the same field, and the outer aggregates are picking a minimum from what would be a set of identical values.
If that's not the case and your actual query and subquery are not grouping by the same field, then the answer is no, you need a subquery for two reasons:
First, by ANSI definition, window functions are evaluated after the WHERE, GROUP BY, and HAVING clauses. There is no clause to specify your desired behavior of aggregating after a window function, so you must use a subquery or CTE.
Second, even if you eliminated the windowing from the OVER() clause you need to GROUP BY data you only know after the first round of aggregation has been completed.