How does the window SUM function work with OVER internally? - sql

I'm trying to understand, how window function works internally.
ID,Amt
A,1
B,2
C,3
D,4
E,5
If I run this, will give sum of all amount in total column against every record.
Select ID, SUM (AMT) OVER () total from table
but when I run this, it will give me cumulative sum
Select ID, SUM (AMT) OVER (order by ID) total from table
Trying to understand what is happening when its OVER() and OVER(order by ID)
What I've understood is when no partition is defined in OVER, it considers everything as single partition. But not able to understand when we add order by Id within over(), how come it starts doing cumulative sum ?
Can anyone share what's happening behind the scenes for this ?

That is an interesting case, based on the documentation here is the explanation and example.
If PARTITION BY is not specified, the function treats all rows of the
query result set as a single partition. Function will be applied on
all rows in the partition if you don't specify ORDER BY clause.
So if you specifiey ORDER BY then
If it is specified, and a ROWS/RANGE is not specified, then default
RANGE UNBOUNDED PRECEDING AND CURRENT ROW is used as default for
window frame by the functions that can accept optional ROWS/RANGE
specification (for example min or max).
So technically these two commands are the same:
SELECT ID, SUM(AMT) OVER (ORDER BY ID) total FROM table
SELECT ID, SUM(AMT) OVER (ORDER BY ID RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) total FROM table
More about you can read in the documentation:https://learn.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql?view=sql-server-ver15

This is not related to Oracle itself, but it's part of the SQL Standard and behaves the same way in many databases including Oracle, DB2, PostgreSQL, SQL Server, MySQL, MariaDB, H2, etc.
By definition, when you include the ORDER BY clause the engine will produce "running values" (cumulative aggregation) inside each partition; without the ORDER BY clause it produces the same, single value that aggregates the whole partition.
Now, the partition itself is mainly defined by the PARTITION BY clause. In its absence, the whole result set is considered as a single partition.
Finally, as a more advanced topic, the partition can be further tweaked using a "frame" clause (ROWS and RANGE) and by a "frame exclusion" clause (EXCLUDE).

Related

How to deal with ports which are not being Grouped by neither being aggregated in Informatica Powercenter

We are working on converting Informatica mappings to Google Bigquery SQL. In one of the mappings, there are a couple ports/columns, say A and B which are not getting grouped by in the Aggregator transformation and neither have been applied any aggregation function like sum, avg etc.
According to senior devs in my org, in Informatica, we will get last values of these ports/columns as a result after the aggregator. My question is, how do we convert this behaviour in BigQuery SQL? Because we cannot use that columns in select statement, which are not present in the Group by clause and we don't want to group by these columns.
For getting last value of the column, we have LAST_VALUE() analytic function in bigquery, but even then we cannot use the group by and analytic function in same select statement.
I would really appreciate some help!
Use some aggregation function.
In Informatica you will get LAST value. This is not deterministic. It basically means that either
you have same values across all the column,
you don't care which one you get, or
you have specific order, on which the last value is taken.
First two cases mean you can use MIN / MAX / whatsoever. The result will be same or you don't care.
If the last one is your case, ARRAY_AGG should help you, as per this answer.
to convert Infa mapping with aggregator to big SQL, I would use row_number over (partitioned by id order by id) as rn and then in outside put a filter rn=1.
Informatica aggregator - id is group by column.
Equivalent SQL should look like this -
select a,b,id
from
(select a,b,row_number over (partitioned by id order by id desc) as rn --this will mimic informatica aggregator. id column is the group by port. if you have any sorter before aggregator add all ports as per order in order by column on same sequence but reverse order(asc/desc)
from mytable) rs
where rs.rn=1 -- this will ensure to pick latest row.

Evaluating multiple window function

If a window is provided multiple times in the same query, how is it evaluated? Does the query parser check if one window is the same as another or easily 'derived' from another. For example in the following:
SELECT
MAX(val) OVER (PARTITION BY product_id ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) one,
MAX(val) OVER (PARTITION BY product_id ORDER BY date ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) two,
MAX(val) OVER (PARTITION BY product_id ORDER BY date ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) three
FROM
table
How do database engines 'optimize' this query, if they do at all? Does it involve calculated a single window and altering that for other calculations, or does this create three distinct windows? Where might I be able to find more information on how/when the window functions are evaluated (any backend is fine -- oracle, mysql, sqlserver, postgres)?
This depends on the database. That said, the partition by and order by incur overhead for processing data. There is a good chance that the database will not need to re-do that work because the window frame specification ("rows between") differs slightly.
Of course, different partition by and order by conditions would mean that the data could not be re-used and would need to be reprocessed.
So, given the specification you have with slight differences, there is an opportunity for a good optimizers to re-use intermediate results. However, it is easy to modify the clauses so they cannot be re-used.

Does adding ROW_NUMBER to query get sorted automatically?

It seems if I add ROW_NUMBER in a simple select query, the results are sorted automatically by the ROW_NUMBER column even without order by added to the select query at the end. I tried this on
without ROW_NUMBER - results are in random order
with ROW_NUMBER over (order by some_col) - results automatically ordered by this ROW_NUMBER column
with ROW_NUMBER over (order by some_col desc) - results again automatically order by this ROW_NUMBER column reflecting the new direction
Why is it behaving like this? Is there an implicit order by when using ROW_NUMBER?
If this is vendor specific, I was testing on MSSQL2014
The ORDER BY Clause in the window function only controls the order of the rows considered for the window function. It does not guarantee the final result set order.
Over clause
Note: The ORDER BY clause in the OVER clause only controls the order that the rows in the partition will be utilized by the window
function. It does not control the order of the final result set.
Without an ORDER BY clause on the query itself, the order of the rows
is not guaranteed. You may notice that your query may be returning in
the order of the last specified OVER clause – this is due to the way
that this is currently implemented in SQL Server. If the SQL Server
team at Microsoft changes the way that it works, it may no longer
order your results in the manner that you are currently observing. If
you need a specific order for the result set, you must provide an
ORDER BY clause against the query itself.

IIF or case when in window function

I have some timestamps in a table and I want to create a dummy variable (0 or 1) that tests if the row above is equal to the current row, after the timestamps are sorted. I need to do this in different partitions. Is there a window function that can do this in SQL Server?
So I know my partition by column and order by column. From my knowledge of window functions I need to perhaps use a rank function, but is there a way to write this with nested functions using IIF and LEAD or LAG to check for some condition between the rows in a partition?
SQL table represent unordered sets. If you have an ordering column separate from the timestamps you can use:
select t.*,
(case when lag(timestamp) over (partition by <partition col> order by <order col>) = timestamp
then 1 else 0
end) as flag
from t;
SQL is a set language, so there is no above or below, but they created something like that, limited portability, by popular demand. You can write a procedure that processes the order'd rows remembering prior values, priming that memory with something for the first row. Still no access to the next row! Back to spreadsheets? Or save the info for the current row and prior row until the next row, and output the row a row late, but then what about the last row? Even in spreadsheets, you have a top and bottom row!

I assume Row_Number doesn’t act only on rows of the window frame

a) Quote is taken from http://www.postgresql.org/docs/current/static/tutorial-window.html
for each row, there is a set of rows within its partition called its window frame. Many (but not all) window functions act only on the rows of the window frame, rather than of the whole partition. By default, if ORDER BY is supplied then the frame consists of all rows from the start of the partition up through the current row, plus any following rows that are equal to the current row according to the ORDER BY clause
I assume Row_Number doesn’t act only on rows of the window frame, but instead always act on all rows of a partition?
b)
By default, if ORDER BY is supplied then the frame consists of all rows from the start of the partition up through the current row, plus any following rows that are equal to the current row according to the ORDER BY clause
I assume that is only true for those window functions that act only on rows of the window frame ( thus above quote isn't true for ROW_NUMBER() function )?
c) http://www.postgresql.org/docs/current/static/tutorial-window.html talks about PostgreSQL 8.4’s Windowing functions. Is everything in that article also true for Sql Server 2008’s Windowing functions
thanx
The ORDER BY clause in aggregate window functions is not supported by SQL Server yet.
http://msdn.microsoft.com/en-us/library/ms189461.aspx
The below query will give you a syntax error
SELECT salary, sum(salary) OVER (ORDER BY salary) FROM empsalary
row_number() is not an aggregate window function so it does not act on rows of the window frame only. It acts on the whole partition. Only aggregate functions work on the window frame. And only if the ORDER BY clause is supported.