Using nested window function in Snowflake

Using nested window function in Snowflake - sql

I've seen a lot of questions about this general error, but I don't get why I have it, maybe because of nested window functions...
With the below query, I get the error for Col_C, Col_D, ... and almost everything I tried
SQL compilation error: [eachColumn] is not a valid group by expression
SELECT
Col_A,
Col_B,
FIRST_VALUE(Col_C) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B
ORDER BY Col_TimeStamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
MAX(Col_D) OVER (PARTITION BY Col_A, Col_B
ORDER BY Col_TimeStamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
FIRST_VALUE(CASE WHEN Col_T = 'testvalue'
THEN LAST_VALUE(Col_E) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B
ORDER BY Col_TimeStamp DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
ELSE NULL END) IGNORE NULLS
OVER (PARTITION BY Col_A, Col_B
ORDER BY Col_TimeStamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM mytable
So, is there a way to used nested window functions in Snowflake (with case when ...) and if so, how/what am I doing wrong ?

So deconstructing your logic to show it's the second FIRST_VALUE that causes the problem
WITH data(Col_A,Col_B,Col_c,col_d, Col_TimeStamp, col_t,col_e) AS (
SELECT * FROM VALUES
(1,1,1,1,1,'testvalue',10),
(1,1,2,3,2,'value',11)
)
SELECT
Col_A,
Col_B,
FIRST_VALUE(Col_C) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B
ORDER BY Col_TimeStamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as first_c,
MAX(Col_D) OVER (PARTITION BY Col_A, Col_B
ORDER BY Col_TimeStamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
LAST_VALUE(Col_E) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B
ORDER BY Col_TimeStamp DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as last_e,
IFF(Col_T = 'testvalue', last_e, NULL) as if_test_last_e
/*,FIRST_VALUE(if_test_last_e) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B
ORDER BY Col_TimeStamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as the_problem*/
FROM data
ORDER BY Col_A,Col_B, col_timestamp
;
if we uncomment the_problem we have it.. compare to PostgreSQL (my background) just getting to reuse so many prior results/steps is a gift, so here I just bust out another SELECT layer.
WITH data(Col_A,Col_B,Col_c,col_d, Col_TimeStamp, col_t,col_e) AS (
SELECT * FROM VALUES
(1,1,1,1,1,'testvalue',10),
(1,1,2,3,2,'value',11)
)
SELECT *,
FIRST_VALUE(if_test_last_e) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B
ORDER BY Col_TimeStamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as not_a_problem
FROM (
SELECT
Col_A,
Col_B,
FIRST_VALUE(Col_C) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B
ORDER BY Col_TimeStamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as first_c,
MAX(Col_D) OVER (PARTITION BY Col_A, Col_B
ORDER BY Col_TimeStamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
LAST_VALUE(Col_E) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B
ORDER BY Col_TimeStamp DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as last_e,
IFF(Col_T = 'testvalue', last_e, NULL) as if_test_last_e
,Col_TimeStamp
FROM data
)
ORDER BY Col_A,Col_B, Col_TimeStamp
And then it all works. This also happens if you LAG then IFF/FIRST_VALUE and then LAG that second result.

"I've seen a lot of questions about this general error, but I don't get why I have it, maybe because of nested window functions..."
Snowflake supports reusing expressions at the same level(sometimes called "lateral column alias reference" )
It is perfectly fine to write:
SELECT 1+1 AS col1,
col1 *2 AS col2,
CASE WHEN col1 > col2 THEN 'Y' ELSE 'NO' AS col3
...
In standard SQL you will either have to use multiple levels of query(cte) or use LATERAL JOIN. Related: PostgreSQL: using a calculated column in the same query
Unfortunately the same syntax will not work for analytic functions(and I am now aware of any RDMBS that supports it):
SELECT ROW_NUMBER() OVER(PARTITION BY ... ORDER BY ...) AS rn
,MAX(rn) OVER(PARTITION BY <different than prev) AS m
FROM tab;
In the SQL Standard 2016 there is optional feature: T619 Nested window functions.
Here an article how the nested analytic function query could look like: Nested window functions in SQL.
It means that current way to nest windowed function is usage of derived table/cte:
WITH cte AS (
SELECT ROW_NUMBER() OVER(PARTITION BY ... ORDER BY ...) AS rn
,*
FROM tab
)
SELECT *, MAX(rn) OVER(PARTITION BY <different than prev) AS m
FROM cte

Related

Optimizing windowing query in presto

I have a table with fields such as user_id, col1, col2, col3, updated_at, is_deleted, day.
And current query looks like this -
SELECT DISTINCT
user_id,
first_value(col1) ignore nulls OVER (partition BY user_id
ORDER BY
updated_at DESC rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED following) AS col1,
first_value(col2) ignore nulls OVER (partition BY user_id
ORDER BY
updated_at DESC rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED following) AS col2,
first_value(col3) ignore nulls OVER (partition BY user_id
ORDER BY
updated_at DESC rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED following) AS col3,
bool_or(is_deleted) ignore nulls OVER (partition BY user_id
ORDER BY
updated_at DESC rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED following) AS is_deleted
FROM
my_table
WHERE
day >= '2021-05-25'
Basically, I want the latest(first) value of each column, for each user id. Since each value column can be null, I am having to run same windowing query multiple times(for each column).
Currently, 66% of the time is being spent on windowing.
Any way to optimize?

seems like you want this :
select * from (
select * , row_number() over (partition by user_id ORDER BY updated_at DESC) rn
from my_table
where day >= '2021-05-25'
) t
where rn = 1

Can Db2 LAG function refer to itself?

I'm trying to put information to identify GROUP ID by replicating this Excel formula:
IF(OR(A2<>A1,AND(B2<>"000",B1="000")),D1+1,D1)
This formula is written when my cursor is in "D2", meaning I've referred to the newly added column value in the previous row to generate the current value.
I'd like to this with Db2 SQL, but I'm not sure how to because I'll need to do LAG function on the column I'm going to add and referring their value.
Kindly advise if having better way to do.
Thanks.

You need nested OLAP-functions, assuming ORDER BY SERIAL_NUMBER, EVENT_TIMESTAMP returns the order shown in Excel:
with cte as
(
select ...
case --IF(OR(A2<>A1,AND(B2<>"000",B1="000"))
when (lag(OPERATION)
over (order by SERIAL_NUMBER, EVENT_TIMESTAMP) = '000'
and OPERATION <> '000')
or lag(SERIAL_NUMBER,1,'')
over (order by SERIAL_NUMBER, EVENT_TIMESTAMP) <> SERIAL_NUMBER
then 1
else 0
end as flag -- start of new group
from tab
)
select ...
sum(flag)
over (order by SERIAL_NUMBER, EVENT_TIMESTAMP
rows unbounded preceding) as GROUP_ID
from cte

Your code is counting the number of "breaks" in your data, where a "break" is defined as 000 or the value in the first column changing.
In SQL, you can do this as a cumulative sum:
select t.*,
sum(case when prev_serial_number = serial_number or operation <> '000'
then 0 else 1
end) over (order by event_timestamp rows between unbounded preceding and current row) as column_d
from (select t.*,
lag(serial_number) over (order by event_timestamp) as prev_serial_number
from t
) t

Populate one column based on the previous values of another

I am trying to create a column which populates the transaction ID for every row up until the row where that transaction was completed - in this example every "add to basket" event before an order.
So far I have tried using FIRST_VALUE:
SELECT
UserID, date, session_id, hitnumber, add_to_basket, transactionid,
first_value(transactionid) over (partition by trans_part order by date, transactionid) AS t_id
FROM(
select UserID, date, session_id, hitnumber, add_to_basket, transactionid,
SUM(CASE WHEN transactionid IS NULL THEN 0 ELSE 1 END) OVER (ORDER BY date, transactionid) AS trans_part,
FIRST_VALUE(transactionid IGNORE NULLS)
OVER (PARTITION BY userid ORDER BY hitnumber ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS t_id,
from q1
join q2 using (session_id)
order by 1,2,3,4
)
But the result I am getting is the inverse of what I want, populating the transaction ID of the previous order against the basket events which happened after this transaction.
How can I change my code so that I will see the transaction id of the order AFTER the basket events that led up to it? For example, in the table below I want to see the transaction id ending in ...095 instead of the id ending in ...383 for the column t_id.
Based on Gordon's answer below I have also tried:
last_value(transactionid ignore nulls) over(
order by hitnumber
rows between unbounded preceding and current row) as t_id2,
But this is not populating the event rows which proceed a transaction with a transaction id (seen below as t_id2):

You can use last_value(ignore nulls):
select . . . ,
last_value(transaction_id ignore nulls) over (
order by hitnumber
rows between unbounded preceding and current row
) as t_id
from q1 join
q2 using (session_id);
The difference from your answer is the windowing clause which ends at the current row.
EDIT:
It looks like there is one t_id per session_id, so just use max():
select . . . ,
max(transaction_id) over (partition by session_id) as t_id
from q1 join
q2 using (session_id);

How to use SQL LAG function with condition

I have a table as the following rows:
tipoProtocolo numeroProtocolo dataReferencia dataAssinatura dataVencimento
------------- --------------- -------------- -------------- --------------
1 47676 NULL 20150112 20151231
1 47676 20151231 20150209 NULL
1 47676 NULL 20150224 NULL
1 47676 NULL 20151005 NULL
1 47676 NULL 20151021 NULL
1 47676 NULL 20151026 NULL
1 47676 NULL 20151120 NULL
I've implemented a piece of code that gets the value from the dataVencimento column (previous row) to the dataRefencia column (red arrow in the image). However, I would like to check if the column dataVencimento (from the previous row) is NULL. If this condition is true I need to copy the value from the column dataReferencia from the previous row (blue arrow in the image).
SELECT tipoProtocolo,
numeroProtocolo,
LAG(dataVencimento, 1 ) OVER(
PARTITION BY numeroProtocolo, tipoProtocolo
ORDER BY dataAssinatura
) dataReferencia,
dataAssinatura,
dataVencimento
FROM cte_ContratoAditivo
Here is my SQL code:
SELECT tipoProtocolo, numeroProtocolo,
LAG(dataVencimento, 1) OVER(
PARTITION BY numeroProtocolo, tipoProtocolo
ORDER BY dataAssinatura
) dataReferencia,
dataAssinatura, dataVencimento
FROM cte_ContratoAditivo

What you want is lag(ignore nulls). Unfortunately, SQL Server does not support this.
If the dates are increasing, you can use a cumulative max:
select . . .,
max(dataVencimento) over (
partition by numeroProtocolo, tipoProtocolo
order by dataAssinatura
rows between unbounded preceding and 1 preceding
) as dataReferencia
If this is not the case, you can use two levels of aggregation:
select ca.*,
max(dataVencimento) over (
partition by numeroProtocolo, tipoProtocolo
order by dataAssinatura
) as dataReferencia
from (select ca.*,
count(dataVencimento) over (
partition by numeroProtocolo, tipoProtocolo
order by dataAssinatura
) as grouping
from cte_ContratoAditivo ca
) ca;
The subquery counts the number of valid values. This is really to assign a group number to the rows. The outer query then spreads the value over the entire group.

As the OP did respond I've gone with the literal and the guessed answer. The first is the literal answer; if the prior row is NULL use that the one prior:
WITH VTE AS (
SELECT *
FROM (VALUES(1,47676,CONVERT(date,NULL),CONVERT(date,'20150112'),CONVERT(date,'20151231')),
(1,47676,CONVERT(date,'20151231'),CONVERT(date,'20150209'),CONVERT(date,NULL)),
(1,47676,CONVERT(date,NULL),CONVERT(date,'20150224'),CONVERT(date,NULL)),
(1,47676,CONVERT(date,NULL),CONVERT(date,'20151005'),CONVERT(date,NULL)),
(1,47676,CONVERT(date,NULL),CONVERT(date,'20151021'),CONVERT(date,NULL)),
(1,47676,CONVERT(date,NULL),CONVERT(date,'20151026'),CONVERT(date,NULL)),
(1,47676,CONVERT(date,NULL),CONVERT(date,'20151120'),CONVERT(date,NULL))) V(tipoProtocolo,numeroProtocolo,dataReferencia,dataAssinatura,dataVencimento)),
CTE AS(
SELECT V.tipoProtocolo,
V.numeroProtocolo,
V.dataReferencia,
V.dataAssinatura,
V.dataVencimento,
LAG(dataVencimento) OVER (PARTITION BY numeroProtocolo, tipoProtocolo ORDER BY dataAssinatura) AS dataReferencia1,
LAG(dataVencimento,2) OVER (PARTITION BY numeroProtocolo, tipoProtocolo ORDER BY dataAssinatura) AS dataReferencia2
FROM VTE V)
SELECT C.tipoProtocolo,
C.numeroProtocolo,
C.dataReferencia,
C.dataAssinatura,
C.dataVencimento,
ISNULL(C.dataReferencia1,C.dataReferencia2) AS dataReferencia
FROM CTE C;
The other is what I suspect the OP really means and that they want the last non-NULLvalue. If this is the case, this is a "classic" gaps and islands problem:
WITH VTE AS (
SELECT *
FROM (VALUES(1,47676,CONVERT(date,NULL),CONVERT(date,'20150112'),CONVERT(date,'20151231')),
(1,47676,CONVERT(date,'20151231'),CONVERT(date,'20150209'),CONVERT(date,NULL)),
(1,47676,CONVERT(date,NULL),CONVERT(date,'20150224'),CONVERT(date,NULL)),
(1,47676,CONVERT(date,NULL),CONVERT(date,'20151005'),CONVERT(date,NULL)),
(1,47676,CONVERT(date,NULL),CONVERT(date,'20151021'),CONVERT(date,NULL)),
(1,47676,CONVERT(date,NULL),CONVERT(date,'20151026'),CONVERT(date,NULL)),
(1,47676,CONVERT(date,NULL),CONVERT(date,'20151120'),CONVERT(date,NULL))) V(tipoProtocolo,numeroProtocolo,dataReferencia,dataAssinatura,dataVencimento)),
Grps AS(
SELECT V.tipoProtocolo,
V.numeroProtocolo,
V.dataReferencia,
V.dataAssinatura,
V.dataVencimento,
COUNT(dataVencimento) OVER (PARTITION BY numeroProtocolo, tipoProtocolo ORDER BY dataAssinatura
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Grp
FROM VTE V)
SELECT G.tipoProtocolo,
G.numeroProtocolo,
G.dataReferencia,
G.dataAssinatura,
G.dataVencimento,
MAX(dataVencimento) OVER (PARTITION BY G.Grp) AS dataReferencia,
G.Grp
FROM Grps G
ORDER BY dataAssinatura;
I will note that is seems odd that you call the column with the LAG expression dataReferencia, despite that the expression is on dataVencimento (and there is already a column called dataReferencia).

How create a calculated column in google bigquery?

I have a data in Google Bigquery like this
id yearmonth value
00007BR0011 201705 8.0
00007BR0011 201701 3.0
and I need to create a table where per id shows the subtraction by year in order to create something like this
id value
00007BR0011 5
The value 5 is the subtraction of the value in 201705 minus the value in 201701
I am using standard SQL, but don't know how to create the column with the calculation.
Sorry in advance if it is too basic, but didn't find anything yet useful.

Perhaps a single table/result set would work for your purposes:
select id,
(max(case when yearmonth = 201705 then value end) -
max(case when yearmonth = 201701 then value end) -
)
from t
where yearmonth in (201705, 201701)
group by id;

It's difficult to answer this based on the current level of detail, but if the smaller value is always subtracted from the larger (and both are never null), you could handle it this way using GROUP BY:
SELECT
id,
MAX(value) - MIN(value) AS new_value
FROM
`your-project.your_dataset.your_table`
GROUP BY
id
From here, you could save these results as a new table, or save this query as a view definition (which would be similar to having it calculated on the fly if the underlying data is changing).
Another option is to add a column under the table schema, then run an UPDATE query to populate it.
If the smaller value isn't always subtracted from the larger, but rather the lower date is what matters (and there are always two), another way to do this would be to use analytic (or window) functions to select the value with the lowest date:
SELECT
DISTINCT
id,
(
FIRST_VALUE(value) OVER(PARTITION BY id ORDER BY yearmonth DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
-
LAST_VALUE(value) OVER(PARTITION BY id ORDER BY yearmonth DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
) AS new_value
FROM
`your-project.your_dataset.your_table`
Because analytic functions operate on the source rows, DISTINCT is needed to eliminate the duplicate rows.
If there could be more than two rows and you need all the prior values subtracted from the latest value, you could handle it this way (which would also be safe against NULLs or only having one row):
SELECT
DISTINCT
id,
(
FIRST_VALUE(value) OVER(PARTITION BY id ORDER BY yearmonth DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
-
(
SUM(value) OVER(PARTITION BY id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
-
FIRST_VALUE(value) OVER(PARTITION BY id ORDER BY yearmonth DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
) AS new_value
FROM
`your-project.your_dataset.your_table`
You could technically do the same thing with grouping and ARRAY_AGG with dereferencing, although this method will be significantly slower on larger data sets:
SELECT
id,
(
ARRAY_AGG(value ORDER BY yearmonth DESC)[OFFSET(0)]
-
(
SUM(value)
-
ARRAY_AGG(value ORDER BY yearmonth DESC)[OFFSET(0)]
)
) AS new_value
FROM
`your-project.your_dataset.your_table`
GROUP BY
id

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Using nested window function in Snowflake - sql

Related

Optimizing windowing query in presto

Can Db2 LAG function refer to itself?

Populate one column based on the previous values of another

How to use SQL LAG function with condition

How create a calculated column in google bigquery?

Categories

Resources