ORDER BY clause in a Window function with a frame clause - sql

I want to take the min and max for a column within each partition.
See example below (both methods give the correct answer). I do not understand why I have to add the ORDER BY clause.
When using min and max as the aggregate function what possible difference will the ORDER BY have?
DROP TABLE IF EXISTS #HELLO;
CREATE TABLE #HELLO (Category char(2), q int);
INSERT INTO #HELLO (Category, q)
VALUES ('A',1), ('A',5), ('A',6), ('B',0), ('B',3)
SELECT *,
min(q) OVER (PARTITION BY category ORDER BY (SELECT NULL) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS minvalue
,max(q) OVER (PARTITION BY category ORDER BY (SELECT NULL) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS maxvalue
,min(q) OVER (PARTITION BY category ORDER BY q ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS minvalue2
,max(q) OVER (PARTITION BY category ORDER BY q ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS maxvalue2
FROM #HELLO;

If you use the ROWS or RANGE clause in a OVER clause then you need to provide an ORDER BY clause, because you are typically telling the OVER clause how many rows to look behind and forward, which can only be determined if you have an ORDER BY.
However in your case because you use ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING i.e. all rows, you don't need any of it. The following produces the same results:
SELECT *,
min(q) OVER (PARTITION BY category) AS minvalue
,max(q) OVER (PARTITION BY category) AS maxvalue
,min(q) OVER (PARTITION BY category) AS minvalue2
,max(q) OVER (PARTITION BY category) AS maxvalue2
FROM #HELLO;

Related

Optimizing windowing query in presto

I have a table with fields such as user_id, col1, col2, col3, updated_at, is_deleted, day.
And current query looks like this -
SELECT DISTINCT
user_id,
first_value(col1) ignore nulls OVER (partition BY user_id
ORDER BY
updated_at DESC rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED following) AS col1,
first_value(col2) ignore nulls OVER (partition BY user_id
ORDER BY
updated_at DESC rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED following) AS col2,
first_value(col3) ignore nulls OVER (partition BY user_id
ORDER BY
updated_at DESC rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED following) AS col3,
bool_or(is_deleted) ignore nulls OVER (partition BY user_id
ORDER BY
updated_at DESC rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED following) AS is_deleted
FROM
my_table
WHERE
day >= '2021-05-25'
Basically, I want the latest(first) value of each column, for each user id. Since each value column can be null, I am having to run same windowing query multiple times(for each column).
Currently, 66% of the time is being spent on windowing.
Any way to optimize?
seems like you want this :
select * from (
select * , row_number() over (partition by user_id ORDER BY updated_at DESC) rn
from my_table
where day >= '2021-05-25'
) t
where rn = 1

Populate one column based on the previous values of another

I am trying to create a column which populates the transaction ID for every row up until the row where that transaction was completed - in this example every "add to basket" event before an order.
So far I have tried using FIRST_VALUE:
SELECT
UserID, date, session_id, hitnumber, add_to_basket, transactionid,
first_value(transactionid) over (partition by trans_part order by date, transactionid) AS t_id
FROM(
select UserID, date, session_id, hitnumber, add_to_basket, transactionid,
SUM(CASE WHEN transactionid IS NULL THEN 0 ELSE 1 END) OVER (ORDER BY date, transactionid) AS trans_part,
FIRST_VALUE(transactionid IGNORE NULLS)
OVER (PARTITION BY userid ORDER BY hitnumber ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS t_id,
from q1
join q2 using (session_id)
order by 1,2,3,4
)
But the result I am getting is the inverse of what I want, populating the transaction ID of the previous order against the basket events which happened after this transaction.
How can I change my code so that I will see the transaction id of the order AFTER the basket events that led up to it? For example, in the table below I want to see the transaction id ending in ...095 instead of the id ending in ...383 for the column t_id.
Based on Gordon's answer below I have also tried:
last_value(transactionid ignore nulls) over(
order by hitnumber
rows between unbounded preceding and current row) as t_id2,
But this is not populating the event rows which proceed a transaction with a transaction id (seen below as t_id2):
You can use last_value(ignore nulls):
select . . . ,
last_value(transaction_id ignore nulls) over (
order by hitnumber
rows between unbounded preceding and current row
) as t_id
from q1 join
q2 using (session_id);
The difference from your answer is the windowing clause which ends at the current row.
EDIT:
It looks like there is one t_id per session_id, so just use max():
select . . . ,
max(transaction_id) over (partition by session_id) as t_id
from q1 join
q2 using (session_id);

How create a calculated column in google bigquery?

I have a data in Google Bigquery like this
id yearmonth value
00007BR0011 201705 8.0
00007BR0011 201701 3.0
and I need to create a table where per id shows the subtraction by year in order to create something like this
id value
00007BR0011 5
The value 5 is the subtraction of the value in 201705 minus the value in 201701
I am using standard SQL, but don't know how to create the column with the calculation.
Sorry in advance if it is too basic, but didn't find anything yet useful.
Perhaps a single table/result set would work for your purposes:
select id,
(max(case when yearmonth = 201705 then value end) -
max(case when yearmonth = 201701 then value end) -
)
from t
where yearmonth in (201705, 201701)
group by id;
It's difficult to answer this based on the current level of detail, but if the smaller value is always subtracted from the larger (and both are never null), you could handle it this way using GROUP BY:
SELECT
id,
MAX(value) - MIN(value) AS new_value
FROM
`your-project.your_dataset.your_table`
GROUP BY
id
From here, you could save these results as a new table, or save this query as a view definition (which would be similar to having it calculated on the fly if the underlying data is changing).
Another option is to add a column under the table schema, then run an UPDATE query to populate it.
If the smaller value isn't always subtracted from the larger, but rather the lower date is what matters (and there are always two), another way to do this would be to use analytic (or window) functions to select the value with the lowest date:
SELECT
DISTINCT
id,
(
FIRST_VALUE(value) OVER(PARTITION BY id ORDER BY yearmonth DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
-
LAST_VALUE(value) OVER(PARTITION BY id ORDER BY yearmonth DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
) AS new_value
FROM
`your-project.your_dataset.your_table`
Because analytic functions operate on the source rows, DISTINCT is needed to eliminate the duplicate rows.
If there could be more than two rows and you need all the prior values subtracted from the latest value, you could handle it this way (which would also be safe against NULLs or only having one row):
SELECT
DISTINCT
id,
(
FIRST_VALUE(value) OVER(PARTITION BY id ORDER BY yearmonth DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
-
(
SUM(value) OVER(PARTITION BY id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
-
FIRST_VALUE(value) OVER(PARTITION BY id ORDER BY yearmonth DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
) AS new_value
FROM
`your-project.your_dataset.your_table`
You could technically do the same thing with grouping and ARRAY_AGG with dereferencing, although this method will be significantly slower on larger data sets:
SELECT
id,
(
ARRAY_AGG(value ORDER BY yearmonth DESC)[OFFSET(0)]
-
(
SUM(value)
-
ARRAY_AGG(value ORDER BY yearmonth DESC)[OFFSET(0)]
)
) AS new_value
FROM
`your-project.your_dataset.your_table`
GROUP BY
id

SQL query on redshift to get the first and the last value

I have a data set like this.
I need to write a query which gives me the below output
for every SessionID and VisitID, it should sort based on the date_time Column and provide me with the First Category and the Last Category.
I have used the following code
rank() OVER( PARTITION BY SessionID
, VisitID
ORDER by
date_Time DESC ) as click_rank_last
where click_rank_last = 1
to get the last Category. But what I need is to get the first and the last in a single query with minimum hit to the data base as the data is huge and querying in costly.
Need the most optimum query!
One way would be:
select distinct
sessionid,
visitid,
first_value(category) over (
partition by sessionid, visitid
order by date_time
rows between unbounded preceding and unbounded following),
last_value(category) over (
partition by sessionid, visitid
order by date_time
rows between unbounded preceding and unbounded following)
from tbl

Hive HQL - optimizing repetitive WINDOW clause

I have following HQL
SELECT count(*) OVER (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) pocet,
min(event.time) OVER (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) minTime,
max(event.time) OVER (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) maxTime
FROM t21_pam6
How can I define the 3 same WINDOW clauses into one?
The documentation (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
) shows this example
SELECT a, SUM(b) OVER w
FROM T;
WINDOW w AS (PARTITION BY c ORDER BY d ROWS UNBOUNDED PRECEDING)
But I don't think it's working. It's not possible to define WINDOW w as... is not a HQL command.
This type of optimization is something that the compiler would need to do. I don't think there is a way to ensure this programmatically.
That said, the calculation for the minimum time is totally unnecessary. Because of the order by, it should be the time in the current row. Similarly, if you can handle null values, then the expression can be simplified to:
SELECT count(*) OVER (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) pocet,
event.time as minTime,
lead(event.time, 2) OVER (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time) as maxTime
FROM t21_pam6;
Note that the maxtime calculation is slightly different because it will return NULL for the last two values matching the conditions.
As #sergey-khudyakov responded, there was a bug in documentation. This variant works fine:
SELECT count(*) OVER w,
min(event.time) OVER w,
max(event.time) OVER w
FROM ar3.t21_pam6
WINDOW w AS (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING)