Optimizing windowing query in presto - sql

I have a table with fields such as user_id, col1, col2, col3, updated_at, is_deleted, day.
And current query looks like this -
SELECT DISTINCT
user_id,
first_value(col1) ignore nulls OVER (partition BY user_id
ORDER BY
updated_at DESC rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED following) AS col1,
first_value(col2) ignore nulls OVER (partition BY user_id
ORDER BY
updated_at DESC rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED following) AS col2,
first_value(col3) ignore nulls OVER (partition BY user_id
ORDER BY
updated_at DESC rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED following) AS col3,
bool_or(is_deleted) ignore nulls OVER (partition BY user_id
ORDER BY
updated_at DESC rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED following) AS is_deleted
FROM
my_table
WHERE
day >= '2021-05-25'
Basically, I want the latest(first) value of each column, for each user id. Since each value column can be null, I am having to run same windowing query multiple times(for each column).
Currently, 66% of the time is being spent on windowing.
Any way to optimize?

seems like you want this :
select * from (
select * , row_number() over (partition by user_id ORDER BY updated_at DESC) rn
from my_table
where day >= '2021-05-25'
) t
where rn = 1

Related

ORDER BY clause in a Window function with a frame clause

I want to take the min and max for a column within each partition.
See example below (both methods give the correct answer). I do not understand why I have to add the ORDER BY clause.
When using min and max as the aggregate function what possible difference will the ORDER BY have?
DROP TABLE IF EXISTS #HELLO;
CREATE TABLE #HELLO (Category char(2), q int);
INSERT INTO #HELLO (Category, q)
VALUES ('A',1), ('A',5), ('A',6), ('B',0), ('B',3)
SELECT *,
min(q) OVER (PARTITION BY category ORDER BY (SELECT NULL) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS minvalue
,max(q) OVER (PARTITION BY category ORDER BY (SELECT NULL) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS maxvalue
,min(q) OVER (PARTITION BY category ORDER BY q ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS minvalue2
,max(q) OVER (PARTITION BY category ORDER BY q ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS maxvalue2
FROM #HELLO;
If you use the ROWS or RANGE clause in a OVER clause then you need to provide an ORDER BY clause, because you are typically telling the OVER clause how many rows to look behind and forward, which can only be determined if you have an ORDER BY.
However in your case because you use ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING i.e. all rows, you don't need any of it. The following produces the same results:
SELECT *,
min(q) OVER (PARTITION BY category) AS minvalue
,max(q) OVER (PARTITION BY category) AS maxvalue
,min(q) OVER (PARTITION BY category) AS minvalue2
,max(q) OVER (PARTITION BY category) AS maxvalue2
FROM #HELLO;

Impala Last_Value() Not giving result as expected

I have a Table in Impala in which I have time information as Unix-Time (with a frequency of 1 mSec) and information about three variables, like given below:
ts Val1 Val2 Val3
1.60669E+12 7541.76 0.55964607 267.1613
1.60669E+12 7543.04 0.5607262 267.27805
1.60669E+12 7543.04 0.5607241 267.22308
1.60669E+12 7543.6797 0.56109643 267.25974
1.60669E+12 7543.6797 0.56107396 267.30624
1.60669E+12 7543.6797 0.56170875 267.2643
I want to resample the data and to get the last value of the new time window. For example, if I want to resample as 10Sec frequency the output should be the last value of 10Sec window, like given below:
ts val1_Last Val2_Last Val3_Last
2020-11-29 22:30:00 7541.76 0.55964607 267.1613
2020-11-29 22:30:10 7542.3994 0.5613486 267.31238
2020-11-29 22:30:20 7542.3994 0.5601791 267.22842
2020-11-29 22:30:30 7544.32 0.56069416 267.20248
To have this result, I am running the following query:
select distinct *
from (
select ts,
last_value(Val1) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val1,
last_value(Val2) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val2,
last_value(Val3) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val3
from (SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/10 as bigint)*10 as TIMESTAMP) as ts ,
Val1 as Val1,
Val2 as Val2,
Val3 as Val3
FROM Sensor_Data.Table where unit='Unit1'
and cast(ts/1000 as TIMESTAMP) BETWEEN '2020-11-29 22:30:00' and '2020-12-01 01:51:00') as ttt) as tttt
order by ts
I have read on some forums that LAST_VALUE() sometimes cause problem, so I tried to achieve the same thing using FIRST_VALUE with ORDER BY DESC. The query is given below:
select distinct *
from (
select ts,
first_value(Val1) over (partition by ts order by ts desc rows between unbounded preceding and unbounded following) as Val1,
first_value(Val2) over (partition by ts order by ts desc rows between unbounded preceding and unbounded following) as Val2,
first_value(Val3) over (partition by ts order by ts desc rows between unbounded preceding and unbounded following) as Val3
from (SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/10 as bigint)*10 as TIMESTAMP) as ts ,
Val1 as Val1,
val2 as Val2,
Val3 as Val3
FROM product_sofcdtw_ops.as_operated_full_backup where unit='FCS05-09'
and cast(ts/1000 as TIMESTAMP) BETWEEN '2020-11-29 22:30:00' and '2020-12-01 01:51:00') as ttt) as tttt
order by ts
But in both cases, I am not getting the result as expected. The resampled time ts appeared as expected (with a window of 10Sec) but I am getting random values for Val1, Val2 and Val3 between 0-9sec, 10-19Sec, ... windows.
Logic wise this query looks fine and I didnÄt find any problem. Could anybody explain that why I am not getting the right answer using this query.
Thanks !!!
The problem is this line:
last_value(Val1) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val1,
You are partitioning and ordering by the same column, ts -- so there is no ordering (or more specifically ordering by a value that is constant throughout the partition results in an arbitrary ordering). You need to preserve the original ts to make this work, using that for ordering:
select ts,
last_value(Val1) over (partition by ts_10 order by ts rows between unbounded preceding and unbounded following) as Val1,
last_value(Val2) over (partition by ts_10 order by ts rows between unbounded preceding and unbounded following) as Val2,
last_value(Val3) over (partition by ts_10 order by ts rows between unbounded preceding and unbounded following) as Val3
from (SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/10 as bigint)*10 as TIMESTAMP) as ts_10,
t.*
FROM Sensor_Data.Table t
WHERE unit = 'Unit1' AND
cast(ts/1000 as TIMESTAMP) BETWEEN '2020-11-29 22:30:00' and '2020-12-01 01:51:00'
) t
Incidentally, the issue with last_value() is that it has unexpected behavior when you leave out the window frame (the rows or range part of the window function specification).
The issue is that the default specification is range between unbounded preceding and current row, meaning that last_value() just picks up the value in the current row.
On the other hand, first_value() works fine with the default frame. However, both are equivalent if you include an explicit frame.

Populate one column based on the previous values of another

I am trying to create a column which populates the transaction ID for every row up until the row where that transaction was completed - in this example every "add to basket" event before an order.
So far I have tried using FIRST_VALUE:
SELECT
UserID, date, session_id, hitnumber, add_to_basket, transactionid,
first_value(transactionid) over (partition by trans_part order by date, transactionid) AS t_id
FROM(
select UserID, date, session_id, hitnumber, add_to_basket, transactionid,
SUM(CASE WHEN transactionid IS NULL THEN 0 ELSE 1 END) OVER (ORDER BY date, transactionid) AS trans_part,
FIRST_VALUE(transactionid IGNORE NULLS)
OVER (PARTITION BY userid ORDER BY hitnumber ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS t_id,
from q1
join q2 using (session_id)
order by 1,2,3,4
)
But the result I am getting is the inverse of what I want, populating the transaction ID of the previous order against the basket events which happened after this transaction.
How can I change my code so that I will see the transaction id of the order AFTER the basket events that led up to it? For example, in the table below I want to see the transaction id ending in ...095 instead of the id ending in ...383 for the column t_id.
Based on Gordon's answer below I have also tried:
last_value(transactionid ignore nulls) over(
order by hitnumber
rows between unbounded preceding and current row) as t_id2,
But this is not populating the event rows which proceed a transaction with a transaction id (seen below as t_id2):
You can use last_value(ignore nulls):
select . . . ,
last_value(transaction_id ignore nulls) over (
order by hitnumber
rows between unbounded preceding and current row
) as t_id
from q1 join
q2 using (session_id);
The difference from your answer is the windowing clause which ends at the current row.
EDIT:
It looks like there is one t_id per session_id, so just use max():
select . . . ,
max(transaction_id) over (partition by session_id) as t_id
from q1 join
q2 using (session_id);

Vertica/SQL: Getting rows immediately proceeding events

Consider a simple query
select from tbl where status=MELTDOWN
I would like to now create a table that in addition to including these rows, also includes the previous p rows and the subsequent n rows, so that I can get a sense as to what happens in the surrounding time of these MELTDOWNs. Any hints?
You can do this with window functions by getting the seqnum of the meltdown rows. I prefer to do this with lag()/lead() ignore nulls, but Vertical doesn't support that. I think this is the equivalent with first_value()/last_value():
with t as (
select t.*, row_number() over (order by id) as seqnum
from tbl
),
tt as (
select t.*,
last_value(case when status = 'meltdown' then seqnum end ignore nulls) over (order by seqnum rows between unbounded preceding and current row) as prev_meltdown_seqnum,
first_value(case when status = 'meltdown' then seqnum end ignore nulls) over (order by seqnum rows between current row and unbounded following) as prev_meltdown_seqnum,
from t
)
select tt.*
from tt
where seqnum between prev_melt_seqnum and prev_melt_seqnum + 7 or
seqnum between next_melt_seqnum -5 and next_melt_seqnum;
WITH
grouped AS
(
SELECT
SUM(
CASE WHEN status = 'Meltdown' THEN 1 ELSE 0 END
)
OVER (
ORDER BY timeStamp
)
AS GroupID,
tbl.*
FROM
tbl
),
sorted AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY GroupID ORDER BY timeStamp ASC ) AS incPos,
ROW_NUMBER() OVER (PARTITION BY GroupID ORDER BY timeStamp DESC) AS decPos,
MAX(GroupID) OVER () AS LastGroup
grouped.*
FROM
grouped
)
SELECT
sorted.*
FROM
sorted
WHERE
(incPos <= 8 AND GroupID > 0 ) -- Meltdown and the 7 events following it
OR (decPos <= 6 AND GroupID <> LastGroup) -- and the 6 events preceding a Meltdown
ORDER BY
timeStamp

selecting a consolidated value based on another value in the same select query?

Sorry for the poorly worded question, but i am consolidating customer records using the following query:
select
customer_key
,FIRST_VALUE(name IGNORE NULLS) OVER(PARTITION BY customer_key ORDER BY last_updated_date desc ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS NAME
,FIRST_VALUE(county IGNORE NULLS) OVER(PARTITION BY customer_key ORDER BY last_updated_date desc ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS COUNTRY
,FIRST_VALUE(country_code IGNORE NULLS) OVER(PARTITION BY customer_key ORDER BY last_updated_date desc ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS COUNTRY_CODE
from customers cust
This selects the most recent record for a customer using the customer_key. but for country I need the country code to be on the same line as country, using the country field as the driver for this, but the country_code column is a NOT NULL field.
For example, this raw data:
customer Country Country_Code Date
Dave NULL 0 30/08/2017
David UK 1 29/08/2017
Needs to display as:
customer Country Country_Code
Dave UK 1
Dave UK 1
But using the select query I'm currently using I get this:
customer Country Country_Code
Dave UK 0
Dave UK 0
Any suggestions?
Following query should work .
SELECT FIRST_VALUE(NAME IGNORE NULLS) OVER (
PARTITION BY customer_key ORDER BY last_updated_date DESC ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING
) AS NAME
,FIRST_VALUE(county IGNORE NULLS) OVER (
PARTITION BY customer_key ORDER BY last_updated_date DESC ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING
) AS COUNTRY
,FIRST_VALUE((
CASE
WHEN county IS NULL
THEN NULL
ELSE country_code
END
) IGNORE NULLS) OVER (
PARTITION BY customer_key ORDER BY last_updated_date DESC ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING
) AS COUNTRY_CODE
FROM customers cust;