Impala Last_Value() Not giving result as expected - sql

I have a Table in Impala in which I have time information as Unix-Time (with a frequency of 1 mSec) and information about three variables, like given below:
ts Val1 Val2 Val3
1.60669E+12 7541.76 0.55964607 267.1613
1.60669E+12 7543.04 0.5607262 267.27805
1.60669E+12 7543.04 0.5607241 267.22308
1.60669E+12 7543.6797 0.56109643 267.25974
1.60669E+12 7543.6797 0.56107396 267.30624
1.60669E+12 7543.6797 0.56170875 267.2643
I want to resample the data and to get the last value of the new time window. For example, if I want to resample as 10Sec frequency the output should be the last value of 10Sec window, like given below:
ts val1_Last Val2_Last Val3_Last
2020-11-29 22:30:00 7541.76 0.55964607 267.1613
2020-11-29 22:30:10 7542.3994 0.5613486 267.31238
2020-11-29 22:30:20 7542.3994 0.5601791 267.22842
2020-11-29 22:30:30 7544.32 0.56069416 267.20248
To have this result, I am running the following query:
select distinct *
from (
select ts,
last_value(Val1) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val1,
last_value(Val2) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val2,
last_value(Val3) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val3
from (SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/10 as bigint)*10 as TIMESTAMP) as ts ,
Val1 as Val1,
Val2 as Val2,
Val3 as Val3
FROM Sensor_Data.Table where unit='Unit1'
and cast(ts/1000 as TIMESTAMP) BETWEEN '2020-11-29 22:30:00' and '2020-12-01 01:51:00') as ttt) as tttt
order by ts
I have read on some forums that LAST_VALUE() sometimes cause problem, so I tried to achieve the same thing using FIRST_VALUE with ORDER BY DESC. The query is given below:
select distinct *
from (
select ts,
first_value(Val1) over (partition by ts order by ts desc rows between unbounded preceding and unbounded following) as Val1,
first_value(Val2) over (partition by ts order by ts desc rows between unbounded preceding and unbounded following) as Val2,
first_value(Val3) over (partition by ts order by ts desc rows between unbounded preceding and unbounded following) as Val3
from (SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/10 as bigint)*10 as TIMESTAMP) as ts ,
Val1 as Val1,
val2 as Val2,
Val3 as Val3
FROM product_sofcdtw_ops.as_operated_full_backup where unit='FCS05-09'
and cast(ts/1000 as TIMESTAMP) BETWEEN '2020-11-29 22:30:00' and '2020-12-01 01:51:00') as ttt) as tttt
order by ts
But in both cases, I am not getting the result as expected. The resampled time ts appeared as expected (with a window of 10Sec) but I am getting random values for Val1, Val2 and Val3 between 0-9sec, 10-19Sec, ... windows.
Logic wise this query looks fine and I didnÄt find any problem. Could anybody explain that why I am not getting the right answer using this query.
Thanks !!!

The problem is this line:
last_value(Val1) over (partition by ts order by ts rows between unbounded preceding and unbounded following) as Val1,
You are partitioning and ordering by the same column, ts -- so there is no ordering (or more specifically ordering by a value that is constant throughout the partition results in an arbitrary ordering). You need to preserve the original ts to make this work, using that for ordering:
select ts,
last_value(Val1) over (partition by ts_10 order by ts rows between unbounded preceding and unbounded following) as Val1,
last_value(Val2) over (partition by ts_10 order by ts rows between unbounded preceding and unbounded following) as Val2,
last_value(Val3) over (partition by ts_10 order by ts rows between unbounded preceding and unbounded following) as Val3
from (SELECT cast(cast(unix_timestamp(cast(ts/1000 as TIMESTAMP))/10 as bigint)*10 as TIMESTAMP) as ts_10,
t.*
FROM Sensor_Data.Table t
WHERE unit = 'Unit1' AND
cast(ts/1000 as TIMESTAMP) BETWEEN '2020-11-29 22:30:00' and '2020-12-01 01:51:00'
) t
Incidentally, the issue with last_value() is that it has unexpected behavior when you leave out the window frame (the rows or range part of the window function specification).
The issue is that the default specification is range between unbounded preceding and current row, meaning that last_value() just picks up the value in the current row.
On the other hand, first_value() works fine with the default frame. However, both are equivalent if you include an explicit frame.

Related

ORDER BY clause in a Window function with a frame clause

I want to take the min and max for a column within each partition.
See example below (both methods give the correct answer). I do not understand why I have to add the ORDER BY clause.
When using min and max as the aggregate function what possible difference will the ORDER BY have?
DROP TABLE IF EXISTS #HELLO;
CREATE TABLE #HELLO (Category char(2), q int);
INSERT INTO #HELLO (Category, q)
VALUES ('A',1), ('A',5), ('A',6), ('B',0), ('B',3)
SELECT *,
min(q) OVER (PARTITION BY category ORDER BY (SELECT NULL) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS minvalue
,max(q) OVER (PARTITION BY category ORDER BY (SELECT NULL) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS maxvalue
,min(q) OVER (PARTITION BY category ORDER BY q ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS minvalue2
,max(q) OVER (PARTITION BY category ORDER BY q ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS maxvalue2
FROM #HELLO;
If you use the ROWS or RANGE clause in a OVER clause then you need to provide an ORDER BY clause, because you are typically telling the OVER clause how many rows to look behind and forward, which can only be determined if you have an ORDER BY.
However in your case because you use ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING i.e. all rows, you don't need any of it. The following produces the same results:
SELECT *,
min(q) OVER (PARTITION BY category) AS minvalue
,max(q) OVER (PARTITION BY category) AS maxvalue
,min(q) OVER (PARTITION BY category) AS minvalue2
,max(q) OVER (PARTITION BY category) AS maxvalue2
FROM #HELLO;

Optimizing windowing query in presto

I have a table with fields such as user_id, col1, col2, col3, updated_at, is_deleted, day.
And current query looks like this -
SELECT DISTINCT
user_id,
first_value(col1) ignore nulls OVER (partition BY user_id
ORDER BY
updated_at DESC rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED following) AS col1,
first_value(col2) ignore nulls OVER (partition BY user_id
ORDER BY
updated_at DESC rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED following) AS col2,
first_value(col3) ignore nulls OVER (partition BY user_id
ORDER BY
updated_at DESC rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED following) AS col3,
bool_or(is_deleted) ignore nulls OVER (partition BY user_id
ORDER BY
updated_at DESC rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED following) AS is_deleted
FROM
my_table
WHERE
day >= '2021-05-25'
Basically, I want the latest(first) value of each column, for each user id. Since each value column can be null, I am having to run same windowing query multiple times(for each column).
Currently, 66% of the time is being spent on windowing.
Any way to optimize?
seems like you want this :
select * from (
select * , row_number() over (partition by user_id ORDER BY updated_at DESC) rn
from my_table
where day >= '2021-05-25'
) t
where rn = 1

Using nested window function in Snowflake

I've seen a lot of questions about this general error, but I don't get why I have it, maybe because of nested window functions...
With the below query, I get the error for Col_C, Col_D, ... and almost everything I tried
SQL compilation error: [eachColumn] is not a valid group by expression
SELECT
Col_A,
Col_B,
FIRST_VALUE(Col_C) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B
ORDER BY Col_TimeStamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
MAX(Col_D) OVER (PARTITION BY Col_A, Col_B
ORDER BY Col_TimeStamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
FIRST_VALUE(CASE WHEN Col_T = 'testvalue'
THEN LAST_VALUE(Col_E) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B
ORDER BY Col_TimeStamp DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
ELSE NULL END) IGNORE NULLS
OVER (PARTITION BY Col_A, Col_B
ORDER BY Col_TimeStamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM mytable
So, is there a way to used nested window functions in Snowflake (with case when ...) and if so, how/what am I doing wrong ?
So deconstructing your logic to show it's the second FIRST_VALUE that causes the problem
WITH data(Col_A,Col_B,Col_c,col_d, Col_TimeStamp, col_t,col_e) AS (
SELECT * FROM VALUES
(1,1,1,1,1,'testvalue',10),
(1,1,2,3,2,'value',11)
)
SELECT
Col_A,
Col_B,
FIRST_VALUE(Col_C) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B
ORDER BY Col_TimeStamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as first_c,
MAX(Col_D) OVER (PARTITION BY Col_A, Col_B
ORDER BY Col_TimeStamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
LAST_VALUE(Col_E) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B
ORDER BY Col_TimeStamp DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as last_e,
IFF(Col_T = 'testvalue', last_e, NULL) as if_test_last_e
/*,FIRST_VALUE(if_test_last_e) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B
ORDER BY Col_TimeStamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as the_problem*/
FROM data
ORDER BY Col_A,Col_B, col_timestamp
;
if we uncomment the_problem we have it.. compare to PostgreSQL (my background) just getting to reuse so many prior results/steps is a gift, so here I just bust out another SELECT layer.
WITH data(Col_A,Col_B,Col_c,col_d, Col_TimeStamp, col_t,col_e) AS (
SELECT * FROM VALUES
(1,1,1,1,1,'testvalue',10),
(1,1,2,3,2,'value',11)
)
SELECT *,
FIRST_VALUE(if_test_last_e) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B
ORDER BY Col_TimeStamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as not_a_problem
FROM (
SELECT
Col_A,
Col_B,
FIRST_VALUE(Col_C) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B
ORDER BY Col_TimeStamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as first_c,
MAX(Col_D) OVER (PARTITION BY Col_A, Col_B
ORDER BY Col_TimeStamp ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
LAST_VALUE(Col_E) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B
ORDER BY Col_TimeStamp DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as last_e,
IFF(Col_T = 'testvalue', last_e, NULL) as if_test_last_e
,Col_TimeStamp
FROM data
)
ORDER BY Col_A,Col_B, Col_TimeStamp
And then it all works. This also happens if you LAG then IFF/FIRST_VALUE and then LAG that second result.
"I've seen a lot of questions about this general error, but I don't get why I have it, maybe because of nested window functions..."
Snowflake supports reusing expressions at the same level(sometimes called "lateral column alias reference" )
It is perfectly fine to write:
SELECT 1+1 AS col1,
col1 *2 AS col2,
CASE WHEN col1 > col2 THEN 'Y' ELSE 'NO' AS col3
...
In standard SQL you will either have to use multiple levels of query(cte) or use LATERAL JOIN. Related: PostgreSQL: using a calculated column in the same query
Unfortunately the same syntax will not work for analytic functions(and I am now aware of any RDMBS that supports it):
SELECT ROW_NUMBER() OVER(PARTITION BY ... ORDER BY ...) AS rn
,MAX(rn) OVER(PARTITION BY <different than prev) AS m
FROM tab;
In the SQL Standard 2016 there is optional feature: T619 Nested window functions.
Here an article how the nested analytic function query could look like: Nested window functions in SQL.
It means that current way to nest windowed function is usage of derived table/cte:
WITH cte AS (
SELECT ROW_NUMBER() OVER(PARTITION BY ... ORDER BY ...) AS rn
,*
FROM tab
)
SELECT *, MAX(rn) OVER(PARTITION BY <different than prev) AS m
FROM cte

How create a calculated column in google bigquery?

I have a data in Google Bigquery like this
id yearmonth value
00007BR0011 201705 8.0
00007BR0011 201701 3.0
and I need to create a table where per id shows the subtraction by year in order to create something like this
id value
00007BR0011 5
The value 5 is the subtraction of the value in 201705 minus the value in 201701
I am using standard SQL, but don't know how to create the column with the calculation.
Sorry in advance if it is too basic, but didn't find anything yet useful.
Perhaps a single table/result set would work for your purposes:
select id,
(max(case when yearmonth = 201705 then value end) -
max(case when yearmonth = 201701 then value end) -
)
from t
where yearmonth in (201705, 201701)
group by id;
It's difficult to answer this based on the current level of detail, but if the smaller value is always subtracted from the larger (and both are never null), you could handle it this way using GROUP BY:
SELECT
id,
MAX(value) - MIN(value) AS new_value
FROM
`your-project.your_dataset.your_table`
GROUP BY
id
From here, you could save these results as a new table, or save this query as a view definition (which would be similar to having it calculated on the fly if the underlying data is changing).
Another option is to add a column under the table schema, then run an UPDATE query to populate it.
If the smaller value isn't always subtracted from the larger, but rather the lower date is what matters (and there are always two), another way to do this would be to use analytic (or window) functions to select the value with the lowest date:
SELECT
DISTINCT
id,
(
FIRST_VALUE(value) OVER(PARTITION BY id ORDER BY yearmonth DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
-
LAST_VALUE(value) OVER(PARTITION BY id ORDER BY yearmonth DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
) AS new_value
FROM
`your-project.your_dataset.your_table`
Because analytic functions operate on the source rows, DISTINCT is needed to eliminate the duplicate rows.
If there could be more than two rows and you need all the prior values subtracted from the latest value, you could handle it this way (which would also be safe against NULLs or only having one row):
SELECT
DISTINCT
id,
(
FIRST_VALUE(value) OVER(PARTITION BY id ORDER BY yearmonth DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
-
(
SUM(value) OVER(PARTITION BY id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
-
FIRST_VALUE(value) OVER(PARTITION BY id ORDER BY yearmonth DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
) AS new_value
FROM
`your-project.your_dataset.your_table`
You could technically do the same thing with grouping and ARRAY_AGG with dereferencing, although this method will be significantly slower on larger data sets:
SELECT
id,
(
ARRAY_AGG(value ORDER BY yearmonth DESC)[OFFSET(0)]
-
(
SUM(value)
-
ARRAY_AGG(value ORDER BY yearmonth DESC)[OFFSET(0)]
)
) AS new_value
FROM
`your-project.your_dataset.your_table`
GROUP BY
id

Hive HQL - optimizing repetitive WINDOW clause

I have following HQL
SELECT count(*) OVER (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) pocet,
min(event.time) OVER (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) minTime,
max(event.time) OVER (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) maxTime
FROM t21_pam6
How can I define the 3 same WINDOW clauses into one?
The documentation (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
) shows this example
SELECT a, SUM(b) OVER w
FROM T;
WINDOW w AS (PARTITION BY c ORDER BY d ROWS UNBOUNDED PRECEDING)
But I don't think it's working. It's not possible to define WINDOW w as... is not a HQL command.
This type of optimization is something that the compiler would need to do. I don't think there is a way to ensure this programmatically.
That said, the calculation for the minimum time is totally unnecessary. Because of the order by, it should be the time in the current row. Similarly, if you can handle null values, then the expression can be simplified to:
SELECT count(*) OVER (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) pocet,
event.time as minTime,
lead(event.time, 2) OVER (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time) as maxTime
FROM t21_pam6;
Note that the maxtime calculation is slightly different because it will return NULL for the last two values matching the conditions.
As #sergey-khudyakov responded, there was a bug in documentation. This variant works fine:
SELECT count(*) OVER w,
min(event.time) OVER w,
max(event.time) OVER w
FROM ar3.t21_pam6
WINDOW w AS (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING)