SQL - Window function to get values from previous row where value is not null - sql

I am using Exasol, in other DBMS it was possible to use analytical functions such LAST_VALUE() and specify some condition for the ORDER BY clause withing the OVER() function, like:
select ...
LAST_VALUE(customer)
OVER (PARTITION BY ID ORDER BY date_x DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING ) as the_last
Unfortunately I get the following error:
ERROR: [0A000] Feature not supported: windowing clause (Session:
1606983630649130920)
the same do not happen if instead of AND 1 PRECEDING I use: CURRENT ROW.
Basically what I wanted is to get the last value according the Order by that is NOT the current row. In this example it would be the $customer of the previous row.
I know that I could use the LAG(customer,1) OVER ( ...) but the problem is that I want the previous customer that is NOT null, so the offset is not always 1...
How can I do that?
Many thanks!

Does this work?
select lag(customer) over (partition by id
order by (case when customer is not null then 1 else 0 end),
date
)
You can do this with two steps:
select t.*,
max(customer) over (partition by id, max_date) as max_customer
from (select t.*,
max(case when customer is not null then date end) over (partition by id order by date) as max_date
from t
) t;

Related

Complex Ranking in SQL (Teradata)

I have a peculiar problem at hand. I need to rank in the following manner:
Each ID gets a new rank.
rank #1 is assigned to the ID with the lowest date. However, the subsequent dates for that particular ID can be higher but they will get the incremental rank w.r.t other IDs.
(E.g. ADF32 series will be considered to be ranked first as it had the lowest date, although it ends with dates 09-Nov, and RT659 starts with 13-Aug it will be ranked subsequently)
For a particular ID, if the days are consecutive then ranks are same, else they add by 1.
For a particular ID, ranks are given in date ASC.
How to formulate a query?
You need two steps:
select
id_col
,dt_col
,dense_rank()
over (order by min_dt, id_col, dt_col - rnk) as part_col
from
(
select
id_col
,dt_col
,min(dt_col)
over (partition by id_col) as min_dt
,rank()
over (partition by id_col
order by dt_col) as rnk
from tab
) as dt
dt_col - rnk caluclates the same result for consecutives dates -> same rank
Try datediff on lead/lag and then perform partitioned ranking
select t.ID_COL,t.dt_col,
rank() over(partition by t.ID_COL, t.date_diff order by t.dt_col desc) as rankk
from ( SELECT ID_COL,dt_col,
DATEDIFF(day, Lag(dt_col, 1) OVER(ORDER BY dt_col),dt_col) as date_diff FROM table1 ) t
One way to think about this problem is "when to add 1 to the rank". Well, that occurs when the previous value on a row with the same id_col differs by more than one day. Or when the row is the earliest day for an id.
This turns the problem into a cumulative sum:
select t.*,
sum(case when prev_dt_col = dt_col - 1 then 0 else 1
end) over
(order by min_dt_col, id_col, dt_col) as ranking
from (select t.*,
lag(dt_col) over (partition by id_col order by dt_col) as prev_dt_col,
min(dt_col) over (partition by id_col) as min_dt_col
from t
) t;

Can Db2 LAG function refer to itself?

I'm trying to put information to identify GROUP ID by replicating this Excel formula:
IF(OR(A2<>A1,AND(B2<>"000",B1="000")),D1+1,D1)
This formula is written when my cursor is in "D2", meaning I've referred to the newly added column value in the previous row to generate the current value.
I'd like to this with Db2 SQL, but I'm not sure how to because I'll need to do LAG function on the column I'm going to add and referring their value.
Kindly advise if having better way to do.
Thanks.
You need nested OLAP-functions, assuming ORDER BY SERIAL_NUMBER, EVENT_TIMESTAMP returns the order shown in Excel:
with cte as
(
select ...
case --IF(OR(A2<>A1,AND(B2<>"000",B1="000"))
when (lag(OPERATION)
over (order by SERIAL_NUMBER, EVENT_TIMESTAMP) = '000'
and OPERATION <> '000')
or lag(SERIAL_NUMBER,1,'')
over (order by SERIAL_NUMBER, EVENT_TIMESTAMP) <> SERIAL_NUMBER
then 1
else 0
end as flag -- start of new group
from tab
)
select ...
sum(flag)
over (order by SERIAL_NUMBER, EVENT_TIMESTAMP
rows unbounded preceding) as GROUP_ID
from cte
Your code is counting the number of "breaks" in your data, where a "break" is defined as 000 or the value in the first column changing.
In SQL, you can do this as a cumulative sum:
select t.*,
sum(case when prev_serial_number = serial_number or operation <> '000'
then 0 else 1
end) over (order by event_timestamp rows between unbounded preceding and current row) as column_d
from (select t.*,
lag(serial_number) over (order by event_timestamp) as prev_serial_number
from t
) t

How create a calculated column in google bigquery?

I have a data in Google Bigquery like this
id yearmonth value
00007BR0011 201705 8.0
00007BR0011 201701 3.0
and I need to create a table where per id shows the subtraction by year in order to create something like this
id value
00007BR0011 5
The value 5 is the subtraction of the value in 201705 minus the value in 201701
I am using standard SQL, but don't know how to create the column with the calculation.
Sorry in advance if it is too basic, but didn't find anything yet useful.
Perhaps a single table/result set would work for your purposes:
select id,
(max(case when yearmonth = 201705 then value end) -
max(case when yearmonth = 201701 then value end) -
)
from t
where yearmonth in (201705, 201701)
group by id;
It's difficult to answer this based on the current level of detail, but if the smaller value is always subtracted from the larger (and both are never null), you could handle it this way using GROUP BY:
SELECT
id,
MAX(value) - MIN(value) AS new_value
FROM
`your-project.your_dataset.your_table`
GROUP BY
id
From here, you could save these results as a new table, or save this query as a view definition (which would be similar to having it calculated on the fly if the underlying data is changing).
Another option is to add a column under the table schema, then run an UPDATE query to populate it.
If the smaller value isn't always subtracted from the larger, but rather the lower date is what matters (and there are always two), another way to do this would be to use analytic (or window) functions to select the value with the lowest date:
SELECT
DISTINCT
id,
(
FIRST_VALUE(value) OVER(PARTITION BY id ORDER BY yearmonth DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
-
LAST_VALUE(value) OVER(PARTITION BY id ORDER BY yearmonth DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
) AS new_value
FROM
`your-project.your_dataset.your_table`
Because analytic functions operate on the source rows, DISTINCT is needed to eliminate the duplicate rows.
If there could be more than two rows and you need all the prior values subtracted from the latest value, you could handle it this way (which would also be safe against NULLs or only having one row):
SELECT
DISTINCT
id,
(
FIRST_VALUE(value) OVER(PARTITION BY id ORDER BY yearmonth DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
-
(
SUM(value) OVER(PARTITION BY id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
-
FIRST_VALUE(value) OVER(PARTITION BY id ORDER BY yearmonth DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
) AS new_value
FROM
`your-project.your_dataset.your_table`
You could technically do the same thing with grouping and ARRAY_AGG with dereferencing, although this method will be significantly slower on larger data sets:
SELECT
id,
(
ARRAY_AGG(value ORDER BY yearmonth DESC)[OFFSET(0)]
-
(
SUM(value)
-
ARRAY_AGG(value ORDER BY yearmonth DESC)[OFFSET(0)]
)
) AS new_value
FROM
`your-project.your_dataset.your_table`
GROUP BY
id

SQL Find the minimum date based on consecutive values

I'm having trouble constructing a query that can find consecutive values meeting a condition. Example data below, note that Date is sorted DESC and is grouped by ID.
To be selected, for each ID, the most recent RESULT must be 'Fail', and what I need back is the earliest date in that run of 'Fails'. For ID==1, only the 1st two values are of interest (the last doesn't count due to prior 'Complete'. ID==2 doesn't count at all, failing the first condition, and for ID==3, only the first value matters.
A result table might be:
The trick seems to be doing some type of run-length encoding, but even with several attempts manipulating ROW_NUM and an attempt at the tabibitosan method for grouping consecutive values, I've been unable to gain traction.
Any help would be appreciated.
If your database supports window functions, you can do
select id, case when result='Fail' then earliest_fail_date end earliest_fail_date
from (
select t.*
,row_number() over(partition by id order by dt desc) rn
,min(case when result = 'Fail' then dt end) over(partition by id) earliest_fail_date
from tablename t
) x
where rn=1
Use row_number to get the latest row in the table. min() over() to get the earliest fail date for each id. If the first row has status Fail, you select the earliest_fail_date or else it would be null.
It should be noted that the expected result for id=1 is wrong. It should be 2016-09-20 as it is the earliest fail date.
Edit: Having re-read the question, i think this is what you might be looking for. Getting the minimum Fail date from the latest consecutive groups of Fail rows.
with grps as (
select t.*,row_number() over(partition by id order by dt desc) rn
,row_number() over(partition by id order by dt)-row_number() over(partition by id,result order by dt) grp
from tablename t
)
,maxfailgrp as (
select g.*,
max(case when result = 'Fail' then grp end) over(partition by id) maxgrp
from grps g
)
select id,
case when result = 'Fail' then (select min(dt) from maxfailgrp where id = m.id and grp=m.maxgrp) end earliest_fail_date
from maxfailgrp m
where rn=1
Sample Demo

Window functions assume null sorts low - expected behavior?

The following will not properly count nulls within a window...
select *,
sum(case when value is null then 1 else 0 end)
over(partition by id
order by coalesce(value,9999999)) as NumNulls,
row_number()
over(partition by id
order by coalesce(value,9999999)) as RN
from temp
Obviously, the problem can be solved using rows between unbounded preceding and unbounded following, so it's not a big deal. But, given my understanding of SQL, I would not have expected this result. Have I missed a finer point of the language, or is this unexpected behavior?
Here's a SQL Fiddle: http://sqlfiddle.com/#!6/98285/3
To reach your expected results you need to this
SELECT *
,SUM(CASE WHEN value IS NULL THEN 1 ELSE 0
END) OVER ( PARTITION BY id ORDER BY value ) AS NumNulls
,ROW_NUMBER() OVER ( PARTITION BY id ORDER BY value) AS RN
FROM #temp
because you used coalesce it could not partition properly this your count was off.