Window functions assume null sorts low - expected behavior? - sql

The following will not properly count nulls within a window...
select *,
sum(case when value is null then 1 else 0 end)
over(partition by id
order by coalesce(value,9999999)) as NumNulls,
row_number()
over(partition by id
order by coalesce(value,9999999)) as RN
from temp
Obviously, the problem can be solved using rows between unbounded preceding and unbounded following, so it's not a big deal. But, given my understanding of SQL, I would not have expected this result. Have I missed a finer point of the language, or is this unexpected behavior?
Here's a SQL Fiddle: http://sqlfiddle.com/#!6/98285/3

To reach your expected results you need to this
SELECT *
,SUM(CASE WHEN value IS NULL THEN 1 ELSE 0
END) OVER ( PARTITION BY id ORDER BY value ) AS NumNulls
,ROW_NUMBER() OVER ( PARTITION BY id ORDER BY value) AS RN
FROM #temp
because you used coalesce it could not partition properly this your count was off.

Related

Row_number skip values

I have a table like this:
The idea were to count only when I have "Include" at column include_appt, when it finds NULL, it should skip set is as "NULL" or "0" and on next found "Include" back to counting where it stopped.
The screenshot above I was almost able to do it but unfortunately the count didn't reset on next value.
PS: I can't use over partition because I have to keep the order by id ASC
I suggest using the DENSE_RANK() with the columns you have hidden (--*,):
SELECT
row_num AS id,
include_appt,
CASE WHEN include_appt is not null
THEN ROW_NUMBER() OVER(ORDER BY (SELECT 0))
+ 1
- DENSE_RANK() OVER(
PARTITION BY /*some hidden columns*/
ORDER BY/*some hidden columns*/)
ELSE NULL
END AS row_num2
FROM C
ORDER BY row_num
Then the result will be:
enter image description here
If you are trying to prevent row numbers being added for NULL/0 values, why not try a query like this instead?
SELECT
row_num AS id,
include_appt,
ROW_NUMBER() OVER
(
ORDER BY (SELECT 0)
) AS row_num2
FROM C
WHERE ISNULL(C.include_appt, 0) <> 0
ORDER BY row_num
I would recommend reconsidering the column names/aliases you want to have displayed in your final result to avoid confusion, but the above should effectively do what you are wanting.
You need a PARTITION BY clause
SELECT
row_num AS id,
include_appt,
CASE WHEN include_appt IS NULL
THEN 0
ELSE
ROW_NUMBER() OVER (PARTITION BY include_appt ORDER BY (SELECT 0))
END AS row_num2
FROM C
ORDER BY row_num
SELECT id, include_appt,
CASE WHEN include_appt IS NULL THEN 0
ELSE ROW_NUMBER() OVER (PARTITION BY include_appt ORDER BY id ASC)
END AS row_num
FROM #1 ORDER BY id asc
This can be easily done with a partition by include_appt as in another answer below, yet after playing around with the query plans I've decided that it is still worthwhile to consider this slightly different approach which might offer a performance boost. I believe the benefit is gained by being able to use the clustered index without involving a sort on the flag column:
select id, flag,
case when flag is not null
then row_number() over (order by id)
- count(case when flag is null then 1 end) over (order by id)
else 0 end /* count up the skips */ as new_rn
from T
order by id
Examples (including a "reset" behavior): https://dbfiddle.uk/?rdbms=sqlserver_2014&fiddle=c9f4c187c494d2a402e43a3b24924581
Performance comparison:
https://dbfiddle.uk/?rdbms=sqlserver_2014&fiddle=719f7bd26135ab498d11c786f1b1b28b

Can Db2 LAG function refer to itself?

I'm trying to put information to identify GROUP ID by replicating this Excel formula:
IF(OR(A2<>A1,AND(B2<>"000",B1="000")),D1+1,D1)
This formula is written when my cursor is in "D2", meaning I've referred to the newly added column value in the previous row to generate the current value.
I'd like to this with Db2 SQL, but I'm not sure how to because I'll need to do LAG function on the column I'm going to add and referring their value.
Kindly advise if having better way to do.
Thanks.
You need nested OLAP-functions, assuming ORDER BY SERIAL_NUMBER, EVENT_TIMESTAMP returns the order shown in Excel:
with cte as
(
select ...
case --IF(OR(A2<>A1,AND(B2<>"000",B1="000"))
when (lag(OPERATION)
over (order by SERIAL_NUMBER, EVENT_TIMESTAMP) = '000'
and OPERATION <> '000')
or lag(SERIAL_NUMBER,1,'')
over (order by SERIAL_NUMBER, EVENT_TIMESTAMP) <> SERIAL_NUMBER
then 1
else 0
end as flag -- start of new group
from tab
)
select ...
sum(flag)
over (order by SERIAL_NUMBER, EVENT_TIMESTAMP
rows unbounded preceding) as GROUP_ID
from cte
Your code is counting the number of "breaks" in your data, where a "break" is defined as 000 or the value in the first column changing.
In SQL, you can do this as a cumulative sum:
select t.*,
sum(case when prev_serial_number = serial_number or operation <> '000'
then 0 else 1
end) over (order by event_timestamp rows between unbounded preceding and current row) as column_d
from (select t.*,
lag(serial_number) over (order by event_timestamp) as prev_serial_number
from t
) t

How create a calculated column in google bigquery?

I have a data in Google Bigquery like this
id yearmonth value
00007BR0011 201705 8.0
00007BR0011 201701 3.0
and I need to create a table where per id shows the subtraction by year in order to create something like this
id value
00007BR0011 5
The value 5 is the subtraction of the value in 201705 minus the value in 201701
I am using standard SQL, but don't know how to create the column with the calculation.
Sorry in advance if it is too basic, but didn't find anything yet useful.
Perhaps a single table/result set would work for your purposes:
select id,
(max(case when yearmonth = 201705 then value end) -
max(case when yearmonth = 201701 then value end) -
)
from t
where yearmonth in (201705, 201701)
group by id;
It's difficult to answer this based on the current level of detail, but if the smaller value is always subtracted from the larger (and both are never null), you could handle it this way using GROUP BY:
SELECT
id,
MAX(value) - MIN(value) AS new_value
FROM
`your-project.your_dataset.your_table`
GROUP BY
id
From here, you could save these results as a new table, or save this query as a view definition (which would be similar to having it calculated on the fly if the underlying data is changing).
Another option is to add a column under the table schema, then run an UPDATE query to populate it.
If the smaller value isn't always subtracted from the larger, but rather the lower date is what matters (and there are always two), another way to do this would be to use analytic (or window) functions to select the value with the lowest date:
SELECT
DISTINCT
id,
(
FIRST_VALUE(value) OVER(PARTITION BY id ORDER BY yearmonth DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
-
LAST_VALUE(value) OVER(PARTITION BY id ORDER BY yearmonth DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
) AS new_value
FROM
`your-project.your_dataset.your_table`
Because analytic functions operate on the source rows, DISTINCT is needed to eliminate the duplicate rows.
If there could be more than two rows and you need all the prior values subtracted from the latest value, you could handle it this way (which would also be safe against NULLs or only having one row):
SELECT
DISTINCT
id,
(
FIRST_VALUE(value) OVER(PARTITION BY id ORDER BY yearmonth DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
-
(
SUM(value) OVER(PARTITION BY id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
-
FIRST_VALUE(value) OVER(PARTITION BY id ORDER BY yearmonth DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
) AS new_value
FROM
`your-project.your_dataset.your_table`
You could technically do the same thing with grouping and ARRAY_AGG with dereferencing, although this method will be significantly slower on larger data sets:
SELECT
id,
(
ARRAY_AGG(value ORDER BY yearmonth DESC)[OFFSET(0)]
-
(
SUM(value)
-
ARRAY_AGG(value ORDER BY yearmonth DESC)[OFFSET(0)]
)
) AS new_value
FROM
`your-project.your_dataset.your_table`
GROUP BY
id

SQL - Window function to get values from previous row where value is not null

I am using Exasol, in other DBMS it was possible to use analytical functions such LAST_VALUE() and specify some condition for the ORDER BY clause withing the OVER() function, like:
select ...
LAST_VALUE(customer)
OVER (PARTITION BY ID ORDER BY date_x DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING ) as the_last
Unfortunately I get the following error:
ERROR: [0A000] Feature not supported: windowing clause (Session:
1606983630649130920)
the same do not happen if instead of AND 1 PRECEDING I use: CURRENT ROW.
Basically what I wanted is to get the last value according the Order by that is NOT the current row. In this example it would be the $customer of the previous row.
I know that I could use the LAG(customer,1) OVER ( ...) but the problem is that I want the previous customer that is NOT null, so the offset is not always 1...
How can I do that?
Many thanks!
Does this work?
select lag(customer) over (partition by id
order by (case when customer is not null then 1 else 0 end),
date
)
You can do this with two steps:
select t.*,
max(customer) over (partition by id, max_date) as max_customer
from (select t.*,
max(case when customer is not null then date end) over (partition by id order by date) as max_date
from t
) t;

SQL Find the minimum date based on consecutive values

I'm having trouble constructing a query that can find consecutive values meeting a condition. Example data below, note that Date is sorted DESC and is grouped by ID.
To be selected, for each ID, the most recent RESULT must be 'Fail', and what I need back is the earliest date in that run of 'Fails'. For ID==1, only the 1st two values are of interest (the last doesn't count due to prior 'Complete'. ID==2 doesn't count at all, failing the first condition, and for ID==3, only the first value matters.
A result table might be:
The trick seems to be doing some type of run-length encoding, but even with several attempts manipulating ROW_NUM and an attempt at the tabibitosan method for grouping consecutive values, I've been unable to gain traction.
Any help would be appreciated.
If your database supports window functions, you can do
select id, case when result='Fail' then earliest_fail_date end earliest_fail_date
from (
select t.*
,row_number() over(partition by id order by dt desc) rn
,min(case when result = 'Fail' then dt end) over(partition by id) earliest_fail_date
from tablename t
) x
where rn=1
Use row_number to get the latest row in the table. min() over() to get the earliest fail date for each id. If the first row has status Fail, you select the earliest_fail_date or else it would be null.
It should be noted that the expected result for id=1 is wrong. It should be 2016-09-20 as it is the earliest fail date.
Edit: Having re-read the question, i think this is what you might be looking for. Getting the minimum Fail date from the latest consecutive groups of Fail rows.
with grps as (
select t.*,row_number() over(partition by id order by dt desc) rn
,row_number() over(partition by id order by dt)-row_number() over(partition by id,result order by dt) grp
from tablename t
)
,maxfailgrp as (
select g.*,
max(case when result = 'Fail' then grp end) over(partition by id) maxgrp
from grps g
)
select id,
case when result = 'Fail' then (select min(dt) from maxfailgrp where id = m.id and grp=m.maxgrp) end earliest_fail_date
from maxfailgrp m
where rn=1
Sample Demo