BigQuery: Get the first non-null value in a window? - google-bigquery

I would like to obtain the first non-null, non-"undefined" value in a list of values as part of a window.
Minimal example:
Given the following code:
SELECT
FIRST_VALUE(
CASE WHEN val = "undefined" THEN NULL ELSE val END
IGNORE NULLS
)
OVER (ORDER BY order_key)
AS res
FROM (
SELECT 1 AS order_key, CAST(NULL AS STRING) AS val
UNION ALL
SELECT 2 AS order_key, "undefined" AS val
UNION ALL
SELECT 3 AS order_key, "value" AS val
) base
I'd expect
res
value
value
value
as the result set. Yet, the result given by the above is the following:
res
null
null
value
The documentation states the following:
FIRST_VALUE (value_expression [{RESPECT | IGNORE} NULLS])
Returns the value of the value_expression for the first row in the current window frame.
This function includes NULL values in the calculation unless IGNORE NULLS is present. If IGNORE NULLS is present, the function excludes NULL values from the calculation.
Yet it seems like value_expression is not what is tested for NULLs in this case.
It seems that instead FIRST_VALUE checks NULLs against the source field, not the CASE statement (effectively value_expression in the above).
While the problem can easily be fixed by doing the case as part of the subquery, I'd like to better understand why this is an issue. Why does FIRST_VALUE not ignore the NULLs provided through the CASE statement?

Alternative to the logic above:
If you are willing to remodel your query, instead of using a window function (FIRST_VALUE), the same effect can be achieved via an ARRAY_AGG(expr IGNORE NULLS ORDER BY ordering)[OFFSET(0)]:
SELECT
id,
ARRAY_AGG(
CASE WHEN val = 'undefined' THEN NULL ELSE val END
IGNORE NULLS
ORDER BY order_key
)[OFFSET(0)]
AS res
FROM (
SELECT 1 AS id, 1 AS order_key, CAST(NULL AS STRING) AS val
UNION ALL
SELECT 1 AS id, 2 AS order_key, 'undefined' AS val
UNION ALL
SELECT 1 AS id, 3 AS order_key, "value" AS val
UNION ALL
SELECT 2 AS id, 1 AS order_key, CAST(NULL AS STRING) AS val
UNION ALL
SELECT 2 AS id, 2 AS order_key, 'undefined' AS val
UNION ALL
SELECT 2 AS id, 3 AS order_key, "value" AS val
) base
GROUP BY id
Given an empty record set for the group, ARRAY_AGG(...)[OFFSET(0)] will return NULL
Given a non-empty record set for the group, ARRAY_AGG(...)[OFFSET(0)] will return the first result of value_expression that is non-NULL, ordered by the ORDER BY clause provided.
The only downside (beside maybe performance?) is that you'll need to create a common table expression with this logic and then join it with your table that was using window functions.

To get expected result you just need to add DESC to the ORDER BY as in below
SELECT
FIRST_VALUE(
CASE WHEN val = "undefined" THEN NULL ELSE val END
IGNORE NULLS
)
OVER (ORDER BY order_key DESC)
AS res
FROM (
SELECT 1 AS order_key, CAST(NULL AS STRING) AS val UNION ALL
SELECT 2 AS order_key, "undefined" AS val UNION ALL
SELECT 3 AS order_key, "value" AS val
) base
so the result now is

Related

BIGQUERY An internal error occurred and the request could not be completed Error: 80038528

I am trying to install a function. I don't understand what the problem is. I can install correctly when:
I delete the pivot
I use the Table and not the unnest only (so from the table, unnest(a))
CREATE OR REPLACE FUNCTION `dataset.function_naming` (a ARRAY<STRUCT<ROW_ID STRING, KEY STRING, VALUE STRING>>, id_one STRING, id_two STRING, start_date DATE, end_date DATE) RETURNS INT64
AS (
with tmp1 as (
select ROW_ID,X,Y,Z,W
from
(
select prop.ROW_ID,prop.KEY, prop.VALUE
from unnest(a) prop
where prop.KEY in ('X','Y','Z','W')
)
PIVOT
(
MAX(VALUE)
FOR UPPER(KEY) in('X','Y','Z','W')
) as PIVOT
)
select case when X is not null then 1,
when Y is not null then 2,
when Z is not null then 2,
when W is not null then 2
else 0
from tmp1
);
Thanks all.
There are few minor issues I see in your code.
missing extra (...) around function body
extra commas (,) within case statement
So, try below
CREATE OR REPLACE FUNCTION `dataset.function_naming` (
a ARRAY<STRUCT<ROW_ID STRING, KEY STRING, VALUE STRING>>,
id_one STRING,
id_two STRING,
start_date DATE,
end_date DATE
) RETURNS INT64
AS ((
with tmp1 as (
select ROW_ID,X,Y,Z,W
from
(
select prop.ROW_ID,prop.KEY, prop.VALUE
from unnest(a) prop
where prop.KEY in ('X','Y','Z','W')
)
PIVOT
(
MAX(VALUE)
FOR UPPER(KEY) in('X','Y','Z','W')
) as PIVOT
)
select case when X is not null then 1
when Y is not null then 2
when Z is not null then 2
when W is not null then 2
else 0
end
from tmp1
));
Seams there is an internal issue when using pivots and the unnest on the array. You can use the following, that executes the same logic, and also, create an case on issue tracker, as a BigQuery issue with Google cloud Support.
CREATE OR REPLACE FUNCTION `<dataset>.function_naming` (
a ARRAY<STRUCT<ROW_ID STRING, KEY STRING, VALUE STRING>>,
id_one STRING,
id_two STRING,
start_date DATE,
end_date DATE
) RETURNS INT64
AS (( WITH tmp AS (
SELECT
CASE
WHEN KEY="X" THEN 1
WHEN KEY="Y" THEN 2
WHEN KEY="Z" THEN 2
WHEN KEY="W" THEN 2
ELSE
0
END
teste_column
#-- FROM ( SELECT UPPER(prop.KEY) KEY, MAX(prop.VALUE) VALUE FROM -- following your query patern, but not really necessary
FROM ( SELECT UPPER(prop.KEY) KEY FROM
UNNEST(a) prop
WHERE
UPPER(key) IN ('X', 'Y', 'Z', 'W')
GROUP BY key )
ORDER BY teste_column DESC LIMIT 1 )
SELECT * FROM tmp
UNION ALL
SELECT 0 teste_column
FROM (SELECT 1)
LEFT JOIN tmp
ON FALSE
WHERE NOT EXISTS ( SELECT 1 FROM tmp)
));
#--- Testing the function:
select `<project>.<dataset>.function_naming`([STRUCT("1" AS ROW_ID, "x" AS KEY, "10"AS VALUE), STRUCT("1" AS ROW_ID, "x" AS KEY, "20"AS VALUE), STRUCT("1" AS ROW_ID, "w" AS KEY, "20"AS VALUE), STRUCT("1" AS ROW_ID, "y" AS KEY, "20"AS VALUE)], "1", "2", "2022-12-10", "2022-12-10")

ARRAY_CONTACT() returns empty array

I am compiling a list of values per users from 2 different columns into a single array like:
with test as (
select 1 as userId, 'something' as value1, cast(null as string) as value2
union all
select 1 as userId, cast(null as string), cast(null as string)
)
select
userId,
ARRAY_CONCAT(
ARRAY_AGG(distinct value1 ignore nulls ),
ARRAY_AGG(distinct value2 ignore nulls )
) as combo,
from test
group by userId
Everything works one until ARRAY_AGG() but then the ARRAY_CONCAT() just won't have it and returns and empty array [] whereas I expect it to be ['something'].
I am at loss as to why this is happening and whether I can force a workaround here.
I am at loss as to why this is happening ...
ARRAY_CONCAT function returns NULL if any input argument is NULL
... and whether I can force a workaround here
Use below workaround
select
userid,
array_concat(
ifnull(array_agg(distinct value1 ignore nulls ), []),
ifnull(array_agg(distinct value2 ignore nulls ), [])
) as combo,
from test
group by userid
if applied to sample data in your question - output is

Window function for is_unique?

I'm looking to see if a value is unique in the column. For example:
; with tbl (value) as (
select 'hello' UNION ALL
select 'hello' UNION ALL
select 'abc' UNION ALL
select null
) select
value,
COUNT(1) OVER (PARTITION BY VALUE) = 1 value_is_unique
from tbl
And the result:
VALUE VALUE_IS_UNIQUE
hello FALSE
hello FALSE
abc TRUE
TRUE
Is there a window function that basically does what I'm doing with the COUNT(1) OVER (PARTITION BY VALUE) = 1? Or is the above the suggested way to do this?
https://docs.snowflake.com/en/sql-reference/functions-analytic.html
There's no built-in is_unique function. Counting and comparing to one, as you did, is probably the best approach to achieve this functionality.

Find and replace pattern inside BigQuery string

Here is my BigQuery table. I am trying to find out the URLs that were displayed but not viewed.
create table dataset.url_visits(ID INT64 ,displayed_url string , viewed_url string);
select * from dataset.url_visits;
ID Displayed_URL Viewed_URL
1 url11,url12 url12
2 url9,url12,url13 url9
3 url1,url2,url3 NULL
In this example, I want to display
ID Displayed_URL Viewed_URL unviewed_URL
1 url11,url12 url12 url11
2 url9,url12,url13 url9 url12,url13
3 url1,url2,url3 NULL url1,url2,url3
Split the each string into an array and unnest them. Do a case to check if the items are in each other and combine to an array or a string.
Select ID, string_agg(viewing ) as viewed,
string_agg(not_viewing ) as not_viewed,
array_agg(viewing ignore nulls) as viewed_array
from (
Select ID ,
case when display in unnest(split(Viewed_URL)) then display else null end as viewing,
case when display in unnest(split(Viewed_URL)) then null else display end as not_viewing,
from (
Select 1 as ID, "url11,url12" as Displayed_URL, "url12" as Viewed_URL UNION ALL
Select 2, "url9,url12,url13", "url9" UNION ALL
Select 3, "url1,url2,url3", NULL UNION ALL
Select 4, "url9,url12,url13", "url9,url12"
),unnest(split(Displayed_URL)) as display
)
group by 1
Consider below approach
select *, (
select string_agg(url)
from unnest(split(Displayed_URL)) url
where url != ifnull(Viewed_URL, '')
) unviewed_URL
from `project.dataset.table`
if applied to sample data in your question - output is

Scalar subquery produced more than one element exception when aggregating multiple unnest elements

I have the following query for the BigQuery instance:
CREATE TABLE my_dataset.PRODUCT AS (
SELECT "1,2,3" AS PRODUCT_DESCRIPTION_IDS UNION ALL
SELECT "2,3" AS PRODUCT_DESCRIPTION_IDS UNION ALL
SELECT "1" AS PRODUCT_DESCRIPTION_IDS
);
CREATE TABLE my_dataset.DESCRIPTION AS (
SELECT "1" AS DESCRIPTION_ID, "VALUE1" AS DESCRIPTION_VALUE UNION ALL
SELECT "2" AS DESCRIPTION_ID, "VALUE2" AS DESCRIPTION_VALUE UNION ALL
SELECT "3" AS DESCRIPTION_ID, "VALUE3" AS DESCRIPTION_VALUE
);
SELECT
FORMAT('%T', ARRAY_AGG(ELEMENT)) AS desc_ids,
FORMAT('%T', ARRAY_AGG((SELECT DESCRIPTION_VALUE FROM my_dataset.DESCRIPTION WHERE DESCRIPTION_ID = ELEMENT))) AS desc_values,
FROM UNNEST((
SELECT
SPLIT(PRODUCT_DESCRIPTION_IDS, ',') as arr
FROM my_dataset.PRODUCT
limit 1
)) AS ELEMENT
It executes fine but only when I have limit 1 specified, otherwise I receive an exception:
Scalar subquery produced more than one element
How should I update my query to receive not only one resulting row but all of them?
Consider below
select
format('%T', array_agg(ELEMENT)) as desc_ids,
format('%T', array_agg(DESCRIPTION_VALUE)) as desc_values
from PRODUCT t, unnest(split(PRODUCT_DESCRIPTION_IDS)) as ELEMENT
left join DESCRIPTION
on ELEMENT = DESCRIPTION_ID
group by format('%T',t)
if applied to sample data in your question - output is