Extracting an element from a cell that contains a dictionary - google-bigquery

I have the value category:ops,client:acompany,type:sometype which as you can see is effectively a dictionary, I would like to extract the value for the dictionary key client, in other words I want to extract acompany.
Here's how I've done it:
select CASE WHEN INSTR(client_step1, ",") > 0 THEN SUBSTR(client_step1, 0, INSTR(client_step1, ",") - 1)
ELSE client_step1
END AS client
from (
select CASE WHEN INSTR(dict, "client") > 0 THEN SUBSTR(dict, INSTR(dict, "client") + 7)
ELSE CAST(NULL as STRING)
END as client_step1
from (
select "category:ops,client:acompany,type:sometype" as dict
)
)
but that seems rather verbose (and frankly, slicing up strings with a combination of INSTR(), SUBSTR() and derived tables feels a bit meh). I'm wondering if there's a better way to do it that I don't know about (I'm fairly new to bq).
thanks in advance

It sounds like you want the REGEXP_EXTRACT function. Here is an example:
SELECT REGEXP_EXTRACT(dict, r'client:([^,:]+)') AS client_step1
FROM (
SELECT "category:ops,client:acompany,type:sometype" AS dict
)
This returns the string acompany as its result. The regexp looks for client: inside the string, and matches everything after it up until the next , or : or of the end of the string.

Another option to parse dictionary like yours is as below (for BigQuery Standard SQL)
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, "category:ops,client:acompany,type:sometype" AS dict
)
SELECT id,
ARRAY(
SELECT AS STRUCT
SPLIT(x, ':')[OFFSET(0)] key,
SPLIT(x, ':')[OFFSET(1)] value
FROM UNNEST(SPLIT(dict)) x
) items
FROM `project.dataset.table`
with result as below
Row id items.key items.value
1 1 category ops
client acompany
type sometype
As you can see here - you parse out all dictionary items
If you still need to only specific element's value - you can use below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, "category:ops,client:acompany,type:sometype" AS dict
)
SELECT id,
( SELECT
SPLIT(x, ':')[OFFSET(1)]
FROM UNNEST(SPLIT(dict)) x
WHERE SPLIT(x, ':')[OFFSET(0)] = 'client'
LIMIT 1
) client
FROM `project.dataset.table`

Related

Regex that matches strings with specific text not between text in BigQuery

I have the following strings:
step_1->step_2->step_3
step_1->step_3
step_1->step_2->step_1->step_3
step_1->step_2->step_1->step_2->step_3
What I would like to do is to capture the ones that between step_1 and step 3 there's no step_2.
The results should be like this:
string result
step_1->step_2->step_3 false
step_1->step_3 true
step_1->step_2->step_1->step_3 true
step_1->step_2->step_1->step_2->step_3 false
I have tried to use the negative lookahead but I found out that BigQuery doesn't support it. Any ideas?
You are essentially looking for when the pattern does not exist. The following regex would support that embedded in a case statement. This would not support a scenario where you have both conditions in a single string, however that was not a scenario you listed in your sample data.
Try the following:
with sample_data as (
select 'step_1->step_2->step_3' as string union all
select 'step_1->step_3' union all
select 'step_1->step_2->step_1->step_3' union all
select 'step_1->step_2->step_1->step_2->step_3' union all
select 'step_1->step_2->step_1->step_2->step_2->step_3' union all
select 'step_1->step_2->step_1->step_2->step_2'
)
select
string,
-- CASE WHEN regexp_extract(string, r'step_1->(\w+)->step_3') IS NULL THEN TRUE
CASE WHEN regexp_extract(string, r'1(->step_2)+->step_3') IS NULL THEN TRUE
ELSE FALSE END as result
from sample_data
This results in:
Consider also below option
select string,
not regexp_contains(string, r'step_1->(step_2->)+step_3\b') as result
from your_table
I believe #Daniel_Zagales answer is the one you were expecting. However here is a broader solution that can maybe be interesting in your usecase:it consists in using arrays
WITH sample AS (
SELECT 'step_1->step_2->step_3' AS path
UNION ALL SELECT 'step_1->step_3'
UNION ALL SELECT 'step_1->step_2->step_1->step_3'
UNION ALL SELECT 'step_1->step_2->step_1->step_2->step_3'
),
temp AS (
SELECT
path,
SPLIT(REGEXP_REPLACE(path,'step_', ''), '->') AS sequences
FROM
sample)
SELECT
path,
position,
flattened AS current_step,
LAG(flattened) OVER (PARTITION BY path ORDER BY OFFSET ) AS previous_step,
LEAD(flattened) OVER (PARTITION BY path ORDER BY OFFSET ) AS following_step
FROM
temp,
temp.sequences AS flattened
WITH
OFFSET AS position
This query returns the following table
The concept is to get an array of the step number (splitting on '->' and erasing 'step_') and to keep the OFFSET (crucial as UNNESTing arrays does not guarantee keeping the order of an array).
The table obtained contains for each path and step of said path, the previous and following step. It is therefore easy to test for instance if successive steps have a difference of 1.
(SELECT * FROM <previous> WHERE ABS(current_step-previous_step) != 1 for example)
(CASTing to INT required)

cast string to array<float64> in standard-sql

I've imported a csv into a table in BigQuery, and some columns hold a string which really should be a list of floats.
So now i'm try to convert these strings to some arrays. Thanks to SO I've managed to convert a string to a list of floats, but now i get one row by element of the list, instead of 1 row by initial row ie the array is "unnested". But it's a problem as it would generate a huge amount of duplicated data. Would someone please now how to do the conversion from STRING to ARRAY<FLOAT64> please?
partial code:
with tbl as (
select "id1" as id,
"10000\t10001\t10002\t10003\t10004" col1_str,
"10000\t10001.1\t10002\t10003\t10004" col2_str
)
select id, cast(elem1 as float64) col1_floatarray, cast(elem2 as float64) col2_floatarray
from tbl
, unnest(split(col1_str, "\t")) elem1
, unnest(split(col2_str, "\t")) elem2
expected:
1 row, with 3 columns of types STRING id, ARRAY<FLOAT64> col1_floatarray, ARRAY<FLOAT64> col2_floatarray
Thank you!
Use below
select id,
array(select cast(elem as float64) from unnest(split(col1_str, "\t")) elem) col1_floatarray,
array(select cast(elem as float64) from unnest(split(col2_str, "\t")) elem) col2_floatarray
from tbl
if applied to sample data in y our question - output is

SQL Array with Null

I'm trying to group BigQuery columns using an array like so:
with test as (
select 1 as A, 2 as B
union all
select 3, null
)
select *,
[A,B] as grouped_columns
from test
However, this won't work, since there is a null value in column B row 2.
In fact this won't work either:
select [1, null] as test_array
When reading the documentation on BigQuery though, it says Nulls should be allowed.
In BigQuery, an array is an ordered list consisting of zero or more
values of the same data type. You can construct arrays of simple data
types, such as INT64, and complex data types, such as STRUCTs. The
current exception to this is the ARRAY data type: arrays of arrays are
not supported. Arrays can include NULL values.
There doesn't seem to be any attributes or safe prefix to be used with ARRAY() to handle nulls.
So what is the best approach for this?
Per documentation - for Array type
Currently, BigQuery has two following limitations with respect to NULLs and ARRAYs:
BigQuery raises an error if query result has ARRAYs which contain NULL elements, although such ARRAYs can be used inside the query.
BigQuery translates NULL ARRAY into empty ARRAY in the query result, although inside the query NULL and empty ARRAYs are two distinct values.
So, as of your example - you can use below "trick"
with test as (
select 1 as A, 2 as B union all
select 3, null
)
select *,
array(select cast(el as int64) el
from unnest(split(translate(format('%t', t), '()', ''), ', ')) el
where el != 'NULL'
) as grouped_columns
from test t
above gives below output
Note: above approach does not require explicit referencing to all involved columns!
My current solution---and I'm not a fan of it---is to use a combo of IFNULL(), UNNEST() and ARRAY() like so:
select
*,
array(
select *
from unnest(
[
ifnull(A, ''),
ifnull(B, '')
]
) as grouping
where grouping <> ''
) as grouped_columns
from test
An alternative way, you can replace NULL value to some NON-NULL figures using function IFNULL(null, 0) as given below:-
with test as (
select 1 as A, 2 as B
union all
select 3, IFNULL(null, 0)
)
select *,
[A,B] as grouped_columns
from test

Extract a string in Postgresql and remove null/empty elements

I Need to extract values from string with Postgresql
But for my special scenario - if an element value is null i want to remove it and bring the next element 1 index closer.
e.g.
assume my string is: "a$$b"
If i will use
select string_to_array('a$$b','$')
The result is:
{a,,b}
If Im trying
SELECT unnest(string_to_array('a__b___d_','_')) EXCEPT SELECT ''
It changes the order
1.d
2.a
3.b
order changes which is bad for me.
I have found a other solution with:
select array_remove( string_to_array(a||','||b||','||c,',') , '')
from (
select
split_part('a__b','_',1) a,
split_part('a__b','_',2) b,
split_part('a__b','_',3) c
) inn
Returns
{a,b}
And then from the Array - i need to extract values by index
e.g. Extract(ARRAY,2)
But this one seems to me like an overkill - is there a better or something simpler to use ?
You can use with ordinality to preserve the index information during unnesting:
select a.c
from unnest(string_to_array('a__b___d_','_')) with ordinality as a(c,idx)
where nullif(trim(c), '') is not null
order by idx;
If you want that back as an array:
select array_agg(a.c order by a.idx)
from unnest(string_to_array('a__b___d_','_')) with ordinality as a(c,idx)
where nullif(trim(c), '') is not null;

Proper Case in Big Query

I have this sentence "i want to buy bananas" across column 'Bananas' in Big Query.
I want to get "I Want To Buy Bananas". How do I it? I was expecting PROPER(Bananas) function when I saw LOWER and UPPER but it seems like PROPER case is not supported?
DZ
October 2020 Update:
BigQuery now support INITCAP function - which takes a STRING and returns it with the first character in each word in uppercase and all other characters in lowercase. Non-alphabetic characters remain the same.
So, below type of fancy-shmancy UDF is not needed anymore - instead you just use
#standradSQL
SELECT str, INITCAP(str) proper_str
FROM `project.dataset.table`
-- ~~~~~~~~~~~~~~~~~~
Below example is for BigQuery Standrad SQL
#standradSQL
CREATE TEMP FUNCTION PROPER(str STRING) AS ((
SELECT STRING_AGG(CONCAT(UPPER(SUBSTR(w,1,1)), LOWER(SUBSTR(w,2))), ' ' ORDER BY pos)
FROM UNNEST(SPLIT(str, ' ')) w WITH OFFSET pos
));
WITH `project.dataset.table` AS (
SELECT 'i Want to buy bananas' str
)
SELECT str, PROPER(str) proper_str
FROM `project.dataset.table`
result is
Row str proper_str
1 i Want to buy bananas I Want To Buy Bananas
I expanded on Mikhail Berlyant's answer to also capitalise after hypens (-) as I needed to use proper case for place names. Had to switch from the SPLIT function to using a regex to do this.
I test for an empty string at the start and return an empty string (as opposed to null) to match the behaviour of the native UPPER and LOWER functions.
CREATE TEMP FUNCTION PROPER(str STRING) AS ((
SELECT
IF(str = '', '',
STRING_AGG(
CONCAT(
UPPER(SUBSTR(single_words,1,1)),
LOWER(SUBSTR(single_words,2))
),
'' ORDER BY position
)
)
FROM UNNEST(REGEXP_EXTRACT_ALL(str, r' +|-+|.[^ -]*')) AS single_words
WITH OFFSET AS position
));
WITH test_table AS (
SELECT 'i Want to buy bananas' AS str
UNION ALL
SELECT 'neWCASTle upon-tyne' AS str
)
SELECT str, PROPER(str) AS proper_str
FROM test_table
Output
Row str proper_str
1 i Want to buy bananas I Want To Buy Bananas
2 neWCASTle upon-tyne Newcastle Upon-Tyne