BigQuery query optimisation - Unnest field to REPEATED STRUCT - google-bigquery

I currently have the below query which perfectly works, but I would like to know if it can be optimized (perhaps avoid to UNNEST firstly and GROUP BY secondly and make transformations in one step).
with src as (
select 1 as row_key, "key_A:value_A,key_B:value_B,key_C:value_C" as field_raw
), tmp as (
select
row_key
, STRUCT(
split(field_items, ':')[offset(0)] as key
, split(field_items, ':')[offset(1)] as value
) AS field_items
from src
, unnest(split(field_raw, ',')) field_items
)
select
row_key
, ARRAY_AGG(field_items) as field_items
from tmp
group by row_key
Input :
row_key
field_raw
1
key_A:value_A,key_B:value_B,key_C:value_C
Expected output :
row_key
field_items.key
field_items.value
1
key_A
value_A
key_B
value_B
key_C
value_C
Thanks for help :)

Consider below refactored approach
select row_key,
array(select as struct
split(kv, ':')[offset(0)] as key,
split(kv, ':')[offset(1)] as value
from t.arr as kv
) as field_items
from src,
unnest([struct(regexp_extract_all(field_raw, r'\w+:\w+') as arr)]) t
if applied to sample data in your question - output is

Related

How can I split a column with nested delimiters into a list of STRUCTs in Bigquery SQL?

I've got a column that has entries (variable numbers of 2-tuples) like this:
D000001:Term1;D00007:Term19;D00008:Term781 (mesh_terms column below) and I'd like to split those so that I end up with an ARRAY<STRUCT<code STRING, term STRING>> for each row.
The query below works as needed, but I'm curious if anyone has suggestions on improvements in terms of readability, performance (on Bigquery, so not too big a problem), or best practices.
with t1 as (
SELECT
pmid,
split(mesh_terms, ';') as l1
FROM `omicidx_etl.pm1`
),
t2 as (
select
t1.pmid,
x
from t1,
unnest(t1.l1) as x
),
t3 as (
select
pmid,
split(x, ':') as y
from t2
)
select
pmid,
array_agg(STRUCT(t3.y[offset(0)] as code, t3.y[offset(1)] as term)) as mesh_terms
from
t3
group by pmid
Use below
select pmid,
array(
select as struct split(kv, ':')[offset(0)] code, split(kv, ':')[safe_offset(1)] term
from unnest(regexp_extract_all(mesh_terms, r'[^:;]+:[^:;]+')) kv
) mesh_terms
from your_table
if applied to sample data like in your question - output is

Separating out key value pairs

In T-SQL how can one separate columns; one with the key and the other with the value for strings that follow the pattern below?
Examples of the strings that need to be processed are:
country_code: "US"province_name: "NY"city_name: "Old Chatham"
postal_code: "11746-8031"country_code: "US"province_name: "NY"street_address: "151 Millet Street North"city_name: "Dix Hills"
street_address: "1036 Main Street, Holbrook, NY 11741"
Desired outcome for example 1 would be:
Key
Value
country_code
US
province_name
NY
city_name
Old Chatham
Nice to see Old Chatham ... a little touch of home
My first thought was to "correct" the JSON string, but that got risky.
Here is an option that will parse and pair the key/values
Example or dbFiddle
Select A.*
,C.*
From YourTable A
Cross Apply ( values ( replace(replace(replace(SomeCol,'"',':'),': :',':'),'::',':') ) ) B(CleanString)
Cross Apply (
Select [Key] =max(case when Seq=1 then Val end)
,[Value]=max(case when Seq=0 then Val end)
From (
Select Seq = row_number() over (order by [Key]) % 2
,Grp = (row_number() over (order by [Key])-1) / 2
,Val = Value
From OpenJSON( '["'+replace(string_escape(CleanString,'json'),':','","')+'"]' )
Where ltrim(Value)<>''
) C1
Group By Grp
) C
Results

How to write a faster sql query to break apart custom dictionary in Postgres column

I'm working with this fun data that has this custom dictionary-ish format in one of the columns.
id
parameters
1
x_id x_value; y_id y_value; z_id z_value;
2
y_id y_value2; z_id z_value2;
And looking to get it into this format...
id
x_id
y_id
z_id
1
x_value
y_value
z_value
2
NULL
y_value2
z_value2
I'd prefer for it to be completely dynamic, but I can live with knowing all possible column names in advance if it reduces complexity/improves performance. I can also guarantee this pattern is consistent. No additional nesting of dictionaries, etc.
Not being a sql master, this is the naive implementation I came up with but it seems quite slow. Is there a more performant way to do this?
select
string_agg(x_id, ',') as x_id,
string_agg(y_id, ',') as y_id,
string_agg(z_id, ',') as z_id
from (
select
t.id,
case when 'x_id' LIKE kvs.key then kvs.value else null end as x_id,
case when 'y_id' LIKE kvs.key then kvs.value else null end as y_id,
case when 'z_id' LIKE kvs.key then kvs.value else null end as z_id
from my_table as t
join (
SELECT
id,
split_part(trim(both ' ' FROM unnest(string_to_array(parameters, ';'))), ' ', 1) "key",
split_part(trim(both ' ' FROM unnest(string_to_array(parameters, ';'))), ' ', 2) "value"
FROM my_table
) as kvs
on t.id = kvs.id
) as params
group by params.id
I would convert this to a JSON value, then you can access each key quite easily:
select id,
parameters ->> 'x_id' as x_id,
parameters ->> 'y_id' as y_id,
parameters ->> 'z_id' as z_id
from (
select t.id,
jsonb_object_agg(split_part(p.parm, ' ', 1), split_part(p.parm, ' ', 2)) as parameters
from the_table t
left join unnest(regexp_split_to_array(trim(';' from parameters), '\s*;\s*')) as p(parm) on true
group by id
) x
order by id;
The trim(';' from p_input) is necessary to remove the trailing ;, otherwise regexp_split_to_array() will return one empty array element.
You can put this into a function to make things easier:
create or replace function convert_to_json(p_input text)
returns jsonb
as
$$
select jsonb_object_agg(elements[1], elements[2]) as parameters
from (
select string_to_array(p.parm, ' ') as elements
from unnest(regexp_split_to_array(trim(';' from p_input), '\s*;\s*')) as p(parm)
) t1;
$$
language sql
immutable;
The this gets a bit simpler:
select id,
convert_to_json(parameters) ->> 'x_id' as x_id,
convert_to_json(parameters) ->> 'y_id' as y_id,
convert_to_json(parameters) ->> 'z_id' as z_id
from the_table
order by id;

How to Group By in Big Query

I have record data structure in the BQ, when i run the following query my output is as follow:
Query : SELECT v.key, v.value from table unnest(dimensions.key_value) v;
key value
region region1
loc location1
region region1
loc location1
region region2
loc location2
Now i want to do group by using region and location so my output will be as follow:
groupBy Count
region1,location1 2
region2,location2 1
If i need to do group by using only one key then it would be a simple query:
SELECT v.key, count(*) from table, unnest(dimensions.key_value) v group by v.key;
But how to do for more than one key ?
Maybe pivot it first?
with pivotted as (
select
(select value from t.dimensions.key_value where key = 'region') as region,
(select value from t.dimensions.key_value where key = 'loc') as loc
from table t
)
select region, loc, count(*)
from pivotted
group by region, loc
Hmmm . . . You seem to be assuming that the ordering of the values is important. This is not a good way to store repeating pairs of data -- arrays of structs seems better. But you can use with offset and some arithmetic.
Assuming that region and loc are the only values and they are interleaved (as in your example):
with t as (
select struct(array[struct('region' as key, 'region1' as value),
struct('loc', 'location1'),
struct('region', 'region1'),
struct('loc', 'location1'),
struct('region', 'region2'),
struct('loc', 'location2')
] as key_value) as dimensions
)
select rl.region, rl.loc, count(*)
from (select (select array_agg(region_loc) as region_locs
from (select struct(max(case when kv.key = 'region' then kv.value end) as region,
max(case when kv.key = 'loc' then kv.value end) as loc
) as region_loc
from unnest(dimensions.key_value) kv with offset n
group by floor( n / 2 )
) rl2
) as region_locs
from t
) rl3 cross join
unnest(rl3.region_locs) rl
group by 1, 2;

Scan columns for a value and use the result to select other columns

I have one bigquery table with 30 columns:
strcol,strcol1,strcol2,.. startstr, startstr1, startstr2,.. endstr, endstr1, endstr2,..
1111, 2343, 1012...... "car", "boat", "scooter"..... "plane", "bike", "boat"
9999, 1012, 3333...... "scooter", "boat", "scooter"..... "boat", "bike", "boat"
I need to scan all columns "strcol" to find a number. If I find the number I need to use the correspondent "startstr" and "endstr".
Example:
Number 1012 is present in column "strcol2" then I need to use "startstr2" and "endstr2". My select statement only needs two columns. Result would be something like:
start, end
scooter, boat
boat, bike
I was thinking of creating an array of strcol...strcol9 and try to find the number and return the index, then use this index to find the correct startstr and endstr. But I don't know how to do this. Maybe there are much better alternatives? Any ideas?
The number 1012 will always be fixed and will never change.
Cheers,
Cris
Below is for BigQuery Standard SQL and might look less "advanced/verbose", so might be easy for you to handle/maintain
#standardSQL
select
regexp_extract(line, r'"startstr' || pattern) start,
regexp_extract(line, r'"endstr' || pattern) `end`,
from `project.dataset.table` t,
unnest([to_json_string(t)]) line,
unnest(generate_array(0, array_length(split(line))/3 - 1)) index,
unnest([if(index > 0, cast(index as string), '') || '":"?([^,"]+)"?']) pattern
where regexp_extract(line, r'"strcol' || pattern) = cast(1012 as string)
When applied to sample data from your question - output is
SELECT (
CASE
WHEN strcol = 1012 THEN STRUCT(startstr, endstr)
WHEN strcol1 = 1012 THEN STRUCT(startstr1, endstr1)
WHEN strcol2 = 1012 THEN STRUCT(startstr2, endstr2)
END
).*
FROM dataset.table
WHERE 1012 IN (strcol, strcol1, strcol2)
Below is for BigQuery Standard SQL
#standardSQL
select start, `end`
from (
select
max(if(key='strcol', value, null)) as strcol,
cast(max(if(key='strcol', value, null)) as int64) as value,
max(if(key='startstr', value, null)) as start,
max(if(key='endstr', value, null)) as `end`
from (
select format('%t', t) id,
regexp_extract(col, r'^[^\d]+') key,
regexp_extract(col, r'[\d]*$') offset,
value
from `project.dataset.table` t,
unnest(split(translate(to_json_string(t), '{}"', ''))) as kv,
unnest([struct(split(kv, ':')[offset(0)] as col, split(kv, ':')[offset(1)] as value)])
)
group by id, offset
)
where value = 1012
if to apply to sample data from your question - output is