BigQuery how to use MERGE to load array columns - google-bigquery

I have a test table that I am trying to load from GCS storage:
CREATE OR REPLACE TABLE ta_producer_conformed.test
(
id NUMERIC,
array_string ARRAY<STRING>,
array_struct_string_string ARRAY<STRUCT<key STRING, value STRING>>,
array_struct_string_numeric ARRAY<STRUCT<key STRING, value NUMERIC>>,
array_struct_string_int64 ARRAY<STRUCT<key STRING, value INT64>>
)
I have defined an external storage table as:
{
"autodetect": true,
"csvOptions": {
"encoding": "UTF-8",
"quote": "\"",
"fieldDelimiter": "\t"
},
"sourceFormat": "CSV",
"sourceUris": [
"gs://my_bucket/test/input/*.tsv"
]
}
In it I am using JSON to hold the ARRAY types:
"id" "array_string" "struct_string_string" "struct_string_numberic" "struct_string_int64"
1 ["one", "two", "three"] [{"key":"one", "value":"1"},{"key":"two", "value":"2"},{"key":"three", "value":"3"}] [{"key":"one", "value":1.1},{"key":"two", "value":2.2}] [{"key":"one", "value":11},{"key":"two", "value":22}]
2 ["four", "five", "six"] [{"key":"four", "value":"4"},{"key":"five", "value":"5"},{"key":"six", "value":"6"}] [{"key":"three", "value":3.3},{"key":"four", "value":4.4}] [{"key":"three", "value":33},{"key":"four", "value":44}]
I then want to using a MERGE to upsert the data in the target table. When I run this:
CREATE TEMPORARY FUNCTION ARRAY_OF(json STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
let parsed = JSON.parse(json);
return parsed;
""";
CREATE TEMPORARY FUNCTION ARRAY_STRUCT_STRING_STRING_OF(json STRING)
RETURNS ARRAY<STRUCT<key STRING, value STRING>>
LANGUAGE js AS """
let parsed = JSON.parse(json);
return parsed;
""";
CREATE TEMPORARY FUNCTION ARRAY_STRUCT_STRING_NUMERIC_OF(json STRING)
RETURNS ARRAY<STRUCT<key STRING, value NUMERIC>>
LANGUAGE js AS """
let parsed = JSON.parse(json);
return parsed;
""";
CREATE TEMPORARY FUNCTION ARRAY_STRUCT_STRING_INT64_OF(json STRING)
RETURNS ARRAY<STRUCT<key STRING, value INT64>>
LANGUAGE js AS """
let parsed = JSON.parse(json);
return parsed;
""";
MERGE ta_producer_conformed.test T
USING ta_producer_raw.test_raw S
ON
S.id = T.id
WHEN NOT MATCHED THEN
INSERT (id, array_string, array_struct_string_string, array_struct_string_numeric, array_struct_string_int64)
VALUES (
id,
ARRAY_OF(array_string),
ARRAY_STRUCT_STRING_STRING_OF(struct_string_string),
ARRAY_STRUCT_STRING_NUMERIC_OF(struct_string_numberic),
ARRAY_STRUCT_STRING_INT64_OF(struct_string_int64)
)
WHEN MATCHED THEN UPDATE SET
T.id = S.id,
T.array_string = ARRAY_OF(S.array_string),
T.struct_string_string = ARRAY_STRUCT_STRING_STRING_OF(S.struct_string_string),
T.ARRAY_STRUCT_STRING_NUMERIC_OF(S.struct_string_numberic),
T.ARRAY_STRUCT_STRING_INT64_OF(S.struct_string_int64)
I get this error:
Error in query string: Error processing job 'xxxx-10843454-datamesh-
dev:bqjob_r4c426875_00000173fcfd2294_1': Syntax error: Expected "." or "=" or
"[" but got "(" at [1:1312]
If I delete the whole last section for WHEN MATCHED such that it only INSERTS the temporary functions work fine. So the problem appears to be that in final THEN UPDATE SET section I cannot use the temporary functions.
How can I get data types such as ARRAY<STRING> and ARRAY<STRUCT<STRING,STRING>> to load from an external bucket ideally using a single MERGE statement?
Update: I tried to use a Common Table Expression to pre-process the data using:
WITH cteConvertJason AS (
SELECT
id,
ARRAY_OF(array_string) AS array_string,
ARRAY_STRUCT_STRING_STRING_OF(struct_string_string) AS struct_string_string,
ARRAY_STRUCT_STRING_NUMERIC_OF(struct_string_numberic) AS struct_string_numberic,
ARRAY_STRUCT_STRING_INT64_OF(struct_string_int64) AS struct_string_int64
FROM
ta_producer_raw.test_raw
)
MERGE ta_producer_conformed.test T
USING cteConvertJason S
...
That gave an error so it looks like you combine WITH and MERGE.
Update: We were trying out TSV for legacy reasons. It is a far better idea to use NEWLINE_DELIMITED_JSON as the format such that you do not need to explicitly parse the nested or repeated columns.

It turns out that the MERGE target USING source can use a query as the source. That query can run the temporary functions to pre-process the source data. Then the rest of the MERGE statement can be vanilla and works:
CREATE TEMPORARY FUNCTION ARRAY_OF(json STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
let parsed = JSON.parse(json);
return parsed;
""";
CREATE TEMPORARY FUNCTION ARRAY_STRUCT_STRING_STRING_OF(json STRING)
RETURNS ARRAY<STRUCT<key STRING, value STRING>>
LANGUAGE js AS """
let parsed = JSON.parse(json);
return parsed;
""";
CREATE TEMPORARY FUNCTION ARRAY_STRUCT_STRING_NUMERIC_OF(json STRING)
RETURNS ARRAY<STRUCT<key STRING, value NUMERIC>>
LANGUAGE js AS """
let parsed = JSON.parse(json);
return parsed;
""";
CREATE TEMPORARY FUNCTION ARRAY_STRUCT_STRING_INT64_OF(json STRING)
RETURNS ARRAY<STRUCT<key STRING, value INT64>>
LANGUAGE js AS """
let parsed = JSON.parse(json);
return parsed;
""";
MERGE ta_producer_conformed.test T
USING
(
SELECT
id,
ARRAY_OF(array_string) AS array_string,
ARRAY_STRUCT_STRING_STRING_OF(struct_string_string) AS struct_string_string,
ARRAY_STRUCT_STRING_NUMERIC_OF(struct_string_numberic) AS struct_string_numberic,
ARRAY_STRUCT_STRING_INT64_OF(struct_string_int64) AS struct_string_int64
FROM
ta_producer_raw.test_raw
)
S
ON
S.id = T.id
WHEN NOT MATCHED THEN
INSERT (id, array_string, array_struct_string_string, array_struct_string_numeric, array_struct_string_int64)
VALUES (
id,
array_string,
struct_string_string,
struct_string_numberic,
struct_string_int64
)
WHEN MATCHED THEN UPDATE SET
T.array_string = S.array_string,
T.array_struct_string_string = S.struct_string_string,
T.array_struct_string_numeric = S.struct_string_numberic,
T.array_struct_string_int64 = S.struct_string_int64

Related

BigQuery UDF user defined functions return types

I am using the following SQL (from another question) which contains temporary functions.
create temp function extract_keys(input string) returns array<string> language js as """
return Object.keys(JSON.parse(input));
""";
create temp function extract_values(input string) returns array<string> language js as """
return Object.values(JSON.parse(input));
""";
create temp function extract_all_leaves(input string) returns string language js as '''
function flattenObj(obj, parent = '', res = {}){
for(let key in obj){
let propName = parent ? parent + '.' + key : key;
if(typeof obj[key] == 'object'){
flattenObj(obj[key], propName, res);
} else {
res[propName] = obj[key];
}
}
return JSON.stringify(res);
}
return flattenObj(JSON.parse(input));
''';
select col || replace(replace(key, 'value', ''), '.', '-') as col, value,
from your_table,
unnest([struct(extract_all_leaves(data) as json)]),
unnest(extract_keys(json)) key with offset
join unnest(extract_values(json)) value with offset
using(offset)
I want to save the above query as a view, but I cannot include the temporary functions, so I planned to define these as user-defined functions that can be called as part of the view.
When defining the functions, I'm having some trouble getting the input and output types defined correctly. Here's the three user defined functions.
CREATE OR REPLACE FUNCTION `dataset.json_extract_all_leaves`(Obj String)
RETURNS String
LANGUAGE js AS """
function flattenObj(obj, parent = '', res = {}){
for(let key in obj){
let propName = parent ? parent + '.' + key : key;
if(typeof obj[key] == 'object'){
flattenObj(obj[key], propName, res);
} else {
res[propName] = obj[key];
}
}
return JSON.stringify(res);
}
return flattenObj(JSON.parse(input));
"""
CREATE OR REPLACE FUNCTION `dataset.json_extract_keys`(input String)
RETURNS Array<String>
LANGUAGE js AS """
return Object.keys(JSON.parse(input));
"""
finally
CREATE OR REPLACE FUNCTION `dataform.json_extract_values`(input STRING)
RETURNS Array<String>
LANGUAGE js AS """
return Object.values(JSON.parse(input));
"""
Those three functions are created successfully, but when I come to use them in this view
WITH extract_all AS (
select
id,
field,
created,
key || replace(replace(key, 'value', ''), '.', '-') as key_name, value,
FROM `dataset.raw_keys_and_values`,
unnest([struct(`dataset.json_extract_all_leaves`(setting_value) as json)]),
unnest(`dataset.json_extract_keys`(json)) key with offset
join unnest(`dataset.json_extract_values`(json)) value with offset
using(offset)
)
SELECT *
FROM
extract_all
This fails with the following error
Error: Multiple errors occurred during the request. Please see the `errors` array for complete details. 1. Failed to coerce output value "{\"value\":true}" to type ARRAY<STRING>
I understand there's a mismatch somewhere between the expected return value of json_extract_values, but I can't understand if it's in the SQL or JavaScript UDF?
Revised Answer
I've given the original ask another read and contrasted with some experimentation in my test data set.
While I'm unable to reproduce the given error, I did experience related difficulty with the following line:
unnest([struct(`dataset.json_extract_all_leaves`(setting_value) as json)]),
Put simply, the function being called takes a string (presumably a stringified JSON value) and returns a similarly stringified JSON value with the result. Because UNNEST can only be used with arrays, the author surrounds the output with [struct and ] which may be the issue. Again, in an effort to yield the same result as I do below, but using the original functions, I would propose that the SQL block be updated to the following:
create temp function extract_keys(input string) returns array<string> language js as """
return Object.keys(JSON.parse(input));
""";
create temp function extract_values(input string) returns array<string> language js as """
return Object.values(JSON.parse(input));
""";
create temp function extract_all_leaves(input string) returns string language js as '''
function flattenObj(obj, parent = '', res = {}){
for(let key in obj){
let propName = parent ? parent + '.' + key : key;
if(typeof obj[key] == 'object'){
flattenObj(obj[key], propName, res);
} else {
res[propName] = obj[key];
}
}
return JSON.stringify(res);
}
return flattenObj(JSON.parse(input));
''';
WITH extract_all AS (
select
id,
field,
created,
properties
FROM
UNNEST([
STRUCT<id int, field string, created DATE, properties string>(1, 'michael', DATE(2022, 5, 1), '[[{"name":"Andy","age":7},{"name":"Mark","age":5},{"name":"Courtney","age":6}], [{"name":"Austin","age":8},{"name":"Erik","age":6},{"name":"Michaela","age":6}]]'),
STRUCT<id int, field string, created DATE, properties string>(2, 'sarah', DATE(2022, 5, 2), '[{"name":"Angela","age":9},{"name":"Ryan","age":7},{"name":"Andrew","age":7}]'),
STRUCT<id int, field string, created DATE, properties string>(3, 'rosy', DATE(2022, 5, 3), '[{"name":"Brynn","age":4},{"name":"Cameron","age":3},{"name":"Rebecca","age":5}]')
])
AS myData
)
SELECT
id,
field,
created,
key,
value
FROM (
SELECT
*
FROM extract_all,
UNNEST(extract_keys(extract_all_leaves(properties))) key WITH OFFSET
JOIN UNNEST(extract_values(extract_all_leaves(properties))) value WITH OFFSET
USING(OFFSET)
)
Put simply - remove the extract_all_leaves line with its array casting and perform it in the offset-joined pair of keys and values, then put all that in a subquery so you can cleanly pull out just the columns you want.
And to explicitly answer the asked question, I believe the issue is in the SQL because of the type casting in the offending line and my own inability to get it to cleanly pair with the subsequent UNNEST queries against its output.
Original Answer
I gather that you've got some sort of JSON object in your settings_value field and you're trying to sift out a result that shows the keys and values of that object alongside the other columns in your dataset.
As others mentioned in the comments, this is a bit of a puzzle to figure out precisely why your query isn't working without any sample data, so happy to re-visit this if you can provide a record or two I can drop in to validate against, but here's an end-to-end that yields my guess as to what you're aiming for. In lieu of that, I've created some sample records intended to be in the same spirit of what you provided.
Based on your use of joining by the offset, I'm supposing that you're really just wanting to see all the keys and their values, paired with the other columns. Assuming that's true, I propose using a different JavaScript function that yields an array of all the key/value pairs instead of two separate functions to yield their own arrays. It simplifies the query (and more importantly, works):
create temp function extract_all_leaves(input string) returns string language js as r'''
function flattenObj(obj, parent = '', res = {}){
for(let key in obj){
let propName = parent ? parent + '.' + key : key;
if(typeof obj[key] == 'object'){
flattenObj(obj[key], propName, res);
} else {
res[propName] = obj[key];
}
}
return JSON.stringify(res);
}
return flattenObj(JSON.parse(input));
''';
create temp function extract_key_values(input string) returns array<struct<key string, value string>> language js as r"""
var parsed = JSON.parse(input);
var keys = Object.keys(parsed);
var result = [];
for (var ii = 0; ii < keys.length; ii++) {
var o = {key: keys[ii], value: parsed[keys[ii]]};
result.push(o);
}
return result;
""";
WITH extract_all AS (
select
id,
field,
created,
properties
FROM
UNNEST([
--STRUCT<id int, field string, created DATE, properties string>(1, 'michael', DATE(2022, 5, 1), '[[{"name":"Andy","age":7},{"name":"Mark","age":5},{"name":"Courtney","age":6}], [{"name":"Austin","age":8},{"name":"Erik","age":6},{"name":"Michaela","age":6}]]'),
STRUCT<id int, field string, created DATE, properties string>(2, 'sarah', DATE(2022, 5, 2), '[{"name":"Angela","age":9},{"name":"Ryan","age":7},{"name":"Andrew","age":7}]'),
STRUCT<id int, field string, created DATE, properties string>(3, 'rosy', DATE(2022, 5, 3), '[{"name":"Brynn","age":4},{"name":"Cameron","age":3},{"name":"Rebecca","age":5}]')
])
AS myData
)
SELECT
id,
field,
created,
key,
value
FROM (
SELECT
*
FROM extract_all
CROSS JOIN UNNEST(extract_key_values(extract_all_leaves(properties)))
)
And I believe this yields a result more like what you're seeking:
id
field
created
key
value
2
sarah
2022-05-02
0.name
Angela
2
sarah
2022-05-02
0.age
9
2
sarah
2022-05-02
1.name
Ryan
2
sarah
2022-05-02
1.age
7
2
sarah
2022-05-02
2.name
Andrew
2
sarah
2022-05-02
2.age
7
3
rosy
2022-05-03
0.name
Brynn
3
rosy
2022-05-03
0.age
4
3
rosy
2022-05-03
1.name
Cameron
3
rosy
2022-05-03
1.age
3
3
rosy
2022-05-03
2.name
Rebecca
3
rosy
2022-05-03
2.age
5
Of course, if this isn't at all in the right place of where you're trying to get to.

BigQuery UDF to remove accents/diacritics in a string

Using this javascript code we can remove accents/diacritics in a string.
var originalText = "éàçèñ"
var result = originalText.normalize('NFD').replace(/[\u0300-\u036f]/g, "")
console.log(result) // eacen
If we create a BigQuery UDF it does not (even with double \).
CREATE OR REPLACE FUNCTION project.remove_accent(x STRING)
RETURNS STRING
LANGUAGE js AS """
return x.normalize("NFD").replace(/[\u0300-\u036f]/g, "");
""";
SELECT project.remove_accent("éàçèñ") --"éàçèñ"
Any thoughts on that?
Consider below approach
select originalText,
regexp_replace(normalize(originalText, NFD), r"\pM", '') output
if applied to sample data in your question - output is
You can easily wrap it with SQL UDF if you wish

Snowflake error in JavaScript procedure column index/name does not exists in resultSet getColumnValue

I am having a procedure on snowflake which executing the following query:
select
array_size(split($1, ',')) as NO_OF_COL,
split($1, ',') as COLUMNS_ARRAY
from
#mystage/myfile.csv(file_format => 'ONE_COLUMN_FILE_FORMAT')
limit 1;
And the result would be like:
Why I run this query in a procedure:
CREATE OR REPLACE PROCEDURE ADD_TEMPORARY_TABLE(TEMP_TABLE_NAME STRING, FILE_FULL_PATH STRING, ONE_COLUMN_FORMAT_FILE STRING, FILE_FORMAT_NAME STRING)
RETURNS variant
LANGUAGE JAVASCRIPT
EXECUTE AS CALLER
AS
$$
try{
var final_result = [];
var nested_obj = {};
var nbr_rows = 0;
var NO_OF_COL = 0;
var COLUMNS_ARRAY = [];
var get_length_and_columns_array = "select array_size(split($1,',')) as NO_OF_COL, "+
"split($1,',') as COLUMNS_ARRAY from "+FILE_FULL_PATH+" "+
"(file_format=>"+ONE_COLUMN_FORMAT_FILE+") limit 1";
var stmt = snowflake.createStatement({sqlText: get_length_and_columns_array});
var array_result = stmt.execute();
array_result.next();
//return array_result.getColumnValue('COLUMNS_ARRAY');
NO_OF_COL = array_result.getColumnValue('NO_OF_COL');
COLUMNS_ARRAY = array_result.getColumnValue('COLUMNS_ARRAY');
return COLUMNS_ARRAY;
}
...
$$;
It will return an error as the following:
{
"code": 100183,
"message": "Given column name/index does not exist: NO_OF_COL",
"stackTraceTxt": "At ResultSet.getColumnValue, line 16 position 29",
"state": "P0000",
"toString": {}
}
The other issue is if I keep trying, it will return the desired array, but most of the times is returning this error.
The other issue is if I keep trying, it will return the desired array
If it works one time and not another time, my educated guess is that the stored procedure is called from different schemas.
Querying stage
( FILE_FORMAT => '<namespace>.<named_file_format>' )
If referencing a file format in the current namespace for your user session, you can omit the single quotes around the format identifier.
In the standalone query we can see:
select
array_size(split($1, ',')) as NO_OF_COL,
split($1, ',') as COLUMNS_ARRAY
from
#mystage/myfile.csv(file_format => 'ONE_COLUMN_FILE_FORMAT')
>-< >-<
But in the stored procedure body:
"(file_format=>"+ONE_COLUMN_FORMAT_FILE+") limit 1";
--here the text is appended, but without wrapping with ''
=>
"(file_format=>'"+ONE_COLUMN_FORMAT_FILE+"') limit 1";
Suggestion: always provide file format as as string wrapped with ', preferably prefixed with namespace '<schema_name>.<format_name>'.

bigquery standard sql udf mapping to struct is returning internal error

I have a code block below for parsing query params using udf. It works fine when the value passed to function is hardcoded as in the example. Thought when I try to parse the same value fetched from a table I get a
An internal error occurred and the request could not be completed. (error code: internalError)
CREATE TEMPORARY FUNCTION parse(queryString STRING) RETURNS ARRAY<STRUCT<key STRING, value STRING>> LANGUAGE js AS
"""
var params = {}
var array = []
// split into key/value pairs
var queries = queryString.split('&');
var ind = 0
// convert the array of strings into an object
for (var i = 0; i < queries.length; i++ ) {
var temp = queries[i].split('=');
if(temp.length < 2) continue;
array[ind++] = { key: temp[0], value: decodeURI(temp[1]) }
}
return array;
""";
select parse('ca_chid=2002810&ca_source=gaw&ca_ace=&ca_nw=g&ca_dev=c&ca_pl=&ca_pos=1t3&ca_agid=32438864366&ca_caid=260997846&ca_adid=151983037851&ca_kwt=florists%20in%20walsall&ca_mt=e&ca_fid=&ca_tid=aud-117534990726:kwd-420175760&ca_lp=9045676&ca_li=&ca_devm=&ca_plt=&ca_sadt=&ca_smid=&ca_spc=&ca_spid=&ca_sco=&ca_sla=&ca_sptid=&ca_ssc=&gclid=CLaDoa6ZrdACFcyRGwodG8IFvQ') as params
--not working
--select parse(page_urlquery) from (
--SELECT page_urlquery FROM `query_param_snapshot` where page_urlquery != '' LIMIT 1
Also reported on the issue tracker (we are working on a fix). One workaround is to use a SQL function rather than a JavaScript function, e.g.:
CREATE TEMPORARY FUNCTION parse(queryString STRING)
RETURNS ARRAY<STRUCT<key STRING, value STRING>> AS (
(SELECT
ARRAY_AGG(STRUCT(
entry[OFFSET(0)] AS key,
entry[OFFSET(1)] AS value))
FROM (
SELECT SPLIT(pairString, '=') AS entry
FROM UNNEST(SPLIT(queryString, '&')) AS pairString)
)
);
SELECT parse('ca_chid=2002810&ca_source=gaw&ca_ace=&ca_nw=g&ca_dev=c&ca_pl=&ca_pos=1t3&ca_agid=32438864366&ca_caid=260997846&ca_adid=151983037851&ca_kwt=florists%20in%20walsall&ca_mt=e&ca_fid=&ca_tid=aud-117534990726:kwd-420175760&ca_lp=9045676&ca_li=&ca_devm=&ca_plt=&ca_sadt=&ca_smid=&ca_spc=&ca_spid=&ca_sco=&ca_sla=&ca_sptid=&ca_ssc=&gclid=CLaDoa6ZrdACFcyRGwodG8IFvQ') AS params;

JSON_EXTRACT in BigQuery Standard SQL?

I'm converting some SQL code from BigQuery to BigQuery Standard SQL.
I can't seem to find JSON_EXTRACT_SCALAR in Bigquery Standard SQL, is there an equivalent?
Edit: we implemented the JSON functions a while back. You can read about them in the documentation.
Not that I know of, but there is always workaround
Let's assume we want to mimic example from JSON_EXTRACT_SCALAR documentation
SELECT JSON_EXTRACT_SCALAR('{"a": ["x", {"b":3}]}', '$.a[1].b') as str
Below code does same
CREATE TEMPORARY FUNCTION CUSTOM_JSON_EXTRACT(json STRING)
RETURNS STRING
LANGUAGE js AS """
try { var parsed = JSON.parse(json);
} catch (e) { return null }
return parsed.a[1].b;
""";
SELECT CUSTOM_JSON_EXTRACT('{"a": ["x", {"b":3}]}') AS str
I think this can be good starting point to experiment with
see more for Scalar UDF in BigQuery Standard SQL
Quick update
After cup of coffee, decided to complete this "exercise" by myself
Look as a good short term solution to me :o)
CREATE TEMPORARY FUNCTION CUSTOM_JSON_EXTRACT(json STRING, json_path STRING)
RETURNS STRING
LANGUAGE js AS """
try { var parsed = JSON.parse(json);
} catch (e) { return null }
return eval(json_path.replace("$", "parsed"));
""";
SELECT
CUSTOM_JSON_EXTRACT('{"a": ["x", {"b":3}]}', '$.a[1].b') AS str1,
CUSTOM_JSON_EXTRACT('{"a": ["x", {"b":3}]}', '$.a[0]') AS str2,
CUSTOM_JSON_EXTRACT('{"a": 1, "b": [4, 5]}', '$.b') AS str3