JSON_EXTRACT in BigQuery Standard SQL? - google-bigquery

I'm converting some SQL code from BigQuery to BigQuery Standard SQL.
I can't seem to find JSON_EXTRACT_SCALAR in Bigquery Standard SQL, is there an equivalent?

Edit: we implemented the JSON functions a while back. You can read about them in the documentation.

Not that I know of, but there is always workaround
Let's assume we want to mimic example from JSON_EXTRACT_SCALAR documentation
SELECT JSON_EXTRACT_SCALAR('{"a": ["x", {"b":3}]}', '$.a[1].b') as str
Below code does same
CREATE TEMPORARY FUNCTION CUSTOM_JSON_EXTRACT(json STRING)
RETURNS STRING
LANGUAGE js AS """
try { var parsed = JSON.parse(json);
} catch (e) { return null }
return parsed.a[1].b;
""";
SELECT CUSTOM_JSON_EXTRACT('{"a": ["x", {"b":3}]}') AS str
I think this can be good starting point to experiment with
see more for Scalar UDF in BigQuery Standard SQL
Quick update
After cup of coffee, decided to complete this "exercise" by myself
Look as a good short term solution to me :o)
CREATE TEMPORARY FUNCTION CUSTOM_JSON_EXTRACT(json STRING, json_path STRING)
RETURNS STRING
LANGUAGE js AS """
try { var parsed = JSON.parse(json);
} catch (e) { return null }
return eval(json_path.replace("$", "parsed"));
""";
SELECT
CUSTOM_JSON_EXTRACT('{"a": ["x", {"b":3}]}', '$.a[1].b') AS str1,
CUSTOM_JSON_EXTRACT('{"a": ["x", {"b":3}]}', '$.a[0]') AS str2,
CUSTOM_JSON_EXTRACT('{"a": 1, "b": [4, 5]}', '$.b') AS str3

Related

BigQuery UDF to remove accents/diacritics in a string

Using this javascript code we can remove accents/diacritics in a string.
var originalText = "éàçèñ"
var result = originalText.normalize('NFD').replace(/[\u0300-\u036f]/g, "")
console.log(result) // eacen
If we create a BigQuery UDF it does not (even with double \).
CREATE OR REPLACE FUNCTION project.remove_accent(x STRING)
RETURNS STRING
LANGUAGE js AS """
return x.normalize("NFD").replace(/[\u0300-\u036f]/g, "");
""";
SELECT project.remove_accent("éàçèñ") --"éàçèñ"
Any thoughts on that?
Consider below approach
select originalText,
regexp_replace(normalize(originalText, NFD), r"\pM", '') output
if applied to sample data in your question - output is
You can easily wrap it with SQL UDF if you wish

BigQuery how to use MERGE to load array columns

I have a test table that I am trying to load from GCS storage:
CREATE OR REPLACE TABLE ta_producer_conformed.test
(
id NUMERIC,
array_string ARRAY<STRING>,
array_struct_string_string ARRAY<STRUCT<key STRING, value STRING>>,
array_struct_string_numeric ARRAY<STRUCT<key STRING, value NUMERIC>>,
array_struct_string_int64 ARRAY<STRUCT<key STRING, value INT64>>
)
I have defined an external storage table as:
{
"autodetect": true,
"csvOptions": {
"encoding": "UTF-8",
"quote": "\"",
"fieldDelimiter": "\t"
},
"sourceFormat": "CSV",
"sourceUris": [
"gs://my_bucket/test/input/*.tsv"
]
}
In it I am using JSON to hold the ARRAY types:
"id" "array_string" "struct_string_string" "struct_string_numberic" "struct_string_int64"
1 ["one", "two", "three"] [{"key":"one", "value":"1"},{"key":"two", "value":"2"},{"key":"three", "value":"3"}] [{"key":"one", "value":1.1},{"key":"two", "value":2.2}] [{"key":"one", "value":11},{"key":"two", "value":22}]
2 ["four", "five", "six"] [{"key":"four", "value":"4"},{"key":"five", "value":"5"},{"key":"six", "value":"6"}] [{"key":"three", "value":3.3},{"key":"four", "value":4.4}] [{"key":"three", "value":33},{"key":"four", "value":44}]
I then want to using a MERGE to upsert the data in the target table. When I run this:
CREATE TEMPORARY FUNCTION ARRAY_OF(json STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
let parsed = JSON.parse(json);
return parsed;
""";
CREATE TEMPORARY FUNCTION ARRAY_STRUCT_STRING_STRING_OF(json STRING)
RETURNS ARRAY<STRUCT<key STRING, value STRING>>
LANGUAGE js AS """
let parsed = JSON.parse(json);
return parsed;
""";
CREATE TEMPORARY FUNCTION ARRAY_STRUCT_STRING_NUMERIC_OF(json STRING)
RETURNS ARRAY<STRUCT<key STRING, value NUMERIC>>
LANGUAGE js AS """
let parsed = JSON.parse(json);
return parsed;
""";
CREATE TEMPORARY FUNCTION ARRAY_STRUCT_STRING_INT64_OF(json STRING)
RETURNS ARRAY<STRUCT<key STRING, value INT64>>
LANGUAGE js AS """
let parsed = JSON.parse(json);
return parsed;
""";
MERGE ta_producer_conformed.test T
USING ta_producer_raw.test_raw S
ON
S.id = T.id
WHEN NOT MATCHED THEN
INSERT (id, array_string, array_struct_string_string, array_struct_string_numeric, array_struct_string_int64)
VALUES (
id,
ARRAY_OF(array_string),
ARRAY_STRUCT_STRING_STRING_OF(struct_string_string),
ARRAY_STRUCT_STRING_NUMERIC_OF(struct_string_numberic),
ARRAY_STRUCT_STRING_INT64_OF(struct_string_int64)
)
WHEN MATCHED THEN UPDATE SET
T.id = S.id,
T.array_string = ARRAY_OF(S.array_string),
T.struct_string_string = ARRAY_STRUCT_STRING_STRING_OF(S.struct_string_string),
T.ARRAY_STRUCT_STRING_NUMERIC_OF(S.struct_string_numberic),
T.ARRAY_STRUCT_STRING_INT64_OF(S.struct_string_int64)
I get this error:
Error in query string: Error processing job 'xxxx-10843454-datamesh-
dev:bqjob_r4c426875_00000173fcfd2294_1': Syntax error: Expected "." or "=" or
"[" but got "(" at [1:1312]
If I delete the whole last section for WHEN MATCHED such that it only INSERTS the temporary functions work fine. So the problem appears to be that in final THEN UPDATE SET section I cannot use the temporary functions.
How can I get data types such as ARRAY<STRING> and ARRAY<STRUCT<STRING,STRING>> to load from an external bucket ideally using a single MERGE statement?
Update: I tried to use a Common Table Expression to pre-process the data using:
WITH cteConvertJason AS (
SELECT
id,
ARRAY_OF(array_string) AS array_string,
ARRAY_STRUCT_STRING_STRING_OF(struct_string_string) AS struct_string_string,
ARRAY_STRUCT_STRING_NUMERIC_OF(struct_string_numberic) AS struct_string_numberic,
ARRAY_STRUCT_STRING_INT64_OF(struct_string_int64) AS struct_string_int64
FROM
ta_producer_raw.test_raw
)
MERGE ta_producer_conformed.test T
USING cteConvertJason S
...
That gave an error so it looks like you combine WITH and MERGE.
Update: We were trying out TSV for legacy reasons. It is a far better idea to use NEWLINE_DELIMITED_JSON as the format such that you do not need to explicitly parse the nested or repeated columns.
It turns out that the MERGE target USING source can use a query as the source. That query can run the temporary functions to pre-process the source data. Then the rest of the MERGE statement can be vanilla and works:
CREATE TEMPORARY FUNCTION ARRAY_OF(json STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
let parsed = JSON.parse(json);
return parsed;
""";
CREATE TEMPORARY FUNCTION ARRAY_STRUCT_STRING_STRING_OF(json STRING)
RETURNS ARRAY<STRUCT<key STRING, value STRING>>
LANGUAGE js AS """
let parsed = JSON.parse(json);
return parsed;
""";
CREATE TEMPORARY FUNCTION ARRAY_STRUCT_STRING_NUMERIC_OF(json STRING)
RETURNS ARRAY<STRUCT<key STRING, value NUMERIC>>
LANGUAGE js AS """
let parsed = JSON.parse(json);
return parsed;
""";
CREATE TEMPORARY FUNCTION ARRAY_STRUCT_STRING_INT64_OF(json STRING)
RETURNS ARRAY<STRUCT<key STRING, value INT64>>
LANGUAGE js AS """
let parsed = JSON.parse(json);
return parsed;
""";
MERGE ta_producer_conformed.test T
USING
(
SELECT
id,
ARRAY_OF(array_string) AS array_string,
ARRAY_STRUCT_STRING_STRING_OF(struct_string_string) AS struct_string_string,
ARRAY_STRUCT_STRING_NUMERIC_OF(struct_string_numberic) AS struct_string_numberic,
ARRAY_STRUCT_STRING_INT64_OF(struct_string_int64) AS struct_string_int64
FROM
ta_producer_raw.test_raw
)
S
ON
S.id = T.id
WHEN NOT MATCHED THEN
INSERT (id, array_string, array_struct_string_string, array_struct_string_numeric, array_struct_string_int64)
VALUES (
id,
array_string,
struct_string_string,
struct_string_numberic,
struct_string_int64
)
WHEN MATCHED THEN UPDATE SET
T.array_string = S.array_string,
T.array_struct_string_string = S.struct_string_string,
T.array_struct_string_numeric = S.struct_string_numberic,
T.array_struct_string_int64 = S.struct_string_int64

DBArrayList to List<Map> Conversion after Query

Currently, I have a SQL query that returns information to me in a DBArrayList.
It returns data in this format : [{id=2kjhjlkerjlkdsf324523}]
For the next step, I need it to be in a List<Map> format without the id: [2kjhjlkerjlkdsf324523]
The Datatypes being used are DBArrayList, and List.
If it helps any, the next step is a function to collect the list and then to replace all single quotes if any [SQL-Injection prevention]. Using:
listMap = listMap.collect() { "'" + Util.removeSingleQuotes(it) + "'" }
public static String removeSingleQuotes(s) {
return s ? s.replaceAll(/'"/, '') : s
}
I spent this morning working on it, and I found out that I needed to actually collect the DBArrayList like this:
listMap = dbArrayList.collect { it.getAt('id')}
If you're in a bind like I was and restrained to a specific schema this might help, but #ou_ryperd has the correct answer!
While using a DBArrayList is not wrong, Groovy's idiom is to use the db result as a collection. I would suggest you use it that way directly from the db:
Map myMap = [:]
dbhandle.eachRow("select fieldSomeID, fieldSomeVal from yourTable;") { row ->
map[row.fieldSomeID] = row.fieldSomeVal.replaceAll(/'"/, '')
}

Strings concatenation in Spark SQL query

I'm experimenting with Spark and Spark SQL and I need to concatenate a value at the beginning of a string field that I retrieve as output from a select (with a join) like the following:
val result = sim.as('s)
.join(
event.as('e),
Inner,
Option("s.codeA".attr === "e.codeA".attr))
.select("1"+"s.codeA".attr, "e.name".attr)
Let's say my tables contain:
sim:
codeA,codeB
0001,abcd
0002,efgh
events:
codeA,name
0001,freddie
0002,mercury
And I would want as output:
10001,freddie
10002,mercury
In SQL or HiveQL I know I have the concat function available, but it seems Spark SQL doesn't support this feature. Can somebody suggest me a workaround for my issue?
Thank you.
Note:
I'm using Language Integrated Queries but I could use just a "standard" Spark SQL query, in case of eventual solution.
The output you add in the end does not seem to be part of your selection, or your SQL logic, if I understand correctly. Why don't you proceed by formatting the output stream as a further step ?
val results = sqlContext.sql("SELECT s.codeA, e.code FROM foobar")
results.map(t => "1" + t(0), t(1)).collect()
It's relatively easy to implement new Expression types directly in your project. Here's what I'm using:
case class Concat(children: Expression*) extends Expression {
override type EvaluatedType = String
override def foldable: Boolean = children.forall(_.foldable)
def nullable: Boolean = children.exists(_.nullable)
def dataType: DataType = StringType
def eval(input: Row = null): EvaluatedType = {
children.map(_.eval(input)).mkString
}
}
val result = sim.as('s)
.join(
event.as('e),
Inner,
Option("s.codeA".attr === "e.codeA".attr))
.select(Concat("1", "s.codeA".attr), "e.name".attr)

Working with url-encoded values in BigQuery

I work with gzipped log files which contain url-encoded columns. (a space character is encoded as "%20", etc).
My plan was to import these files directly from Google Cloud Storage into BigQuery.
I did not find any option in the Load config to automatically decode values during the import.
I guess you wouldn't advice using a series of REGEXP_REPLACE in all my queries.
Any idea which would avoid parsing all the logs and escape all these characters before importing them to BigQuery (which would be dangerous if one of them is the separator) ?
The accepted answer if for Legacy SQL.
For Standard SQL:
#standardSQL
CREATE TEMPORARY FUNCTION DECODE_URI_COMPONENT(path STRING)
RETURNS STRING
LANGUAGE js AS """
if (path == null) return null;
try {
return decodeURIComponent(path);
} catch (e) {
return path;
}
""";
WITH source AS (SELECT "/work.json?myfield=R%C3%A9gions%2CSport" AS path)
SELECT DECODE_URI_COMPONENT(REGEXP_EXTRACT(path, r"[?&]myfield=([^&]+)")) AS myfield FROM source
This returns:
myfield
---------------
Régions,Sport
Most likely you already ended up with something like below :o)
SELECT url FROM
js(
(SELECT url FROM
(SELECT 'http://example.com/query?q=my%20query%20string' AS url),
(SELECT 'http://example.com/query?q=your%20query%20string' AS url),
(SELECT 'http://example.com/query?q=his%20query%20string' AS url)
),
// Input columns.
url,
// Output schema.
"[
{name: 'url', type:'string'}]",
// The function.
"function(r, emit) {
var url = decodeURI(r.url);
emit({
url: url
});
}"
)
https://cloud.google.com/bigquery/user-defined-functions