Original Question - Transform Array into columns in BigQuery
Solution suggested in the original question works well when you want to extract the same information from the array elements. But in my case, the information that I want to extract from each array element can be different. For example- If you see the original question, the 3rd Array element we have doesn't have jsonPayload but instead it has nameValuePairs in it. If I use pivot there then unnecessary fields get created. How to avoid them, I know we can use EXCEPT but I don't think that is a good solution because If I have to choose different elements from each array element, it would be really a mess. As I can have 10+ payloads in the payloads array.
SQL -
select * from (
select
json_value(payload,'$.pool') as pool,
json_value(payloadArr, '$.name') as name,
json_value(payloadArr, '$.fullpath') as fullPath,
json_value(payloadArr, '$.jsonPayload.body') as payload,
json_value(payloadArr, '$.nameValuePairs.data.one') as nv,
from table t
, unnest(json_extract_array(payload, '$.payloads')) payloadArr
)
pivot (any_value(fullPath) as fullPath , any_value(payload) as payload, any_value(nv) as nv for name in ('request', 'response', 'attributes') )
Use below
select * from (
select
json_value(payload,'$.pool') as pool,
json_value(payloadArr, '$.name') as name,
json_value(payloadArr, '$.fullpath') as fullPath,
coalesce(
json_value(payloadArr, '$.jsonPayload.body'),
json_value(payloadArr, '$.nameValuePairs.data.one')
) as payload,
from table t
, unnest(json_extract_array(payload, '$.payloads')) payloadArr
)
pivot (any_value(fullPath) as fullPath , any_value(payload) as payload for name in ('request', 'response', 'attributes') )
with output
Related
I'm working on building a follow-network form Github's available data on Google BigQuery, e.g.: https://bigquery.cloud.google.com/table/githubarchive:day.20210606
The key data is contained in the "payload" field, STRING type. I managed to unnest the data contained in that field and convert it to an array, but how can I get the last element?
Here is what I have so far...
select type,
array(select trim(val) from unnest(split(trim(payload, '[]'))) val) payload
from `githubarchive.day.20210606`
where type = 'MemberEvent'
Which outputs:
How can I get only the last element, "Action":"added"} ?
I know that
select array_reverse(your_array)[offset(0)]
should do the trick, however I'm unsure how to combine that in my code. I've been trying different options without success, for example:
with payload as ( select array(select trim(val) from unnest(split(trim(payload, '[]'))) val) payload from `githubarchive.day.20210606`)
select type, ARRAY_REVERSE(payload)[ORDINAL(1)]
from `githubarchive.day.20210606` where type = 'MemberEvent'
The desired output should look like:
To get last element in array you can use below approach
select array_reverse(your_array)[offset(0)]
I'm unsure how to combine that in my code
select type, array_reverse(array(
select trim(val)
from unnest(split(trim(payload, '[]'))) val
))[offset(0)]
from `githubarchive.day.20210606`
where type = 'MemberEvent'
There is a solution without reversing the array.
SELECT event[OFFSET(ARRAY_LENGTH(event)-1)
I'm currently working on a JSON creation on a SQL Server 2016. For this, I use the FOR JSON function.
SELECT TOP 2
'12.00' AS [time]
,GUID AS [ID]
,'action value' AS [EVENT.ACTION]
,'category value' AS [EVENT.CATEGORY]
,'username' AS [user.name]
FROM TABLE_NAME
FOR JSON PATH, WITHOUT_ARRAY_WRAPPER
This piece of code creates me this:
{"time":"12.00","ID":"16AE8C15-084C-4C84-8F5D-0000193F8E74","EVENT":{"ACTION":"action value","CATEGORY":"category value"},"user":{"name":"username"}},{"time":"12.00","ID":"D5667AF4-5922-4D30-9C8A-00001AB928F6","EVENT":{"ACTION":"action value","CATEGORY":"category value"},"user":{"name":"username"}}
The problem is, every object gets displayed on line 1, but I would like to have one line per object.
This would look like this:
{"time":"12.00","ID":"16AE8C15-084C-4C84-8F5D-0000193F8E74","EVENT":{"ACTION":"action value","CATEGORY":"category value"},"user":{"name":"username"}},
{"time":"12.00","ID":"D5667AF4-5922-4D30-9C8A-00001AB928F6","EVENT":{"ACTION":"action value","CATEGORY":"category value"},"user":{"name":"username"}}
I have not found any snippets to do this. How can I create such a
If you need to build a separate JSON object for each row, you may try the following statement:
SELECT TOP 2
(
SELECT
'12.00' AS [time]
,GUID AS [ID]
,'action value' AS [EVENT.ACTION]
,'category value' AS [EVENT.CATEGORY]
,'username' AS [user.name]
FOR JSON PATH, WITHOUT_ARRAY_WRAPPER
) AS JsonData
FROM TABLE_NAME
But, if the expected ouput is one JSON array (with or without array wrapper) for the entire table, the format of this output is controlled by FOR JSON AUTO. The visual representation of this output (one line per each JSON object) should be made in the presentation layer. Of course, as an additional option using T-SQL, a small string manipulation is a possible solution:
SELECT REPLACE (
(
SELECT TOP 2
'12.00' AS [time]
,GUID AS [ID]
,'action value' AS [EVENT.ACTION]
,'category value' AS [EVENT.CATEGORY]
,'username' AS [user.name]
FROM TABLE_NAME
FOR JSON PATH, WITHOUT_ARRAY_WRAPPER
),
'},{',
'},' + CHAR(10) + '{'
)
i'm trying to extract two key from every json in an arry of jsons(using sql legacy)
currently i am using json extract function :
json_extract(json_column , '$[1].X') AS X,
json_extract(json_column , '$[1].Y') AS Y,
how can i make it run on every json at the 'json arry column', and not just [1] (for example)?
An example json:
[
{"blabla":000,"X":1,"blabla":000,"blabla":000,"blabla":000,,"Y":"2"},
{"blabla":000,"X":3,"blabla":000,"blabla":000,"blabla":000,,"Y":"4"},
]
thanks in advance!
Update 2020: JSON_EXTRACT_ARRAY()
Now BigQuery supports JSON_EXTRACT_ARRAY():
https://cloud.google.com/bigquery/docs/reference/standard-sql/json_functions#json_extract_array
For example, to solve this particular question:
SELECT id
, ARRAY(
SELECT JSON_EXTRACT_SCALAR(x, '$.author.email')
FROM UNNEST(JSON_EXTRACT_ARRAY(payload, "$.commits"))x
) emails
FROM `githubarchive.day.20180830`
WHERE type='PushEvent'
AND id='8188163772'
Previous answer
Let's start with a similar problem - this is not a very convenient way to extract all emails from a json array:
SELECT id
, [ JSON_EXTRACT_SCALAR(JSON_EXTRACT(payload, '$.commits'), '$[0].author.email')
, JSON_EXTRACT_SCALAR(JSON_EXTRACT(payload, '$.commits'), '$[1].author.email')
, JSON_EXTRACT_SCALAR(JSON_EXTRACT(payload, '$.commits'), '$[2].author.email')
, JSON_EXTRACT_SCALAR(JSON_EXTRACT(payload, '$.commits'), '$[3].author.email')
] emails
FROM `githubarchive.day.20180830`
WHERE type='PushEvent'
AND id='8188163772'
The best way we have right now to deal with this is to use some JavaScript in an UDF to split a json-array into a SQL array:
CREATE TEMP FUNCTION json2array(json STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
return JSON.parse(json).map(x=>JSON.stringify(x));
""";
SELECT * EXCEPT(array_commits),
ARRAY(SELECT JSON_EXTRACT_SCALAR(x, '$.author.email') FROM UNNEST(array_commits) x) emails
FROM (
SELECT id
, json2array(JSON_EXTRACT(payload, '$.commits')) array_commits
FROM `githubarchive.day.20180830`
WHERE type='PushEvent'
AND id='8188163772'
)
May 1st, 2020 Update
A new function, JSON_EXTRACT_ARRAY, has been just added to the list of JSON
functions. This function allows you to extract the contents of a JSON document as
a string array.
so in below you can replace use of CUSTOM_JSON_EXTRACT UDF with just in-built function JSON_EXTRACT_ARRAY as in below example
#standardSQL
SELECT
JSON_EXTRACT_SCALAR(json , '$.X') AS X,
JSON_EXTRACT_SCALAR(json , '$.Y') AS Y
FROM t, UNNEST(JSON_EXTRACT_ARRAY(json_column , '$')) json
==============
Below example for BigQuery Standard SQL and allows you to be close to standard way of working with JSONPath and no extra manipulation needed so you just simply use CUSTOM_JSON_EXTRACT(json, json_path) function
#standardSQL
CREATE TEMPORARY FUNCTION CUSTOM_JSON_EXTRACT(json STRING, json_path STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
return jsonPath(JSON.parse(json), json_path);
"""
OPTIONS (
library="gs://your_bucket/jsonpath-0.8.0.js"
);
WITH t AS (
SELECT '''
[
{"blabla1":1,"X":1,"blabla2":3,"blabla3":5,"blabla4":7,"Y":"2"},
{"blabla1":2,"X":3,"blabla2":4,"blabla3":6,"blabla4":8,"Y":"4"}
]
''' AS json_column
)
SELECT
CUSTOM_JSON_EXTRACT(json_column , '$[*].X') AS X,
CUSTOM_JSON_EXTRACT(json_column , '$[*].Y') AS Y
FROM t
result will be
Row X Y
1 1 2
3 4
Note: to overcome current BigQuery's "limitation" for JsonPath, above solution uses custom function along with external library - jsonpath-0.8.0.js that can be downloaded from https://code.google.com/archive/p/jsonpath/downloads and uploaded to Google Cloud Storage - gs://your_bucket/jsonpath-0.8.0.js
Just re-read Felipe's answer - for his example above solution will look like below (just as FYI)
SELECT
id,
CUSTOM_JSON_EXTRACT(payload, '$.commits[*].author.email') emails
FROM `githubarchive.day.20180830`
WHERE type='PushEvent'
AND id='8188163772'
Help needed to extract the data below from XML messages. I have table which contains the xml message in clob data type. I am trying using below query but it is not returning any data . I need to extract all the values from xml message.
<iORDERS:iORDERS xmlns:iORDERS="urn:iORDERS-abcdonline-com:Integration:v1">
<ORDER_NOTIFY>
<MESSAGE_DATETIME>2017-06-13T12:20:51+10:00</MESSAGE_DATETIME>
<MESSAGE_SEQ>1</MESSAGE_SEQ>
<MESSAGE_TYPE>PLACED</MESSAGE_TYPE>
<ORDER_HEAD>
<ORDER_ID>1111</ORDER_ID>
<DROP_SHIP_ORDER_NO></DROP_SHIP_ORDER_NO>
<CUSTOMER_ORDER_NO>22222</CUSTOMER_ORDER_NO>
<DISPATCH_LOCATION>
<SKU>2323234</SKU>
<UPC>4549432533626</UPC>
<REQUESTED_QTY>1</REQUESTED_QTY>
<DISPATCH_ASSIGNMENT>7777</DISPATCH_ASSIGNMENT>
<PROVIDER_ID>100</PROVIDER_ID>
<PKG_TYPE>SAT</PKG_TYPE>
</DISPATCH_LOCATION>
</ORDER_HEAD>
</ORDER_NOTIFY>
</iORDERS:iORDERS>
query :
select wor.batch_no,wor.web_service_no,x.*
from web_orders wo
cross join XMLTABLE (
XMLNAMESPACES(DEFAULT 'urn:iORDERS-abcdonline-com:Integration:v1'),
'iORDERS/ORDER_NOTIFY/ORDER_HEAD/DISPATCH_LOCATION'
passing xmltype(wo.xml_message)
columns
MESSAGE_TYPE varchar(120) path './../../../MESSAGE_TYPE') x;
You need to provide the named namespace identifier rather than a defealt, and your column path is going up one too many levels:
select wo.batch_no,wo.web_service_no,x.*
from web_orders wo
cross join XMLTABLE (
XMLNAMESPACES('urn:iORDERS-abcdonline-com:Integration:v1' as "iORDERS"),
'iORDERS:iORDERS/ORDER_NOTIFY/ORDER_HEAD/DISPATCH_LOCATION'
passing xmltype(wo.xml_message)
columns
MESSAGE_TYPE varchar(120) path './../../MESSAGE_TYPE') x;
BATCH_NO WEB_SERVICE_NO MESSAGE_TYPE
---------- -------------- ------------------------------------------------------------------------------------------------------------------------
1 2 PLACED
Presumably you're planning on getting for information that that from the XML, and/or expect to have multiple nodes; otherwise, to just get the message type you could simplify to:
select wo.batch_no,wo.web_service_no,x.*
from web_orders wo
cross join XMLTABLE (
XMLNAMESPACES('urn:iORDERS-abcdonline-com:Integration:v1' as "iORDERS"),
'iORDERS:iORDERS/ORDER_NOTIFY'
passing xmltype(wo.xml_message)
columns
MESSAGE_TYPE varchar(120) path 'MESSAGE_TYPE') x;
or even, with a single node:
select wo.batch_no,wo.web_service_no,XMLQuery(
'declare namespace iORDERS="urn:iORDERS-abcdonline-com:Integration:v1"; (: :)
iORDERS:iORDERS/ORDER_NOTIFY/MESSAGE_TYPE/text()'
passing xmltype(wo.xml_message)
returning content).getStringVal() as message_type
from web_orders wo;
Is there an easy way to do URL decoding within the BigQuery query language? I'm working with a table that has a column containing URL-encoded strings in some values. For example:
http://xyz.com/example.php?url=http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345&foo=bar&abc=xyz
I extract the "url" parameter like so:
SELECT REGEXP_EXTRACT(column_name, "url=([^&]+)") as url
from [mydataset.mytable]
which gives me:
http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345
What I would like to do is something like:
SELECT URL_DECODE(REGEXP_EXTRACT(column_name, "url=([^&]+)")) as url
from [mydataset.mytable]
thereby returning:
http://www.example.com/hello?v=12345
I would like to avoid using multiple REGEXP_REPLACE() statements (replacing %20, %3A, etc...) if possible.
Ideas?
Below is built on top of #sigpwned answer, but slightly refactored and wrapped with SQL UDF (which has no limitation that JS UDF has so safe to use)
#standardSQL
CREATE TEMP FUNCTION URLDECODE(url STRING) AS ((
SELECT SAFE_CONVERT_BYTES_TO_STRING(
ARRAY_TO_STRING(ARRAY_AGG(
IF(STARTS_WITH(y, '%'), FROM_HEX(SUBSTR(y, 2)), CAST(y AS BYTES)) ORDER BY i
), b''))
FROM UNNEST(REGEXP_EXTRACT_ALL(url, r"%[0-9a-fA-F]{2}|[^%]+")) AS y WITH OFFSET AS i
));
SELECT
column_name,
URLDECODE(REGEXP_EXTRACT(column_name, "url=([^&]+)")) AS url
FROM `project.dataset.table`
can be tested with example from question as below
#standardSQL
CREATE TEMP FUNCTION URLDECODE(url STRING) AS ((
SELECT SAFE_CONVERT_BYTES_TO_STRING(
ARRAY_TO_STRING(ARRAY_AGG(
IF(STARTS_WITH(y, '%'), FROM_HEX(SUBSTR(y, 2)), CAST(y AS BYTES)) ORDER BY i
), b''))
FROM UNNEST(REGEXP_EXTRACT_ALL(url, r"%[0-9a-fA-F]{2}|[^%]+")) AS y WITH OFFSET AS i
));
WITH `project.dataset.table` AS (
SELECT 'http://example.com/example.php?url=http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345&foo=bar&abc=xyz' column_name
)
SELECT
URLDECODE(REGEXP_EXTRACT(column_name, "url=([^&]+)")) AS url,
column_name
FROM `project.dataset.table`
with result
Row url column_name
1 http://www.example.com/hello?v=12345 http://example.com/example.php?url=http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345&foo=bar&abc=xyz
Update with further quite optimized SQL UDF
CREATE TEMP FUNCTION URLDECODE(url STRING) AS ((
SELECT STRING_AGG(
IF(REGEXP_CONTAINS(y, r'^%[0-9a-fA-F]{2}'),
SAFE_CONVERT_BYTES_TO_STRING(FROM_HEX(REPLACE(y, '%', ''))), y), ''
ORDER BY i
)
FROM UNNEST(REGEXP_EXTRACT_ALL(url, r"%[0-9a-fA-F]{2}(?:%[0-9a-fA-F]{2})*|[^%]+")) y
WITH OFFSET AS i
));
It's a good feature request, but currently there is no built in BigQuery function that provides URL decoding.
One more workaround is using a user-defined function.
#standardSQL
CREATE TEMPORARY FUNCTION URL_DECODE(enc STRING)
RETURNS STRING
LANGUAGE js AS """
try {
return decodeURI(enc);;
} catch (e) { return null }
return null;
""";
SELECT ven_session,
URL_DECODE(REGEXP_EXTRACT(para,r'&kw=(\w|[^&]*)')) AS q
FROM raas_system.weblog_20170327
WHERE para like '%&kw=%'
LIMIT 10
I agree with everyone here that URLDECODE should be a native function. However, until that happens, it is possible to write a "native" URLDECODE:
SELECT id, SAFE_CONVERT_BYTES_TO_STRING(ARRAY_TO_STRING(ps, b'')) FROM (SELECT
id,
ARRAY_AGG(CASE
WHEN REGEXP_CONTAINS(y, r"^%") THEN FROM_HEX(SUBSTR(y, 2))
ELSE CAST(y AS bytes)
END ORDER BY i) AS ps
FROM (SELECT x AS id, REGEXP_EXTRACT_ALL(x, r"%[0-9a-fA-F]{2}|[^%]+") AS element FROM UNNEST(ARRAY['domodossola%e2%80%93locarno railway', 'gabu%c5%82t%c3%b3w']) AS x) AS x
CROSS JOIN UNNEST(x.element) AS y WITH OFFSET AS i GROUP BY id);
In this example, I've tried and tested the implementation with a couple of percent-encoded page names from Wikipedia as the input. It should work with your input, too.
Obviously, this is extremely unwieldly! For that reason, I'd suggest building a materialized join table, or wrapping this in a view, rather than using this expression "naked" in your query. However, it does appear to get the job done, and it doesn't hit the UDF limits.
EDIT: #MikhailBerylyant's post below has wrapped this cumbersome implementation into a nice, tidy little SQL UDF. That's a much better way to handle this!