Extract information from a json string in BigQuery - google-bigquery

I am storing a table in Bigquery with the results of a classification algorithm. The table schema is INT, STRING and looks something like this :
ID
Output
1001
{'Apple Cider': 0.7, 'Coffee' : 0.2, 'Juice' : 0.1}
1002
{'Black Coffee':0.9, 'Tea':0.1}
The problem is how to fetch the first (or second or any order) element of each string together with its score. It doesn't seem likely that JSON_EXTRACT can work and most likely it can be done with Javascript. Was wondering what an elegant solution would look like here.

Consider below
select ID,
trim(split(kv, ':')[offset(0)], " '") element,
cast(split(kv, ':')[offset(1)] as float64) score,
element_position
from `project.dataset.table` t,
unnest(regexp_extract_all(trim(Output, '{}'), r"'[^':']+'\s?:\s?[^,]+")) kv with offset as element_position
If applied to sample data in your question - output is
Note: you can use less verbose unnest statement if you wish
unnest(split(trim(Output, '{}'))) kv with offset as element_position

Related

pyspark hive sql convert array(map(varchar, varchar)) to string by rows

I would like to transform a column of
array(map(varchar, varchar))
to string as rows of a table on presto db by pyspark hive sql programmatically from jupyter notebook python3.
example
user_id sport_ids
'aca' [ {'sport_id': '5818'}, {'sport_id': '6712'}, {'sport_id': '1065'} ]
expected results
user_id. sport_ids
'aca'. '5815'
'aca'. '5712'
'aca'. '1065'
I have tried
sql_q= """
select distinct, user_id, transform(sport_ids, x -> element_at(x, 'sport_id')
from tab """
spark.sql(sql_q)
but got error:
'->' cannot be resolved
I have also tried
sql_q= """
select distinct, user_id, sport_ids
from tab"""
spark.sql(sql_q)
but got error:
org.apache.spark.sql.AnalysisException: Cannot have map type columns in DataFrame which calls set operations(intersect, except, etc.), but the type of column request_features[0] is map<string,string>;;
Did I miss something ?
I have tried this, but helpful
hive convert array<map<string, string>> to string
Extract map(varchar, array(varchar)) - Hive SQL
thanks
Lets try use higher order functions to find map values and explode into individual rows
df.withColumn('sport_ids', explode(expr("transform(sport_ids, x->map_values(x)[0])"))).show()
+-------+---------+
|user_id|sport_ids|
+-------+---------+
| aca| 5818|
| aca| 6712|
| aca| 1065|
+-------+---------+
You can process json data (json_parse, cast to array of json and json_extract_scalar - for more json functions - see here) and flatten (unnest) on presto side:
-- sample data
WITH dataset(user_id, sport_ids) AS (
VALUES
('aca', '[ {"sport_id": "5818"}, {"sport_id": "6712"}, {"sport_id": "1065"} ]')
)
-- query
select user_id,
json_extract_scalar(record, '$.sport_id') sport_id
from dataset,
unnest(cast(json_parse(sport_ids) as array(json))) as t(record)
Output:
user_id
sport_id
aca
5818
aca
6712
aca
1065

Extract complex json with random key field

I am trying to extract the following JSON into its own rows like the table below in Presto query. The issue here is the name of the key/av engine name is different for each row, and I am stuck on how I can extract and iterate on the keys without knowing the value of the key.
The json is a value of a table row
{
"Bkav":
{
"detected": false,
"result": null,
},
"Lionic":
{
"detected": true,
"result": Trojan.Generic.3611249',
},
...
AV Engine Name
Detected Virus
Result
Bkav
false
null
Lionic
true
Trojan.Generic.3611249
I have tried to use json_extract following the documentation here https://teradata.github.io/presto/docs/141t/functions/json.html but there is no mention of extraction if we don't know the key :( I am trying to find a solution that works in both presto & hive query, is there a common query that is applicable to both?
You can cast your json to map(varchar, json) and process it with unnest to flatten:
-- sample data
WITH dataset (json_str) AS (
VALUES (
'{"Bkav":{"detected": false,"result": null},"Lionic":{"detected": true,"result": "Trojan.Generic.3611249"}}'
)
)
--query
select k "AV Engine Name", json_extract_scalar(v, '$.detected') "Detected Virus", json_extract_scalar(v, '$.result') "Result"
from (
select cast(json_parse(json_str) as map(varchar, json)) as m
from dataset
)
cross join unnest (map_keys(m), map_values(m)) t(k, v)
Output:
AV Engine Name
Detected Virus
Result
Bkav
false
Lionic
true
Trojan.Generic.3611249
The presto query suggested by #Guru works, but for hive, there is no easy way.
I had to extract the json
Parse it with replace to remove some character and bracket
Then convert it back to a map, and repeat for one more time to get the nested value out
SELECT
av_engine,
str_to_map(regexp_replace(engine_result, '\\}', ''),',', ':') AS output_map
FROM (
SELECT
str_to_map(regexp_replace(regexp_replace(get_json_object(raw_response, '$.scans'), '\"', ''), '\\{',''),'\\},', ':') AS key_val_map
FROM restricted_antispam.abuse_malware_scanning
) AS S
LATERAL VIEW EXPLODE(key_val_map) temp AS av_engine, engine_result

How can I get the last element of an array? SQL Bigquery

I'm working on building a follow-network form Github's available data on Google BigQuery, e.g.: https://bigquery.cloud.google.com/table/githubarchive:day.20210606
The key data is contained in the "payload" field, STRING type. I managed to unnest the data contained in that field and convert it to an array, but how can I get the last element?
Here is what I have so far...
select type,
array(select trim(val) from unnest(split(trim(payload, '[]'))) val) payload
from `githubarchive.day.20210606`
where type = 'MemberEvent'
Which outputs:
How can I get only the last element, "Action":"added"} ?
I know that
select array_reverse(your_array)[offset(0)]
should do the trick, however I'm unsure how to combine that in my code. I've been trying different options without success, for example:
with payload as ( select array(select trim(val) from unnest(split(trim(payload, '[]'))) val) payload from `githubarchive.day.20210606`)
select type, ARRAY_REVERSE(payload)[ORDINAL(1)]
from `githubarchive.day.20210606` where type = 'MemberEvent'
The desired output should look like:
To get last element in array you can use below approach
select array_reverse(your_array)[offset(0)]
I'm unsure how to combine that in my code
select type, array_reverse(array(
select trim(val)
from unnest(split(trim(payload, '[]'))) val
))[offset(0)]
from `githubarchive.day.20210606`
where type = 'MemberEvent'
There is a solution without reversing the array.
SELECT event[OFFSET(ARRAY_LENGTH(event)-1)

Alias nested struct columns

How can I alias field1 as index & & field 2 as value
The query gives me an error:
#standardsql
with q1 as (select 1 x, ARRAY<struct<id string, cd ARRAY<STRUCT<index STRING,value STRING>>>>
[struct('h1',[('1','a')
,('2','b')
])
,('h2',[('3','c')
,('4','d')
])] hits
)
Select * from q1
ORDER by x
Error: Array element type STRUCT<STRING, ARRAY<STRUCT<STRING, STRING>>> does not coerce to STRUCT<id STRING, cd ARRAY<STRUCT<index STRING, value STRING>>> at [5:26]
Thanks a lot for your time in responding
Cheers!
#standardsql
WITH q1 AS (
SELECT
1 AS x,
[
STRUCT('h1' AS id, [STRUCT('1' AS index, 'a' AS value), ('2','b')] AS cd),
STRUCT('h2', [STRUCT('3' AS index, 'c' AS value), ('4','d')] AS cd)
] AS hits
)
SELECT *
FROM q1
-- ORDER BY x
or below might be even more "readable"
#standardsql
WITH q1 AS (
SELECT
1 AS x,
[
STRUCT('h1' AS id, [STRUCT<index STRING, value STRING>('1', 'a'), ('2','b')] AS cd),
STRUCT('h2', [STRUCT<index STRING, value STRING>('3', 'c'), ('4','d')] AS cd)
] AS hits
)
SELECT *
FROM q1
-- ORDER BY x
When I try to simulate data in BigQuery using the Standard Version I usually try to name all variables and aliases everywhere possible. For instance, your data works if you build it like so:
with q1 as (
select 1 x, ARRAY<struct<id string, cd ARRAY<STRUCT<index STRING,value STRING>>>> [struct('h1' as id,[STRUCT('1' as index,'a' as value) ,STRUCT('2' as index ,'b' as value)] as cd), STRUCT('h2',[STRUCT('3' as index,'c' as value) ,STRUCT('4' as index,'d' as value)] as cd)] hits
)
select * from q1
order by x
Notice I've built structs and put aliases inside of them in order for this to work (if you remove the aliases and the structs it might not work, but I found that this seems to be rather intermittent. If you fully describe the variables then it works all the time).
Also as a recommendation, I try to build simulated data piece by piece. First I create the struct and test it to see if BigQuery accepts it. After the validator is green, then I proceed to add more values. If you try to build everything at once you might find this a somewhat challenging task.

bigquery url decode

Is there an easy way to do URL decoding within the BigQuery query language? I'm working with a table that has a column containing URL-encoded strings in some values. For example:
http://xyz.com/example.php?url=http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345&foo=bar&abc=xyz
I extract the "url" parameter like so:
SELECT REGEXP_EXTRACT(column_name, "url=([^&]+)") as url
from [mydataset.mytable]
which gives me:
http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345
What I would like to do is something like:
SELECT URL_DECODE(REGEXP_EXTRACT(column_name, "url=([^&]+)")) as url
from [mydataset.mytable]
thereby returning:
http://www.example.com/hello?v=12345
I would like to avoid using multiple REGEXP_REPLACE() statements (replacing %20, %3A, etc...) if possible.
Ideas?
Below is built on top of #sigpwned answer, but slightly refactored and wrapped with SQL UDF (which has no limitation that JS UDF has so safe to use)
#standardSQL
CREATE TEMP FUNCTION URLDECODE(url STRING) AS ((
SELECT SAFE_CONVERT_BYTES_TO_STRING(
ARRAY_TO_STRING(ARRAY_AGG(
IF(STARTS_WITH(y, '%'), FROM_HEX(SUBSTR(y, 2)), CAST(y AS BYTES)) ORDER BY i
), b''))
FROM UNNEST(REGEXP_EXTRACT_ALL(url, r"%[0-9a-fA-F]{2}|[^%]+")) AS y WITH OFFSET AS i
));
SELECT
column_name,
URLDECODE(REGEXP_EXTRACT(column_name, "url=([^&]+)")) AS url
FROM `project.dataset.table`
can be tested with example from question as below
#standardSQL
CREATE TEMP FUNCTION URLDECODE(url STRING) AS ((
SELECT SAFE_CONVERT_BYTES_TO_STRING(
ARRAY_TO_STRING(ARRAY_AGG(
IF(STARTS_WITH(y, '%'), FROM_HEX(SUBSTR(y, 2)), CAST(y AS BYTES)) ORDER BY i
), b''))
FROM UNNEST(REGEXP_EXTRACT_ALL(url, r"%[0-9a-fA-F]{2}|[^%]+")) AS y WITH OFFSET AS i
));
WITH `project.dataset.table` AS (
SELECT 'http://example.com/example.php?url=http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345&foo=bar&abc=xyz' column_name
)
SELECT
URLDECODE(REGEXP_EXTRACT(column_name, "url=([^&]+)")) AS url,
column_name
FROM `project.dataset.table`
with result
Row url column_name
1 http://www.example.com/hello?v=12345 http://example.com/example.php?url=http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345&foo=bar&abc=xyz
Update with further quite optimized SQL UDF
CREATE TEMP FUNCTION URLDECODE(url STRING) AS ((
SELECT STRING_AGG(
IF(REGEXP_CONTAINS(y, r'^%[0-9a-fA-F]{2}'),
SAFE_CONVERT_BYTES_TO_STRING(FROM_HEX(REPLACE(y, '%', ''))), y), ''
ORDER BY i
)
FROM UNNEST(REGEXP_EXTRACT_ALL(url, r"%[0-9a-fA-F]{2}(?:%[0-9a-fA-F]{2})*|[^%]+")) y
WITH OFFSET AS i
));
It's a good feature request, but currently there is no built in BigQuery function that provides URL decoding.
One more workaround is using a user-defined function.
#standardSQL
CREATE TEMPORARY FUNCTION URL_DECODE(enc STRING)
RETURNS STRING
LANGUAGE js AS """
try {
return decodeURI(enc);;
} catch (e) { return null }
return null;
""";
SELECT ven_session,
URL_DECODE(REGEXP_EXTRACT(para,r'&kw=(\w|[^&]*)')) AS q
FROM raas_system.weblog_20170327
WHERE para like '%&kw=%'
LIMIT 10
I agree with everyone here that URLDECODE should be a native function. However, until that happens, it is possible to write a "native" URLDECODE:
SELECT id, SAFE_CONVERT_BYTES_TO_STRING(ARRAY_TO_STRING(ps, b'')) FROM (SELECT
id,
ARRAY_AGG(CASE
WHEN REGEXP_CONTAINS(y, r"^%") THEN FROM_HEX(SUBSTR(y, 2))
ELSE CAST(y AS bytes)
END ORDER BY i) AS ps
FROM (SELECT x AS id, REGEXP_EXTRACT_ALL(x, r"%[0-9a-fA-F]{2}|[^%]+") AS element FROM UNNEST(ARRAY['domodossola%e2%80%93locarno railway', 'gabu%c5%82t%c3%b3w']) AS x) AS x
CROSS JOIN UNNEST(x.element) AS y WITH OFFSET AS i GROUP BY id);
In this example, I've tried and tested the implementation with a couple of percent-encoded page names from Wikipedia as the input. It should work with your input, too.
Obviously, this is extremely unwieldly! For that reason, I'd suggest building a materialized join table, or wrapping this in a view, rather than using this expression "naked" in your query. However, it does appear to get the job done, and it doesn't hit the UDF limits.
EDIT: #MikhailBerylyant's post below has wrapped this cumbersome implementation into a nice, tidy little SQL UDF. That's a much better way to handle this!