BigQuery extract repeated JSON - google-bigquery

I have the following data in one column in BigQuery:
{"id": "81", "type": ["{'id2': '12', 'type2': 'main'}", "{'id2': '15', 'type2': 'sub'}"]}
I would like to parse this and have 'id2' and 'type2' as nested fields. I tried using JSON_VALUE_ARRAY(data, "$.type") that correctly creates the nested rows but couldn't process extracting 'id2' and 'type2'. I believe maybe the "s are the issue inside the list, how could I get past those?
UPDATE:
This is the format I would like to achieve.

Consider below approach
select
json_value(json, '$.id') id, array(
select as struct json_value(trim(type, '"'), '$.id2') as id2, json_value(trim(type, '"'), '$.type2') as type2
from unnest(json_extract_array(json, '$.type')) type
) type
from your_table

Related

How can I only get keys from an array in BigQuery

I have an event table in BigQuery. I have two columns: event_name and event_parameters. I need to export parameter keys.
I'm looking for something similar to
dict.keys()
from pyton.
for example:
event_name
event_parameters
event_a
{"from": "main_page","ID":"123"}
event_b
{"ID":"242","custom_value":"true"
output table:
event_name
event_parameters
event_a
{"from","ID"}
event_b
{"ID","custom_value"
You can extract the keys from a json column using the following SQL code as a template:
WITH data AS(
SELECT '{"from":"main_page","ID":"123"}' AS params
)
SELECT keys from(
select
params,
array(select trim(split(kv, ':')[offset(0)]) from t.kv kv) as keys,
array(
select as struct
trim(split(kv, ':')[offset(0)]) as key,
trim(split(kv, ':')[offset(1)]) as value
from t.kv kv
) as key_value
from data,
unnest([struct(split(translate(params, '{}"', '')) as kv)]) t
)
Credit to #Mikhail Berlyant for providing the original code for the following solution:
BigQuery: extract keys from json object, convert json from object to key-value array

JSON_Extract from list of json string

I want to extract some values for particular keys from a table with json string as below.
raw_data
...
{"label": "XXX", "lines":[{"amount":1000, "category": "A"}, {"amount":100, "category": "B"}, {"amount":10, "category": "C"}]}
...
I am expecting an outcome like
label
amount
category
XXX
[1000, 100, 10]
['A', 'B', 'C']
I am using the following sql query to achieve that
select
JSON_EXTRACT(raw_data, '$.lines[*].amount') AS amount,
JSON_EXTRACT(raw_data, '$.lines[*].category') AS category,
JSON_EXTRACT(raw_data, '$.label') AS label
from table
I can get a specific element of the list with [0] , [1] etc. But the sql code doesn't work with [*]. I am getting the following error
Invalid JSON path: '$.lines[*].amount'
Edit
I am using Presto
Json path support in Presto is very limited, so you need to do some processing manually for example with casts and array functions:
-- sample data
with dataset (raw_data) as (
values '{"label": "XXX", "lines":[{"amount":1000, "category": "A"}, {"amount":100, "category": "B"}, {"amount":10, "category": "C"}]}'
)
-- query
select label,
transform(lines, l -> l['amount']) amount,
transform(lines, l -> l['category']) category
from (
select JSON_EXTRACT(raw_data, '$.label') AS label,
cast(JSON_EXTRACT(raw_data, '$.lines') as array(map(varchar, json))) lines
from dataset
);
Output:
label
amount
category
XXX
[1000, 100, 10]
["A", "B", "C"]
In Trino json path support was vastly improved, so you can do next:
-- query
select JSON_EXTRACT(raw_data, '$.label') label,
JSON_QUERY(raw_data, 'lax $.lines[*].amount' WITH ARRAY WRAPPER) amount,
JSON_QUERY(raw_data, 'lax $.lines[*].category' WITH ARRAY WRAPPER) category
from dataset;
You can use json_table and json_arrayagg:
select json_extract(t.raw_data, '$.label'),
(select json_arrayagg(t1.v) from json_table(t.raw_data, '$.lines[*]' columns (v int path '$.amount')) t1),
(select json_arrayagg(t1.v) from json_table(t.raw_data, '$.lines[*]' columns (v text path '$.category')) t1)
from tbl t
I was able to get the expected output using unnest to flatten and array_agg to aggregate in Presto. Below is the SQL used and output generated:
WITH dataset AS (
SELECT
* from sf_73535794
)
SELECT raw_data.label,array_agg(t.lines.amount) as amount,array_agg(t.lines.category) as category FROM dataset
CROSS JOIN UNNEST(raw_data.lines) as t(lines) group by 1
Output:

Presto Build JSON Array with Different Data Types

My goal is to get a JSON array of varchar name, varchar age, and a LIST of books_read (array(varchar)) for EACH id
books_read has following format: ["book1", "book2"]
Example Given Table:
id
name
age
books_read
1
John
21
["book1", "book2"]
Expected Output:
id
info
1
[{"name":"John", "age":"21", "books_read":["book1", "book2"]}]
When I use the following query I get an error (All ARRAY elements must be the same type: row(varchar, varchar)) because books_read is not of type varchar like name and age.
select id,
array_agg(CAST(MAP_FROM_ENTRIES(ARRAY[
('name', name),
('age', age),
('books_read', books)
]) AS JSON)) AS info
from tbl
group by id
Is there an alternative method that allows multiple types as input to the array?
I've also tried doing MAP_CONCAT(MAP_AGG(name), MAP_AGG(age), MULTIMAP_AGG(books_read)) but it also gives me an issue with the books_read column: Unexpected parameters for the function map_concat
Cast data to json before placing it into the map:
-- sample data
WITH dataset (id, name, age, books_read) AS (
VALUES (1, 'John', 21, array['book1', 'book2'])
)
-- query
select id,
cast(
map(
array [ 'name', 'age', 'books_read' ],
array [ cast(name as json), cast(age as json), cast(books_read as json) ]
) as json
) info
from dataset
Output:
id
info
1
{"age":21,"books_read":["book1","book2"],"name":"John"}

Extract complex json with random key field

I am trying to extract the following JSON into its own rows like the table below in Presto query. The issue here is the name of the key/av engine name is different for each row, and I am stuck on how I can extract and iterate on the keys without knowing the value of the key.
The json is a value of a table row
{
"Bkav":
{
"detected": false,
"result": null,
},
"Lionic":
{
"detected": true,
"result": Trojan.Generic.3611249',
},
...
AV Engine Name
Detected Virus
Result
Bkav
false
null
Lionic
true
Trojan.Generic.3611249
I have tried to use json_extract following the documentation here https://teradata.github.io/presto/docs/141t/functions/json.html but there is no mention of extraction if we don't know the key :( I am trying to find a solution that works in both presto & hive query, is there a common query that is applicable to both?
You can cast your json to map(varchar, json) and process it with unnest to flatten:
-- sample data
WITH dataset (json_str) AS (
VALUES (
'{"Bkav":{"detected": false,"result": null},"Lionic":{"detected": true,"result": "Trojan.Generic.3611249"}}'
)
)
--query
select k "AV Engine Name", json_extract_scalar(v, '$.detected') "Detected Virus", json_extract_scalar(v, '$.result') "Result"
from (
select cast(json_parse(json_str) as map(varchar, json)) as m
from dataset
)
cross join unnest (map_keys(m), map_values(m)) t(k, v)
Output:
AV Engine Name
Detected Virus
Result
Bkav
false
Lionic
true
Trojan.Generic.3611249
The presto query suggested by #Guru works, but for hive, there is no easy way.
I had to extract the json
Parse it with replace to remove some character and bracket
Then convert it back to a map, and repeat for one more time to get the nested value out
SELECT
av_engine,
str_to_map(regexp_replace(engine_result, '\\}', ''),',', ':') AS output_map
FROM (
SELECT
str_to_map(regexp_replace(regexp_replace(get_json_object(raw_response, '$.scans'), '\"', ''), '\\{',''),'\\},', ':') AS key_val_map
FROM restricted_antispam.abuse_malware_scanning
) AS S
LATERAL VIEW EXPLODE(key_val_map) temp AS av_engine, engine_result

SQL server query on json string for stats

I have this SQL Server database that holds contest participations. In the Participation table, I have various fields and a special one called ParticipationDetails. It's a varchar(MAX). This field is used to throw in all contest specific data in json format. Example rows:
Id,ParticipationDetails
1,"{'Phone evening': '6546546541', 'Store': 'StoreABC', 'Math': '2', 'Age': '01/01/1951'}"
2,"{'Phone evening': '6546546542', 'Store': 'StoreABC', 'Math': '2', 'Age': '01/01/1952'}"
3,"{'Phone evening': '6546546543', 'Store': 'StoreXYZ', 'Math': '2', 'Age': '01/01/1953'}"
4,"{'Phone evening': '6546546544', 'Store': 'StoreABC', 'Math': '3', 'Age': '01/01/1954'}"
I'm trying to get a a query runing, that will yield this result:
Store, Count
StoreABC, 3
StoreXYZ, 1
I used to run this query:
SELECT TOP (20) ParticipationDetails, COUNT(*) Count FROM Participation GROUP BY ParticipationDetails ORDER BY Count DESC
This works as long as I want unique ParticipationDetails. How can I change this to "sub-query" into my json strings. I've gotten to this query, but I'm kind of stuck here:
SELECT 'StoreABC' Store, Count(*) Count FROM Participation WHERE ParticipationDetails LIKE '%StoreABC%'
This query gets me the results I want for a specific store, but I want the store value to be "anything that was put in there".
Thanks for the help!
first of all, I suggest to avoid any json management with t-sql, since is not natively supported. If you have an application layer, let it to manage those kind of formatted data (i.e. .net framework and non MS frameworks have json serializers available).
However, you can convert your json strings using the function described in this link.
You can also write your own query which works with strings. Something like the following one:
SELECT
T.Store,
COUNT(*) AS [Count]
FROM
(
SELECT
STUFF(
STUFF(ParticipationDetails, 1, CHARINDEX('"Store"', ParticipationDetails) + 9, ''),
CHARINDEX('"Math"',
STUFF(ParticipationDetails, 1, CHARINDEX('"Store"', ParticipationDetails) + 9, '')) - 3, LEN(STUFF(ParticipationDetails, 1, CHARINDEX('"Store"', ParticipationDetails) + 9, '')), '')
AS Store
FROM
Participation
) AS T
GROUP BY
T.Store