BigQuery Dynamic JSON attributes as columnar data - sql

I have a table with one of the columns containing JSON.
Col_A
Col_B
Col_C
1
Abc1
{“a”: “a_val”, “b”: “b_val”}
2
Abc2
{"a": “a_val2”, “c”: “c_val2”}
3
Abc3
{"b": “b_val3”, “c”: “c_val3”, “d”: {“x”: “x_val”, “y”: “y_val”}}
How can I put together BQ SQL to extract the columns and attributes of the JSON as additional columns. Need to go only 1 level deep into the JSON. So the output should look like:
Col_A
Col_B
A
B
C
D
1
Abc1
a_val
b_val
2
Abc2
a_val2
c_val2
3
Abc3
b_val3
c_val3
{“x”: “x_val”, “y”: “y_val”}

Consider below approach
create temp function json_keys(input string) returns array<string> language js as """
return Object.keys(JSON.parse(input));
""";
create temp function json_path(json string, json_path string)
returns string
language js as """
try { var parsed = JSON.parse(json);
return JSON.stringify(jsonPath(parsed, json_path));
} catch (e) { return null }
"""
OPTIONS (
library="gs://my-storage/jsonpath-0.8.0.js"
);
select * from (
select t.* except(col_c), key, trim(json_path(col_c, '$.' || key), '"[]') value
from your_table t,
unnest(json_keys(col_c)) key
)
pivot (min(value) for key in ('a', 'b', 'c', 'd'))
if applied to sample data in your question
with your_table as (
select 1 col_a, 'abc1' col_b, '{"a": "a_val", "b": "b_val"}' col_c union all
select 2, 'abc2', '{"a": "a_val2", "c": "c_val2"}' union all
select 3, 'abc3', '{"b": "b_val3", "c": "c_val3", "d": {"x": "x_val", "y": "y_val"}}'
)
output is
In order to use above you need to upload jsonpath-0.8.0.js (can be download at https://code.google.com/archive/p/jsonpath/downloads) into your GCS bucket gs://my-storage/
As you can see - above solution assumes you know key names in advance
Obviously, when you know keys in advance you would simply use json_extract, but this would not work for when you don't know keys in advance!
So, if you don't know - above solution can be used as a template for dynamically generating query (with real keys in for key in ('a', 'b', 'c', 'd')) to be executed with EXECUTE IMMEDIATE

Related

Sum values in Athena table where column having key/value pair json

I have an Athena table with one column having JSON and key/value pairs.
Ex:
Select test_client, test_column from ABC;
test_client, test_column
john, {"d":13, "e":210}
mark, {"a":1,"b":10,"c":1}
john, {"e":100,"a":110,"b":10, "d":10}
mark, {"a":56,"c":11,"f":9, "e": 10}
And I need to sum the values corresponding to keys and return in some sort like the below: return o/p format doesn't matter. I want to sum it up.
john: d: 23, e:310, a:110, b:10
mark: a:57, b:10, c:12, f:9, e:10
It is a combination of a few useful functions in Trino:
WITH example_table AS
(SELECT 'john' as person, '{"d":13, "e":210}' as _json UNION ALL
SELECT 'mark', ' {"a":1,"b":10,"c":1}' UNION ALL
SELECT 'john', '{"e":100,"a":110,"b":10, "d":10}' UNION ALL
SELECT 'mark', '{"a":56,"c":11,"f":9, "e": 10}')
SELECT person, reduce(
array_agg(CAST(json_parse(_json) AS MAP(VARCHAR, INTEGER))),
MAP(ARRAY['a'],ARRAY[0]),
(s, x) -> map_zip_with(
s,x, (k, v1, v2) ->
if(v1 is null, 0, v1) +
if(v2 is null, 0, v2)
),
s -> s
)
FROM example_table
GROUP BY person
json_parse - Parses the string to a JSON object
CAST ... AS MAP... - Creates a MAP from the JSON object
array_agg - Aggregates the maps for each Person based on the group by
reduce - steps through the aggregated array and reduce it to a single map
map_zip_with - applies a function on each similar key in two maps
if(... is null ...) - puts 0 instead of null if the key is not present

JSON_Extract from list of json string

I want to extract some values for particular keys from a table with json string as below.
raw_data
...
{"label": "XXX", "lines":[{"amount":1000, "category": "A"}, {"amount":100, "category": "B"}, {"amount":10, "category": "C"}]}
...
I am expecting an outcome like
label
amount
category
XXX
[1000, 100, 10]
['A', 'B', 'C']
I am using the following sql query to achieve that
select
JSON_EXTRACT(raw_data, '$.lines[*].amount') AS amount,
JSON_EXTRACT(raw_data, '$.lines[*].category') AS category,
JSON_EXTRACT(raw_data, '$.label') AS label
from table
I can get a specific element of the list with [0] , [1] etc. But the sql code doesn't work with [*]. I am getting the following error
Invalid JSON path: '$.lines[*].amount'
Edit
I am using Presto
Json path support in Presto is very limited, so you need to do some processing manually for example with casts and array functions:
-- sample data
with dataset (raw_data) as (
values '{"label": "XXX", "lines":[{"amount":1000, "category": "A"}, {"amount":100, "category": "B"}, {"amount":10, "category": "C"}]}'
)
-- query
select label,
transform(lines, l -> l['amount']) amount,
transform(lines, l -> l['category']) category
from (
select JSON_EXTRACT(raw_data, '$.label') AS label,
cast(JSON_EXTRACT(raw_data, '$.lines') as array(map(varchar, json))) lines
from dataset
);
Output:
label
amount
category
XXX
[1000, 100, 10]
["A", "B", "C"]
In Trino json path support was vastly improved, so you can do next:
-- query
select JSON_EXTRACT(raw_data, '$.label') label,
JSON_QUERY(raw_data, 'lax $.lines[*].amount' WITH ARRAY WRAPPER) amount,
JSON_QUERY(raw_data, 'lax $.lines[*].category' WITH ARRAY WRAPPER) category
from dataset;
You can use json_table and json_arrayagg:
select json_extract(t.raw_data, '$.label'),
(select json_arrayagg(t1.v) from json_table(t.raw_data, '$.lines[*]' columns (v int path '$.amount')) t1),
(select json_arrayagg(t1.v) from json_table(t.raw_data, '$.lines[*]' columns (v text path '$.category')) t1)
from tbl t
I was able to get the expected output using unnest to flatten and array_agg to aggregate in Presto. Below is the SQL used and output generated:
WITH dataset AS (
SELECT
* from sf_73535794
)
SELECT raw_data.label,array_agg(t.lines.amount) as amount,array_agg(t.lines.category) as category FROM dataset
CROSS JOIN UNNEST(raw_data.lines) as t(lines) group by 1
Output:

How to extract a value as a column from JSON with multiple key-value lists using (a materialized view compatible) SQL?

There is a table with measurements. There is a column in this table with the name measurement (JSON type). It contains lists of a named parameter values.
A sample table with one key-value list called parameters can be defined as follow:
select 1 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}]}') measurement union all
select 2 id, parse_json('{"parameters":[{"name":"aaa","value":11},{"name":"bbb","value":22},{"name":"ccc","value":33}]}') measurement union all
select 3 id, parse_json('{"parameters":[{"name":"aaa","value":111},{"name":"bbb","value":222},{"name":"ccc","value":333}]}') measurement
Same in the table form:
id
measurement
1
{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}]}
2
{"parameters":[{"name":"aaa","value":11},{"name":"bbb","value":22},{"name":"ccc","value":33}]}
3
{"parameters":[{"name":"aaa","value":111},{"name":"bbb","value":222},{"name":"ccc","value":333}]}
Now, I want to extract some values from the list to columns. For example, if I want parameters aaa and bbb, I would expect the output like:
id
aaa
bbb
1
10
20
2
11
22
3
111
222
CASE-WHEN-GROUP-BY method
I can achieve this using 4 sub-queries. It starts already getting complex, but still bearable:
with measurements AS (
select 1 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}]}') measurement union all
select 2 id, parse_json('{"parameters":[{"name":"aaa","value":11},{"name":"bbb","value":22},{"name":"ccc","value":33}]}') measurement union all
select 3 id, parse_json('{"parameters":[{"name":"aaa","value":111},{"name":"bbb","value":222},{"name":"ccc","value":333}]}') measurement
),
parameters AS (select id, JSON_QUERY_ARRAY(measurements.measurement.parameters) measurements_list from measurements),
param_values as (select id, JSON_VALUE(ml.name) name, JSON_VALUE(ml.value) value from parameters, parameters.measurements_list ml),
trimmed_values as (select id, case when name="aaa" then value else null end as aaa, case when name="bbb" then value else null end as bbb
from param_values where name in ("aaa", "bbb"))
select id, max(aaa) aaa, max(bbb) bbb from trimmed_values group by id
JSONPath method
I can also use full-featured JSONPath function as suggested by Mikhail. Then things start looking more manageable:
select id,
bq_data_loader_json.CUSTOM_JSON_VALUE(TO_JSON_STRING(measurement.parameters), '$.[?(#.name=="aaa")].value') aaa,
bq_data_loader_json.CUSTOM_JSON_VALUE(TO_JSON_STRING(measurement.parameters), '$.[?(#.name=="bbb")].value') bbb
from `sap-clm-analytics-dev.ag_experiment.measurements`
(It may be less efficient then the CASE-WHEN-GROUP-BY method because of the external UDF call, but let's focus on maintainability for now).
Adding another list of values
Now I add another list of key-value pairs named colors:
select 1 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}], "colors": [{"name": "green", "value": "A"}, {"name": "yellow", "value": "B"}]}') measurement union all
select 2 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}], "colors": [{"name": "green", "value": "AA"}, {"name": "yellow", "value": "BB"}]}') measurement union all
select 3 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}], "colors": [{"name": "green", "value": "AAA"}, {"name": "yellow", "value": "BBB"}]}') measurement
Let's pick values for green from the list of colors. Then the output will be:
id
aaa
bbb
green
1
10
20
A
2
11
22
AA
3
111
222
AAA
The JSONPath solution from above can be trivially extended to cover this case:
select id,
bq_data_loader_json.CUSTOM_JSON_VALUE(TO_JSON_STRING(measurement.parameters), '$.[?(#.name=="aaa")].value') aaa,
bq_data_loader_json.CUSTOM_JSON_VALUE(TO_JSON_STRING(measurement.parameters), '$.[?(#.name=="bbb")].value') bbb,
bq_data_loader_json.CUSTOM_JSON_VALUE(TO_JSON_STRING(measurement.colors), '$.[?(#.name=="green")].value') bbb
from measurements
With the CASE-WHEN approach things starts getting tricky. The below query already gets complex and is simply wrong:
with measurements AS (
select 1 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}], "colors": [{"name": "green", "value": "A"}, {"name": "yellow", "value": "B"}]}') measurement union all
select 2 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}], "colors": [{"name": "green", "value": "AA"}, {"name": "yellow", "value": "BB"}]}') measurement union all
select 3 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}], "colors": [{"name": "green", "value": "AAA"}, {"name": "yellow", "value": "BBB"}]}') measurement),
parameters_colors AS (
select id, JSON_QUERY_ARRAY(measurements.measurement.parameters) parameters_list, JSON_QUERY_ARRAY(measurements.measurement.colors) colors_list from measurements),
param_color_values AS (select id, JSON_VALUE(parameters_list.name) param_name, JSON_VALUE(parameters_list.value) param_value, JSON_VALUE(colors_list.name) color_name, JSON_VALUE(colors_list.value) color_value from parameters_colors, parameters_colors.parameters_list, parameters_colors.colors_list),
trimmed_values AS (select id,
case when param_name="aaa" then param_value else null end as aaa,
case when param_name="bbb" then param_value else null end as bbb,
case when color_name="green" then color_value else null end as green,
from param_color_values where param_name in ("aaa", "bbb") and color_name = "green")
select id, max(aaa) aaaa, max(bbb) bbb, max(green) green from trimmed_values group by 1
Wrong result:
id
aaa
bbb
green
1
10
20
A
2
10
20
AA
3
10
20
AAA
The cartesian product in param_color_values is fine, but trimmed_values incorrectly fill permutations with nulls. Apparently the level of dependency is needed for "green" values.
It would be apparently possible to fix my example, but it probably won't be maintainable after another list of parameters. So, I want to phrase my question differently.
Question
What would be a maintainable way to extract multiple values from such data structures in SQL?
Materialized view
Ideally, I'd like to persist such query as a BigQuery materialized view. The original data object is huge, so I want to create a stage in the data pipeline, which persists a curated subset of it, differently clustered. I want that the BigQuery manages the refreshes of this object. Materialized view has a limited set of functions. For examples, UDFs (like CUSTOM_JSON_PATH) are not supported.
My current state
I tend to drop the idea of using the materialized view in favor of the maintainability of the UDF/JSONPath method and organize the refresh of the extracted dataset myself using scheduled queries.
Do I oversee any trivial pure SQL solution, which is optionally materialized-view compatible and easy to extend to more complex cases?
I tend to drop the idea of using the materialized view in favor of the maintainability of the UDF/JSONPath method and organize the refresh of the extracted dataset myself using scheduled queries.
Consider below approach (not compatible with materialized view)
create temp function get_keys(input string) returns array<string> language js as """
return Object.keys(JSON.parse(input));
""";
create temp function get_values(input string) returns array<string> language js as """
return Object.values(JSON.parse(input));
""";
create temp function get_leaves(input string) returns string language js as '''
function flattenObj(obj, parent = '', res = {}){
for(let key in obj){
let propName = parent ? parent + '.' + key : key;
if(typeof obj[key] == 'object'){
flattenObj(obj[key], propName, res);
} else {
res[propName] = obj[key];
}
}
return JSON.stringify(res);
}
return flattenObj(JSON.parse(input));
''';
with temp as (
select id, val, --key, val, --leaves
if(ends_with(key, '.name'), 'name', 'value') type,
regexp_replace(key, r'.name$|.value$', '') key
from your_table, unnest([struct(get_leaves(json_extract(to_json_string(measurement), '$')) as leaves)]),
unnest(get_keys(leaves)) key with offset
join unnest(get_values(leaves)) val with offset using(offset)
)
select * from (
select * except(key)
from temp
pivot (any_value(val) for type in ('name', 'value'))
)
pivot (any_value(value) for name in ('aaa', 'bbb', 'ccc', 'green', 'yellow') )
if applied to sample data in your question - output is
In case if keys are not known in advance or too many to manually manage - you can use below dynamic version
create temp function get_keys(input string) returns array<string> language js as """
return Object.keys(JSON.parse(input));
""";
create temp function get_values(input string) returns array<string> language js as """
return Object.values(JSON.parse(input));
""";
create temp function get_leaves(input string) returns string language js as '''
function flattenObj(obj, parent = '', res = {}){
for(let key in obj){
let propName = parent ? parent + '.' + key : key;
if(typeof obj[key] == 'object'){
flattenObj(obj[key], propName, res);
} else {
res[propName] = obj[key];
}
}
return JSON.stringify(res);
}
return flattenObj(JSON.parse(input));
''';
create temp table temp as (
select * except(key) from (
select id, val,
if(ends_with(key, '.name'), 'name', 'value') type,
regexp_replace(key, r'.name$|.value$', '') key
from your_table, unnest([struct(get_leaves(json_extract(to_json_string(measurement), '$')) as leaves)]),
unnest(get_keys(leaves)) key with offset
join unnest(get_values(leaves)) val with offset using(offset)
)
pivot (any_value(val) for type in ('name', 'value'))
);
execute immediate (select '''
select * from temp
pivot (any_value(value) for name in ("''' || string_agg(distinct name, '","') || '"))'
from temp
);

how to return "sparse" json (choose a number of attributes ) from PostgreSQL

MongoDB has a way of choosing the fields of a JSON documents that are returned as a result of query. I am looking for the same with PostgreSQL.
Let's assume that I've got a JSON like this:
{
a: valuea,
b: valueb,
c: valuec,
...
z: valuez
}
The particular values may be either simple values or subobjects with further nesting.
I want to have a way of returning JSON Documents containing only the atttributes I choose, something like:
SELECT json_col including_only a,b,c,g,n from table where...
I know that there is the "-" operator, allowing eliminating specific attributes, but is there an operator that does exactly the opposite?
In trivial cases you can use jsonb_to_record(jsonb)
with data(json_col) as (
values
('{"a": 1, "b": 2, "c": 3, "d": 4}'::jsonb)
)
select *, to_jsonb(rec) as result
from data
cross join jsonb_to_record(json_col) as rec(a int, d int)
json_col | a | d | result
----------------------------------+---+---+------------------
{"a": 1, "b": 2, "c": 3, "d": 4} | 1 | 4 | {"a": 1, "d": 4}
(1 row)
See JSON Functions and Operators.
If you need a more generic tool, the function does the job:
create or replace function jsonb_sparse(jsonb, text[])
returns jsonb language sql immutable as $$
select $1 - (
select array_agg(key)
from jsonb_object_keys($1) as key
where key <> all($2)
)
$$;
-- use:
select jsonb_sparse('{"a": 1, "b": 2, "c": 3, "d": 4}', '{a, d}')
Test it in db<>fiddle.

How to refactor adding properties to json through nvl2 to avoid repetition?

We are creating json of tag names and respective tag values:
[
"name": "bob",
"surname": "dylan",
...
]
This is done through first creating a main select, and then through that including tens of nvl2 functions to call the appropriate procedures for getting the value of each tag:
select u_json_pck.JsonPropertyObject(null,
nclob_tt(
nvl2(ExecuteSelectToCheckIfValueExists(),
U_JSON_PCK.JsonProperty('TAG_NAME', ExecuteAVerySimilarSelectToGetValue()),
decode(v_remove_empty_tags, 1,
U_JSON_PCK.JsonProperty('TAG_NAME', ''), null)),
nvl2(......),
nvl2(...)...
(1) check if any value exists (e.g., for a tag "meetingParticipants" no participants might exist)
(2) if it exists, then call the procedure that actually gets that value and forms it into necessary nclob, and add this and the tag to json
(3) if it doesn't exist, then check if empty tags should be added to json, and then either add an empty one or don't add one
Can this be refactored so that ExecuteSelectToCheckIfValueExists() isn't called at all? We could check v_remove_empty_tags in the ExecuteAVerySimilarSelectToGetValue() and then, if this function finds no results, return -1 or null. But how to form the appropriate json out of that result?
I don't know what your function u_json_pck and type nclob_tt do exactly, but why not do this with the built-in JSON syntax something like:
select json_object
( 'myobject' value json_arrayagg
( json_object
( 'tag_name1' value value1
, 'tag_name2' value value2
, 'tag_name3' value value3
absent on null
)
returning clob
)
) json_data
from
( select null value1, 'BBB' value2, 'CCC' value3 from dual
union all
select null value1, 'BBB' value2, null value3 from dual
union all
select 123 value1, null value2, 'DDD' value3 from dual
union all
select 456 value1, 'EEE' value2, null value3 from dual
);
which returns:
{
"myobject": [
{
"tag_name2": "BBB",
"tag_name3": "CCC"
},
{
"tag_name2": "BBB"
},
{
"tag_name1": 123,
"tag_name3": "DDD"
},
{
"tag_name1": 456,
"tag_name2": "EEE"
}
]
}
You just need to change absent on null to null on null according to the value of v_remove_empty_tags - so either have a separate select statement for each, or construct the SQL dynamically to specify that.