Key value table to json in BigQuery - sql

Hey all,
I have a table that looks like this:
row
key
val
1
a
100
2
b
200
3
c
"apple
4
d
{}
I want to convert it into JSON:
{
"a": 100,
"b": 200,
"c": "apple",
"d": {}
}
Note: the number of lines can change so this is only an example
Thx in advanced !

With string manipulation,
WITH sample_table AS (
SELECT 'a' key, '100' value UNION ALL
SELECT 'b', '200' UNION ALL
SELECT 'c', '"apple"' UNION ALL
SELECT 'd', '{}'
)
SELECT '{' || STRING_AGG(FORMAT('"%s": %s', key, value)) || '}' json
FROM sample_table;
You can get following result similar to your expected output.

Related

Aggregate jsonb map with string array as value in postgresql

I have postgresql table with a jsonb column containing maps with strings as keys and string arrays as values. I want to aggregate all the maps into a single jsonb map. There should be no duplicate values in string array. How can I do this in postgres.
Eg:
Input: {"a": ["1", "2"]}, {"a": ["2", "3"], "b": ["5"]}
Output: {"a": ["1", "2", "3"], "b": ["5"]}
I tried '||' operator but it overwrites values if there as same keys in maps.
Eg:
Input: SELECT '{"a": ["1", "2"]}'::jsonb || '{"a": ["3"], "b": ["5"]}'::jsonb;
Output: {"a": ["3"], "b": ["5"]}
Using jsonb_object_agg with a series of cross joins:
select jsonb_object_agg(t.key, t.a) from (
select v.key, jsonb_agg(distinct v1.value) a from objects o
cross join jsonb_each(o.tags) v
cross join jsonb_array_elements(v.value) v1
group by v.key) t
See fiddle.
You can use the jsonb_object_agg aggregate function to achieve this. The jsonb_object_agg function takes a set of key-value pairs and returns a JSONB object. You can use this function to aggregate all the maps into a single JSONB map by concatenating all the maps as key-value pairs. Here is an example query:
SELECT jsonb_object_agg(key, value)
FROM (
SELECT key, jsonb_agg(value) AS value
FROM (
SELECT key, value
FROM (
SELECT 'a' AS key, '["1", "2"]'::jsonb AS value
UNION ALL
SELECT 'a' AS key, '["3"]'::jsonb AS value
UNION ALL
SELECT 'b' AS key, '["5"]'::jsonb AS value
) subq
) subq2
GROUP BY key
) subq3;
This will give you the following result:
{"a": ["1", "2", "3"], "b": ["5"]}

How to query on fields from nested records without referring to the parent records in BigQuery?

I have data structured as follows:
{
"results": {
"A": {"first": 1, "second": 2, "third": 3},
"B": {"first": 4, "second": 5, "third": 6},
"C": {"first": 7, "second": 8, "third": 9},
"D": {"first": 1, "second": 2, "third": 3},
... },
...
}
i.e. nested records, where the lowest level has the same schema for all records in the level above. The schema would be similar to this:
results RECORD NULLABLE
results.A RECORD NULLABLE
results.A.first INTEGER NULLABLE
results.A.second INTEGER NULLABLE
results.A.third INTEGER NULLABLE
results.B RECORD NULLABLE
results.B.first INTEGER NULLABLE
...
Is there a way to do (e.g. aggregate) queries in BigQuery on fields from the lowest level, without knowledge of the keys on the (direct) parent level? Put differently, can I do a query on first for all records in results without having to specify A, B, ... in my query?
I would for example want to achieve something like
SELECT SUM(results.*.first) FROM table
in order to get 1+4+7+1 = 13,
but SELECT results.*.first isn't supported.
(I've tried playing around with STRUCTs, but haven't gotten far.)
Below trick is for BigQuery Standard SQL
#standardSQL
SELECT id, (
SELECT AS STRUCT
SUM(first) AS sum_first,
SUM(second) AS sum_second,
SUM(third) AS sum_third
FROM UNNEST([a]||[b]||[c]||[d])
).*
FROM `project.dataset.table`,
UNNEST([results])
You can test, play with above using dummy/sample data from your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 AS id, STRUCT(
STRUCT(1 AS first, 2 AS second, 3 AS third) AS A,
STRUCT(4 AS first, 5 AS second, 6 AS third) AS B,
STRUCT(7 AS first, 8 AS second, 9 AS third) AS C,
STRUCT(1 AS first, 2 AS second, 3 AS third) AS D
) AS results
)
SELECT id, (
SELECT AS STRUCT
SUM(first) AS sum_first,
SUM(second) AS sum_second,
SUM(third) AS sum_third
FROM UNNEST([a]||[b]||[c]||[d])
).*
FROM `project.dataset.table`,
UNNEST([results])
with output
Row id sum_first sum_second sum_third
1 1 13 17 21
Is there a way to do (e.g. aggregate) queries in BigQuery on fields from the lowest level, without knowledge of the keys on the (direct) parent level?
Below is for BigQuery Standard SQL and totally avoids referencing parent records (A, B, C, D, etc.)
#standardSQL
CREATE TEMP FUNCTION Nested_SUM(entries ANY TYPE, field_name STRING) AS ((
SELECT SUM(CAST(SPLIT(kv, ':')[OFFSET(1)] AS INT64))
FROM UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(entries), r'":{(.*?)}')) entry,
UNNEST(SPLIT(entry)) kv
WHERE TRIM(SPLIT(kv, ':')[OFFSET(0)], '"') = field_name
));
SELECT id,
Nested_SUM(results, 'first') AS first_sum,
Nested_SUM(results, 'second') AS second_sum,
Nested_SUM(results, 'third') AS third_sum,
Nested_SUM(results, 'forth') AS forth_sum
FROM `project.dataset.table`
if to apply to sample data from your question as in below example
#standardSQL
CREATE TEMP FUNCTION Nested_SUM(entries ANY TYPE, field_name STRING) AS ((
SELECT SUM(CAST(SPLIT(kv, ':')[OFFSET(1)] AS INT64))
FROM UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(entries), r'":{(.*?)}')) entry,
UNNEST(SPLIT(entry)) kv
WHERE TRIM(SPLIT(kv, ':')[OFFSET(0)], '"') = field_name
));
WITH `project.dataset.table` AS (
SELECT 1 AS id, STRUCT(
STRUCT(1 AS first, 2 AS second, 3 AS third) AS A,
STRUCT(4 AS first, 5 AS second, 6 AS third) AS B,
STRUCT(7 AS first, 8 AS second, 9 AS third) AS C,
STRUCT(1 AS first, 2 AS second, 3 AS third) AS D
) AS results
)
SELECT id,
Nested_SUM(results, 'first') AS first_sum,
Nested_SUM(results, 'second') AS second_sum,
Nested_SUM(results, 'third') AS third_sum,
Nested_SUM(results, 'forth') AS forth_sum
FROM `project.dataset.table`
output is
Row id first_sum second_sum third_sum forth_sum
1 1 13 17 21 null
I adapted Mikhail's answer in order to support grouping on the values of the lowest-level fields:
#standardSQL
CREATE TEMP FUNCTION Nested_AGGREGATE(entries ANY TYPE, field_name STRING) AS ((
SELECT ARRAY(
SELECT AS STRUCT TRIM(SPLIT(kv, ':')[OFFSET(1)], '"') AS value, COUNT(SPLIT(kv, ':')[OFFSET(1)]) AS count
FROM UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(entries), r'":{(.*?)}')) entry,
UNNEST(SPLIT(entry)) kv
WHERE TRIM(SPLIT(kv, ':')[OFFSET(0)], '"') = field_name
GROUP BY TRIM(SPLIT(kv, ':')[OFFSET(1)], '"')
)
));
SELECT id,
Nested_AGGREGATE(results, 'first') AS first_agg,
Nested_AGGREGATE(results, 'second') AS second_agg,
Nested_AGGREGATE(results, 'third') AS third_agg,
FROM `project.dataset.table`
Output for WITH `project.dataset.table` AS (SELECT 1 AS id, STRUCT( STRUCT(1 AS first, 2 AS second, 3 AS third) AS A, STRUCT(4 AS first, 5 AS second, 6 AS third) AS B, STRUCT(7 AS first, 8 AS second, 9 AS third) AS C, STRUCT(1 AS first, 2 AS second, 3 AS third) AS D) AS results ):
Row id first_agg.value first_agg.count second_agg.value second_agg.count third_agg.value third_agg.count
1 1 1 2 2 2 3 2
4 1 5 1 6 1
7 1 8 1 9 1

Redshift Postgresql - How to Parse Nested JSON

I am trying to parse a JSON text using JSON_EXTRACT_PATH_TEXT() function.
JSON sample:
{
"data":[
{
"name":"ping",
"idx":0,
"cnt":27,
"min":16,
"max":33,
"avg":24.67,
"dev":5.05
},
{
"name":"late",
"idx":0,
"cnt":27,
"min":8,
"max":17,
"avg":12.59,
"dev":2.63
}
]
}
'
I tried JSON_EXTRACT_PATH_TEXT(event , '{"name":"late"}', 'avg') function to get 'avg' for name = "late", but it returns blank.
Can anyone help, please?
Thanks
This is a rather complicated task in Redshift, that, unlike Postgres, has poor support to manage JSON, and no function to unnest arrays.
Here is one way to do it using a number table; you need to populate the table with incrementing numbers starting at 0, like:
create table nums as
select 0 i union all select 1 union all select 2 union all select 3
union all select 4 union all select 5 n union all select 6
union all select 7 union all select 8 union all select 9
;
Once the table is created, you can use it to walk the JSON array using json_extract_array_element_text(), and check its content with json_extract_path_text():
select json_extract_path_text(item, 'avg') as my_avg
from (
select json_extract_array_element_text(t.items, n.i, true) as item
from (
select json_extract_path_text(mycol, 'data', true ) as items
from mytable
) t
inner join nums n on n.i < json_array_length(t.items, true)
) t
where json_extract_path_text(item, 'name') = 'late';
You'll need to use json_array_elements for that:
select obj->'avg'
from foo f, json_array_elements(f.event->'data') obj
where obj->>'name' = 'late';
Working example
create table foo (id int, event json);
insert into foo values (1,'{
"data":[
{
"name":"ping",
"idx":0,
"cnt":27,
"min":16,
"max":33,
"avg":24.67,
"dev":5.05
},
{
"name":"late",
"idx":0,
"cnt":27,
"min":8,
"max":17,
"avg":12.59,
"dev":2.63
}]}');

pivot from multiple rows to multiple columns in hive

I have a hive table like following
(id:int, vals: Map<String, int> , type: string)
id, vals, type
1, {"foo": 1}, "a"
1, {"foo": 2}, "b"
2, {"foo": 3}, "a"
2, {"foo": 1}, "b"
Now, there are only two types
I want to change this to following schema
id, type_a_vals, type_b_vals
1, {"foo", 1}, {"foo": 2}
2, {"foo": 3}, {"foo": 1}
and if any "type" is missing, it can be null?
An easy way keeping in mind the map column would be a self join.
select ta.id,ta.vals,tb.vals
from (select * from tbl where type = 'a') ta
full join (select * from tbl where type = 'b') tb on ta.id = tb.id
You can use conditional aggregation to solve questions like these as below. However, doing so on a map column would produce an error.
select id
,max(case when type = 'a' then vals end) as type_a_vals
,max(case when type = 'b' then vals end) as type_a_vals
from tbl
group by id

PostgreSQL get any value from jsonb object

I want to get the value of either key 'a' or 'b' if either one exists. If neither exists, I want the value of any key in the map.
Example:
'{"a": "aaa", "b": "bbbb", "c": "cccc"}' should return aaa.
'{"b": "bbbb", "c": "cccc"}' should return bbb.
'{"c": "cccc"}' should return cccc.
Currently I'm doing it like this:
SELECT COALESCE(o ->> 'a', o ->> 'b', o->> 'c') FROM...
The problem is that I don't really want to name key 'c' explicitly since there are objects that can have any key.
So how do I achieve the desired effect of "Get value of either 'a' or 'b' if either exists. If neither exists, grab anything that exists."?
I am using postgres 9.6.
maybe too long:
t=# with c(j) as (values('{"a": "aaa", "b": "bbbb", "c": "cccc"}'::jsonb))
, m as (select j,jsonb_object_keys(j) k from c)
, f as (select * from m where k not in ('a','b') limit 1)
t-# select COALESCE(j ->> 'a', j ->> 'b', j->>k) from f;
coalesce
----------
aaa
(1 row)
and with no a,b keys:
t=# with c(j) as (values('{"a1": "aaa", "b1": "bbbb", "c": "cccc"}'::jsonb))
, m as (select j,jsonb_object_keys(j) k from c)
, f as (select * from m where k not in ('a','b') limit 1)
select COALESCE(j ->> 'a', j ->> 'b', j->>k) from f;
coalesce
----------
cccc
(1 row)
Idea is to extract all keys with jsonb_object_keys and get the first "random"(because I don't order by anything) (limit 1) and then use it for last coalesce invariant