Hive: Parse nested json list - hive

I have data which comprises nested json list, like:
{"id":"aaa", "list":[{"eventId":222},{"details":[{"sub1":333},{"sub2":444}]},{"name":555}]}
The target is to extract the "outer" list, like
id data
aaa {"eventId":222}
aaa {"details":[{"sub1":333},{"sub2":444}]}
aaa {"name":555}
How to explode the list without split the inner nested json list? Any help is appreciated.

You'll have to use the built in function get_json_object or json_tupple to get the json objects together with hive posexplode to retrieve values from list
For example,
SELECT get_json_object(json_object, $.id) as id,
posexplode(get_json_object(json_object, $.list)) as pos, val
FROM your table;
From that you can just use get_json_object on your pos,val columns.
You could also use explode with lateral VIEW on hive
Note
This SQL code was only for ilustrating, it may have some errors
edit:
As pointed by #leftjoin, would be impossible to handle this king with the return from get_json_object. Maybe se solution would be use a udf to handle this case.

Related

How to access nested arrays and JSON in AWS Athena

I'm trying to process some data from s3 logs in Athena that has a complex type I cannot figure out how to work with.
I have a table with rows such as:
data
____
"[{\"k1\":\"value1\", \"key2\":\"value2\"...}]"
I'd like to treat it as (1) an array to extract the first element, and then that first element as the JSON that it is.
Everything is confused because the data naturally is a string, that contains an array, that contains json and I don't even know where to start
You can use the following combination of JSON commands:
SELECT
JSON_EXTRACT_SCALAR(
JSON_EXTRACT_SCALAR('"[{\"k1\":\"value1\", \"key2\":\"value2\"...}]"','$'),
'$[0].k1'
)
The inner JSON_EXTRACT_SCALAR will return the JSON ARRAY [{"k1":"value1", "key2":"value2"...}] and the outer will return the relevant value value1
Another similar option is to use CAST(JSON :
SELECT
JSON_EXTRACT_SCALAR(
CAST(JSON '"[{\"k1\":\"value1\", \"key2\":\"value2\"...}]"' as VARCHAR),
'$[0].k1'
)

Convert strings into table columns in biq query

I would like to convert this table
to something like this
the long string can be dynamic so it's important to me that it's not a fixed solution for these values specifically
Please help, i'm using big query
You could start by using SPLIT SPLIT(value[, delimiter]) to convert your long string into separate key-value pairs in an array.
This will be sensitive to you having commas as part of your values.
SPLIT(session_experiments, ',')
Then you could either FLATTEN that array or access each element, and then use some REGEXs to separate the key and the value.
If you share more context on your restrictions and intended result I could try and put together a query for you that does exactly what you want.
It's not possible what you want, however, there is a better practice for BigQuery.
You can use arrays of structs to store that information in a table.
Let's say you have a table like that
You can use that sample query to understand how to use it.
with rawdata AS
(
SELECT 1 as id, 'test1-val1,test2-val2,test3-val3' as experiments union all
SELECT 1 as id, 'test1-val1,test3-val3,test5-val5' as experiments
)
select
id,
(select array_agg(struct(split(param, '-')[offset(0)] as experiment, split(param, '-')[offset(1)] as value)) from unnest(split(experiments)) as param ) as experiments
from rawdata
The output will look like that:
After having that output, it's more convenient to manipulate the data

Athena/Presto : complex structure/array

Think that I am asking the impossible here, but throwing it out there.
Trying to query some json in Athena.
The data I'm working with looks like this (excerpt)
condition={
"foranyvalue:stringlike":{"s3:prefix":["lala","hehe"]},
"forallvalues:stringlike":{"s3:prefix":["apples","bananas"]
}
.. and I need to get to here :
... PLUS:the key names are not fixed, so one day I might get:
condition={"something not seen before":{"surprise":["haha","hoho"]}}
With that last point, I was hoping to treat this an an array, and start by splitting the 'foranyvalue' and 'forallvalues' parts into separate rows.
but with everything wrapped in {}, it refuses to unnest.
But despite the above failed plan - ANY tips on solving this by ANY means gratefully received !
Thank You
When you have JSON data that does not have a schema that is easy to describe you can use STRING as the type of the column and then use Athena/Presto's JSON functions to query them, in combination with casting to MAP and UNNEST to flatten the structures.
One way of achieving what I think you're trying to do would be something like this:
WITH the_table AS (
SELECT CAST(condition AS MAP(VARCHAR, JSON)) AS condition
FROM (
VALUES
(JSON '{"foranyvalue:stringlike":{"s3:prefix":["lala","hehe"]},"forallvalues:stringlike":{"s3:prefix":["apples","bananas"]}}'),
(JSON '{"something not seen before":{"surprise":["haha","hoho"]}}')
) AS t (condition)
),
first_flattening AS (
SELECT
SPLIT(first_level_key, ':', 2) AS first_level_key,
CAST(first_level_value AS MAP(VARCHAR, JSON)) AS first_level_value
FROM the_table
CROSS JOIN UNNEST (condition) AS t (first_level_key, first_level_value)
),
second_flattening AS (
SELECT
first_level_key,
second_level_key,
second_level_value
FROM first_flattening
CROSS JOIN UNNEST (first_level_value) AS t (second_level_key, second_level_value)
)
SELECT
first_level_key[1] AS "for",
TRY(first_level_key[2]) AS condition,
second_level_key AS "left",
second_level_value AS "right"
FROM second_flattening
I've included the two examples you gave as an inline VALUES list in the first CTE, and exactly what to do in the table declaration (i.e. what type for the column to use) and what processing to do in the query (i.e. the cast) depends on your data and how you want/can set up the table. YMMV.
The query flattens the JSON structure in a couple of separate steps, first flattening the first level of keys and values, then the keys and values of the inner documents. It might be possible to do this in one step, but doing it in two at least makes it easier to read.
Since the first level keys don't always have the colon I've used TRY to make sure that accessing the second value doesn't break anything. You could perhaps filter out values without a colon earlier and avoid this, since you're not interested in them.

unnest() not exploding array, returns error Column alias list has 1 entries but 't' has 2 columns available

I have some json data which includes a property 'characters' and it looks like this:
select json_data['characters'] from latest_snapshot_events
Returns: [{"CHAR_STARS":1,"CHAR_A1_LVL":1,"ITEM_POWER":60,"CHAR_A3_LVL":1,"CHAR_TIER":1,"ITEM":10,"shards":0,"CHAR_TPIECES":0,"CHAR_A5_LVL":0,"CHAR_A2_LVL":1,"CHAR_A4_LVL":1,"ITEM_CATEGORY":"Character","ITEM_LEVEL":3},{"CHAR_STARS":1,"CHAR_A1_LVL":1,"ITEM_POWER":50,"CHAR_A3_LVL":1,"CHAR_TIER":1,"ITEM":39,"shards":0,"CHAR_TPIECES":0,"CHAR_A5_LVL":0,"CHAR_A2_LVL":1,"CHAR_A4_LVL":1,"ITEM_CATEGORY":"Character","ITEM_LEVEL":2},{"CHAR_STARS":1,"CHAR_A1_LVL":1,"ITEM_POWER":80,"CHAR_A3_LVL":1,"CHAR_TIER":1,"ITEM":6801450488388220,"shards":0,"CHAR_TPIECES":0,"CHAR_A5_LVL":1,"CHAR_A2_LVL":1,"CHAR_A4_LVL":1,"ITEM_CATEGORY":"Character","ITEM_LEVEL":4},{"CHAR_STARS":1,"CHAR_A1_LVL":1,"ITEM_POWER":85,"CHAR_A3_LVL":1,"CHAR_TIER":1,"ITEM":8355588830097610,"shards":0,"CHAR_TPIECES":5,"CHAR_A5_LVL":0,"CHAR_A2_LVL":1,"CHAR_A4_LVL":1,"ITEM_CATEGORY":"Character","ITEM_LEVEL":4}]
This is returned on a single row. I would like a single row for each item within the array.
I found several SO posts and other blogs advising me to use unnest(). I've tried this several times and cannot get a result to return. For example, here is the documentation from presto. The bottom covers unnest as a stand in for hive's lateral view explode:
SELECT student, score
FROM tests
CROSS JOIN UNNEST(scores) AS t (score);
So I tried to apply this to my table:
characters as (
select
jdata.characters
from latest_snapshot_events
cross join unnest(json_data) as t(jdata)
)
select * from characters;
where json_data is the field in latest_snapshot_events that contains the the property 'characters' which is an array like the one shown above.
This returns an error:
[Simba]AthenaJDBC An error has been thrown from the AWS Athena client. SYNTAX_ERROR: line 69:12: Column alias list has 1 entries but 't' has 2 columns available
How can I unnest/explode latest_snapshot_events.json_data['characters'] onto multiple rows?
Since characters is a JSON array in textual representation, you'll have to:
Parse the JSON text with json_parse to produce a value of type JSON.
Convert the JSON value into a SQL array using CAST.
Explode the array using UNNEST.
For instance:
WITH data(characters) AS (
VALUES '[{"CHAR_STARS":1,"CHAR_A1_LVL":1,"ITEM_POWER":60,"CHAR_A3_LVL":1,"CHAR_TIER":1,"ITEM":10,"shards":0,"CHAR_TPIECES":0,"CHAR_A5_LVL":0,"CHAR_A2_LVL":1,"CHAR_A4_LVL":1,"ITEM_CATEGORY":"Character","ITEM_LEVEL":3},{"CHAR_STARS":1,"CHAR_A1_LVL":1,"ITEM_POWER":50,"CHAR_A3_LVL":1,"CHAR_TIER":1,"ITEM":39,"shards":0,"CHAR_TPIECES":0,"CHAR_A5_LVL":0,"CHAR_A2_LVL":1,"CHAR_A4_LVL":1,"ITEM_CATEGORY":"Character","ITEM_LEVEL":2},{"CHAR_STARS":1,"CHAR_A1_LVL":1,"ITEM_POWER":80,"CHAR_A3_LVL":1,"CHAR_TIER":1,"ITEM":6801450488388220,"shards":0,"CHAR_TPIECES":0,"CHAR_A5_LVL":1,"CHAR_A2_LVL":1,"CHAR_A4_LVL":1,"ITEM_CATEGORY":"Character","ITEM_LEVEL":4},{"CHAR_STARS":1,"CHAR_A1_LVL":1,"ITEM_POWER":85,"CHAR_A3_LVL":1,"CHAR_TIER":1,"ITEM":8355588830097610,"shards":0,"CHAR_TPIECES":5,"CHAR_A5_LVL":0,"CHAR_A2_LVL":1,"CHAR_A4_LVL":1,"ITEM_CATEGORY":"Character","ITEM_LEVEL":4}]'
)
SELECT entry
FROM data, UNNEST(CAST(json_parse(characters) AS array(json))) t(entry)
which produces:
entry
-----------------------------------------------------------------------
{"CHAR_STARS":1,"CHAR_A1_LVL":1,"ITEM_POWER":60,"CHAR_A3_LVL":1,...
{"CHAR_STARS":1,"CHAR_A1_LVL":1,"ITEM_POWER":50,"CHAR_A3_LVL":1,...
{"CHAR_STARS":1,"CHAR_A1_LVL":1,"ITEM_POWER":80,"CHAR_A3_LVL":1,...
{"CHAR_STARS":1,"CHAR_A1_LVL":1,"ITEM_POWER":85,"CHAR_A3_LVL":1,...
In the example above, I convert the JSON value into an array(json), but
you can further convert it to something more concrete if the values inside each
array entry have a regular schema. For example, for your data, it is
possible to cast it to an array(map(varchar, json)) since every element in the
array is a JSON object.
json_parse works if your initial data is a JSON string. However, for array(row) types (i.e. an array of objects/dictionaries), casting to array(json) will convert each row into an array, removing all keys from the object and preventing you from using dot notation or json_extract functions.
To unnest array(row) data, the syntax is much simpler:
CROSS JOIN UNNEST(my_array) AS my_row
I got stuck with this error trying to unpivot data.
This might help someone:
SELECT a_col, b_col
FROM
(
SELECT MAP(
ARRAY['a', 'b', 'c', 'd'],
ARRAY[1, 2, 3, 4]
) my_col
) CROSS JOIN UNNEST(my_col) as t(a_col, b_col)
t() allows you define multiple columns as outputs.

How to use array_agg() for varchar[]

I have a column in our database called min_crew that has varying character arrays such as '{CA, FO, FA}'.
I have a query where I'm trying to get aggregates of these arrays without success:
SELECT use.user_sched_id, array_agg(se.sched_entry_id) AS seids
, array_agg(se.min_crew)
FROM base.sched_entry se
LEFT JOIN base.user_sched_entry use ON se.sched_entry_id = use.sched_entry_id
WHERE se.sched_entry_id = ANY(ARRAY[623, 625])
GROUP BY user_sched_id;
Both 623 and 625 have the same use.user_sched_id, so the result should be the grouping of the seids and the min_crew, but I just keep getting this error:
ERROR: could not find array type for data type character varying[]
If I remove the array_agg(se.min_crew) portion of the code, I do get a table returned with the user_sched_id = 2131 and seids = '{623, 625}'.
The standard aggregate function array_agg() only works for base types, not array types as input.
(But Postgres 9.5+ has a new variant of array_agg() that can!)
You could use the custom aggregate function array_agg_mult() as defined in this related answer:
Selecting data into a Postgres array
Create it once per database. Then your query could work like this:
SELECT use.user_sched_id, array_agg(se.sched_entry_id) AS seids
,array_agg_mult(ARRAY[se.min_crew]) AS min_crew_arr
FROM base.sched_entry se
LEFT JOIN base.user_sched_entry use USING (sched_entry_id)
WHERE se.sched_entry_id = ANY(ARRAY[623, 625])
GROUP BY user_sched_id;
There is a detailed rationale in the linked answer.
Extents have to match
In response to your comment, consider this quote from the manual on array types:
Multidimensional arrays must have matching extents for each dimension.
A mismatch causes an error.
There is no way around that, the array type does not allow such a mismatch in Postgres. You could pad your arrays with NULL values so that all dimensions have matching extents.
But I would rather translate the arrays to a comma-separated lists with array_to_string() for the purpose of this query and use string_agg() to aggregate the text - preferably with a different separator. Using a newline in my example:
SELECT use.user_sched_id, array_agg(se.sched_entry_id) AS seids
,string_agg(array_to_string(se.min_crew, ','), E'\n') AS min_crews
FROM ...
Normalize
You might want to consider normalizing your schema to begin with. Typically, you would implement such an n:m relationship with a separate table like outlined in this example:
How to implement a many-to-many relationship in PostgreSQL?