How to access nested arrays and JSON in AWS Athena - sql

I'm trying to process some data from s3 logs in Athena that has a complex type I cannot figure out how to work with.
I have a table with rows such as:
data
____
"[{\"k1\":\"value1\", \"key2\":\"value2\"...}]"
I'd like to treat it as (1) an array to extract the first element, and then that first element as the JSON that it is.
Everything is confused because the data naturally is a string, that contains an array, that contains json and I don't even know where to start

You can use the following combination of JSON commands:
SELECT
JSON_EXTRACT_SCALAR(
JSON_EXTRACT_SCALAR('"[{\"k1\":\"value1\", \"key2\":\"value2\"...}]"','$'),
'$[0].k1'
)
The inner JSON_EXTRACT_SCALAR will return the JSON ARRAY [{"k1":"value1", "key2":"value2"...}] and the outer will return the relevant value value1
Another similar option is to use CAST(JSON :
SELECT
JSON_EXTRACT_SCALAR(
CAST(JSON '"[{\"k1\":\"value1\", \"key2\":\"value2\"...}]"' as VARCHAR),
'$[0].k1'
)

Related

separate values from JSON in PostgreSQL

I am working with a table that has one JSON column (JSONcolumn).
The values in it appear like this:
["91601","85202","78746"]
Is there any way to get all the objects from that JSON list separated into rows. I want the result to be like this:
JSONcolumn
91601
85202
78746
I read a lot of answers on how to do it but the difference that I noticed is that in my case, the JSON contains a LIST and in the most of cases the people answered using queries that work if the JSON contains a DICT
First it is JSON Array and Object not List and Dict. List and dict are Python terms.
Second the documentation has a whole section JSON operators and functions that cover this.
Third what you want is:
select * from json_array_elements_text('["91601","85202","78746"]');
value
-------
91601
85202
78746
--Or
select * from json_array_elements('["91601","85202","78746"]');
value
---------
"91601"
"85202"
"78746"

Hive: Parse nested json list

I have data which comprises nested json list, like:
{"id":"aaa", "list":[{"eventId":222},{"details":[{"sub1":333},{"sub2":444}]},{"name":555}]}
The target is to extract the "outer" list, like
id data
aaa {"eventId":222}
aaa {"details":[{"sub1":333},{"sub2":444}]}
aaa {"name":555}
How to explode the list without split the inner nested json list? Any help is appreciated.
You'll have to use the built in function get_json_object or json_tupple to get the json objects together with hive posexplode to retrieve values from list
For example,
SELECT get_json_object(json_object, $.id) as id,
posexplode(get_json_object(json_object, $.list)) as pos, val
FROM your table;
From that you can just use get_json_object on your pos,val columns.
You could also use explode with lateral VIEW on hive
Note
This SQL code was only for ilustrating, it may have some errors
edit:
As pointed by #leftjoin, would be impossible to handle this king with the return from get_json_object. Maybe se solution would be use a udf to handle this case.

unnest() not exploding array, returns error Column alias list has 1 entries but 't' has 2 columns available

I have some json data which includes a property 'characters' and it looks like this:
select json_data['characters'] from latest_snapshot_events
Returns: [{"CHAR_STARS":1,"CHAR_A1_LVL":1,"ITEM_POWER":60,"CHAR_A3_LVL":1,"CHAR_TIER":1,"ITEM":10,"shards":0,"CHAR_TPIECES":0,"CHAR_A5_LVL":0,"CHAR_A2_LVL":1,"CHAR_A4_LVL":1,"ITEM_CATEGORY":"Character","ITEM_LEVEL":3},{"CHAR_STARS":1,"CHAR_A1_LVL":1,"ITEM_POWER":50,"CHAR_A3_LVL":1,"CHAR_TIER":1,"ITEM":39,"shards":0,"CHAR_TPIECES":0,"CHAR_A5_LVL":0,"CHAR_A2_LVL":1,"CHAR_A4_LVL":1,"ITEM_CATEGORY":"Character","ITEM_LEVEL":2},{"CHAR_STARS":1,"CHAR_A1_LVL":1,"ITEM_POWER":80,"CHAR_A3_LVL":1,"CHAR_TIER":1,"ITEM":6801450488388220,"shards":0,"CHAR_TPIECES":0,"CHAR_A5_LVL":1,"CHAR_A2_LVL":1,"CHAR_A4_LVL":1,"ITEM_CATEGORY":"Character","ITEM_LEVEL":4},{"CHAR_STARS":1,"CHAR_A1_LVL":1,"ITEM_POWER":85,"CHAR_A3_LVL":1,"CHAR_TIER":1,"ITEM":8355588830097610,"shards":0,"CHAR_TPIECES":5,"CHAR_A5_LVL":0,"CHAR_A2_LVL":1,"CHAR_A4_LVL":1,"ITEM_CATEGORY":"Character","ITEM_LEVEL":4}]
This is returned on a single row. I would like a single row for each item within the array.
I found several SO posts and other blogs advising me to use unnest(). I've tried this several times and cannot get a result to return. For example, here is the documentation from presto. The bottom covers unnest as a stand in for hive's lateral view explode:
SELECT student, score
FROM tests
CROSS JOIN UNNEST(scores) AS t (score);
So I tried to apply this to my table:
characters as (
select
jdata.characters
from latest_snapshot_events
cross join unnest(json_data) as t(jdata)
)
select * from characters;
where json_data is the field in latest_snapshot_events that contains the the property 'characters' which is an array like the one shown above.
This returns an error:
[Simba]AthenaJDBC An error has been thrown from the AWS Athena client. SYNTAX_ERROR: line 69:12: Column alias list has 1 entries but 't' has 2 columns available
How can I unnest/explode latest_snapshot_events.json_data['characters'] onto multiple rows?
Since characters is a JSON array in textual representation, you'll have to:
Parse the JSON text with json_parse to produce a value of type JSON.
Convert the JSON value into a SQL array using CAST.
Explode the array using UNNEST.
For instance:
WITH data(characters) AS (
VALUES '[{"CHAR_STARS":1,"CHAR_A1_LVL":1,"ITEM_POWER":60,"CHAR_A3_LVL":1,"CHAR_TIER":1,"ITEM":10,"shards":0,"CHAR_TPIECES":0,"CHAR_A5_LVL":0,"CHAR_A2_LVL":1,"CHAR_A4_LVL":1,"ITEM_CATEGORY":"Character","ITEM_LEVEL":3},{"CHAR_STARS":1,"CHAR_A1_LVL":1,"ITEM_POWER":50,"CHAR_A3_LVL":1,"CHAR_TIER":1,"ITEM":39,"shards":0,"CHAR_TPIECES":0,"CHAR_A5_LVL":0,"CHAR_A2_LVL":1,"CHAR_A4_LVL":1,"ITEM_CATEGORY":"Character","ITEM_LEVEL":2},{"CHAR_STARS":1,"CHAR_A1_LVL":1,"ITEM_POWER":80,"CHAR_A3_LVL":1,"CHAR_TIER":1,"ITEM":6801450488388220,"shards":0,"CHAR_TPIECES":0,"CHAR_A5_LVL":1,"CHAR_A2_LVL":1,"CHAR_A4_LVL":1,"ITEM_CATEGORY":"Character","ITEM_LEVEL":4},{"CHAR_STARS":1,"CHAR_A1_LVL":1,"ITEM_POWER":85,"CHAR_A3_LVL":1,"CHAR_TIER":1,"ITEM":8355588830097610,"shards":0,"CHAR_TPIECES":5,"CHAR_A5_LVL":0,"CHAR_A2_LVL":1,"CHAR_A4_LVL":1,"ITEM_CATEGORY":"Character","ITEM_LEVEL":4}]'
)
SELECT entry
FROM data, UNNEST(CAST(json_parse(characters) AS array(json))) t(entry)
which produces:
entry
-----------------------------------------------------------------------
{"CHAR_STARS":1,"CHAR_A1_LVL":1,"ITEM_POWER":60,"CHAR_A3_LVL":1,...
{"CHAR_STARS":1,"CHAR_A1_LVL":1,"ITEM_POWER":50,"CHAR_A3_LVL":1,...
{"CHAR_STARS":1,"CHAR_A1_LVL":1,"ITEM_POWER":80,"CHAR_A3_LVL":1,...
{"CHAR_STARS":1,"CHAR_A1_LVL":1,"ITEM_POWER":85,"CHAR_A3_LVL":1,...
In the example above, I convert the JSON value into an array(json), but
you can further convert it to something more concrete if the values inside each
array entry have a regular schema. For example, for your data, it is
possible to cast it to an array(map(varchar, json)) since every element in the
array is a JSON object.
json_parse works if your initial data is a JSON string. However, for array(row) types (i.e. an array of objects/dictionaries), casting to array(json) will convert each row into an array, removing all keys from the object and preventing you from using dot notation or json_extract functions.
To unnest array(row) data, the syntax is much simpler:
CROSS JOIN UNNEST(my_array) AS my_row
I got stuck with this error trying to unpivot data.
This might help someone:
SELECT a_col, b_col
FROM
(
SELECT MAP(
ARRAY['a', 'b', 'c', 'd'],
ARRAY[1, 2, 3, 4]
) my_col
) CROSS JOIN UNNEST(my_col) as t(a_col, b_col)
t() allows you define multiple columns as outputs.

Extract Array from JSON in HiveQL

I have a column which contains a large JSON object. For example, let's call the column Column1, and this is a typical element:
{"key1":value,"key2":[{"subK11":val,"subK12":val},{"subK21":val,"subK22":val}]}
So, I can extract a normal element fine using:
select get_json_object(Column1,'$.key1') as key1
But I have been unable to figure out how to extract the ARRAY in a usable form, as this:
select get_json_object(Column1,'$.key2') as key2
Returns a STRING type. So I can't select elements from the array like normal. That is, this query will fail:
select key2[1] as first_element
from
(select get_json_object(Column1,'$.key2') as key2)
OR
select explode(key2)
from
(select get_json_object(Column1,'$.key2') as key2 )
Both give errors, the later says "explode() requires array type". So the issue, I think, is that get_json_object returns a string. I need it to recognize that key2 contains an ARRAY, but I have no idea how to do that.
I'm new to Hive SQL, mainly an SQL user, so please let me know if there's anything crazy obvious I'm missing. I have not found a solution to this type of problem on any of the other questions.
you can use hive-third-functions, It provide json_array_extract function, you can extract json array info like this:
json_array_extract("[{\"a\":{\"b\":\"13\"}}, {\"a\":{\"b\":\"18\"}}, {\"a\":{\"b\":\"12\"}}]", "$.a.b"); => ["\"13\"","\"18\"","\"12\""]
json_array_extract_scalar("[{\"a\":{\"b\":\"13\"}}, {\"a\":{\"b\":\"18\"}}, {\"a\":{\"b\":\"12\"}}]", "$.a.b") => ["13","18","12"]

Get an average value for element in column of arrays of json data in postgres

I have some data in a postgres table that is a string representation of an array of json data, like this:
[
{"UsageInfo"=>"P-1008366", "Role"=>"Abstract", "RetailPrice"=>2, "EffectivePrice"=>0},
{"Role"=>"Text", "ProjectCode"=>"", "PublicationCode"=>"", "RetailPrice"=>2},
{"Role"=>"Abstract", "RetailPrice"=>2, "EffectivePrice"=>0, "ParentItemId"=>"396487"}
]
This is is data in one cell from a single column of similar data in my database.
The datatype of this stored in the db is varchar(max).
My goal is to find the average RetailPrice of EVERY json item with "Role"=>"Abstract", including all of the json elements in the array, and all of the rows in the database.
Something like:
SELECT avg(json_extract_path_text(json_item, 'RetailPrice'))
FROM (
SELECT cast(json_items to varchar[]) as json_item
FROM my_table
WHERE json_extract_path_text(json_item, 'Role') like 'Abstract'
)
Now, obviously this particular query wouldn't work for a few reasons. Postgres doesn't let you directly convert a varchar to a varchar[]. Even after I had an array, this query would do nothing to iterate through the array. There are probably other issues with it too, but I hope it helps to clarify what it is I want to get.
Any advice on how to get the average retail price from all of these arrays of json data in the database?
It does not seem like Redshift would support the json data type per se. At least, I found nothing in the online manual.
But I found a few JSON function in the manual, which should be instrumental:
JSON_ARRAY_LENGTH
JSON_EXTRACT_ARRAY_ELEMENT_TEXT
JSON_EXTRACT_PATH_TEXT
Since generate_series() is not supported, we have to substitute for that ...
SELECT tbl_id
, round(avg((json_extract_path_text(elem, 'RetailPrice'))::numeric), 2) AS avg_retail_price
FROM (
SELECT *, json_extract_array_element_text(json_items, pos) AS elem
FROM (VALUES (0),(1),(2),(3),(4),(5)) a(pos)
CROSS JOIN tbl
) sub
WHERE json_extract_path_text(elem, 'Role') = 'Abstract'
GROUP BY 1;
I substituted with a poor man's solution: A dummy table counting from 0 to n (the VALUES expression). Make sure you count up to the maximum number of possible elements in your array. If you need this on a regular basis create an actual numbers table.
Modern Postgres has much better options, like json_array_elements() to unnest a json array. Compare to your sibling question for Postgres:
Can get an average of values in a json array using postgres?
I tested in Postgres with the related operator ->>, where it works:
SQL Fiddle.