sql json extract scalar with different keys in json - sql

have a 'dateinfo' column that I extract from a table in Athena, which is a json like the ones you can see below.
[{"pickuprequesteddate":"2022-08-09T00:00:00"}, {"deliveryrequesteddate":"2022-08-09T00:00:00"}]
[{"departureestimateddate":"2022-08-25T00:00:00"}, {"arrivalestimateddate":"2022-10-07T00:00:00"}, {}, {}]
As you can see inside the json there are different keys. I am interested in extracting the values for 'pickuprequesteddate' and 'deliveryrequesteddate' if they are in the json array. That is, for the examples above I would like to obtain as a result a column with the following values:
[2022-08-09T00:00:00,deliveryrequesteddate":"2022-08-09T00:00:00]
[null, null, null, null]
I know how to extract the values of each key but separately, using
TRANSFORM(CAST(stopinfo AS ARRAY<JSON>), x -> JSON_EXTRACT_SCALAR(x, '$.dateinfo.pickuprequesteddate')) as pickup,
TRANSFORM(CAST(stopinfo AS ARRAY<JSON>), x -> JSON_EXTRACT_SCALAR(x, '$.dateinfo.deliveryrequesteddate')) as delivery
However, this gives me two separate columns.
How could I extract the values the way I want?

If only one is present in the object you can use coalesce:
WITH dataset (stopinfo) AS (
VALUES (JSON '[{"pickuprequesteddate":"2022-08-09T00:00:00"}, {"deliveryrequesteddate":"2022-08-09T00:00:00"}]'),
(JSON '[{"departureestimateddate":"2022-08-25T00:00:00"}, {"arrivalestimateddate":"2022-10-07T00:00:00"}, {}, {}]')
)
-- query
select TRANSFORM(
CAST(stopinfo AS ARRAY(JSON)),
x-> coalesce(
JSON_EXTRACT_SCALAR(x, '$.pickuprequesteddate'),
JSON_EXTRACT_SCALAR(x, '$.deliveryrequesteddate')
)
)
from dataset;
Output:
_col0
[2022-08-09T00:00:00, 2022-08-09T00:00:00]
[NULL, NULL, NULL, NULL]

Related

How can I get result in JSON format from AWS Athena?

I want to get result in JSON format, below is the structure of object obj
UUID string
MobileCountryCode string
UserState []struct {
Mobile int64
IsNewUser bool
}
when i do simple select of object from Athena via Redash:
select obj FROM table_name
getting output in this format:
{uuid=UCLPKJWPZXH,mobile_country_code=,user_state=[{mobile=9988998899, is_new_user=false}]}
We can see clearly the output doesn't contain double quotes for json keys and has = instead of : to qualifier for proper json
I even tried casting it to json:
select CAST(obj AS JSON) AS json_obj FROM table_name
but i am getting only values without json keys like below:
["UCLPKJWPZXH","",[["9988998899",false]]]
But i want it to be like this with json keys
{"uuid":"UCLPKJWPZXH","mobile_country_code":"","user_state":[{"mobile":9988998899,"is_new_user":false}]}
Athena is based on Presto/Trino (engine v3 should use Trino functions) and in Trino cast(... as json) should work:
select cast(r as json)
from (values (1, CAST(ROW('UUID123', array[row(1, TRUE)]) AS ROW(UUID varchar, UserState array(row(Mobile int, IsNewUser boolean)))))) as t(id, r);
Output:
_col0
----------------------------------------------------------------
{"uuid":"UUID123","userstate":[{"mobile":1,"isnewuser":true}]}
Try upgrading to v3 engine. If you are already using v3 engine or it does not work after upgrade or you can't upgrade - the only way is to convert ROW into MAP, cause Presto treats ROWs as arrays (docs):
When casting from ROW to JSON, the result is a JSON array rather than a JSON object. This is because positions are more important than names for rows in SQL.
And converting row to map can be quite cumbersome:
select cast(
map(array['UUID123', 'MobileCountryCode', 'UserState'],
array[
cast(r.UUID as json),
cast(r.MobileCountryCode as json),
cast(
transform(r.UserState,
e -> map(
array['Mobile', 'IsNewUser'],
array[cast(e.Mobile as json), cast(e.IsNewUser as json)]))
as json)
])
as json)
from (values (1, CAST(ROW('UUID123', 'US', array[row(1, TRUE)]) AS ROW (UUID varchar, MobileCountryCode varchar,
UserState array(row(Mobile int, IsNewUser boolean)))))) as t(id, r);
Output:
_col0
--------------------------------------------------------------------------------------------
{"MobileCountryCode":"US","UUID123":"UUID123","UserState":[{"IsNewUser":true,"Mobile":1}]}

How do I Unnest varchar to json in Athena

I am crawling data from Google Big Query and staging them into Athena.
One of the columns crawled as string, contains json :
{
"key": "Category",
"value": {
"string_value": "something"
}
I need to unnest these and flatten them to be able to use them in a query. I require key and string value (so in my query it will be where Category = something
I have tried the following :
WITH dataset AS (
SELECT cast(json_column as json) as json_column
from "thedatabase"
LIMIT 10
)
SELECT
json_extract_scalar(json_column, '$.value.string_value') AS string_value
FROM dataset
which is returning null.
Casting the json_column as json adds \ into them :
"[{\"key\":\"something\",\"value\":{\"string_value\":\"app\"}}
If I use replace on the json, it doesn't allow me as it's not a varchar object.
So how do I extract the values from the some_column field?
Presto's json_extract_scalar actually supports extracting just from the varchar (string) value :
-- sample data
WITH dataset(json_column) AS (
values ('{
"key": "Category",
"value": {
"string_value": "something"
}}')
)
--query
SELECT
json_extract_scalar(json_column, '$.value.string_value') AS string_value
FROM dataset;
Output:
string_value
something
Casting to json will encode data as json (in case of string you will get a double encoded one), not parse it, use json_parse (in this particular case it is not needed, but there are cases when you will want to use it):
-- query
SELECT
json_extract_scalar(json_parse(json_column), '$.value.string_value') AS string_value
FROM dataset;

Extracting JSON returns null (Presto Athena)

I'm working with SQL Presto in Athena and in a table I have a column named "data.input.additional_risk_data.basket" that has a json like this:
[
{
"data.input.additional_risk_data.basket.val.brand":null,
"data.input.additional_risk_data.basket.val.category":null,
"data.input.additional_risk_data.basket.val.item_reference":"26484651",
"data.input.additional_risk_data.basket.val.name":"Nike Force 1",
"data.input.additional_risk_data.basket.val.product_name":null,
"data.input.additional_risk_data.basket.val.published_date":null,
"data.input.additional_risk_data.basket.val.quantity":"1",
"data.input.additional_risk_data.basket.val.size":null,
"data.input.additional_risk_data.basket.val.subCategory":null,
"data.input.additional_risk_data.basket.val.unit_price":769.0,
"data.input.additional_risk_data.basket.val.upc":null,
"data.input.additional_risk_data.basket.val.url":null
}
]
I need to extract some of the data there, for example data.input.additional_risk_data.basket.val.item_reference. I'm not used to working with jsons but I tried a few things:
json_extract("data.input.additional_risk_data.basket", '$.data.input.additional_risk_data.basket.val.item_reference')
json_extract_scalar("data.input.additional_risk_data.basket", '$.data.input.additional_risk_data.basket.val.item_reference)
They all returned null. I'm wondering what is the correct way to get the values from that json
Thank you!
There are multiple "problems" with your data and json path selector. Keys are not conventional (and I have not found a way to tell athena to escape them) and your json is actually an array of json objects. What you can do - cast data to an array and process it. For example:
-- sample data
WITH dataset (json_val) AS (
VALUES (json '[
{
"data.input.additional_risk_data.basket.val.brand":null,
"data.input.additional_risk_data.basket.val.category":null,
"data.input.additional_risk_data.basket.val.item_reference":"26484651",
"data.input.additional_risk_data.basket.val.name":"Nike Force 1",
"data.input.additional_risk_data.basket.val.product_name":null,
"data.input.additional_risk_data.basket.val.published_date":null,
"data.input.additional_risk_data.basket.val.quantity":"1",
"data.input.additional_risk_data.basket.val.size":null,
"data.input.additional_risk_data.basket.val.subCategory":null,
"data.input.additional_risk_data.basket.val.unit_price":769.0,
"data.input.additional_risk_data.basket.val.upc":null,
"data.input.additional_risk_data.basket.val.url":null
}
]')
)
--query
select arr[1]['data.input.additional_risk_data.basket.val.item_reference'] item_reference -- or use unnest if there are actually more than 1 element in array expected
from(
select cast(json_val as array(map(varchar, json))) arr
from dataset
)
Output:
item_reference
"26484651"

Extract complex json with random key field

I am trying to extract the following JSON into its own rows like the table below in Presto query. The issue here is the name of the key/av engine name is different for each row, and I am stuck on how I can extract and iterate on the keys without knowing the value of the key.
The json is a value of a table row
{
"Bkav":
{
"detected": false,
"result": null,
},
"Lionic":
{
"detected": true,
"result": Trojan.Generic.3611249',
},
...
AV Engine Name
Detected Virus
Result
Bkav
false
null
Lionic
true
Trojan.Generic.3611249
I have tried to use json_extract following the documentation here https://teradata.github.io/presto/docs/141t/functions/json.html but there is no mention of extraction if we don't know the key :( I am trying to find a solution that works in both presto & hive query, is there a common query that is applicable to both?
You can cast your json to map(varchar, json) and process it with unnest to flatten:
-- sample data
WITH dataset (json_str) AS (
VALUES (
'{"Bkav":{"detected": false,"result": null},"Lionic":{"detected": true,"result": "Trojan.Generic.3611249"}}'
)
)
--query
select k "AV Engine Name", json_extract_scalar(v, '$.detected') "Detected Virus", json_extract_scalar(v, '$.result') "Result"
from (
select cast(json_parse(json_str) as map(varchar, json)) as m
from dataset
)
cross join unnest (map_keys(m), map_values(m)) t(k, v)
Output:
AV Engine Name
Detected Virus
Result
Bkav
false
Lionic
true
Trojan.Generic.3611249
The presto query suggested by #Guru works, but for hive, there is no easy way.
I had to extract the json
Parse it with replace to remove some character and bracket
Then convert it back to a map, and repeat for one more time to get the nested value out
SELECT
av_engine,
str_to_map(regexp_replace(engine_result, '\\}', ''),',', ':') AS output_map
FROM (
SELECT
str_to_map(regexp_replace(regexp_replace(get_json_object(raw_response, '$.scans'), '\"', ''), '\\{',''),'\\},', ':') AS key_val_map
FROM restricted_antispam.abuse_malware_scanning
) AS S
LATERAL VIEW EXPLODE(key_val_map) temp AS av_engine, engine_result

in snowflake, how to get a list of all values of a certain key out a list of key values

Have a column of a large semi structured object, one of the parts is a key value on its own (actually a list of key values) I can get it like so:
t.payload:questions_and_answers
which gives:
[{"answer":"yes","position":0,"question":"would you"},
{"answer":"because","position":1,"question":"what"}]
I want to get from that:
yes, because
any ideas?
Using FLATTEN:
CREATE OR REPLACE TABLE t
AS
SELECT PARSE_JSON('{questions_and_answers:[{"answer":"yes","position":0,"question":"would you"},
{"answer":"because","position":1,"question":"what"}]}') AS payload;
Query:
SELECT s.value:answer::STRING
FROM t
,TABLE(FLATTEN (input => t.payload, PATH =>'questions_and_answers')) s;
Or if single output is required:
SELECT LISTAGG(s.value:answer::STRING, ', ') AS result
FROM t
,TABLE(FLATTEN (input => t.payload, PATH =>'questions_and_answers')) s;
Output: