How can I get result in JSON format from AWS Athena? - sql

I want to get result in JSON format, below is the structure of object obj
UUID string
MobileCountryCode string
UserState []struct {
Mobile int64
IsNewUser bool
}
when i do simple select of object from Athena via Redash:
select obj FROM table_name
getting output in this format:
{uuid=UCLPKJWPZXH,mobile_country_code=,user_state=[{mobile=9988998899, is_new_user=false}]}
We can see clearly the output doesn't contain double quotes for json keys and has = instead of : to qualifier for proper json
I even tried casting it to json:
select CAST(obj AS JSON) AS json_obj FROM table_name
but i am getting only values without json keys like below:
["UCLPKJWPZXH","",[["9988998899",false]]]
But i want it to be like this with json keys
{"uuid":"UCLPKJWPZXH","mobile_country_code":"","user_state":[{"mobile":9988998899,"is_new_user":false}]}

Athena is based on Presto/Trino (engine v3 should use Trino functions) and in Trino cast(... as json) should work:
select cast(r as json)
from (values (1, CAST(ROW('UUID123', array[row(1, TRUE)]) AS ROW(UUID varchar, UserState array(row(Mobile int, IsNewUser boolean)))))) as t(id, r);
Output:
_col0
----------------------------------------------------------------
{"uuid":"UUID123","userstate":[{"mobile":1,"isnewuser":true}]}
Try upgrading to v3 engine. If you are already using v3 engine or it does not work after upgrade or you can't upgrade - the only way is to convert ROW into MAP, cause Presto treats ROWs as arrays (docs):
When casting from ROW to JSON, the result is a JSON array rather than a JSON object. This is because positions are more important than names for rows in SQL.
And converting row to map can be quite cumbersome:
select cast(
map(array['UUID123', 'MobileCountryCode', 'UserState'],
array[
cast(r.UUID as json),
cast(r.MobileCountryCode as json),
cast(
transform(r.UserState,
e -> map(
array['Mobile', 'IsNewUser'],
array[cast(e.Mobile as json), cast(e.IsNewUser as json)]))
as json)
])
as json)
from (values (1, CAST(ROW('UUID123', 'US', array[row(1, TRUE)]) AS ROW (UUID varchar, MobileCountryCode varchar,
UserState array(row(Mobile int, IsNewUser boolean)))))) as t(id, r);
Output:
_col0
--------------------------------------------------------------------------------------------
{"MobileCountryCode":"US","UUID123":"UUID123","UserState":[{"IsNewUser":true,"Mobile":1}]}

Related

How do I Unnest varchar to json in Athena

I am crawling data from Google Big Query and staging them into Athena.
One of the columns crawled as string, contains json :
{
"key": "Category",
"value": {
"string_value": "something"
}
I need to unnest these and flatten them to be able to use them in a query. I require key and string value (so in my query it will be where Category = something
I have tried the following :
WITH dataset AS (
SELECT cast(json_column as json) as json_column
from "thedatabase"
LIMIT 10
)
SELECT
json_extract_scalar(json_column, '$.value.string_value') AS string_value
FROM dataset
which is returning null.
Casting the json_column as json adds \ into them :
"[{\"key\":\"something\",\"value\":{\"string_value\":\"app\"}}
If I use replace on the json, it doesn't allow me as it's not a varchar object.
So how do I extract the values from the some_column field?
Presto's json_extract_scalar actually supports extracting just from the varchar (string) value :
-- sample data
WITH dataset(json_column) AS (
values ('{
"key": "Category",
"value": {
"string_value": "something"
}}')
)
--query
SELECT
json_extract_scalar(json_column, '$.value.string_value') AS string_value
FROM dataset;
Output:
string_value
something
Casting to json will encode data as json (in case of string you will get a double encoded one), not parse it, use json_parse (in this particular case it is not needed, but there are cases when you will want to use it):
-- query
SELECT
json_extract_scalar(json_parse(json_column), '$.value.string_value') AS string_value
FROM dataset;

sql json extract scalar with different keys in json

have a 'dateinfo' column that I extract from a table in Athena, which is a json like the ones you can see below.
[{"pickuprequesteddate":"2022-08-09T00:00:00"}, {"deliveryrequesteddate":"2022-08-09T00:00:00"}]
[{"departureestimateddate":"2022-08-25T00:00:00"}, {"arrivalestimateddate":"2022-10-07T00:00:00"}, {}, {}]
As you can see inside the json there are different keys. I am interested in extracting the values for 'pickuprequesteddate' and 'deliveryrequesteddate' if they are in the json array. That is, for the examples above I would like to obtain as a result a column with the following values:
[2022-08-09T00:00:00,deliveryrequesteddate":"2022-08-09T00:00:00]
[null, null, null, null]
I know how to extract the values of each key but separately, using
TRANSFORM(CAST(stopinfo AS ARRAY<JSON>), x -> JSON_EXTRACT_SCALAR(x, '$.dateinfo.pickuprequesteddate')) as pickup,
TRANSFORM(CAST(stopinfo AS ARRAY<JSON>), x -> JSON_EXTRACT_SCALAR(x, '$.dateinfo.deliveryrequesteddate')) as delivery
However, this gives me two separate columns.
How could I extract the values the way I want?
If only one is present in the object you can use coalesce:
WITH dataset (stopinfo) AS (
VALUES (JSON '[{"pickuprequesteddate":"2022-08-09T00:00:00"}, {"deliveryrequesteddate":"2022-08-09T00:00:00"}]'),
(JSON '[{"departureestimateddate":"2022-08-25T00:00:00"}, {"arrivalestimateddate":"2022-10-07T00:00:00"}, {}, {}]')
)
-- query
select TRANSFORM(
CAST(stopinfo AS ARRAY(JSON)),
x-> coalesce(
JSON_EXTRACT_SCALAR(x, '$.pickuprequesteddate'),
JSON_EXTRACT_SCALAR(x, '$.deliveryrequesteddate')
)
)
from dataset;
Output:
_col0
[2022-08-09T00:00:00, 2022-08-09T00:00:00]
[NULL, NULL, NULL, NULL]

Extracting JSON returns null (Presto Athena)

I'm working with SQL Presto in Athena and in a table I have a column named "data.input.additional_risk_data.basket" that has a json like this:
[
{
"data.input.additional_risk_data.basket.val.brand":null,
"data.input.additional_risk_data.basket.val.category":null,
"data.input.additional_risk_data.basket.val.item_reference":"26484651",
"data.input.additional_risk_data.basket.val.name":"Nike Force 1",
"data.input.additional_risk_data.basket.val.product_name":null,
"data.input.additional_risk_data.basket.val.published_date":null,
"data.input.additional_risk_data.basket.val.quantity":"1",
"data.input.additional_risk_data.basket.val.size":null,
"data.input.additional_risk_data.basket.val.subCategory":null,
"data.input.additional_risk_data.basket.val.unit_price":769.0,
"data.input.additional_risk_data.basket.val.upc":null,
"data.input.additional_risk_data.basket.val.url":null
}
]
I need to extract some of the data there, for example data.input.additional_risk_data.basket.val.item_reference. I'm not used to working with jsons but I tried a few things:
json_extract("data.input.additional_risk_data.basket", '$.data.input.additional_risk_data.basket.val.item_reference')
json_extract_scalar("data.input.additional_risk_data.basket", '$.data.input.additional_risk_data.basket.val.item_reference)
They all returned null. I'm wondering what is the correct way to get the values from that json
Thank you!
There are multiple "problems" with your data and json path selector. Keys are not conventional (and I have not found a way to tell athena to escape them) and your json is actually an array of json objects. What you can do - cast data to an array and process it. For example:
-- sample data
WITH dataset (json_val) AS (
VALUES (json '[
{
"data.input.additional_risk_data.basket.val.brand":null,
"data.input.additional_risk_data.basket.val.category":null,
"data.input.additional_risk_data.basket.val.item_reference":"26484651",
"data.input.additional_risk_data.basket.val.name":"Nike Force 1",
"data.input.additional_risk_data.basket.val.product_name":null,
"data.input.additional_risk_data.basket.val.published_date":null,
"data.input.additional_risk_data.basket.val.quantity":"1",
"data.input.additional_risk_data.basket.val.size":null,
"data.input.additional_risk_data.basket.val.subCategory":null,
"data.input.additional_risk_data.basket.val.unit_price":769.0,
"data.input.additional_risk_data.basket.val.upc":null,
"data.input.additional_risk_data.basket.val.url":null
}
]')
)
--query
select arr[1]['data.input.additional_risk_data.basket.val.item_reference'] item_reference -- or use unnest if there are actually more than 1 element in array expected
from(
select cast(json_val as array(map(varchar, json))) arr
from dataset
)
Output:
item_reference
"26484651"

Extract complex json with random key field

I am trying to extract the following JSON into its own rows like the table below in Presto query. The issue here is the name of the key/av engine name is different for each row, and I am stuck on how I can extract and iterate on the keys without knowing the value of the key.
The json is a value of a table row
{
"Bkav":
{
"detected": false,
"result": null,
},
"Lionic":
{
"detected": true,
"result": Trojan.Generic.3611249',
},
...
AV Engine Name
Detected Virus
Result
Bkav
false
null
Lionic
true
Trojan.Generic.3611249
I have tried to use json_extract following the documentation here https://teradata.github.io/presto/docs/141t/functions/json.html but there is no mention of extraction if we don't know the key :( I am trying to find a solution that works in both presto & hive query, is there a common query that is applicable to both?
You can cast your json to map(varchar, json) and process it with unnest to flatten:
-- sample data
WITH dataset (json_str) AS (
VALUES (
'{"Bkav":{"detected": false,"result": null},"Lionic":{"detected": true,"result": "Trojan.Generic.3611249"}}'
)
)
--query
select k "AV Engine Name", json_extract_scalar(v, '$.detected') "Detected Virus", json_extract_scalar(v, '$.result') "Result"
from (
select cast(json_parse(json_str) as map(varchar, json)) as m
from dataset
)
cross join unnest (map_keys(m), map_values(m)) t(k, v)
Output:
AV Engine Name
Detected Virus
Result
Bkav
false
Lionic
true
Trojan.Generic.3611249
The presto query suggested by #Guru works, but for hive, there is no easy way.
I had to extract the json
Parse it with replace to remove some character and bracket
Then convert it back to a map, and repeat for one more time to get the nested value out
SELECT
av_engine,
str_to_map(regexp_replace(engine_result, '\\}', ''),',', ':') AS output_map
FROM (
SELECT
str_to_map(regexp_replace(regexp_replace(get_json_object(raw_response, '$.scans'), '\"', ''), '\\{',''),'\\},', ':') AS key_val_map
FROM restricted_antispam.abuse_malware_scanning
) AS S
LATERAL VIEW EXPLODE(key_val_map) temp AS av_engine, engine_result

Cast DATETIME to STRING working on ARRAY of STRUCT on BigQuery Standard SQL

in my table I have an attribute called 'messages' with this exact data type:
ARRAY<STRUCT<created_time DATETIME ,`from` STRUCT<id STRING,
name STRING,email STRING>, id STRING, message STRING>>
and I have defined a UDF named my_func()
Because UDF function in Big Query don't support the type DATETIME I need to cast the attribute created_time.
So I tried this:
safe_cast ( messages as ARRAY<STRUCT<created_time STRING ,
'from` STRUCT<id STRING, name STRING, email STRING>,
id STRING, message STRING>>) as messages_casted
and I get this error
Casting between arrays with incompatible element types is not
supported: Invalid cast from...
There is an error in the way I cast an array struct?
There is some way to use UDF with this data structure or the only way is to flatten the array and do the cast?
My goal is to take the array in the JS execution environment in order to make the aggregation with JS code.
When working with JavaScript UDFs, you don't need to specify complex input data types explicitly. Instead, you can use the TO_JSON_STRING function. In your case, you can have the UDF take messages as a STRING, then parse it inside the UDF. You would call my_func(TO_JSON_STRING(messages)), for example.
Here is an example from the documentation:
CREATE TEMP FUNCTION SumFieldsNamedFoo(json_row STRING)
RETURNS FLOAT64
LANGUAGE js AS """
function SumFoo(obj) {
var sum = 0;
for (var field in obj) {
if (obj.hasOwnProperty(field) && obj[field] != null) {
if (typeof obj[field] == "object") {
sum += SumFoo(obj[field]);
} else if (field == "foo") {
sum += obj[field];
}
}
}
return sum;
}
var row = JSON.parse(json_row);
return SumFoo(row);
""";
WITH Input AS (
SELECT STRUCT(1 AS foo, 2 AS bar, STRUCT('foo' AS x, 3.14 AS foo) AS baz) AS s, 10 AS foo UNION ALL
SELECT NULL, 4 AS foo UNION ALL
SELECT STRUCT(NULL, 2 AS bar, STRUCT('fizz' AS x, 1.59 AS foo) AS baz) AS s, NULL AS foo
)
SELECT
TO_JSON_STRING(t) AS json_row,
SumFieldsNamedFoo(TO_JSON_STRING(t)) AS foo_sum
FROM Input AS t;
+---------------------------------------------------------------------+---------+
| json_row | foo_sum |
+---------------------------------------------------------------------+---------+
| {"s":{"foo":1,"bar":2,"baz":{"x":"foo","foo":3.14}},"foo":10} | 14.14 |
| {"s":null,"foo":4} | 4 |
| {"s":{"foo":null,"bar":2,"baz":{"x":"fizz","foo":1.59}},"foo":null} | 1.59 |
+---------------------------------------------------------------------+---------+
Because UDF function in Big Query don't support the type DATETIME I need to cast the attribute created_time.
Below is for BigQuery Standard SQL and is a very simple way of Casting specific element of ARRAY leaving everything else as is.
So in below example - it CASTs created_time from DATETIME to STRING (you can use any compatible type you need in your case though)
#standardSQL
SELECT messages,
ARRAY(SELECT AS STRUCT * REPLACE (SAFE_CAST(created_time AS STRING) AS created_time)
FROM UNNEST(messages) message
) casted_messages
FROM `project.dataset.table`
If you run it against your data - you will see original and casted messages - all elements should be same (value/type) with exception of (as expected) created_time which will be of casted type (STRING in this particular case) or NULL if not compatible
You can test / play with above using dummy data as below
#standardSQL
WITH `project.dataset.table` AS (
SELECT [STRUCT<created_time DATETIME,
`from` STRUCT<id STRING, name STRING, email STRING>,
id STRING,
message STRING>
(DATETIME '2018-01-01 13:00:00', ('1', 'mike', 'zzz#ccc'), 'a', 'abc'),
(DATETIME '2018-01-02 14:00:00', ('2', 'john', 'yyy#bbb'), 'b', 'xyz')
] messages
)
SELECT messages,
ARRAY(SELECT AS STRUCT * REPLACE (SAFE_CAST(created_time AS STRING) AS created_time)
FROM UNNEST(messages) message
) casted_messages
FROM `project.dataset.table`