Spark extract nested JSON array items using purely SQL-query - apache-spark-sql

Note: this is NOT a duplicate of following (or several other similar discussions)
Spark SQL JSON dataset query nested datastructures
How to use Spark SQL to parse the JSON array of objects
Querying Spark SQL DataFrame with complex types
I have a Hive table that I must read and process purely via Spark-SQL-query. This table has a string-type column, that contains JSON dumps from APIs; so expectedly, it has deeply nested stringified JSONs.
Lets take this example (it depicts the exact depth / complexity of data that I'm trying to process)
{
"key1": ..
"key2": ..
..
"bill_summary": {
"key1": ..
"key2": ..
..
"items": [
{
"item": {
"key1": ..
"key2": ..
..
"type": "item_type_1"
..
"total_cost": 57.65
..
}
},
{
"item": {
"key1": ..
"key2": ..
..
"total_cost": 23.31
..
}
}
..
{
"item": {
"key1": ..
"key2": ..
..
"type": "item_type_1"
..
}
}
]
..
}
..
}
I'm interested in the items array. I'm able to access it via
get_json_object(get_json_object('$.bill_summary'), '$.items') AS items
Now here's the problem
I need to take out all (type, total_cost) tuples from the array
But I only need those entries where both these are present, whereas several item objects have either of those or none of them
While I've also managed to separately pick all type fields and total_cost fields into two separate arrays, but due to second limitation above (absent fields), I end up loosing the relationship.
What I get (using following snippet) in the end are two arrays (possibly of different lengths), but with no certainty that corresponding elements of each array belong to same item or not
this snippet lists only part of my rather long SQL-query. It employs CTE
..
split(get_json_object(get_json_object(var6, '$.bill_summary'), '$.items[*].item.type'), ',') AS types_array,
split(get_json_object(get_json_object(var6, '$.bill_summary'), '$.items[*].item.total_cost'), ',') AS total_cost_array
..
Now here are the limitations
I have no control over source Hive table schema or its data
I want to do this using purely SparkSQL-query
I cannot use DataFrame manipulation
I do not want to employ a registered udf (I'm keeping that as a last resort)
I've spent several hours on docs and forums, but the Spark-SQL docs are sparse and discussions mostly revolve around DataFrame API, which I cannot use. Is this problem even solvable by SQL-query alone?

After hours of scouring the web, this answer hinted me that I can cast a stringified JSON array to array of structs in spark-sql. Finally here's what I did
..
var6_items AS
(SELECT hash_id,
entity1,
dt,
get_json_object(get_json_object(var6,'$.bill_summary'), '$.items[*].item') AS items_as_string
FROM rows_with_appversion
WHERE appversion >= 14),
filtered_var6_items AS
(SELECT *
FROM var6_items
WHERE items_as_string IS NOT NULL
AND items_as_string != '')
SELECT from_json(items_as_string, 'array<struct<type:string,total_cost:string>>') AS items_as_struct_array
FROM filtered_var6_items
..
explanation
the expression get_json_object(get_json_object(var6,'$.bill_summary'), '$.items[*].item') AS items_as_string results in items_as_string containing following (stringified) JSON (note that one level of redundant nesting around each item has also been removed)
[
{
"key1": "val1",
"key2": "val2",
"type": "item_type_1",
"total_cost": 57.65
},
{
"key1": "val1",
"key2": "val2",
"total_cost": 57.65
}
..
{
"key1": "val1",
"key2": "val2",
"type": "item_type_1"
}
]
thereafter, from_json function allows casting the above structure into an array of structs. Once that is obtained, I can filter structs that have both type and total_cost as not NULL
References
Spark-SQL's from_json function
user6910411's answer to How to cast an array of struct in a spark dataframe using selectExpr?

Related

How to extract a field from an array of JSON objects in AWS Athena?

I have the following JSON data structure in a column in AWS Athena:
[
{
"event_type": "application_state_transition",
"data": {
"event_id": "-3368023833341021830"
}
},
{
"event_type": "application_state_transition",
"data": {
"event_id": "5692882176024811076"
}
}
]
I would like to somehow extract the values of event_id field, e.g. in the form of a list:
["-3368023833341021830", "5692882176024811076"]
(Though I don't insist on exactly this as long as I can get my event IDs.)
I wanted to use the JSON_EXTRACT function and thought it uses the very same syntax as jq. In jq, I can easily get what I want using the following query syntax:
.[].data.event_id
However, in AWS Athena this results in an error, as apparently the syntax is not entirely compatible with jq. Is there an alternative way to achieve the result I want?
JSON_EXTRACT supports quite limited set of json paths. Depending on Athena engine version you can either process column by casting it to array of maps and processing this array via array functions:
-- sample data
with dataset(json_col) as (
values ('[
{
"event_type": "application_state_transition",
"data": {
"event_id": "-3368023833341021830"
}
},
{
"event_type": "application_state_transition",
"data": {
"event_id": "5692882176024811076"
}
}
]')
)
-- query
select transform(
cast(json_parse(json_col) as array(map(varchar, json))),
m -> json_extract(m['data'], '$.event_id'))
from dataset;
Output:
_col0
["-3368023833341021830", "5692882176024811076"]
Or for 3rd Athena engine version you can try using Trino's json_query:
-- query
select JSON_QUERY(json_col, 'lax $[*].data.event_id' WITH ARRAY WRAPPER)
from dataset;
Note that return type of two will differ - in first case you will have array(json) and in the second one - just varchar.

AWS Glue / Hive struct with undetermined struct

Adding data to a AWS Glue table where one of the columns is a struct where one of the values has undetermined form.
More specifically there's a known key called 'name', that is a string and another called 'metadata' that can be a dict with any structure.
Ex:
# Row 1
{
"name": "Jane",
"metadata": {
"foo": 123,
"bar": "something"
}
}
# Row 2
{
"name": "Bill",
"metadata": {
"baz": "something else"
}
}
Note how metadata is a different dictionary in the two entries.
How can this be specified as a struct?
struct<
name:string,
metadata:?
>
Ended up doing what I mentioned in the comment, which is to make the column a string and have the JSON blob serialized to string.
SQL queries will then need to deserialize the JSON blob, which is supported in several different implementations, including AWS Athena (the one I'm using).

BigQuery: Get field names of a STRUCT

I have some data in a STRUCT in BigQuery. Below I have visualised an example of the data as JSON:
{
...
siblings: {
david: { a: 1 }
sarah: { b: 1, c: 1 }
}
...
}
I want to produce a field from a query that resembles ["david", "sarah"]. Essentially I just want to get the keys from the STRUCT (object). Note that every user will have different key names in the siblings STRUCT.
Is this possible in BigQuery?
Thanks,
A
Your structs schema must be consistent throughout the table. They can't change keys because they're part of the table schema. To get the keys you simply take a look at the table schema.
If values change, they're probably values in an array - I guess you might have something like this:
WITH t AS (
SELECT 1 AS id, [STRUCT('david' AS name, 33 as age), ('sarah', 42)] AS siblings
union all
SELECT 2, [('ken', 19), ('ryu',21), ('chun li',23)]
)
SELECT * FROM t
If you tried to introduce new keys in the second row or within the array, you'd get an error Array elements of types {...} do not have a common supertype at ....
The first element of the above example in json representation looks like this:
{
"id": "1",
"siblings": [
{
"name": "david",
"age": "33"
},
{
"name": "sarah",
"age": "42"
}
]
}

Returning unknown JSON in a query

Here is my scenario. I have data in a Cosmos DB and I want to return c.this, c.that etc as the indexer for Azure Cognitive Search. One field I want to return is JSON of an unknown structure. The one thing I do know about it is that it is flat. However it is my understanding that the return value for an indexer needs to be known. How, using SQL in a SELECT, would I return all JSON elements in the flat object? Here is an example value I would be querying:
{
"BusinessKey": "SomeKey",
"Source": "flat",
"id": "SomeId",
"attributes": {
"Source": "flat",
"Element": "element",
"SomeOtherElement": "someOtherElement"
}
}
So I would want my select to be maybe something like:
SELECT
c.BusinessKey,
c.Source,
c.id,
-- SOMETHING HERE TO LIST OUT ALL ATTRIBUTES IN THE JSON AS FIELDS IN THE RESULT
And I would want the result to be:
{
"BusinessKey": "SomeKey",
"Source": "flat",
"id": "SomeId",
"attributes": [{"Source":"flat"},{"Element":"element"},{"SomeOtherElement":"someotherelement"}]
}
Currently we are calling ToString on the c.attributes, which is the JSON of unknown structure but it is adding all the escape characters. When we want to search the index, we have to add all those escape characters and it's getting really unruly.
Is there a way to do this using SQL?
Thanks for any help!
You could use UDF in cosmos db sql.
UDF code:
function userDefinedFunction(object){
var returnArray = [];
for (var key in object) {
var map = {};
map[key] = object[key];
returnArray.push(map);
}
return returnArray;
}
Sql:
SELECT
c.BusinessKey,
c.Source,
c.id,
udf.test(c.attributes) as attributes
from c
Output:

Querying Deep JSONb Information - PostgreSQL

I have the following JSON array stored on a row:
{
"openings": [
{
"visibleFormData": {
"productName": "test"
}
}
]
}
I'm trying to get the value of productName. So far I've tried something like this:
SELECT tbl.column->'openings'->'0'->'visibleFormData'->>'productName'
The theory being that this would grab the first object (index 0) in the openings array and then grab the productName attribute from that object's visibleFormData object.
All I'm getting is null, though. I've tried multiple configurations of this. I'm thinking it has to do with the grabbing of index zero, but I am unsure. I am not a regular PSQL user, so it's proving a tad tricky to debug.
The json array index is integer, so use 0 instead of '0':
with tbl(col) as (
values
('{
"openings": [
{
"visibleFormData": {
"productName": "test"
}
}
]
}'::jsonb)
)
SELECT tbl.col->'openings'->0->'visibleFormData'->>'productName'
FROM tbl
?column?
----------
test
(1 row)