Avoid flattening of query result on nested JSON with spark.sql - apache-spark-sql

I have a nested JSON file that I've loaded into dataset.
for example:
{"name":"Y", "address":{"city":"C","state":"O"}}
{"name":"M", "address":{"city":"h", "state":"C"}}
I want to write a sql query with spark.sql that extracts the nested fields but I need that the result structure will no be flattened, for example:
SELECT address.city FROM T1
will return column with the original two level of nesting :
address
-------
city
-------
c
h
Thanks

Related

Can we flatten column which contain Json as value in Hive table?

I have one hive column 'events' with Json values.How can i flatten this Json to create one hive table with columns as the key field of Json.Is it even possible?
ex- I need hive table columns to be events,start_date,id,details with corresponding values.
| events |
|[{"start_date":20201230,"id":"3245ret","details":"Imp"},{"start_date":20201228,"id":"3245rtr","details":"NoImp"}] |
|[{"start_date":20191230,"id":"3245ret","details":"vImp"},{"start_date":20191228,"id":"3245rwer","details":"NoImp"}]|
Demo:
select events,
get_json_object(element,'$.id') as id,
get_json_object(element,'$.start_date') as start_date,
get_json_object(element,'$.details') as details
from
(
select '[{"start_date":20201230,"id":"3245ret","details":"Imp"},{"start_date":20201228,"id":"3245rtr","details":"NoImp"}]' as events
union all
select '[{"start_date":20191230,"id":"3245ret","details":"vImp"},{"start_date":20191228,"id":"3245rwer","details":"NoImp"}]' as events
) s lateral view outer explode (split(regexp_replace(events, '\\[|\\]',''),'(?<=\\}),(?=\\{)')) e as element
Initial string is splitted by comma between curly brackets, (see explanation here), array exploded with lateral view and JSON objects parsed using get_json_object
Result:
events id start_date details
[{"start_date":20201230,"id":"3245ret","details":"Imp"},{"start_date":20201228,"id":"3245rtr","details":"NoImp"}] 3245ret 20201230 Imp
[{"start_date":20201230,"id":"3245ret","details":"Imp"},{"start_date":20201228,"id":"3245rtr","details":"NoImp"}] 3245rtr 20201228 NoImp
[{"start_date":20191230,"id":"3245ret","details":"vImp"},{"start_date":20191228,"id":"3245rwer","details":"NoImp"}] 3245ret 20191230 vImp
[{"start_date":20191230,"id":"3245ret","details":"vImp"},{"start_date":20191228,"id":"3245rwer","details":"NoImp"}] 3245rwer 20191228 NoImp

How can use WHERE clause in AWS Athena Json queries?

I have a table where I've stored some information from a Json object:
Table:
investment
unit(string)
data(string)
If a run the the query SELECT * FROM "db"."investment" limit 10; I got the following result:
Unit Data
CH [{"from":"CH","when":"2021-02-16","who":"pp#gmail.com"}]
AB [{"from":"AB","when":"2020-02-16","who":"jj#gmail.com"}]
Now, I run the following basic query to return value within the Json nested object:
SELECT json_extract_scalar(Data, '$[0].who') email FROM "db"."investment";
and I got the following result:
email
jj#gmail.com
pp#gmail.com
How can filter this query with WHERE clause to return just a single value:
I've tried this, but obviously it doesn't work as normal SQL table with row and columns:
SELECT json_extract_scalar(Data, '$[0].who') email FROM "db"."investment" WHERE email = "pp#gmail.com";
Any help with this?
your question seems to have a few typos.
Date in Unit Date should probably be Data
what is key referring to. Perhaps you mean Data
also, note that athena is case insensitive, and column names are converted to lower case (even if you quote them).
with that out of the way, you have to use the full expression that extracts your email from the json document in the where clause. the column alias defined is not accessible to the rest of the query.
here's a self contained example:
with test (unit, data) as (
values
('CH', JSON '[{"from":"CH","when":"2021-02-16","who":"pp#gmail.com"}]'),
('AB', JSON '[{"from":"AB","when":"2020-02-16","who":"jj#gmail.com"}]')
)
select json_extract_scalar(data, '$[0].who') email
from test
where json_extract_scalar(data, '$[0].who') = 'pp#gmail.com';
outputs:
| email |
+--------------+
| pp#gmail.com |

Merge left on load data from BigQuery

I have an input table: input and one or more maptables, where input contains data for multiple identifiers and dates stacked under each other. The schemas are as follows:
#input
Id: string (might contain empty values)
Id2: string (might contain empty values)
Id3: string (might contain empty values)
Date: datetime
Value: number
#maptable_1
Id: string
Id2: string
Target_1: string
#maptable_2
Id3: string
Target_2: string
What I do now is that I run a pipeline that for each date/(id, id2, id3) combination loads the data from input and applies a left merge in python against one or more maptables (both as a DataFrame). I then stream the results to a third table named output with the schema:
#output
Id: string
Id2: string
Id3: string
Date: datetime
Value: number
Target_1: string (from maptable_1)
Target_2: string (from maptable_2)
Target_x: ...
Now I was thinking that this is not really efficient. If I change one value from a maptable, I have to redo all the pipelines for each date/(id, id2, id3) combination.
Therefore I was wondering if its possible to apply directly a left merge when loading the data? How would such a Query look like?
In the case of multiple maptables and target columns, would it also be beneficial to do the same? Would the query not become too complex or unreadable, in particular as the id columns are not the same?
How would such a Query look like?
Below is for BigQuery Standard SQL
INSERT `project.dataset.output`
SELECT *
FROM `project.dataset.input` i
LEFT JOIN `project.dataset.maptable_1` m1 USING(id, id2)
LEFT JOIN `project.dataset.maptable_2` m2 USING(id3)
In the case of multiple maptables and target columns ...
If all your map tables are same/similar to two maps in your question - in this case it is just extra LEFT JOIN for each extra map

Flatten nested data in Big Query to a single row

This is what the data looks like
This is what I am trying to achieve
I just need the flattened data to show destination 1 and destination 2 as well as duration 1 and duration 2.
I have used the unnest function in Big Query but it creates multiple rows. I am unable to use any aggregation to group the multiple rows as the data is non-numeric. Thank you for helping!
Below is for BigQuery Standard SQL
#standardSQL
SELECT EnquiryReference,
Destinations[OFFSET(0)].Name AS Destination1,
Destinations[SAFE_OFFSET(1)].Name AS Destination2,
Destinations[OFFSET(0)].Duration AS Duration1,
Destinations[SAFE_OFFSET(1)].Duration AS Duration2
FROM `project.dataset.table`
If to apply to sample data from your question
result will be

How to explode nested array of structure with unknown array length in Hive?

I have a hive table emp_test as below:
'name' as string <br>
'testing' as array< struct < code:string,tests:array < struct < testtype:string,errorline:string>>>>
and have column values :"name" as "JOHN" and "testing" as
[{"code":"cod1234","tests":[{"testtype":"java","errorline":"100"},{"testtype":"C++","errorline":"10000"}]},<br>
{"code":"cod6790","tests":[{"testtype":"hive","errorline":"10"},{"testtype":"pig","errorline":"978"},{"testtype":"spark","errorline":"35"}]}
]
How to select these values and store in another table
emp_test_detail(name,code,testtype,errorline) as
JOHN cod1234 java 100 <br>
JOHN cod1234 C++ 10000<br>
JOHN cod6790 hive 10<br>
JOHN cod6790 pig 978<br>
JOHN cod6790 spark 35<br>
i have tried below query but got error :
*insert into emp_test_detail select <br>
emp_tasting.code, <br>
emp_tasting.emp_tests.testtype, <br>
emp_tasting.emp_tests.errorline from emp_test <br>
lateral view explode(testing) mytest as emp_tasting <br>
lateral view explode(testing[0].tests) mytest as emp_tasting;* <br>
and here I don't know the exact length of testing array.so how to reference array fields?
Please help me on this ?
In your example query the error is likely related to using emp_tasting, the same column alias for both lateral view explode lines. They need to have different aliases.
To un-nest an array two levels deep, you need to explode the first array, then refer to the alias of that exploded array when exploding the nested array.
For example, you wanted name, code, testtype, errorline
name is available directly in the table
code is available from the first explode
testtype and errorline are available from the nested explode.
Note I am looking at your schema, not the data you've listed, it's easier for me to reason about
This query should do what you want
SELECT
name,
testingelement.code,
test.testtype,
test.errorline
FROM emp_test
LATERAL VIEW explode(testing) testingarray as testingelement
LATERAL VIEW explode(testingelement.tests) testsarray as test;
Table and column aliases
Note that explode has two aliases added after it, the first is for the table expression it generates, the second is for the column(s).
So in this example
LATERAL VIEW explode(testing) testingarray as testingelement
testingarray is the table alias
testingelement is the array column alias you need to reference to extract the fields within the struct.
Skipping the first explode
If you only wanted fields directly from the table and from the nested array then you can shortcut that query by doing a single LATERAL VIEW explode to
LATERAL VIEW explode(testing.tests) testsarray as test
The problem with that is it will also explode empty arrays, and you can't use * star expansion, you have to refer to field names explicitly. That's not a bad thing.
What is a bad thing is having to use array indexes in a query. As soon as you start writing field[0] then something smells funky. That would only ever get the first element of the array, and as you've said it relies on knowing the size of the array beforehand which would have very limited use cases.