Spark SQL to explode array of structure - apache-spark-sql

I have the below JSON structure which I am trying to convert to a structure with each element as column as shown below using Spark SQL. Explode(control) is not working. Can someone please suggest a way to do this?
Input:
{
"control" : [
{
"controlIndex": 0,
"customerValue": 100.0,
"guid": "abcd",
"defaultValue": 50.0,
"type": "discrete"
},
{
"controlIndex": 1,
"customerValue": 50.0,
"guid": "pqrs",
"defaultValue": 50.0,
"type": "discrete"
}
]
}
Desired output:
controlIndex customerValue guid defaultValult type
0 100.0 abcd 50.0 discrete
1 50.0 pqrs 50.0 discrete

Addition to Paul Leclercq's answer, here is what can work.
import org.apache.spark.sql.functions.explode
df.select(explode($"control").as("control")).select("control.*")

Explode will create a new row for each element in the given array or map column
import org.apache.spark.sql.functions.explode
df.select(
explode($"control")
)

Explode will not work here as its not a normal array column but an array of struct. You might want to try something like
df.select(col("control.controlIndex"), col("control.customerValue"), col ("control. guid"), col("control. defaultValue"), col(control. type))

Related

How to load a jsonl file into BigQuery when the file has mix data fields as columns

During my work flow, after extracting the data from API, the JSON has the following structure:
[
{
"fields":
[
{
"meta": {
"app_type": "ios"
},
"name": "app_id",
"value": 100
},
{
"meta": {},
"name": "country",
"value": "AE"
},
{
"meta": {
"name": "Top"
},
"name": "position",
"value": 1
}
],
"metrics": {
"click": 1,
"price": 1,
"count": 1
}
}
]
Then it is store as .jsonl and put on GCS. However, when I load it onto BigQuery for further extraction, the automatic schema inference return the following error:
Error while reading data, error message: JSON parsing error in row starting at position 0: Could not convert value to string. Field: value; Value: 100
I want to convert it in to the following structure:
app_type
app_id
country
position
click
price
count
ios
100
AE
Top
1
1
1
Is there a way to define manual schema on BigQuery to achieve this result? Or do I have to preprocess the jsonl file before put it to BigQuery?
One of the limitations in loading JSON data from GCS to BigQuery is that it does not support maps or dictionaries in JSON.
A invalid example would be:
"metrics": {
"click": 1,
"price": 1,
"count": 1
}
Your jsonl file should be something like this:
{"app_type":"ios","app_id":"100","country":"AE","position":"Top","click":"1","price":"1","count":"1"}
I already tested it and it works fine.
So wherever you process the conversion of the json files to jsonl files and storage to GCS, you will have to do some preprocessing.
Probably you have to options:
precreate target table with an app_id field as an INTEGER
preprocess jsonfile and enclose 100 into quotes like "100"

Splitting concatenated values in table column using Dataweave

I have sample data in my Salesforce table as given below:
I am on Mule 3.9 and running dataweave 1.0.
I need to use dataweave to read the above data (from a Salesforce table) and transform it into a JSON as given below:
[
{"Id": "634594cc","Name": "Alpha","List": "AB01"},
{"Id": "634594cc","Name": "Alpha","List": "AB02"},
{"Id": "634594cc","Name": "Alpha","List": "AB03"},
{"Id": "5d839e9c","Name": "Bravo","List": "CD01"},
{"Id": "5d839e9c","Name": "Bravo","List": "CD02"},
{"Id": "3a5f34d3","Name": "Charlie","List": null}
]
As you can see above, the "List" column is what I need to split as separate arrays in the final JSON. It has data with semicolon at the begin, in between and in the end.
Thanks in advance for your help.
I took the liberty to create a couple of sample data based upon the SS (BTW, its best not to use SS) :).
Try this:
%dw 1.0
%output application/dw
%var data = [
{
id: "ABC123",
name: "A",
list: ";AB1;AB2;AB3;"
},
{
id: "ZXY321",
name: "B",
list: null
}
]
---
data reduce (e,result=[]) -> (
result ++ using (
list = e.list default "" splitBy /;/ filter ($ != ""),
sharedFields = {
Id: e.id,
Name: e.name
}
) (
using (
flist = list when ((sizeOf list) > 0) otherwise [null]
) (
flist map {
(sharedFields),
List: $
}
)
)
)

Filter an object array to modify json with circe

I am evaluating Circe and couldn't find out how to use filter for arrays to transform a JSON. I read the guide on its website and API doc, still no clue. Help much appreciated.
Sample data:
{
"Department" : "HR",
"Employees" :[{ "name": "abc", "age": 25 }, {"name":"def", "age" : 30 }]
}
Task:
How to use a filter for Employees to transform the JSON to another JSON, for example, all employees with age older than 50?
For some reason I can't filter from data source before JSON is generated, in case you ask.
Thanks
One possible way of doing this is by
val data = """{"Department" : "HR","Employees" :[{ "name": "abc", "age": 25 }, {"name":"def", "age":30}]}"""
def ageFilter(j:Json): Json = j.withArray { x =>
Json.fromValues(x.filter(_.hcursor.downField("age").as[Int].map(_ > 26).getOrElse(false)))
}
val y: Either[ParsingFailure, Json] = parse(data).map( _.hcursor.downField("Employees").withFocus(ageFilter).top.get)
println(s"$y")

BigQuery: Create column of JSON datatype

I am trying to load json with the following schema into BigQuery:
{
key_a:value_a,
key_b:{
key_c:value_c,
key_d:value_d
}
key_e:{
key_f:value_f,
key_g:value_g
}
}
The keys under key_e are dynamic, ie in one response key_e will contain key_f and key_g and for another response it will instead contain key_h and key_i. New keys can be created at any time so I cannot create a record with nullable fields for all possible keys.
Instead I want to create a column with JSON datatype that can then be queried using the JSON_EXTRACT() function. I have tried loading key_e as a column with STRING datatype but value_e is analysed as JSON and so fails.
How can I load a section of JSON into a single BigQuery column when there is no JSON datatype?
Having your JSON as a single string column inside BigQuery is definitelly an option. If you have large volume of data this can end up with high query price as all your data will end up in one column and actually querying logic can become quite messy.
If you have luxury of slightly changing your "design" - I would recommend considering below one - here you can employ REPEATED mode
Table schema:
[
{ "name": "key_a",
"type": "STRING" },
{ "name": "key_b",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{ "name": "key",
"type": "STRING"},
{ "name": "value",
"type": "STRING"}
]
},
{ "name": "key_e",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{ "name": "key",
"type": "STRING"},
{ "name": "value",
"type": "STRING"}
]
}
]
Example of JSON to load
{"key_a": "value_a1", "key_b": [{"key": "key_c", "value": "value_c"}, {"key": "key_d", "value": "value_d"}], "key_e": [{"key": "key_f", "value": "value_f"}, {"key": "key_g", "value": "value_g"}]}
{"key_a": "value_a2", "key_b": [{"key": "key_x", "value": "value_x"}, {"key": "key_y", "value": "value_y"}], "key_e": [{"key": "key_h", "value": "value_h"}, {"key": "key_i", "value": "value_i"}]}
Please note: it should be newline delimited JSON so each row must be on one line
You can't do this directly with BigQuery, but you can make it work in two passes:
(1) Import your JSON data as a CSV file with a single string column.
(2) Transform each row to pack your "any-type" field into a string. Write a UDF that takes a string and emits the final set of columns you would like. Append the output of this query to your target table.
Example
I'll start with some JSON:
{"a": 0, "b": "zero", "c": { "woodchuck": "a"}}
{"a": 1, "b": "one", "c": { "chipmunk": "b"}}
{"a": 2, "b": "two", "c": { "squirrel": "c"}}
{"a": 3, "b": "three", "c": { "chinchilla": "d"}}
{"a": 4, "b": "four", "c": { "capybara": "e"}}
{"a": 5, "b": "five", "c": { "housemouse": "f"}}
{"a": 6, "b": "six", "c": { "molerat": "g"}}
{"a": 7, "b": "seven", "c": { "marmot": "h"}}
{"a": 8, "b": "eight", "c": { "badger": "i"}}
Import it into BigQuery as a CSV with a single STRING column (I called it 'blob'). I had to set the delimiter character to something arbitrary and unlikely (thorn -- 'þ') or it tripped over the default ','.
Verify your table imported correctly. You should see your simple one-column schema and the preview should look just like your source file.
Next, we write a query to transform it into your desired shape. For this example, we'd like the following schema:
a (INTEGER)
b (STRING)
c (STRING -- packed JSON)
We can do this with a UDF:
// Map a JSON string column ('blob') => { a (integer), b (string), c (json-string) }
bigquery.defineFunction(
'extractAndRepack', // Name of the function exported to SQL
['blob'], // Names of input columns
[{'name': 'a', 'type': 'integer'}, // Output schema
{'name': 'b', 'type': 'string'},
{'name': 'c', 'type': 'string'}],
function (row, emit) {
var parsed = JSON.parse(row.blob);
var repacked = JSON.stringify(parsed.c);
emit({a: parsed.a, b: parsed.b, c: repacked});
}
);
And a corresponding query:
SELECT a, b, c FROM extractAndRepack(JsonAnyKey.raw)
Now you just need to run the query (selecting your desired target table) and you'll have your data in the form you like.
Row a b c
1 0 zero {"woodchuck":"a"}
2 1 one {"chipmunk":"b"}
3 2 two {"squirrel":"c"}
4 3 three {"chinchilla":"d"}
5 4 four {"capybara":"e"}
6 5 five {"housemouse":"f"}
7 6 six {"molerat":"g"}
8 7 seven {"marmot":"h"}
9 8 eight {"badger":"i"}
One way to do it, is to load this file as CSV instead of JSON (and quote the values or eliminate newlines in the middle), then it will become single STRING column inside BigQuery.
P.S. You are right that having a native JSON data type would have made this scenario much more natural, and BigQuery team is well aware of it.

How to get each JSON child of an array, postgreSQL

I have postgreSQL 9.3 and working with json, my field json in DB looks like:
{
"route_json": [
{
"someKeys": "someValues",
"time": 123
},
{
"someKeys": "someValues",
"time": 123
}, ... N
]
}
In my case I need to catch the 'time' element from each element of route_json array and set them in new array. Is there any way to do this.
It’s not pretty:
SELECT
value->'time'
FROM
json_array_elements('{"route_json": [{"someKeys": "someValues","time": 123},{"someKeys": "someValues","time": 456}]}'::json->'route_json');