Extract Spark data frame from a nested structure - dataframe

I have a DataFrame with nested structure (originally an Avro output from a mapreduce job). I would like to flatten it. The schema of original DataFrame looks like this (simplified):
|-- key: struct
| |-- outcome: boolean
| |-- date: string
| |-- age: int
| |-- features: map
| | |-- key: string
| | |-- value: double
|-- value: struct (nullable = true)
| |-- nullString: string (nullable = true)
In Json representation one row of the data looks like this:
{"key":
{"outcome": false,
"date": "2015-01-01",
"age" : 20,
"features": {
{"f1": 10.0,
"f2": 11.0,
...
"f100": 20.1
}
},
"value": null
}
The features map has the same structure for all rows, i.e. the key set is the same (f1, f2, ..., f100). By "flatten" I mean the following.
+----------+----------+---+----+----+-...-+------+
| outcome| date|age| f1| f2| ... | f100|
+----------+----------+---+----+----+-...-+------+
| true|2015-01-01| 20|10.0|11.0| ... | 20.1|
...
(truncated)
I am using Spark 2.1.0 the spark-avro package from https://github.com/databricks/spark-avro.
The original dataframe is read in by
import com.databricks.spark.avro._
val df = spark.read.avro("path/to/my/file.avro")
// it's nested
df.show()
+--------------------+------+
| key| value|
+--------------------+------+
|[false,2015... |[null]|
|[false,2015... |[null]|
...
(truncated)
Any help is greatly appreciated!

In Spark you can extract data from a nested AVRO file. For example, the JSON you have provided:
{"key":
{"outcome": false,
"date": "2015",
"features": {
{"f1": v1,
"f2": v2,
...
}
},
"value": null
}
after being read from AVRO:
import com.databricks.spark.avro._
val df = spark.read.avro("path/to/my/file.avro")
can provide flattened data from nested JSON. For that you can write code something like this:
df.select("key.*").show
+----+------------+-------+
|date| features |outcome|
+----+------------+-------+
|2015| [v1,v2,...]| false|
+----+------------+-------+
...
(truncated)
df.select("key.*").printSchema
root
|-- date: string (nullable = true)
|-- features: struct (nullable = true)
| |-- f1: string (nullable = true)
| |-- f2: string (nullable = true)
| |-- ...
|-- outcome: boolean (nullable = true)
or something like this:
df.select("key.features.*").show
+---+---+---
| f1| f2|...
+---+---+---
| v1| v2|...
+---+---+---
...
(truncated)
df.select("key.features.*").printSchema
root
|-- f1: string (nullable = true)
|-- f2: string (nullable = true)
|-- ...
If this is the output you are expecting.

Related

How to filter a struct array in a spark dataframe?

I have the following code and outputs.
import org.apache.spark.sql.functions.{collect_list, struct}
import sqlContext.implicits._
val df = Seq(
("john", "tomato", 1.99),
("john", "carrot", 0.45),
("bill", "apple", 0.99),
("john", "banana", 1.29),
("bill", "taco", 2.59)
).toDF("name", "food", "price")
df.groupBy($"name")
.agg(collect_list(struct($"food", $"price")).as("foods"))
.show(false)
df.printSchema
Output and Schema:
+----+---------------------------------------------+
|name|foods |
+----+---------------------------------------------+
|john|[[tomato,1.99], [carrot,0.45], [banana,1.29]]|
|bill|[[apple,0.99], [taco,2.59]] |
+----+---------------------------------------------+
root
|-- name: string (nullable = true)
|-- foods: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- food: string (nullable = true)
| | |-- price: double (nullable = false)
I want to filter based on df("foods.price") > 1.00. How do I filter this to get the output below?
+----+---------------------------------------------+
|name|foods |
+----+---------------------------------------------+
|john|[[banana,1.29], [tomato,1.99]] |
|bill|[[[taco,2.59]] |
+----+---------------------------------------------+
I have tried df.filter($"foods.food" > 1.00), but this does not work as I'm getting an error. Anything else I can try?
you are trying to apply filter on an array, hence it will throw an error as the syntax will be wrong. You can apply filter on price before and then do transformation as needed.
val cf = df.filter("price > 1.0").groupBy($"name").agg(collect_list(struct($"food", $"price")).as("foods")

Rearrange Array of Struct to Array of a Struct with a field as Array in Pyspark

I have a "simple" dataframe array of struct(nome,h_0,h_1,....h_23) and I want rearrange this column as array of struct(nome, array(h_0,h_1....h_23))
as-is:
root
|-- array_area: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- nome: string (nullable = true)
| | |-- h_0: string (nullable = true)
| | |-- h_1: string (nullable = true)
| | |-- h_10: string (nullable = true)
| | |-- h_11: string (nullable = true)
| | |-- h_12: string (nullable = true)
| | |-- h_13: string (nullable = true)
| | |-- h_14: string (nullable = true)
| | |-- h_15: string (nullable = true)
| | |-- h_16: string (nullable = true)
| | |-- h_17: string (nullable = true)
| | |-- h_18: string (nullable = true)
| | |-- h_19: string (nullable = true)
| | |-- h_2: string (nullable = true)
| | |-- h_20: string (nullable = true)
| | |-- h_21: string (nullable = true)
| | |-- h_22: string (nullable = true)
| | |-- h_23: string (nullable = true)
| | |-- h_3: string (nullable = true)
| | |-- h_4: string (nullable = true)
| | |-- h_5: string (nullable = true)
| | |-- h_6: string (nullable = true)
| | |-- h_7: string (nullable = true)
| | |-- h_8: string (nullable = true)
| | |-- h_9: string (nullable = true)
I want:
root
|-- array_area: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- nome: string (nullable = true)
| | |-- circadiana: array (nullable = true)
| | | |--element: struct (containsNull = true)
| | | | |--h_0: string (nullable = true)
| | | | |--h_1: string (nullable = true)
| | | | |--h_2: string (nullable = true)
| | | | |--... until h_23
I've used UDF like:
concat_udf = F.udf(lambda arr: F.array(F.struct(x["nome"], F.array(x["h_0"],x["h_1"],x["h_2"],x["h_3"],x["h_4"],x["h_5"],x["h_6"],x["h_7"],x["h_8"],x["h_9"],x["h_10"],x["h_11"],x["h_12"],x["h_13"],x["h_14"],x["h_15"],x["h_16"],x["h_17"],x["h_18"],x["h_19"],x["h_20"],x["h_21"],x["h_22"],x["h_23"])) for x in arr),
ArrayType(StructType([StructField("nome", StringType(), True),StructField("circadiana", ArrayType(StringType()), True)])))
printSchema is ok!
but when I view data with show()
df_new=df.withColumn("area_provenienza_X",concat_udf(F.col("array_area"))).show()
I have this error:
File "/opt/giotto/spark/python/lib/pyspark.zip/pyspark/worker.py", line 377, in main
process()
File "/opt/giotto/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/opt/giotto/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 345, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/opt/giotto/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream
for obj in iterator:
File "/opt/giotto/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 334, in _batched
for item in iterator:
File "<string>", line 1, in <lambda>
File "/opt/giotto/spark/python/lib/pyspark.zip/pyspark/worker.py", line 83, in <lambda>
return lambda *a: toInternal(f(*a))
File "/opt/giotto/spark/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
return f(*args, **kwargs)
File "LoadFileSIMO112_dati_aggregati.py", line 150, in <lambda>
x["h_23"])) for x in arr),
File "/opt/giotto/spark/python/lib/pyspark.zip/pyspark/sql/functions.py", line 1869, in array
jc = sc._jvm.functions.array(_to_seq(sc, cols, _to_java_column))
AttributeError: 'NoneType' object has no attribute '_jvm'
Example Data:
"area": [{
"nome": "Extra",
"h_0": "0",
"h_1": "0",
"h_2": "0",
"h_3": "0",
"h_4": "0",
"h_5": "0",
"h_6": "1",
"h_7": "0",
"h_8": "0",
"h_9": "0",
"h_10": "1",
"h_11": "1",
"h_12": "0",
"h_13": "1",
"h_14": "0",
"h_15": "0",
"h_16": "0",
"h_17": "1",
"h_18": "0",
"h_19": "1",
"h_20": "0",
"h_21": "1",
"h_22": "0",
"h_23": "1"
},
{
"nome": "ROMA CAP",
"h_0": "130",
"h_1": "94",
"h_2": "116",
"h_3": "61",
"h_4": "54",
"h_5": "47",
"h_6": "58",
"h_7": "57",
"h_8": "87",
"h_9": "0",
"h_10": "0",
"h_11": "0",
"h_12": "0",
"h_13": "0",
"h_14": "0",
"h_15": "0",
"h_16": "0",
"h_17": "0",
"h_18": "0",
"h_19": "0",
"h_20": "0",
"h_21": "0",
"h_22": "0",
"h_23": "124"
}]
I want:
"area": [{
"nome": "Extra",
"circadiana":[0,0,0,0,0,0,1,0,0,0,1,1,0,1,0,0,0,1,0,1,0,1,0,1]
},
{
"nome": "ROMA CAP",
"circadiana":[130,94,116,61,54,47,58,87,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,124]
}]
But there isn't Null Values in my dataframe (df).
Thanks for help
Regards
You can use spark built-in transform function to convert each element of the array into the desired struct.
First, get all h_x field names present in the structs:
import pyspark.sql.functions as F
h_fields = [c for c in df.select(F.expr("inline(area)")).columns if c != "nome"]
Then, using transform function on area array, for each element, we create a struct with 2 fields. First field holds nome and the second field is an array circadiana created using all the others fields (h_0...):
transform_expr = f"""
transform(area,
x -> struct(
x.nome as nome,
array({','.join([f'x.{c}' for c in h_fields])}) as circadiana
)
)
"""
df1 = df.withColumn("area", F.expr(transform_expr))
df1.printSchema()
#root
# |-- area: array (nullable = true)
# | |-- element: struct (containsNull = false)
# | | |-- nome: string (nullable = true)
# | | |-- circadiana: array (nullable = false)
# | | | |-- element: string (containsNull = true)
If the list of h_x fields is fixed you can actually use:
transform_expr = f"""
transform(area,
x -> struct(
x.nome as nome,
array({','.join([f'x.h_{i}' for i in range(24)])}) as circadiana
)
)
"""

Update array of structs - Spark

I have the following spark delta table structure,
+---+------------------------------------------------------+
|id |addresses |
+---+------------------------------------------------------+
|1 |[{"Address":"ABC", "Street": "XXX"}, {"Address":"XYZ", "Street": "YYY"}]|
+---+------------------------------------------------------+
Here the addresses column is an array of structs.
I need to update the first Address inside array as "XXX", from the "Street" attributes value without changing the second element in the list.
So, "ABC" should be updated to "XXX" and "XYZ" should be updated to "YYY"
You can assume, I have so many attributes in the struct like street, zipcode etc so I want to leave them untouched and just update the value of Address from Street attribute.
How can I do this in Spark or Databricks or Sql?
Schema,
|-- id: string (nullable = true)
|-- addresses: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- Address: string (nullable = true)
| | | | |-- Street: string (nullable = true)
Cheers!
Please check below code.
scala> vdf.show(false)
+---+--------------+
|id |addresses |
+---+--------------+
|1 |[[ABC], [XYZ]]|
+---+--------------+
scala> vdf.printSchema
root
|-- id: integer (nullable = false)
|-- addresses: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Address: string (nullable = true)
scala> val new_address = array(struct(lit("AAA").as("Address")))
scala> val except_first = array_except($"addresses",array($"addresses"(0)))
scala> val addresses = array_union(new_address,except_first).as("addresses")
scala> vdf.select($"id",addresses).select($"id",$"addresses",to_json($"addresses").as("json_addresses")).show(false)
+---+--------------+-------------------------------------+
|id |addresses |json_addresses |
+---+--------------+-------------------------------------+
|1 |[[AAA], [XYZ]]|[{"Address":"AAA"},{"Address":"XYZ"}]|
+---+--------------+-------------------------------------+
Updated
scala> vdf.withColumn("addresses",explode($"addresses")).groupBy($"id").agg(collect_list(struct($"addresses.Street".as("Address"),$"addresses.Street")).as("addresses")).withColumn("json_data",to_json($"addresses")).show(false)
+---+------------------------+-------------------------------------------------------------------+
|id |addresses |json_data |
+---+------------------------+-------------------------------------------------------------------+
|1 |[[XXX, XXX], [YYY, YYY]]|[{"Address":"XXX","Street":"XXX"},{"Address":"YYY","Street":"YYY"}]|
+---+------------------------+-------------------------------------------------------------------+

Scala Spark Dataframe: how to explode an array of Int and array of struct at the same time

I'm new to Scala/Spark and I'm trying to make explode a dataframe that has an array column and array of struct column so that I end up with no arrays and no struct.
Here's an example
case class Area(start_time: String, end_time: String, area: String)
val df = Seq((
"1", Seq(4,5,6),
Seq(Area("07:00", "07:30", "70"), Area("08:00", "08:30", "80"), Area("09:00", "09:30", "90"))
)).toDF("id", "before", "after")
df.printSchema
df.show
df has the following schema
root
|-- id: string (nullable = true)
|-- before: array (nullable = true)
| |-- element: integer (containsNull = false)
|-- after: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- start_time: string (nullable = true)
| | |-- end_time: string (nullable = true)
| | |-- area: string (nullable = true)
and the data looks like
+---+---------+--------------------+
| id| before| after|
+---+---------+--------------------+
| 1|[4, 5, 6]|[[07:00, 07:30, 7...|
+---+---------+--------------------+
How do I explode the dataframe so I get the following schema
|-- id: string (nullable = true)
|-- before: integer (containsNull = false)
|-- after_start_time: string (nullable = true)
|-- after_end_time: string (nullable = true)
|-- after_area: string (nullable = true)
The resulting data should have 3 rows and 5 columns
+---+---------+--------------------+--------------------+--------+
| id| before| after_start_time| after_start_time| area|
+---+---------+--------------------+--------------------+--------+
| 1| 4| 07:00| 07:30| 70|
| 1| 5| 08:00| 08:30| 80|
| 1| 6| 09:00| 09:30| 90|
+---+---------+--------------------+--------------------+--------+
I'm using spark 2.3.0 (arrays_zip is not available). And the only solutions I can find is either for exploding two Arrays of String or one Array of struct.
Use arrays_zip to combine two arrays, then explode to explode array columns & use as to rename required columns.
As arrays_zip is not available in spark 2.3. Created UDF to perform same operation.
val arrays_zip = udf((before:Seq[Int],after: Seq[Area]) => before.zip(after))
Execution time with built in (spark 2.4.2) arrays_zip - Time taken: 1146 ms
Execution time with arrays_zip UDF - Time taken: 1165 ms
Check below code.
scala> df.show(false)
+---+---------+------------------------------------------------------------+
|id |before |after |
+---+---------+------------------------------------------------------------+
|1 |[4, 5, 6]|[[07:00, 07:30, 70], [08:00, 08:30, 80], [09:00, 09:30, 90]]|
+---+---------+------------------------------------------------------------+
scala>
df
.select(
$"id",
explode(
arrays_zip($"before",$"after")
.cast("array<struct<before:int,after:struct<start_time:string,end_time:string,area:string>>>")
).as("before_after")
)
.select(
$"id",
$"before_after.before".as("before"),
$"before_after.after.start_time".as("after_start_time"),
$"before_after.after.end_time".as("after_end_time"),
$"before_after.after.area"
)
.printSchema
root
|-- id: string (nullable = true)
|-- before: integer (nullable = true)
|-- after_start_time: string (nullable = true)
|-- after_end_time: string (nullable = true)
|-- area: string (nullable = true)
Output
scala>
df
.select(
$"id",
explode(
arrays_zip($"before",$"after")
.cast("array<struct<before:int,after:struct<start_time:string,end_time:string,area:string>>>")
).as("before_after")
)
.select(
$"id",
$"before_after.before".as("before"),
$"before_after.after.start_time".as("after_start_time"),
$"before_after.after.end_time".as("after_end_time"),
$"before_after.after.area"
)
.show(false)
+---+------+----------------+--------------+----+
|id |before|after_start_time|after_end_time|area|
+---+------+----------------+--------------+----+
|1 |4 |07:00 |07:30 |70 |
|1 |5 |08:00 |08:30 |80 |
|1 |6 |09:00 |09:30 |90 |
+---+------+----------------+--------------+----+
To handle some complex struct you can do,
Declare two beans Area(input) and Area2(output)
Map row to Area2 bean
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
import scala.collection.mutable
object ExplodeTwoArrays {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
val df = Seq((
"1", Seq(4, 5, 6),
Seq(Area("07:00", "07:30", "70"), Area("08:00", "08:30", "80"), Area("09:00", "09:30", "90"))
)).toDF("id", "before", "after")
val outDf = df.map(row=> {
val id = row.getString(0)
val beforeArray : Seq[Int]= row.getSeq[Int](1)
val afterArray : mutable.WrappedArray[Area2] =
row.getAs[mutable.WrappedArray[GenericRowWithSchema]](2) // Need to map Array(Struct) to the something compatible
.zipWithIndex // Require to iterate with indices
.map{ case(element,i) => {
Area2(element.getAs[String]("start_time"),
element.getAs[String]("end_time"),
element.getAs[String]("area"),
beforeArray(i))
}}
(id,afterArray) // Return row(id,Array(Area2(...)))
}).toDF("id","after")
outDf.printSchema()
outDf.show()
}
}
case class Area(start_time: String, end_time: String, area: String)
case class Area2(start_time: String, end_time: String, area: String, before: Int)

Schema Conversion from String datatype to Array(Map(Array)) datatype in Pyspark

We are reading data from dynamo db so we are getting datatype as string, but we want writing string data type as array(map(array))
string Data :
{"policy_details":[{"cdhid":" 123","p2cid":" NA", "roleDesc":" NA","positionnum":"NA"}, {"cdhid":" 1234","p2cid":" NA", "roleDesc":" NA","positionnum":"NA"}]}
output required:
string data type need to convert to ARRAY(MAP(ARRAY))
We have tried with below schema:
ArrayType([
StructField("policy_num", MapType(ArrayType([
StructField("cdhid", StringType(), True),
StructField("role_id", StringType(), True),
StructField("role_desc", StringType(), True)
])))
])
getting below issue:
elementType [StructField(cdhid,StringType,true),
StructField(role_id,StringType,true),
StructField(role_desc,StringType,true)] should be an instance of < class 'pyspark.sql.types.DataType' >
Regarding your data, the schema your want is not the one that fits.
The schema of you data is :
from pyspark.sql import types as T
schm = T.StructType([T.StructField("policy_details",T.ArrayType(T.StructType([
T.StructField("cdhid", T.StringType(), True),
T.StructField("p2cid", T.StringType(), True),
T.StructField("roleDesc", T.StringType(), True),
T.StructField("positionnum", T.StringType(), True),
])), True)])
Then, you just need to use the from_json function.
from pyspark.sql import functions as F
df.show()
+--------------------+
| db_data|
+--------------------+
|{"policy_details"...|
+--------------------+
new_df = df.select(F.from_json("db_data", schm).alias("data"))
new_df.printSchema()
root
|-- data: struct (nullable = true)
| |-- policy_details: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- cdhid: string (nullable = true)
| | | |-- p2cid: string (nullable = true)
| | | |-- roleDesc: string (nullable = true)
| | | |-- positionnum: string (nullable = true)
EDIT : If you want to use MapType, you can replace the schema with :
schm = T.StructType([
T.StructField(
"policy_details",
T.ArrayType(T.MapType(
T.StringType(),
T.StringType()
)),
True
)
])