Spark Dataframe - Merge Nested Columns into one - dataframe

I'm trying to merge a dataframe that has multiple nested struct columns into one single struct column. Here's a same schema
Before:
|- column_1
|- column_2
|- column_3
|- nested_details_1
|- a
|- b
|- c
|- nested_details_2
|- d
|- e
|- f
Desired after:
|- column_1
|- column_2
|- column_3
|- nested_details
|- a
|- b
|- c
|- d
|- e
|- f
Visually I can see what needs to be done, but is there a function with data frames to merge columns for me? If it makes any difference I am using AWS Glue Dynamic Dataframe but I can convert easily enough to a normal data frame.

pyspark
I managed to merge 2 structs using the pyspark struct function like this:
from pyspark.sql import SparkSession
from pyspark.sql.functions import struct, col, expr
spark = SparkSession.builder.getOrCreate()
input_df = spark.sql("""
SELECT
1 as column_1,
2 as column_2,
3 as column_3,
named_struct("a", 1, "b", 2, "c", 3) as nested_details_1,
named_struct("d", 4, "e", 5, "f", 6) as nested_details_2
""")
merged_df = input_df.select(
col("column_1"),
col("column_2"),
col("column_3"),
struct(col("nested_details_1.*"),col("nested_details_2.*")).alias("nested_details")
)
merged_df.printSchema()
merged_df.show()
merged_df.select(expr("typeof(nested_details)")).show(truncate=False)
above prints
root
|-- column_1: integer (nullable = false)
|-- column_2: integer (nullable = false)
|-- column_3: integer (nullable = false)
|-- nested_details: struct (nullable = false)
| |-- a: integer (nullable = false)
| |-- b: integer (nullable = false)
| |-- c: integer (nullable = false)
| |-- d: integer (nullable = false)
| |-- e: integer (nullable = false)
| |-- f: integer (nullable = false)
+--------+--------+--------+------------------+
|column_1|column_2|column_3| nested_details|
+--------+--------+--------+------------------+
| 1| 2| 3|{1, 2, 3, 4, 5, 6}|
+--------+--------+--------+------------------+
+-------------------------------------------+
|typeof(nested_details) |
+-------------------------------------------+
|struct<a:int,b:int,c:int,d:int,e:int,f:int>|
+-------------------------------------------+
Spark SQL
same can be achieved with SQL also using SQL struct function like this:
with input as (
SELECT
1 as column_1,
2 as column_2,
3 as column_3,
named_struct("a", 1, "b", 2, "c", 3) as nested_details_1,
named_struct("d", 4, "e", 5, "f", 6) as nested_details_2
)
SELECT
column_1,
column_2,
column_3,
struct(nested_details_1.*, nested_details_2.*) as nested_details
FROM input

Related

How to add the index of the array as a field to an array of structs in pyspark dataframe

I have a dataframe containing an array of structs. I would like to add the index of the array as a field within the struct. Is this possible?
So structure would go from:
|-- my_array_column: array
| |-- element: struct
| | |-- field1: string
| | |-- field2: string
to:
|-- my_array_column: array
| |-- element: struct
| | |-- field1: string
| | |-- field2: string
| | |-- index of element: integer
Many thanks
For Spark 3.1+, you can use transform function and withField to update each struct element of the array column like his:
from pyspark.sql import functions as F
df = df.withColumn(
"my_array_column",
F.transform("my_array_column", lambda x, i: x.withField("index", i))
)
For older version, you'll have to recreate the whole struct element in order to add a field:
df = df.withColumn(
"my_array_column",
F.expr("transform(my_array_column, (x, i) -> struct(x.field1 as field1, x.field2 as field2, i as index))")
)

Struct data type when creating dataframe with createDataFrame in Scala

In PySpark, we can create struct data type when using createDataFrame like in the following example ("b", "c") and ("e", "f")
df = spark.createDataFrame([
["a", ("b", "c")],
["d", ("e", "f")]
])
df.printSchema()
# root
# |-- _1: string (nullable = true)
# |-- _2: struct (nullable = true)
# | |-- _1: string (nullable = true)
# | |-- _2: string (nullable = true)
df.show()
# +---+------+
# | _1| _2|
# +---+------+
# | a|{b, c}|
# | d|{e, f}|
# +---+------+
Is there a similar way in Scala - to create struct schema inside createDataFrame, without using org.apache.spark.sql.functions?
For your specific example, you can use tuples and call this flavor of createDataFrame.
val df = spark.createDataFrame(Seq(
("a", ("b", "c")),
("d", ("e", "f"))
))
df.printSchema()
/*
root
|-- _1: string (nullable = true)
|-- _2: struct (nullable = true)
| |-- _1: string (nullable = true)
| |-- _2: string (nullable = true)
*/
df.show()
/*
+---+------+
| _1| _2|
+---+------+
| a|[b, c]|
| d|[e, f]|
+---+------+
*/
Instead of ("b", "c") one can also use "b" -> "c" to create a tuple of length 2.
Preferred method
Tuples can become difficult to manage when dealing with many fields and especially nested fields. Likely, you'll want to model your data using case class(s). This also allows to specify struct field names and types.
case class Person(name: String, age: Int)
case class Car(manufacturer: String, model: String, mileage: Double, owner: Person)
val df = spark.createDataFrame(Seq(
Car("Toyota", "Camry", 81400.8, Person("John", 37)),
Car("Honda", "Accord", 152090.2, Person("Jane", 25))
))
df.printSchema()
/*
root
|-- manufacturer: string (nullable = true)
|-- model: string (nullable = true)
|-- mileage: double (nullable = false)
|-- owner: struct (nullable = true)
| |-- name: string (nullable = true)
| |-- age: integer (nullable = false)
*/
df.show()
/*
+------------+------+--------+----------+
|manufacturer| model| mileage| owner|
+------------+------+--------+----------+
| Toyota| Camry| 81400.8|[John, 37]|
| Honda|Accord|152090.2|[Jane, 25]|
+------------+------+--------+----------+
*/

Scala Spark Dataframe: how to explode an array of Int and array of struct at the same time

I'm new to Scala/Spark and I'm trying to make explode a dataframe that has an array column and array of struct column so that I end up with no arrays and no struct.
Here's an example
case class Area(start_time: String, end_time: String, area: String)
val df = Seq((
"1", Seq(4,5,6),
Seq(Area("07:00", "07:30", "70"), Area("08:00", "08:30", "80"), Area("09:00", "09:30", "90"))
)).toDF("id", "before", "after")
df.printSchema
df.show
df has the following schema
root
|-- id: string (nullable = true)
|-- before: array (nullable = true)
| |-- element: integer (containsNull = false)
|-- after: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- start_time: string (nullable = true)
| | |-- end_time: string (nullable = true)
| | |-- area: string (nullable = true)
and the data looks like
+---+---------+--------------------+
| id| before| after|
+---+---------+--------------------+
| 1|[4, 5, 6]|[[07:00, 07:30, 7...|
+---+---------+--------------------+
How do I explode the dataframe so I get the following schema
|-- id: string (nullable = true)
|-- before: integer (containsNull = false)
|-- after_start_time: string (nullable = true)
|-- after_end_time: string (nullable = true)
|-- after_area: string (nullable = true)
The resulting data should have 3 rows and 5 columns
+---+---------+--------------------+--------------------+--------+
| id| before| after_start_time| after_start_time| area|
+---+---------+--------------------+--------------------+--------+
| 1| 4| 07:00| 07:30| 70|
| 1| 5| 08:00| 08:30| 80|
| 1| 6| 09:00| 09:30| 90|
+---+---------+--------------------+--------------------+--------+
I'm using spark 2.3.0 (arrays_zip is not available). And the only solutions I can find is either for exploding two Arrays of String or one Array of struct.
Use arrays_zip to combine two arrays, then explode to explode array columns & use as to rename required columns.
As arrays_zip is not available in spark 2.3. Created UDF to perform same operation.
val arrays_zip = udf((before:Seq[Int],after: Seq[Area]) => before.zip(after))
Execution time with built in (spark 2.4.2) arrays_zip - Time taken: 1146 ms
Execution time with arrays_zip UDF - Time taken: 1165 ms
Check below code.
scala> df.show(false)
+---+---------+------------------------------------------------------------+
|id |before |after |
+---+---------+------------------------------------------------------------+
|1 |[4, 5, 6]|[[07:00, 07:30, 70], [08:00, 08:30, 80], [09:00, 09:30, 90]]|
+---+---------+------------------------------------------------------------+
scala>
df
.select(
$"id",
explode(
arrays_zip($"before",$"after")
.cast("array<struct<before:int,after:struct<start_time:string,end_time:string,area:string>>>")
).as("before_after")
)
.select(
$"id",
$"before_after.before".as("before"),
$"before_after.after.start_time".as("after_start_time"),
$"before_after.after.end_time".as("after_end_time"),
$"before_after.after.area"
)
.printSchema
root
|-- id: string (nullable = true)
|-- before: integer (nullable = true)
|-- after_start_time: string (nullable = true)
|-- after_end_time: string (nullable = true)
|-- area: string (nullable = true)
Output
scala>
df
.select(
$"id",
explode(
arrays_zip($"before",$"after")
.cast("array<struct<before:int,after:struct<start_time:string,end_time:string,area:string>>>")
).as("before_after")
)
.select(
$"id",
$"before_after.before".as("before"),
$"before_after.after.start_time".as("after_start_time"),
$"before_after.after.end_time".as("after_end_time"),
$"before_after.after.area"
)
.show(false)
+---+------+----------------+--------------+----+
|id |before|after_start_time|after_end_time|area|
+---+------+----------------+--------------+----+
|1 |4 |07:00 |07:30 |70 |
|1 |5 |08:00 |08:30 |80 |
|1 |6 |09:00 |09:30 |90 |
+---+------+----------------+--------------+----+
To handle some complex struct you can do,
Declare two beans Area(input) and Area2(output)
Map row to Area2 bean
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
import scala.collection.mutable
object ExplodeTwoArrays {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
val df = Seq((
"1", Seq(4, 5, 6),
Seq(Area("07:00", "07:30", "70"), Area("08:00", "08:30", "80"), Area("09:00", "09:30", "90"))
)).toDF("id", "before", "after")
val outDf = df.map(row=> {
val id = row.getString(0)
val beforeArray : Seq[Int]= row.getSeq[Int](1)
val afterArray : mutable.WrappedArray[Area2] =
row.getAs[mutable.WrappedArray[GenericRowWithSchema]](2) // Need to map Array(Struct) to the something compatible
.zipWithIndex // Require to iterate with indices
.map{ case(element,i) => {
Area2(element.getAs[String]("start_time"),
element.getAs[String]("end_time"),
element.getAs[String]("area"),
beforeArray(i))
}}
(id,afterArray) // Return row(id,Array(Area2(...)))
}).toDF("id","after")
outDf.printSchema()
outDf.show()
}
}
case class Area(start_time: String, end_time: String, area: String)
case class Area2(start_time: String, end_time: String, area: String, before: Int)

The output of Spark DataFrame replace column names with "0", "1" when arrays_zip data originated nested

I used spark sql functions arrays_zip combined with flatten to transform data from array of struct of inner array of the same length into array of struct. printSchema shows exactly I want. However, df output lost original column names and replace them with generic column name "0", "1", "2" etc. no matter in Parquet or Avro format. I like to output original column names.
Not to reveal the business of my company. The followings are similar but much simplified examples.
scala> c2.printSchema
root
|-- cal: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- month: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- num: array (nullable = true)
| | | |-- element: long (containsNull = true)
scala> c2.show(false)
+----------------------------------------------+
|cal |
+----------------------------------------------+
|[[[Jan, Feb, Mar], [1, 2, 3]], [[April], [4]]]|
+----------------------------------------------+
I like to transform to
scala> newC2.show(false)
+------------------------------------------+
|cal |
+------------------------------------------+
|[[Jan, 1], [Feb, 2], [Mar, 3], [April, 4]]|
+------------------------------------------+
with
scala> newC2.printSchema
root
|-- cal: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- month: string (nullable = true)
| | |-- num: long (nullable = true)
I know arrays_zip only work well on the top-level arrays. Therefore, I flatten them to top level. The followings codes work in this example
val newC2 = c2.withColumn("month", flatten(col("cal.month"))).withColumn("num", flatten(col("cal.num"))).withColumn("cal", arrays_zip(col("month"), col("num"))).drop("month", "num")
It generates exactly data and schema I want. However, it outputs all columns generically using "0", "1", "2" etc.
newC2.write.option("header", false).parquet("c2_parquet")
I tried another example that has original data of month array and num array at the top level. I can arrays_zip without flatten and get the same schema and data shown. However, it output original field name correctly in this case.
I tried add alias to flatten data. That does not work. I even tried manipulate columns like (assume field store the result of arrays_zip is 'zipped'
val columns: Array[Column] = inner.fields.map(_.name).map{x => col("zipped").getField(x).alias(x)}
val newB3 = newB2.withColumn("b", array(struct(columns:_*))).drop("zipped")
It ends up generate original schema ('month", array of string and "num", array of long).
To duplicate the problem, you can use the json input
"cal":[{"month":["Jan","Feb","Mar"],"num":[1,2,3]},{"month":["April"],"num":[4]}]}
the following json is for top-level arrays_zip
{"month":["Jan","Feb","Mar"],"num":[1,2,3]}
How Spark internally decide what field names to use? How can I get it to work? Please advise.
Since Spark 2.4, the schema transformation can be achieved using Higher Order Functions. In Scala the query can look like this:
import org.apache.spark.sql.functions.{expr, flatten}
val result = df
.withColumn("cal", flatten(expr("TRANSFORM(cal, x -> zip_with(x.month, x.num, (month, num) -> (month,num)))")))
After applying on your sample data i get this schema:
result.printSchema()
root
|-- cal: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- month: string (nullable = true)
| | |-- num: long (nullable = true)

How do I add a column to a nested struct in a PySpark dataframe?

I have a dataframe with a schema like
root
|-- state: struct (nullable = true)
| |-- fld: integer (nullable = true)
I'd like to add columns within the state struct, that is, create a dataframe with a schema like
root
|-- state: struct (nullable = true)
| |-- fld: integer (nullable = true)
| |-- a: integer (nullable = true)
I tried
df.withColumn('state.a', val).printSchema()
# root
# |-- state: struct (nullable = true)
# | |-- fld: integer (nullable = true)
# |-- state.a: integer (nullable = true)
Here is a way to do it without using a udf:
# create example dataframe
import pyspark.sql.functions as f
data = [
({'fld': 0},)
]
schema = StructType(
[
StructField('state',
StructType(
[StructField('fld', IntegerType())]
)
)
]
)
df = sqlCtx.createDataFrame(data, schema)
df.printSchema()
#root
# |-- state: struct (nullable = true)
# | |-- fld: integer (nullable = true)
Now use withColumn() and add the new field using lit() and alias().
val = 1
df_new = df.withColumn(
'state',
f.struct(*[f.col('state')['fld'].alias('fld'), f.lit(val).alias('a')])
)
df_new.printSchema()
#root
# |-- state: struct (nullable = false)
# | |-- fld: integer (nullable = true)
# | |-- a: integer (nullable = false)
If you have a lot of fields in the nested struct you can use a list comprehension, using df.schema["state"].dataType.names to get the field names. For example:
val = 1
s_fields = df.schema["state"].dataType.names # ['fld']
df_new = df.withColumn(
'state',
f.struct(*([f.col('state')[c].alias(c) for c in s_fields] + [f.lit(val).alias('a')]))
)
df_new.printSchema()
#root
# |-- state: struct (nullable = false)
# | |-- fld: integer (nullable = true)
# | |-- a: integer (nullable = false)
References
I found a way to get the field names from the Struct without naming them manually from this answer.
Use a transformation such as the following:
import pyspark.sql.functions as f
df = df.withColumn(
"state",
f.struct(
f.col("state.*"),
f.lit(123).alias("a")
)
)
Although this is a too late answer, for pyspark version 2.x.x following is supported.
Assuming dfOld already contains state and fld as asked in question.
dfOld.withColumn("a","value")
dfNew = dfOld.select("level1Field1", "level1Field2", struct(col("state.fld").alias("fld"), col("a")).alias("state"))
Reference: https://medium.com/#mrpowers/adding-structtype-columns-to-spark-dataframes-b44125409803
Here's a way to do it without a udf.
Initialize example dataframe:
nested_df1 = (spark.read.json(sc.parallelize(["""[
{ "state": {"fld": 1} },
{ "state": {"fld": 2}}
]"""])))
nested_df1.printSchema()
root
|-- state: struct (nullable = true)
| |-- fld: long (nullable = true)
Spark .read.json imports all integers as long by default.
If state.fld has to be an int, you will need to cast it.
from pyspark.sql import functions as F
nested_df1 = (nested_df1
.select( F.struct(F.col("state.fld").alias("fld").cast('int')).alias("state") ))
nested_df1.printSchema()
root
|-- state: struct (nullable = false)
| |-- col1: integer (nullable = true)
nested_df1.show()
+-----+
|state|
+-----+
| [1]|
| [2]|
+-----+
Finally
Use .select to get the nested columns you want from the existing struct with the "parent.child" notation, create the new column, then re-wrap the old columns together with the new columns in a struct.
val_a = 3
nested_df2 = (nested_df
.select(
F.struct(
F.col("state.fld"),
F.lit(val_a).alias("a")
).alias("state")
)
)
nested_df2.printSchema()
root
|-- state: struct (nullable = false)
| |-- fld: integer (nullable = true)
| |-- a: integer (nullable = false)
nested_df2.show()
+------+
| state|
+------+
|[1, 3]|
|[2, 3]|
+------+
Flatten if needed with "parent.*".
nested_df2.select("state.*").printSchema()
root
|-- fld: integer (nullable = true)
|-- a: integer (nullable = false)
nested_df2.select("state.*").show()
+---+---+
|fld| a|
+---+---+
| 1| 3|
| 2| 3|
+---+---+
Spark 3.1+
F.col('state').withField('a', F.lit(1))
Example:
from pyspark.sql import functions as F
df = spark.createDataFrame([((1,),)], 'state:struct<fld:int>')
df.printSchema()
# root
# |-- state: struct (nullable = true)
# | |-- fld: integer (nullable = true)
df = df.withColumn('state', F.col('state').withField('a', F.lit(1)))
df.printSchema()
# root
# |-- state: struct (nullable = true)
# | |-- fld: integer (nullable = true)
# | |-- a: integer (nullable = false)
You can use the struct function
import pyspark.sql.functions as f
df = df.withColumn(
"state",
f.struct(
f.col("state.fld").alias("fld"),
f.lit(1).alias("a")
)
)
from pyspark.sql.functions import *
from pyspark.sql.types import *
def add_field_in_dataframe(nfield, df, dt):
fields = nfield.split(".")
print fields
n = len(fields)
addField = fields[0]
if n == 1:
return df.withColumn(addField, lit(None).cast(dt))
nestedField = ".".join(fields[:-1])
sfields = df.select(nestedField).schema[fields[-2]].dataType.names
print sfields
ac = col(nestedField)
if n == 2:
nc = struct(*( [ac[c].alias(c) for c in sfields] + [lit(None).cast(dt).alias(fields[-1])]))
else:
nc = struct(*( [ac[c].alias(c) for c in sfields] + [lit(None).cast(dt).alias(fields[-1])])).alias(fields[-2])
print nc
n = n - 1
while n > 1:
print "n: ",n
fields = fields[:-1]
print "fields: ", fields
nestedField = ".".join(fields[:-1])
print "nestedField: ", nestedField
sfields = df.select(nestedField).schema[fields[-2]].dataType.names
print fields[-1]
print "sfields: ", sfields
sfields = [s for s in sfields if s != fields[-1]]
print "sfields: ", sfields
ac = col(".".join(fields[:-1]))
if n > 2:
print fields[-2]
nc = struct(*( [ac[c].alias(c) for c in sfields] + [nc])).alias(fields[-2])
else:
nc = struct(*( [ac[c].alias(c) for c in sfields] + [nc]))
n = n - 1
return df.withColumn(addField, nc)