How to convert a DataFrame map column to a struct column? - dataframe

Suppose we have a DataFrame with a column of map type.
val df = spark.sql("""select map("foo", 1, "bar", 2) AS mapColumn""")
df.show()
// +--------------------+
// | mapColumn|
// +--------------------+
// |{foo -> 1, bar -> 2}|
// +--------------------+
What is the most straightforward way to convert it to a struct (or, equivalently, define a new column with the same keys and values but as a struct type)? See the following spark-shell (2.4.5) session, for an insanely inefficient way of going about it:
val jsonStr = df.select(to_json($"mapColumn")).collect()(0)(0).asInstanceOf[String]
spark.read.json(Seq(jsonStr).toDS()).show()
// +---+---+
// |bar|foo|
// +---+---+
// | 2| 1|
// +---+---+
Now, obviously collect() is very inefficient, and this is generally an awful way to do things in Spark. But what is the preferred way to accomplish this conversion? named_struct and struct both take a sequence of parameter values to construct the results, but I can't find any way to "unwrap" the map key/values to pass them to these functions.

I would use explode function:
+--------------------+
| mapColumn|
+--------------------+
|[foo -> 1, bar -> 2]|
+--------------------+
df.select(explode('mapColumn)).select(struct('*).as("struct"))
output:
+--------+
| struct|
+--------+
|[foo, 1]|
|[bar, 2]|
+--------+
root
|-- struct: struct (nullable = false)
| |-- key: string (nullable = false)
| |-- value: integer (nullable = false)

I see #chlebek answer but in case it should be kept in one row you can use an UDF
scala> val df = spark.sql("""select map("foo", 1, "bar", 2) AS mapColumn""")
df: org.apache.spark.sql.DataFrame = [mapColumn: map<string,int>]
scala> df.show
+--------------------+
| mapColumn|
+--------------------+
|[foo -> 1, bar -> 2]|
+--------------------+
scala> case class KeyValue(key: String, value: String)
defined class KeyValue
scala> val toArrayOfStructs = udf((value: Map[String, String]) => value.map {
| case (k, v) => KeyValue(k, v)
| }.toArray )
toArrayOfStructs: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(StructType(StructField(key,StringType,true), StructField(value,StringType,true)),true),Some(List(MapType(StringType,StringType,true))))
scala> df.withColumn("alfa", toArrayOfStructs(col("mapColumn")))
res4: org.apache.spark.sql.DataFrame = [mapColumn: map<string,int>, alfa: array<struct<key:string,value:string>>]
scala> res4.show
+--------------------+--------------------+
| mapColumn| alfa|
+--------------------+--------------------+
|[foo -> 1, bar -> 2]|[[foo, 1], [bar, 2]]|
+--------------------+--------------------+
scala> res4.printSchema
root
|-- mapColumn: map (nullable = false)
| |-- key: string
| |-- value: integer (valueContainsNull = false)
|-- alfa: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)

Your method doesn't seem to work with more rows, and especially if they have different maps, like this one:
val df = Seq(
(Map("foo"->1, "bar"->2)),
(Map("foo"->3, "baz"->4))
).toDF("mapColumn")
df.show()
// +--------------------+
// | mapColumn|
// +--------------------+
// |{foo -> 1, bar -> 2}|
// |{foo -> 3, baz -> 4}|
// +--------------------+
Your script would return...
val jsonStr = df.select(to_json($"mapColumn")).collect()(0)(0).asInstanceOf[String]
spark.read.json(Seq(jsonStr).toDS()).show()
// +---+---+
// |bar|foo|
// +---+---+
// | 2| 1|
// +---+---+
Solutions
map to columns:
val json_col = to_json($"mapColumn")
val json_schema = spark.read.json(df.select(json_col).as[String]).schema
val df2 = df.withColumn("_c", from_json(json_col, json_schema)).select("_c.*")
df2.show()
// +----+----+---+
// | bar| baz|foo|
// +----+----+---+
// | 2|null| 1|
// |null| 4| 3|
// +----+----+---+
map to struct (field names: "key", "value"):
val df2 = df.select(explode(map_entries($"mapColumn")).as("struct"))
df2.show()
// +--------+
// | struct|
// +--------+
// |{foo, 1}|
// |{bar, 2}|
// |{foo, 3}|
// |{baz, 4}|
// +--------+
map to struct (field names: "foo", "bar", "baz"):
val json_col = to_json($"mapColumn")
val json_schema = spark.read.json(df.select(json_col).as[String]).schema
val df2 = df.select(from_json(json_col, json_schema).as("struct"))
df2.show()
// +------------+
// | struct|
// +------------+
// |{2, null, 1}|
// |{null, 4, 3}|
// +------------+

Define a case class
case class Bean56(foo: Int, bar: Int)
//Define a bean Encoder
val personEncoder = Encoders.bean[Bean56](classOf[Bean56])
val df = spark.sql("""select map("foo", 1, "bar", 2) AS mapColumn""")
//Map the output to required bean
val Bean56s = df.map(row => {
val map = row.getMap[String, Int](0)
Bean56(map.getOrElse("foo", -1), map.getOrElse("bar", -1))
})(personEncoder) // Supply implicit Encoder of the bean
Bean56s.foreach(println(_)) // Print the bean

Related

query spark dataframe on max column value

I have a hive external partitioned table with following data structure:
hdfs://my_server/stg/my_table/project=foo/project_version=2.0/dt=20210105/file1.parquet
hdfs://my_server/stg/my_table/project=foo/project_version=2.0/dt=20210110/file2.parquet
hdfs://my_server/stg/my_table/project=foo/project_version=2.1/dt=20210201/file3.parquet
hdfs://my_server/stg/my_table/project=bar/project_version=2.0/dt=20210103/file4.parquet
hdfs://my_server/stg/my_table/project=bar/project_version=2.1/dt=20210210/file5.parquet
hdfs://my_server/stg/my_table/project=bar/project_version=2.1/dt=20210311/file6.parquet
hdfs://my_server/stg/my_table/project=big_project/project_version=1.1/dt=20210401/file7.parquet
hdfs://my_server/stg/my_table/project=big_project/project_version=1.1/dt=20210401/file8.parquet
hdfs://my_server/stg/my_table/project=big_project/project_version=1.1/dt=20210401/file9.parquet
i want to return a dataframe containing data for foo project, for the max version.
i want to avoid reading records for any other project.
I'm unable to query this table directly due to limitations in our etl process, so have tried reading directly from parquet
val df_foo = spark.read.parquet("hdfs://my_server/stg/my_table/project=foo")
df_foo.printSchema
root
|-- clientid: string (nullable = true)
|-- some_field_i_care_about: string (nullable = true)
|-- project_version: double (nullable = true)
|-- dt: string (nullable = true)
df_foo.groupBy("project_version", "dt").count().show
+---------------+--------+------+
|project_version| dt| count|
+---------------+--------+------+
| 2.0|20210105|187234|
| 2.0|20210110|188356|
| 2.1|20210201|188820|
+---------------+--------+------+
val max_version = df_foo.groupBy().max("project_version")
max_version.show
+--------------------+
|max(project_version)|
+--------------------+
| 2.1|
+--------------------+
val df_foo_latest = df_foo.filter($"project_version" === max_version).count()
java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.Dataset [max(project_version): double]
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:77)
at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:163)
at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:163)
at scala.util.Try.getOrElse(Try.scala:79)
project_version column is a double, and max_version value is also a double, why can't i compare these values in the filter?
thanks for any help
max_version is of type org.apache.spark.sql.DataFrame its not Double. You have to extract value from the DataFrame.
Check below code.
scala> val max_version = df.groupBy().agg(max("project_version").as("version")).as[Double].collect.head
max_version: Double = 2.1
scala> val df_foo = Seq((2.0,20210105,187234),(2.0,20210110,188356),(2.1,20210201,188820)).toDF("project_version","dt","count")
df_foo: org.apache.spark.sql.DataFrame = [project_version: double, dt: int ... 1 more field]
scala> val max_version = df_foo.groupBy().agg(max("project_version").as("version")).as[Double].collect.head
max_version: Double = 2.1
scala> val df_foo_latest = df_foo.filter($"project_version" === max_version).count()
df_foo_latest: Long = 1
scala> val df_foo_latest = df_foo.filter($"project_version" === max_version)
df_foo_latest: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [project_version: double, dt: int ... 1 more field]
scala> df_foo_latest.count
res1: Long = 1
scala> df_foo_latest.show(false)
+---------------+--------+------+
|project_version|dt |count |
+---------------+--------+------+
|2.1 |20210201|188820|
+---------------+--------+------+
Instead of extracting value from DataFrame, try to use inner join. It's much safer.
scala> val max_version = df_foo.groupBy().max("project_version")
max_version: org.apache.spark.sql.DataFrame = [max(project_version): double]
scala> val max_version = df_foo.groupBy().agg(max("project_version").as("project_version"))
scala> val df_foo_latest = df_foo.join(max_version,Seq($"project_version"),"inner")
scala> df_foo_latest.show(false)
+---------------+--------+------+
|project_version|dt |count |
+---------------+--------+------+
|2.1 |20210201|188820|
+---------------+--------+------+

Extract columns from a list of lists in pyspark

I am stuck trying to extract columns from a list of lists but can't visualize how to do it. I am fairly new to spark. Running pyspark on Spark 2.4.3.
I have a json organized like this:
{ "meta" : { ... },
"data" :
[[ "a", 0, null, "{ }"],
[ "b", 0, null, "{ }"],
[ "c", 0, null, "{ }"],
] }
I want to get the 'data' portion into columns, like
+------+------+------+------+
| col1 | col2 | col3 | col4 |
+------+------+------+------+
| a | 0 | None | "{ }"|
| b | 0 | None | "{ }"|
| c | 0 | None | "{ }"|
I have my dataframe read in and printSchema() shows this.
root
|-- data: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- meta: struct (nullable = true)
| |-- view: struct (nullable = true)
| | |-- approvals: array (nullable = true) ...
My rough shape is 70 columns by 650k rows.
I was able to explode the df to get just the data portion but am stuck there.
Explode the rows first, and then select the array elements using [] in Python.
df2 = df.select(F.explode('data').alias('data')) \
.select(*[F.col('data')[i].alias('col%s'%(i+1)) for i in range(4)])
df2.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| a| 0|null| { }|
| b| 0|null| { }|
| c| 0|null| { }|
+----+----+----+----+
Why don't you use just SparkSession.createDataFrame() method?
https://spark.apache.org/docs/2.4.3/api/python/pyspark.sql.html#pyspark.sql.SparkSession.createDataFrame
You can provide data and schema parameters to this method and get spark dataframe.
Example:
from pyspark.sql import SparkSession
sparkSession = SparkSession.builder.getOrCreate()
df = sparkSession.createDataFrame(data)
If spark cannot infer schema from the data then schema also need to be provided
from pyspark.sql.types import StructType
struct = StructType()
struct.add("col1", "string", True)
struct.add("col2", "integer", True)
struct.add("col3", "string", True)
struct.add("col4", "string", True)
df = sparkSession.createDataFrame(data=data, schema=struct)
In addition, you can use pyspark type classes instead of python primitive type names.
https://spark.apache.org/docs/2.4.3/api/python/pyspark.sql.html#module-pyspark.sql.types
module contains both simple types(StringType, IntegerType, ...) and complex types(ArrayType, MapType, ...)
Last note: data cannot contain null, it should be None in python. spark DataFrame.show() will print None columns as null.

Update array of structs - Spark

I have the following spark delta table structure,
+---+------------------------------------------------------+
|id |addresses |
+---+------------------------------------------------------+
|1 |[{"Address":"ABC", "Street": "XXX"}, {"Address":"XYZ", "Street": "YYY"}]|
+---+------------------------------------------------------+
Here the addresses column is an array of structs.
I need to update the first Address inside array as "XXX", from the "Street" attributes value without changing the second element in the list.
So, "ABC" should be updated to "XXX" and "XYZ" should be updated to "YYY"
You can assume, I have so many attributes in the struct like street, zipcode etc so I want to leave them untouched and just update the value of Address from Street attribute.
How can I do this in Spark or Databricks or Sql?
Schema,
|-- id: string (nullable = true)
|-- addresses: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- Address: string (nullable = true)
| | | | |-- Street: string (nullable = true)
Cheers!
Please check below code.
scala> vdf.show(false)
+---+--------------+
|id |addresses |
+---+--------------+
|1 |[[ABC], [XYZ]]|
+---+--------------+
scala> vdf.printSchema
root
|-- id: integer (nullable = false)
|-- addresses: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Address: string (nullable = true)
scala> val new_address = array(struct(lit("AAA").as("Address")))
scala> val except_first = array_except($"addresses",array($"addresses"(0)))
scala> val addresses = array_union(new_address,except_first).as("addresses")
scala> vdf.select($"id",addresses).select($"id",$"addresses",to_json($"addresses").as("json_addresses")).show(false)
+---+--------------+-------------------------------------+
|id |addresses |json_addresses |
+---+--------------+-------------------------------------+
|1 |[[AAA], [XYZ]]|[{"Address":"AAA"},{"Address":"XYZ"}]|
+---+--------------+-------------------------------------+
Updated
scala> vdf.withColumn("addresses",explode($"addresses")).groupBy($"id").agg(collect_list(struct($"addresses.Street".as("Address"),$"addresses.Street")).as("addresses")).withColumn("json_data",to_json($"addresses")).show(false)
+---+------------------------+-------------------------------------------------------------------+
|id |addresses |json_data |
+---+------------------------+-------------------------------------------------------------------+
|1 |[[XXX, XXX], [YYY, YYY]]|[{"Address":"XXX","Street":"XXX"},{"Address":"YYY","Street":"YYY"}]|
+---+------------------------+-------------------------------------------------------------------+

Scala Spark Dataframe: how to explode an array of Int and array of struct at the same time

I'm new to Scala/Spark and I'm trying to make explode a dataframe that has an array column and array of struct column so that I end up with no arrays and no struct.
Here's an example
case class Area(start_time: String, end_time: String, area: String)
val df = Seq((
"1", Seq(4,5,6),
Seq(Area("07:00", "07:30", "70"), Area("08:00", "08:30", "80"), Area("09:00", "09:30", "90"))
)).toDF("id", "before", "after")
df.printSchema
df.show
df has the following schema
root
|-- id: string (nullable = true)
|-- before: array (nullable = true)
| |-- element: integer (containsNull = false)
|-- after: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- start_time: string (nullable = true)
| | |-- end_time: string (nullable = true)
| | |-- area: string (nullable = true)
and the data looks like
+---+---------+--------------------+
| id| before| after|
+---+---------+--------------------+
| 1|[4, 5, 6]|[[07:00, 07:30, 7...|
+---+---------+--------------------+
How do I explode the dataframe so I get the following schema
|-- id: string (nullable = true)
|-- before: integer (containsNull = false)
|-- after_start_time: string (nullable = true)
|-- after_end_time: string (nullable = true)
|-- after_area: string (nullable = true)
The resulting data should have 3 rows and 5 columns
+---+---------+--------------------+--------------------+--------+
| id| before| after_start_time| after_start_time| area|
+---+---------+--------------------+--------------------+--------+
| 1| 4| 07:00| 07:30| 70|
| 1| 5| 08:00| 08:30| 80|
| 1| 6| 09:00| 09:30| 90|
+---+---------+--------------------+--------------------+--------+
I'm using spark 2.3.0 (arrays_zip is not available). And the only solutions I can find is either for exploding two Arrays of String or one Array of struct.
Use arrays_zip to combine two arrays, then explode to explode array columns & use as to rename required columns.
As arrays_zip is not available in spark 2.3. Created UDF to perform same operation.
val arrays_zip = udf((before:Seq[Int],after: Seq[Area]) => before.zip(after))
Execution time with built in (spark 2.4.2) arrays_zip - Time taken: 1146 ms
Execution time with arrays_zip UDF - Time taken: 1165 ms
Check below code.
scala> df.show(false)
+---+---------+------------------------------------------------------------+
|id |before |after |
+---+---------+------------------------------------------------------------+
|1 |[4, 5, 6]|[[07:00, 07:30, 70], [08:00, 08:30, 80], [09:00, 09:30, 90]]|
+---+---------+------------------------------------------------------------+
scala>
df
.select(
$"id",
explode(
arrays_zip($"before",$"after")
.cast("array<struct<before:int,after:struct<start_time:string,end_time:string,area:string>>>")
).as("before_after")
)
.select(
$"id",
$"before_after.before".as("before"),
$"before_after.after.start_time".as("after_start_time"),
$"before_after.after.end_time".as("after_end_time"),
$"before_after.after.area"
)
.printSchema
root
|-- id: string (nullable = true)
|-- before: integer (nullable = true)
|-- after_start_time: string (nullable = true)
|-- after_end_time: string (nullable = true)
|-- area: string (nullable = true)
Output
scala>
df
.select(
$"id",
explode(
arrays_zip($"before",$"after")
.cast("array<struct<before:int,after:struct<start_time:string,end_time:string,area:string>>>")
).as("before_after")
)
.select(
$"id",
$"before_after.before".as("before"),
$"before_after.after.start_time".as("after_start_time"),
$"before_after.after.end_time".as("after_end_time"),
$"before_after.after.area"
)
.show(false)
+---+------+----------------+--------------+----+
|id |before|after_start_time|after_end_time|area|
+---+------+----------------+--------------+----+
|1 |4 |07:00 |07:30 |70 |
|1 |5 |08:00 |08:30 |80 |
|1 |6 |09:00 |09:30 |90 |
+---+------+----------------+--------------+----+
To handle some complex struct you can do,
Declare two beans Area(input) and Area2(output)
Map row to Area2 bean
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
import scala.collection.mutable
object ExplodeTwoArrays {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
val df = Seq((
"1", Seq(4, 5, 6),
Seq(Area("07:00", "07:30", "70"), Area("08:00", "08:30", "80"), Area("09:00", "09:30", "90"))
)).toDF("id", "before", "after")
val outDf = df.map(row=> {
val id = row.getString(0)
val beforeArray : Seq[Int]= row.getSeq[Int](1)
val afterArray : mutable.WrappedArray[Area2] =
row.getAs[mutable.WrappedArray[GenericRowWithSchema]](2) // Need to map Array(Struct) to the something compatible
.zipWithIndex // Require to iterate with indices
.map{ case(element,i) => {
Area2(element.getAs[String]("start_time"),
element.getAs[String]("end_time"),
element.getAs[String]("area"),
beforeArray(i))
}}
(id,afterArray) // Return row(id,Array(Area2(...)))
}).toDF("id","after")
outDf.printSchema()
outDf.show()
}
}
case class Area(start_time: String, end_time: String, area: String)
case class Area2(start_time: String, end_time: String, area: String, before: Int)

How to join two dataframe with both a array column, one's element is nullable, another is not

I have two dataframa df1,df2
df1
df1's Index was generated by udf function.
+------+----------+
|other | Index |
+------+----------+
| 5| [1, 1] |
| 1| [0, 7] |
| 2| [1, 7] |
| 8| [2, 6] |
| 6| [2, 7] |
|-- gridIndex: array (nullable = true)
| |-- element: integer (containsNull = false)
df2
var l =new ListBuffer[(Int, Array[Int])]
l.append((5,Array(1,1)))
l.append((6,Array(0,7)))
l.append((1,Array(1,7)))
l.append((3,Array(2,6)))
l.append((7,Array(2,7)))
val df2 = l.toDF("other2","Index")
+------+----------+
|other2| Index |
+------+----------+
| 5| [1, 1] |
| 6| [0, 7] |
| 1| [1, 7] |
| 3| [2, 6] |
| 7| [2, 7] |
|-- gridIndex: array (nullable = true)
| |-- element: integer (containsNull = true)
I want to join them by Index.
But as you can see a error appear:
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Join keys from two sides should have same types
df1.Index is different with df2.Index, their containNull is different.
So how can i figure out this problem?
I also want to why |-- element: integer (containsNull = true) in df2's schema.
Thank you very much!
I'm able to do this.
Not sure if this was the desired result as you have not posted the expected output.
val df1 = Seq(
(5,Array(1,1)),
(1,Array(0,7)),
(2,Array(1,7)),
(8,Array(2,6)),
(6,Array(2,7)),
(100,Array(null.asInstanceOf[Int],null.asInstanceOf[Int]))
).toDF("Other","Index")
val df2 = Seq(
(5,Array(1,1)),
(6,Array(0,7)),
(1,Array(1,7)),
(3,Array(2,6)),
(7,Array(null.asInstanceOf[Int],null.asInstanceOf[Int]))
).toDF("Other2","Index")
root
|-- Other: integer (nullable = false)
|-- Index: array (nullable = true)
| |-- element: integer (containsNull = false)
root
|-- Other2: integer (nullable = false)
|-- Index: array (nullable = true)
| |-- element: integer (containsNull = false)
df1.join(df2, df1("Index") <=> df2("Index") ).show()
+-----+------+------+------+
|Other| Index|Other2| Index|
+-----+------+------+------+
| 5|[1, 1]| 5|[1, 1]|
| 1|[0, 7]| 6|[0, 7]|
| 2|[1, 7]| 1|[1, 7]|
| 8|[2, 6]| 3|[2, 6]|
| 100|[0, 0]| 7|[0, 0]|
+-----+------+------+------+