Transform Map[String,Any] to a dataframe in Scala - dataframe

I have the following object of type Map[String,Any]:
d: Map[String,Any] = Map(count -> 1, results -> List(Map(A -> 1, C -> Hello, B -> Map(BA -> 0, BC -> 0)), Map(A -> 2, C -> Hi, B -> Map(BA -> 0, BC -> 0))), parameters -> Map(P1 -> 805, P2 -> 20230101))
I dont need most of these informations. I only need to extract results map list into a dataframe (ignoring the B object as it is a map). So the desired output would be:
+---------+----------------+
| A |C |
+---------+----------------+
| 1 |Hello |
| 2 |Hi |
+---------+----------------+
I tried:
val df = d
.map( m => (m.get("A"),m.get("C")))
.toDF("A", "C")
But I got
error: value get is not a member of Any

Try casting .asInstanceOf[Seq[Map[String, Any]]], .asInstanceOf[(Int, String)]
import org.apache.spark.sql.SparkSession
val d: Map[String,Any] = Map(
"count" -> 1,
"results" -> List(
Map("A" -> 1, "C" -> "Hello", "B" -> Map("BA" -> 0, "BC" -> 0)),
Map("A" -> 2, "C" -> "Hi", "B" -> Map("BA" -> 0, "BC" -> 0))
),
"parameters" -> Map("P1" -> 805, "P2" -> 20230101)
)
val spark = SparkSession.builder
.master("local")
.appName("Spark app")
.getOrCreate()
import spark.implicits._
val df = d("results")
.asInstanceOf[Seq[Map[String, Any]]]
.map(m =>
(m("A"), m("C")).asInstanceOf[(Int, String)]
)
.toDF("A", "C")
df.show()
//+---+-----+
//| A| C|
//+---+-----+
//| 1|Hello|
//| 2| Hi|
//+---+-----+

Related

Transform list of map to dataframe

I have the following data:
d = Some(List(Map(id -> 1, n -> Hi), Map(id -> 2, n -> Hello)))
I would like to transform it into a dataframe like the following:
+--------+
|id|n |
+--------+
|1 |Hi |
+--------+
|2 |Hello|
+--------+
I tried the following:
import spark.implicits._
val df = d
.map( m => (m.get("id"),m.get("n")))
.toDF("id", "n")
But im getting:
error: value get is not a member of Any
.map( m => (m.get("id"),m.get("n")))
Your top level here is Option and i think thats the reason why you cant handle it with single map. I managed to do it with something like this:
import spark.implicits._
val d = Some(List(Map("id" -> "1", "n" -> "Hi"), Map("id" -> "2", "n" -> "Hello")))
val data = d.fold (List.empty [(Option [String], Option [String])]) (_.map (m => (m.get ("id"), m.get ("n"))))
val df = data.toDF("id", "n").show()
Output:
+---+-----+
| id| n|
+---+-----+
| 1| Hi|
| 2|Hello|
+---+-----+

Aggregate by struct field in spark scala

I have dataframe like this:
+-------+--------------+-----------------------+
|column1|mapColumn |arrayColumn |
+-------+--------------+-----------------------+
|a |[b -> 100.1] |[d] |
|a |[c -> 200.1] |[d, e] |
+-------+--------------+-----------------------+
val groupFunc = udf((x: Seq[Map[String, Double]]) => x.flatten.toMap)
df.groupBy("column1")
.agg(
map(col("column1"),
struct(
groupFunc(collect_list("mapColumn")),
flatten(collect_set(col("arrayColumn")))
)
).as("aggregatedMap")
)
After the aggregation the result is follow:
+-------+--------------------------------------------+
|column1|aggregatedMap |
+-------+--------------------------------------------+
|a |[a -> [[b -> 100.1, c -> 200.1], [d, d, e]]]|
+-------+--------------------------------------------+
Expected result:
+-------+--------------------------------------------+
|column1|aggregatedMap |
+-------+--------------------------------------------+
|a |[a -> [[b -> 100.1, c -> 200.1], [d, e]]] |
+-------+--------------------------------------------+
Any ideas how to remove duplicates while aggregating?

PySpark replace Map Key-Value with empty string for null keys

I did A groupby on my column and for some reasons my dataframe looks like this
ID col
1 [item1 -> 2, -> 3, item3 -> 4, -> 5]
2 [item2 -> 1, -> 7, item3 -> 2, -> 7]
I want to remove the KV that have no keys or null key
I want something like this
ID col
1 [item1 -> 2, item3 -> 4, -> 5]
2 [item2 -> 1, item3 -> 2, -> 7]
My Approach
dsNew = ds.withColumn("col", expr("map_filter(col, (k, v) -> k != '')"))
But the map_filter method is not available in PySpark
map_filter() is available from version 3.1.0. However your column needs to be of map type, like this for example:
root
|-- id: long (nullable = true)
|-- data: map (nullable = true)
| |-- key: string
| |-- value: double (valueContainsNull = true)
Then you could do the following:
from pyspark.sql import SparkSession, functions as F
df = df.withColumn("filtered_data", F.map_filter("data", lambda k, v: k != ""))

Scala Spark Dataframe sum list of json values in the column

I have a spark data frame as given below:
id
col1
col2
1
[{"a":1}]
[{"d": 3, "e": 4}]
2
[{"a":2}]
[{"d": 5, "e": 10}]
I want to obtain the following data frame:
id
col2_sum
1
7
2
10
Datatypes:
id:StringType
col1:StringType
col2:StringType
Thanks in advance
Convert JSON string into map type using from_json then use aggregate function to sum the map values:
val df = Seq(
(1, """[{"a":1}]""", """[{"d": 3, "e": 4}]"""),
(2, """[{"a":2}]""", """[{"d": 5, "e": 10}]""")
).toDF("id", "col1", "col2")
val df1 = (df
.withColumn("col2", from_json(col("col2"), lit("array<map<string,int>>")))
.withColumn("col2", flatten(expr("transform(col2, x -> map_values(x))")))
.withColumn("col2_sum", expr("aggregate(col2, 0, (acc, x) -> acc + x)"))
.drop("col1", "col2")
)
df1.show
//+---+--------+
//| id|col2_sum|
//+---+--------+
//| 1| 7|
//| 2| 15|
//+---+--------+

How to convert a DataFrame map column to a struct column?

Suppose we have a DataFrame with a column of map type.
val df = spark.sql("""select map("foo", 1, "bar", 2) AS mapColumn""")
df.show()
// +--------------------+
// | mapColumn|
// +--------------------+
// |{foo -> 1, bar -> 2}|
// +--------------------+
What is the most straightforward way to convert it to a struct (or, equivalently, define a new column with the same keys and values but as a struct type)? See the following spark-shell (2.4.5) session, for an insanely inefficient way of going about it:
val jsonStr = df.select(to_json($"mapColumn")).collect()(0)(0).asInstanceOf[String]
spark.read.json(Seq(jsonStr).toDS()).show()
// +---+---+
// |bar|foo|
// +---+---+
// | 2| 1|
// +---+---+
Now, obviously collect() is very inefficient, and this is generally an awful way to do things in Spark. But what is the preferred way to accomplish this conversion? named_struct and struct both take a sequence of parameter values to construct the results, but I can't find any way to "unwrap" the map key/values to pass them to these functions.
I would use explode function:
+--------------------+
| mapColumn|
+--------------------+
|[foo -> 1, bar -> 2]|
+--------------------+
df.select(explode('mapColumn)).select(struct('*).as("struct"))
output:
+--------+
| struct|
+--------+
|[foo, 1]|
|[bar, 2]|
+--------+
root
|-- struct: struct (nullable = false)
| |-- key: string (nullable = false)
| |-- value: integer (nullable = false)
I see #chlebek answer but in case it should be kept in one row you can use an UDF
scala> val df = spark.sql("""select map("foo", 1, "bar", 2) AS mapColumn""")
df: org.apache.spark.sql.DataFrame = [mapColumn: map<string,int>]
scala> df.show
+--------------------+
| mapColumn|
+--------------------+
|[foo -> 1, bar -> 2]|
+--------------------+
scala> case class KeyValue(key: String, value: String)
defined class KeyValue
scala> val toArrayOfStructs = udf((value: Map[String, String]) => value.map {
| case (k, v) => KeyValue(k, v)
| }.toArray )
toArrayOfStructs: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(StructType(StructField(key,StringType,true), StructField(value,StringType,true)),true),Some(List(MapType(StringType,StringType,true))))
scala> df.withColumn("alfa", toArrayOfStructs(col("mapColumn")))
res4: org.apache.spark.sql.DataFrame = [mapColumn: map<string,int>, alfa: array<struct<key:string,value:string>>]
scala> res4.show
+--------------------+--------------------+
| mapColumn| alfa|
+--------------------+--------------------+
|[foo -> 1, bar -> 2]|[[foo, 1], [bar, 2]]|
+--------------------+--------------------+
scala> res4.printSchema
root
|-- mapColumn: map (nullable = false)
| |-- key: string
| |-- value: integer (valueContainsNull = false)
|-- alfa: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
Your method doesn't seem to work with more rows, and especially if they have different maps, like this one:
val df = Seq(
(Map("foo"->1, "bar"->2)),
(Map("foo"->3, "baz"->4))
).toDF("mapColumn")
df.show()
// +--------------------+
// | mapColumn|
// +--------------------+
// |{foo -> 1, bar -> 2}|
// |{foo -> 3, baz -> 4}|
// +--------------------+
Your script would return...
val jsonStr = df.select(to_json($"mapColumn")).collect()(0)(0).asInstanceOf[String]
spark.read.json(Seq(jsonStr).toDS()).show()
// +---+---+
// |bar|foo|
// +---+---+
// | 2| 1|
// +---+---+
Solutions
map to columns:
val json_col = to_json($"mapColumn")
val json_schema = spark.read.json(df.select(json_col).as[String]).schema
val df2 = df.withColumn("_c", from_json(json_col, json_schema)).select("_c.*")
df2.show()
// +----+----+---+
// | bar| baz|foo|
// +----+----+---+
// | 2|null| 1|
// |null| 4| 3|
// +----+----+---+
map to struct (field names: "key", "value"):
val df2 = df.select(explode(map_entries($"mapColumn")).as("struct"))
df2.show()
// +--------+
// | struct|
// +--------+
// |{foo, 1}|
// |{bar, 2}|
// |{foo, 3}|
// |{baz, 4}|
// +--------+
map to struct (field names: "foo", "bar", "baz"):
val json_col = to_json($"mapColumn")
val json_schema = spark.read.json(df.select(json_col).as[String]).schema
val df2 = df.select(from_json(json_col, json_schema).as("struct"))
df2.show()
// +------------+
// | struct|
// +------------+
// |{2, null, 1}|
// |{null, 4, 3}|
// +------------+
Define a case class
case class Bean56(foo: Int, bar: Int)
//Define a bean Encoder
val personEncoder = Encoders.bean[Bean56](classOf[Bean56])
val df = spark.sql("""select map("foo", 1, "bar", 2) AS mapColumn""")
//Map the output to required bean
val Bean56s = df.map(row => {
val map = row.getMap[String, Int](0)
Bean56(map.getOrElse("foo", -1), map.getOrElse("bar", -1))
})(personEncoder) // Supply implicit Encoder of the bean
Bean56s.foreach(println(_)) // Print the bean