PySpark replace Map Key-Value with empty string for null keys - dataframe

I did A groupby on my column and for some reasons my dataframe looks like this
ID col
1 [item1 -> 2, -> 3, item3 -> 4, -> 5]
2 [item2 -> 1, -> 7, item3 -> 2, -> 7]
I want to remove the KV that have no keys or null key
I want something like this
ID col
1 [item1 -> 2, item3 -> 4, -> 5]
2 [item2 -> 1, item3 -> 2, -> 7]
My Approach
dsNew = ds.withColumn("col", expr("map_filter(col, (k, v) -> k != '')"))
But the map_filter method is not available in PySpark

map_filter() is available from version 3.1.0. However your column needs to be of map type, like this for example:
root
|-- id: long (nullable = true)
|-- data: map (nullable = true)
| |-- key: string
| |-- value: double (valueContainsNull = true)
Then you could do the following:
from pyspark.sql import SparkSession, functions as F
df = df.withColumn("filtered_data", F.map_filter("data", lambda k, v: k != ""))

Related

Transform Map[String,Any] to a dataframe in Scala

I have the following object of type Map[String,Any]:
d: Map[String,Any] = Map(count -> 1, results -> List(Map(A -> 1, C -> Hello, B -> Map(BA -> 0, BC -> 0)), Map(A -> 2, C -> Hi, B -> Map(BA -> 0, BC -> 0))), parameters -> Map(P1 -> 805, P2 -> 20230101))
I dont need most of these informations. I only need to extract results map list into a dataframe (ignoring the B object as it is a map). So the desired output would be:
+---------+----------------+
| A |C |
+---------+----------------+
| 1 |Hello |
| 2 |Hi |
+---------+----------------+
I tried:
val df = d
.map( m => (m.get("A"),m.get("C")))
.toDF("A", "C")
But I got
error: value get is not a member of Any
Try casting .asInstanceOf[Seq[Map[String, Any]]], .asInstanceOf[(Int, String)]
import org.apache.spark.sql.SparkSession
val d: Map[String,Any] = Map(
"count" -> 1,
"results" -> List(
Map("A" -> 1, "C" -> "Hello", "B" -> Map("BA" -> 0, "BC" -> 0)),
Map("A" -> 2, "C" -> "Hi", "B" -> Map("BA" -> 0, "BC" -> 0))
),
"parameters" -> Map("P1" -> 805, "P2" -> 20230101)
)
val spark = SparkSession.builder
.master("local")
.appName("Spark app")
.getOrCreate()
import spark.implicits._
val df = d("results")
.asInstanceOf[Seq[Map[String, Any]]]
.map(m =>
(m("A"), m("C")).asInstanceOf[(Int, String)]
)
.toDF("A", "C")
df.show()
//+---+-----+
//| A| C|
//+---+-----+
//| 1|Hello|
//| 2| Hi|
//+---+-----+

Spark Dataframe - Merge Nested Columns into one

I'm trying to merge a dataframe that has multiple nested struct columns into one single struct column. Here's a same schema
Before:
|- column_1
|- column_2
|- column_3
|- nested_details_1
|- a
|- b
|- c
|- nested_details_2
|- d
|- e
|- f
Desired after:
|- column_1
|- column_2
|- column_3
|- nested_details
|- a
|- b
|- c
|- d
|- e
|- f
Visually I can see what needs to be done, but is there a function with data frames to merge columns for me? If it makes any difference I am using AWS Glue Dynamic Dataframe but I can convert easily enough to a normal data frame.
pyspark
I managed to merge 2 structs using the pyspark struct function like this:
from pyspark.sql import SparkSession
from pyspark.sql.functions import struct, col, expr
spark = SparkSession.builder.getOrCreate()
input_df = spark.sql("""
SELECT
1 as column_1,
2 as column_2,
3 as column_3,
named_struct("a", 1, "b", 2, "c", 3) as nested_details_1,
named_struct("d", 4, "e", 5, "f", 6) as nested_details_2
""")
merged_df = input_df.select(
col("column_1"),
col("column_2"),
col("column_3"),
struct(col("nested_details_1.*"),col("nested_details_2.*")).alias("nested_details")
)
merged_df.printSchema()
merged_df.show()
merged_df.select(expr("typeof(nested_details)")).show(truncate=False)
above prints
root
|-- column_1: integer (nullable = false)
|-- column_2: integer (nullable = false)
|-- column_3: integer (nullable = false)
|-- nested_details: struct (nullable = false)
| |-- a: integer (nullable = false)
| |-- b: integer (nullable = false)
| |-- c: integer (nullable = false)
| |-- d: integer (nullable = false)
| |-- e: integer (nullable = false)
| |-- f: integer (nullable = false)
+--------+--------+--------+------------------+
|column_1|column_2|column_3| nested_details|
+--------+--------+--------+------------------+
| 1| 2| 3|{1, 2, 3, 4, 5, 6}|
+--------+--------+--------+------------------+
+-------------------------------------------+
|typeof(nested_details) |
+-------------------------------------------+
|struct<a:int,b:int,c:int,d:int,e:int,f:int>|
+-------------------------------------------+
Spark SQL
same can be achieved with SQL also using SQL struct function like this:
with input as (
SELECT
1 as column_1,
2 as column_2,
3 as column_3,
named_struct("a", 1, "b", 2, "c", 3) as nested_details_1,
named_struct("d", 4, "e", 5, "f", 6) as nested_details_2
)
SELECT
column_1,
column_2,
column_3,
struct(nested_details_1.*, nested_details_2.*) as nested_details
FROM input

How to convert a DataFrame map column to a struct column?

Suppose we have a DataFrame with a column of map type.
val df = spark.sql("""select map("foo", 1, "bar", 2) AS mapColumn""")
df.show()
// +--------------------+
// | mapColumn|
// +--------------------+
// |{foo -> 1, bar -> 2}|
// +--------------------+
What is the most straightforward way to convert it to a struct (or, equivalently, define a new column with the same keys and values but as a struct type)? See the following spark-shell (2.4.5) session, for an insanely inefficient way of going about it:
val jsonStr = df.select(to_json($"mapColumn")).collect()(0)(0).asInstanceOf[String]
spark.read.json(Seq(jsonStr).toDS()).show()
// +---+---+
// |bar|foo|
// +---+---+
// | 2| 1|
// +---+---+
Now, obviously collect() is very inefficient, and this is generally an awful way to do things in Spark. But what is the preferred way to accomplish this conversion? named_struct and struct both take a sequence of parameter values to construct the results, but I can't find any way to "unwrap" the map key/values to pass them to these functions.
I would use explode function:
+--------------------+
| mapColumn|
+--------------------+
|[foo -> 1, bar -> 2]|
+--------------------+
df.select(explode('mapColumn)).select(struct('*).as("struct"))
output:
+--------+
| struct|
+--------+
|[foo, 1]|
|[bar, 2]|
+--------+
root
|-- struct: struct (nullable = false)
| |-- key: string (nullable = false)
| |-- value: integer (nullable = false)
I see #chlebek answer but in case it should be kept in one row you can use an UDF
scala> val df = spark.sql("""select map("foo", 1, "bar", 2) AS mapColumn""")
df: org.apache.spark.sql.DataFrame = [mapColumn: map<string,int>]
scala> df.show
+--------------------+
| mapColumn|
+--------------------+
|[foo -> 1, bar -> 2]|
+--------------------+
scala> case class KeyValue(key: String, value: String)
defined class KeyValue
scala> val toArrayOfStructs = udf((value: Map[String, String]) => value.map {
| case (k, v) => KeyValue(k, v)
| }.toArray )
toArrayOfStructs: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(StructType(StructField(key,StringType,true), StructField(value,StringType,true)),true),Some(List(MapType(StringType,StringType,true))))
scala> df.withColumn("alfa", toArrayOfStructs(col("mapColumn")))
res4: org.apache.spark.sql.DataFrame = [mapColumn: map<string,int>, alfa: array<struct<key:string,value:string>>]
scala> res4.show
+--------------------+--------------------+
| mapColumn| alfa|
+--------------------+--------------------+
|[foo -> 1, bar -> 2]|[[foo, 1], [bar, 2]]|
+--------------------+--------------------+
scala> res4.printSchema
root
|-- mapColumn: map (nullable = false)
| |-- key: string
| |-- value: integer (valueContainsNull = false)
|-- alfa: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
Your method doesn't seem to work with more rows, and especially if they have different maps, like this one:
val df = Seq(
(Map("foo"->1, "bar"->2)),
(Map("foo"->3, "baz"->4))
).toDF("mapColumn")
df.show()
// +--------------------+
// | mapColumn|
// +--------------------+
// |{foo -> 1, bar -> 2}|
// |{foo -> 3, baz -> 4}|
// +--------------------+
Your script would return...
val jsonStr = df.select(to_json($"mapColumn")).collect()(0)(0).asInstanceOf[String]
spark.read.json(Seq(jsonStr).toDS()).show()
// +---+---+
// |bar|foo|
// +---+---+
// | 2| 1|
// +---+---+
Solutions
map to columns:
val json_col = to_json($"mapColumn")
val json_schema = spark.read.json(df.select(json_col).as[String]).schema
val df2 = df.withColumn("_c", from_json(json_col, json_schema)).select("_c.*")
df2.show()
// +----+----+---+
// | bar| baz|foo|
// +----+----+---+
// | 2|null| 1|
// |null| 4| 3|
// +----+----+---+
map to struct (field names: "key", "value"):
val df2 = df.select(explode(map_entries($"mapColumn")).as("struct"))
df2.show()
// +--------+
// | struct|
// +--------+
// |{foo, 1}|
// |{bar, 2}|
// |{foo, 3}|
// |{baz, 4}|
// +--------+
map to struct (field names: "foo", "bar", "baz"):
val json_col = to_json($"mapColumn")
val json_schema = spark.read.json(df.select(json_col).as[String]).schema
val df2 = df.select(from_json(json_col, json_schema).as("struct"))
df2.show()
// +------------+
// | struct|
// +------------+
// |{2, null, 1}|
// |{null, 4, 3}|
// +------------+
Define a case class
case class Bean56(foo: Int, bar: Int)
//Define a bean Encoder
val personEncoder = Encoders.bean[Bean56](classOf[Bean56])
val df = spark.sql("""select map("foo", 1, "bar", 2) AS mapColumn""")
//Map the output to required bean
val Bean56s = df.map(row => {
val map = row.getMap[String, Int](0)
Bean56(map.getOrElse("foo", -1), map.getOrElse("bar", -1))
})(personEncoder) // Supply implicit Encoder of the bean
Bean56s.foreach(println(_)) // Print the bean

The output of Spark DataFrame replace column names with "0", "1" when arrays_zip data originated nested

I used spark sql functions arrays_zip combined with flatten to transform data from array of struct of inner array of the same length into array of struct. printSchema shows exactly I want. However, df output lost original column names and replace them with generic column name "0", "1", "2" etc. no matter in Parquet or Avro format. I like to output original column names.
Not to reveal the business of my company. The followings are similar but much simplified examples.
scala> c2.printSchema
root
|-- cal: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- month: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- num: array (nullable = true)
| | | |-- element: long (containsNull = true)
scala> c2.show(false)
+----------------------------------------------+
|cal |
+----------------------------------------------+
|[[[Jan, Feb, Mar], [1, 2, 3]], [[April], [4]]]|
+----------------------------------------------+
I like to transform to
scala> newC2.show(false)
+------------------------------------------+
|cal |
+------------------------------------------+
|[[Jan, 1], [Feb, 2], [Mar, 3], [April, 4]]|
+------------------------------------------+
with
scala> newC2.printSchema
root
|-- cal: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- month: string (nullable = true)
| | |-- num: long (nullable = true)
I know arrays_zip only work well on the top-level arrays. Therefore, I flatten them to top level. The followings codes work in this example
val newC2 = c2.withColumn("month", flatten(col("cal.month"))).withColumn("num", flatten(col("cal.num"))).withColumn("cal", arrays_zip(col("month"), col("num"))).drop("month", "num")
It generates exactly data and schema I want. However, it outputs all columns generically using "0", "1", "2" etc.
newC2.write.option("header", false).parquet("c2_parquet")
I tried another example that has original data of month array and num array at the top level. I can arrays_zip without flatten and get the same schema and data shown. However, it output original field name correctly in this case.
I tried add alias to flatten data. That does not work. I even tried manipulate columns like (assume field store the result of arrays_zip is 'zipped'
val columns: Array[Column] = inner.fields.map(_.name).map{x => col("zipped").getField(x).alias(x)}
val newB3 = newB2.withColumn("b", array(struct(columns:_*))).drop("zipped")
It ends up generate original schema ('month", array of string and "num", array of long).
To duplicate the problem, you can use the json input
"cal":[{"month":["Jan","Feb","Mar"],"num":[1,2,3]},{"month":["April"],"num":[4]}]}
the following json is for top-level arrays_zip
{"month":["Jan","Feb","Mar"],"num":[1,2,3]}
How Spark internally decide what field names to use? How can I get it to work? Please advise.
Since Spark 2.4, the schema transformation can be achieved using Higher Order Functions. In Scala the query can look like this:
import org.apache.spark.sql.functions.{expr, flatten}
val result = df
.withColumn("cal", flatten(expr("TRANSFORM(cal, x -> zip_with(x.month, x.num, (month, num) -> (month,num)))")))
After applying on your sample data i get this schema:
result.printSchema()
root
|-- cal: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- month: string (nullable = true)
| | |-- num: long (nullable = true)

how to do intersection of list columns with pyspark dataframe columns?

I have a below pyspark dataframe and i need to create new column (new_col) which is common items in column X and Y excluding items in Z.
df
id X Y Z new_col
1 [12,23,1,24] [13,412,12,23,24] [12] [23,24]
2 [1,2,3] [2,4,5,6] [] [2]
If your schema is the following:
df.printSchema()
#root
# |-- id: long (nullable = true)
# |-- X: array (nullable = true)
# | |-- element: long (containsNull = true)
# |-- Y: array (nullable = true)
# | |-- element: long (containsNull = true)
# |-- Z: array (nullable = true)
# | |-- element: long (containsNull = true)
and your pyspark version 2.4+ you can use array_intersect and array_except:
from pyspark.sql.functions import array_except, array_intersect
df=df.withColumn("new_col", array_except(array_intersect("X", "Y"), "Z"))
df.show()
#+---+---------------+---------------------+----+--------+
#|id |X |Y |Z |new_col |
#+---+---------------+---------------------+----+--------+
#|1 |[12, 23, 1, 24]|[13, 412, 12, 23, 24]|[12]|[23, 24]|
#|2 |[1, 2, 3] |[2, 4, 5, 6] |[] |[2] |
#+---+---------------+---------------------+----+--------+
You can use withcolumn + udf
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType, ArrayType
def intersection_function(list1, list2):
intersection_list = [value for value in list1 if value in list2]
return intersection_list
udf_intersection = F.udf(intersection_function, ArrayType(IntegerType())
newdf = df.withColumn("new_col", udf_intersection(df["ListColumn1"], df["ListColumn2"]))