Spark: Multiple filter inside agg and concat not null values - apache-spark-sql

I'm trying to concatenate not null values from a List column.
I know this can be done easily by using UDF but would like to know how to handle this by using multiple filter conditions inside agg function.
Don't know what's missing here?
val df = sc.parallelize(Seq(("foo", List(null,"bar",null)),
("bar", List("one","two",null)),
("rio", List("Ria","","Kevin")))).toDF("key", "value")
+---+-----------------+
|key| value|
+---+-----------------+
|foo|[null, bar, null]|
|bar| [one, two, null]|
|rio| [Ria, , Kevin]|
+---+-----------------+
df.groupBy("key")
.agg(concat_ws(",",first(when(($"value".isNotNull || $"value" =!= ""),$"value"))).as("RemovedNullSeq"))
.show(false)
+---+--------------+
|key|RemovedNullSeq|
+---+--------------+
|bar|one,two |
|rio|Ria,,Kevin |
|foo|bar |
+---+--------------+
I don't need that blank value in the second record.
Thanks

I'm not immediately sure if using aggregate functions is necessary based on the example provided.
If you're just trying to concatenate the values in an array then the following works:
val df = Seq(List(null,"abc", null),
List(null, null, null),
List(null, "def", "ghi", "kjl"),
List("mno", null, "pqr")).toDF("list")
df.withColumn("concat", concat_ws(",",$"list")).show(false)
+---------------------+-----------+
|list |concat |
+---------------------+-----------+
|[null, abc, null] |abc |
|[null, null, null] | |
|[null, def, ghi, kjl]|def,ghi,kjl|
|[mno, null, pqr] |mno,pqr |
+---------------------+-----------+
If there is a need to group first:
val df2 = Seq((123,List(null,"abc", null)),
(123,List(null,"def", "hij"))).toDF("key","list")
df2.show(false)
+---+-----------------+
|key|list |
+---+-----------------+
|123|[null, abc, null]|
|123|[null, def, hij] |
+---+-----------------+
You might think you could do something like
val grouped = df2.groupBy($"key").agg(collect_list($"list").as("collected"))
And then apply some functions to the array of arrays to obtain your concatenated result. However, I have been unable to find a way to do this without resorting to UDFs.
In this case, exploding before the grouping does the trick:
val grouped = df2.groupBy($"key").agg(collect_list($"list").as("collected"))
.groupBy($"key").agg(collect_list($"listItem").as("collected"))
.withColumn("concat", concat_ws(",",$"collected")).show(false)
+---+---------------+-----------+
|key|collected |concat |
+---+---------------+-----------+
|123|[abc, def, hij]|abc,def,hij|
+---+---------------+-----------+
Note however that there is no guarantee of the order in which the lists will be collected.
Hope this helps

import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(("foo", List(null,"bar",null)),
("bar", List("one","two",null)),
("rio", List("Ria","","Kevin")))).toDF("key", "value")
val filtd = df.select($"key" as "key", explode($"value") as "val").where (length($"val") > 0)
val rsult = filtd.select($"*").groupBy($"key").agg(collect_list("val"))
rsult.show(5)
You can add ultiple conditions like this
val filtd = df.select($"key" as "key", explode($"value") as "val").where (length($"val") > 0 && $"val".isNotNull)
Output
+---+-----------------+
|key|collect_list(val)|
+---+-----------------+
|bar| [one, two]|
|rio| [Ria, Kevin]|
|foo| [bar]|
+---+-----------------+

Related

Check repeated values in a dataframe and implement a ignoreNulls parameter

I created a function to check if there is repeated values in a dataframe based on a Seq of columns.
I want to implement an "ignoreNulls", to be passed as a Boolean parameter into the function
If true, will ignore and not group and count the nulls values. So for the nulls values, the "newColName" will return false.
If false (default), will consider nulls values as a group and return true if theres multiples values with nulls for the key that I'm checking.
I don't know how could I do this.
Should I use an if or case?
There's some expression to ignore nulls on partitionBy statement?
Anyone could help me?
Here's the current function
def checkRepeatedKey(newColName: String, keys: Seq[String])(dataframe: DataFrame): DataFrame = {
val repeatedCondition = $"sum" > 1
val windowCondition = Window.partitionBy(keys.head, keysToCheck.tail: _*)
dataframe
.withColumn("count", lit(1))
.withColumn("sum", sum("count").over(windowCondition))
.withColumn(newColName, repeatedCondition)
.drop("count", "sum")
}
Some test data
val testDF = Seq(
("1", Some("name-1")),
("2", Some("repeated-name")),
("3", Some("repeated-name")),
("4", Some("name-4")),
("5", None),
("6", None)
).toDF("name_key", "name")
Testing the function
val results = testDF.transform(checkRepeatedKey("has_repeated_name", Seq("name"))
Output (without the ignoreNulls implementation)
+--------+---------------+--------------------+
|name_key| name | has_repeated_name |
+--------+---------------+--------------------+
| 1 | name-1 | false |
+--------+---------------+--------------------+
| 2 | repeated-name | true |
+--------+---------------+--------------------+
| 3 | repeated-name | true |
+--------+---------------+--------------------+
| 4 | name-4 | false |
+--------+---------------+--------------------+
| 5 | null | true |
+--------+---------------+--------------------+
| 6 | null | true |
+--------+---------------+--------------------+
And with the ignoreNulls=true implementation should be like this
-- function header with ignoreNulls parameter
def checkRepeatedKey(newColName: String, keys: Seq[String], ignoreNulls: Boolean)(dataframe: DataFrame): DataFrame =
-- using the function, passing true for ignoreNulls
testDF.transform(checkRepeatedKey("has_repeated_name", Seq("name"), true)
-- expected output for nulls
+--------+---------------+--------------------+
| 5 | null | false |
+--------+---------------+--------------------+
| 6 | null | false |
+--------+---------------+--------------------+
Firstly, you should define properly the logic in case that only part of the columns in keys are null - should it be counted as null values or null value is defined only if all the columns in keys are null?
For the sake of simplicity, lets assume that the there is only one column in keys (you can easily extend the logic for multiple columns). You can just add a simple if into your checkRepeatedKey function:
def checkIfNullValue(keys: Seq[String]): Column = {
// for the sake of simplicity checking only the first key
col(keys.head).isNull
}
def checkRepeatedKey(newColName: String, keys: Seq[String], ignoreNulls: Boolean)(dataframe: DataFrame): DataFrame = {
...
...
val df = dataframe
.withColumn("count", lit(1))
.withColumn("sum", sum("count").over(windowCondition))
.withColumn(newColName, repeatedCondition)
.drop("count", "sum")
if (ignoreNulls)
df.withColumn(newColName, when(checkIfNullValue(keys), df(newColName)).otherwise(lit(false))
else df
}

size function applied to empty array column in dataframe returns 1 after spilt

Noticed that with size function on an array column in a dataframe using following code - which includes a split:
import org.apache.spark.sql.functions.{trim, explode, split, size}
val df1 = Seq(
(1, "[{a},{b},{c}]"),
(2, "[]"),
(3, "[{d},{e},{f}]")
).toDF("col1", "col2")
df1.show(false)
val df2 = df.withColumn("cola", split(trim($"col2", "[]"), ",")).withColumn("s", size($"cola"))
df2.show(false)
we get:
+----+-------------+---------------+---+
|col1|col2 |cola |s |
+----+-------------+---------------+---+
|1 |[{a},{b},{c}]|[{a}, {b}, {c}]|3 |
|2 |[] |[] |1 |
|3 |[{d},{e},{f}]|[{d}, {e}, {f}]|3 |
+----+-------------+---------------+---+
I was hoping for a zero so as to be able distinguish between 0 or 1 entries.
A few hints here and there on SO, but none that helped.
If I have the following entry: (2, null), then I get size -1, which is more helpful I guess.
On the other hand, this borrowed sample from the internet:
val df = Seq("a" -> Array(1,2,3), "b" -> null, "c" -> Array(7,8,9)).toDF("id","numbers")
df.show
val df2 = df.withColumn("numbers", coalesce($"numbers", array()))
df2.show
val df3 = df2.withColumn("s", size($"numbers"))
df3.show()
does return 0 - as expected.
Looking for the correct approach here so as to get size = 0.
This behavior is inherited from the Java function split which is used in the same way in Scala and Spark. The empty input is a special case, and this is well discussed in this SO post.
Spark sets the default value for the second parameter (limit) of the split function to -1. And as of Spark 3, we can now pass a limit parameter for split function.
You can see this in Scala split function vs Spark SQL split function:
"".split(",").length
//res31: Int = 1
spark.sql("""select size(split("", '[,]'))""").show
//+----------------------+
//|size(split(, [,], -1))|
//+----------------------+
//| 1|
//+----------------------+
And
",".split(",").length // without setting limit=-1 this gives empty array
//res33: Int = 0
",".split(",", -1).length
//res34: Int = 2
spark.sql("""select size(split(",", '[,]'))""").show
//+-----------------------+
//|size(split(,, [,], -1))|
//+-----------------------+
//| 2|
//+-----------------------+
I suppose the root cause is that split returns an empty string, instead of a null.
scala> df1.withColumn("cola", split(trim($"col2", "[]"), ",")).withColumn("s", $"cola"(0)).select("s").collect()(1)(0)
res53: Any = ""
And the size of an array containing an empty string is, of course, 1.
To get around this, perhaps you could do
val df2 = df1.withColumn("cola", split(trim($"col2", "[]"), ","))
.withColumn("s", when(length($"cola"(0)) =!= 0, size($"cola"))
.otherwise(lit(0)))
df2.show(false)
+----+-------------+---------------+---+
|col1|col2 |cola |s |
+----+-------------+---------------+---+
|1 |[{a},{b},{c}]|[{a}, {b}, {c}]|3 |
|2 |[] |[] |0 |
|3 |[{d},{e},{f}]|[{d}, {e}, {f}]|3 |
+----+-------------+---------------+---+

How to convert dataframe value into Map[String,List[String]]?

I want to convert below dataframe into Map[String,List[String]]. I have changed initial dataframe to get Name columns in List format(using collect_list) but I am not able to convert it into Map[String,List[String]].
DataFrame
+---------+-------+
|City | Name |
+---------+-------+
|Mumbai |[A,B] |
|Pune |[C,D] |
|Delhi |[A,D] |
+---------+-------+
Expected Output:
Map(Mumbai -> List(A,B), Pune -> List(C,D), Delhi-> List(A,D))
You can convert to rdd and collect as Map as below
val df = Seq(
("Mumbai", List("A", "B")),
("Pune", List("C", "D")),
("Delhi", List("A", "D"))
).toDF("city", "name")
val map: collection.Map[String, List[String]] = df.rdd
.map(row => (row.getAs[String]("city"), row.getAs[List[String]]("name")))
.collectAsMap()
Hope this helps!

to_json with static name value spark

I have a dataframe with two array coloumns,
+---------+-----------------------+
|itemval |fruit |
+---------+-----------------------+
|[1, 2, 3]|[apple, banana, orange]|
+---------+-----------------------+
I am trying to zip them and create a name value pair
+---------+-----------------------+--------------------------------------+
|itemval |fruit |ziped |
+---------+-----------------------+--------------------------------------+
|[1, 2, 3]|[apple, banana, orange]|[[1, apple], [2, banana], [3, orange]]|
+---------+-----------------------+--------------------------------------+
and then make it to JSON, the to_json output is formatted like this
+---------------------------------------------------------------------------+
|ziped |
+---------------------------------------------------------------------------+
|[{"_1":"1","_2":"apple"},{"_1":"2","_2":"banana"},{"_1":"3","_2":"orange"}]|
+---------------------------------------------------------------------------+
The format, I am expecting is like this
+---------------------------------------------------------------------------+
|ziped |
+---------------------------------------------------------------------------+
|[{"itemval":"1","name":"apple"},{"itemval":"2","name":"banana"},{"itemval":"3","name":"orange"}]|
+---------------------------------------------------------------------------+
here is my implementation
val df1 = Seq((Array(1,2,3),Array("apple","banana","orange"))).toDF("itemval","fruit")
df1.show(false)
def zipper=udf((list1:Seq[String],list2:Seq[String]) => {
val zipList = list2 zip list1
zipList
)
df1.withColumn("ziped",to_json(zipper($"fruit",$"itemval"))).drop("itemval","fruit").show(false)
This is the solution which worked for me. Create a schema with new value and cast it to the column
val schema = ArrayType(
StructType(
Array(
StructField("itemval",StringType),
StructField("name",StringType)
)
)
)
val casted =zival.withColumn("result",$"ziped".cast(schema))
casted.show(false)
casted.select(to_json($"result")).show(false)
and the out put will be
casted:org.apache.spark.sql.DataFrame
ziped:array
element:struct
_1:string
_2:string
result:array
element:struct
itemval:string
name:string
+-----------------------------------------------------------------+
|structstojson(result) |
+-----------------------------------------------------------------+
|[{"itemval":"3","name":"orange"},{"itemval":"2","name":"banana"}]|
+-----------------------------------------------------------------+

Including null values in an Apache Spark Join

I would like to include null values in an Apache Spark join. Spark doesn't include rows with null by default.
Here is the default Spark behavior.
val numbersDf = Seq(
("123"),
("456"),
(null),
("")
).toDF("numbers")
val lettersDf = Seq(
("123", "abc"),
("456", "def"),
(null, "zzz"),
("", "hhh")
).toDF("numbers", "letters")
val joinedDf = numbersDf.join(lettersDf, Seq("numbers"))
Here is the output of joinedDf.show():
+-------+-------+
|numbers|letters|
+-------+-------+
| 123| abc|
| 456| def|
| | hhh|
+-------+-------+
This is the output I would like:
+-------+-------+
|numbers|letters|
+-------+-------+
| 123| abc|
| 456| def|
| | hhh|
| null| zzz|
+-------+-------+
Spark provides a special NULL safe equality operator:
numbersDf
.join(lettersDf, numbersDf("numbers") <=> lettersDf("numbers"))
.drop(lettersDf("numbers"))
+-------+-------+
|numbers|letters|
+-------+-------+
| 123| abc|
| 456| def|
| null| zzz|
| | hhh|
+-------+-------+
Be careful not to use it with Spark 1.5 or earlier. Prior to Spark 1.6 it required a Cartesian product (SPARK-11111 - Fast null-safe join).
In Spark 2.3.0 or later you can use Column.eqNullSafe in PySpark:
numbers_df = sc.parallelize([
("123", ), ("456", ), (None, ), ("", )
]).toDF(["numbers"])
letters_df = sc.parallelize([
("123", "abc"), ("456", "def"), (None, "zzz"), ("", "hhh")
]).toDF(["numbers", "letters"])
numbers_df.join(letters_df, numbers_df.numbers.eqNullSafe(letters_df.numbers))
+-------+-------+-------+
|numbers|numbers|letters|
+-------+-------+-------+
| 456| 456| def|
| null| null| zzz|
| | | hhh|
| 123| 123| abc|
+-------+-------+-------+
and %<=>% in SparkR:
numbers_df <- createDataFrame(data.frame(numbers = c("123", "456", NA, "")))
letters_df <- createDataFrame(data.frame(
numbers = c("123", "456", NA, ""),
letters = c("abc", "def", "zzz", "hhh")
))
head(join(numbers_df, letters_df, numbers_df$numbers %<=>% letters_df$numbers))
numbers numbers letters
1 456 456 def
2 <NA> <NA> zzz
3 hhh
4 123 123 abc
With SQL (Spark 2.2.0+) you can use IS NOT DISTINCT FROM:
SELECT * FROM numbers JOIN letters
ON numbers.numbers IS NOT DISTINCT FROM letters.numbers
This is can be used with DataFrame API as well:
numbersDf.alias("numbers")
.join(lettersDf.alias("letters"))
.where("numbers.numbers IS NOT DISTINCT FROM letters.numbers")
val numbers2 = numbersDf.withColumnRenamed("numbers","num1") //rename columns so that we can disambiguate them in the join
val letters2 = lettersDf.withColumnRenamed("numbers","num2")
val joinedDf = numbers2.join(letters2, $"num1" === $"num2" || ($"num1".isNull && $"num2".isNull) ,"outer")
joinedDf.select("num1","letters").withColumnRenamed("num1","numbers").show //rename the columns back to the original names
Based on K L's idea, you could use foldLeft to generate join column expression:
def nullSafeJoin(rightDF: DataFrame, columns: Seq[String], joinType: String)(leftDF: DataFrame): DataFrame =
{
val colExpr: Column = leftDF(columns.head) <=> rightDF(columns.head)
val fullExpr = columns.tail.foldLeft(colExpr) {
(colExpr, p) => colExpr && leftDF(p) <=> rightDF(p)
}
leftDF.join(rightDF, fullExpr, joinType)
}
then, you could call this function just like:
aDF.transform(nullSafejoin(bDF, columns, joinType))
Complementing the other answers, for PYSPARK < 2.3.0 you would not have Column.eqNullSafe neither IS NOT DISTINCT FROM.
You still can build the <=> operator with an sql expression to include it in the join, as long as you define alias for the join queries:
from pyspark.sql.types import StringType
import pyspark.sql.functions as F
numbers_df = spark.createDataFrame (["123","456",None,""], StringType()).toDF("numbers")
letters_df = spark.createDataFrame ([("123", "abc"),("456", "def"),(None, "zzz"),("", "hhh") ]).\
toDF("numbers", "letters")
joined_df = numbers_df.alias("numbers").join(letters_df.alias("letters"),
F.expr('numbers.numbers <=> letters.numbers')).\
select('letters.*')
joined_df.show()
+-------+-------+
|numbers|letters|
+-------+-------+
| 456| def|
| null| zzz|
| | hhh|
| 123| abc|
+-------+-------+
Based on timothyzhang's idea one can further improve it by removing duplicate columns:
def dropDuplicateColumns(df: DataFrame, rightDf: DataFrame, cols: Seq[String]): DataFrame
= cols.foldLeft(df)((df, c) => df.drop(rightDf(c)))
def joinTablesWithSafeNulls(rightDF: DataFrame, leftDF: DataFrame, columns: Seq[String], joinType: String): DataFrame =
{
val colExpr: Column = leftDF(columns.head) <=> rightDF(columns.head)
val fullExpr = columns.tail.foldLeft(colExpr) {
(colExpr, p) => colExpr && leftDF(p) <=> rightDF(p)
}
val finalDF = leftDF.join(rightDF, fullExpr, joinType)
val filteredDF = dropDuplicateColumns(finalDF, rightDF, columns)
filteredDF
}
Try the following method to include the null rows to the result of JOIN operator:
def nullSafeJoin(leftDF: DataFrame, rightDF: DataFrame, columns: Seq[String], joinType: String): DataFrame = {
var columnsExpr: Column = leftDF(columns.head) <=> rightDF(columns.head)
columns.drop(1).foreach(column => {
columnsExpr = columnsExpr && (leftDF(column) <=> rightDF(column))
})
var joinedDF: DataFrame = leftDF.join(rightDF, columnsExpr, joinType)
columns.foreach(column => {
joinedDF = joinedDF.drop(leftDF(column))
})
joinedDF
}