In DataFrame.withColumn, how can I check if the column's value is null as a condition for the second parameter? - dataframe

If I have a DataFrame called df that looks like:
+----+----+
| a1+ a2|
+----+----+
| foo| bar|
| N/A| baz|
|null| etc|
+----+----+
I can selectively replace values like so:
val df2 = df.withColumn("a1", when($"a1" === "N/A", $"a2"))
so that df2 looks like:
+----+----+
| a1+ a2|
+----+----+
| foo| bar|
| baz| baz|
|null| etc|
+----+----+
but why can't I check if it's null, like:
val df3 = df2.withColumn("a1", when($"a1" === null, $"a2"))
so that I get:
+----+----+
| a1+ a2|
+----+----+
| foo| bar|
| baz| baz|
| etc| etc|
+----+----+
Edit: $"a1".isNull doesn't seem to work. Could it be because of how I'm constructing the dataframe I'm using to test, which is as follows?
val schema = StructType(
StructField("a1", StringType, false) ::
StructField("a2", StringType, false) :: Nil
)
val data = sc.parallelize(Array(
Row("foo","bar"),
Row("N/A","baz"),
Row(null,"etc"))
)
val df = sqlContext.createDataFrame(data, schema)
I also can't use coalesce, as far as I know, because sometimes I need to use a static value instead of another column's value.
Edit again: Setting my test columns to nullable = false doesn't help.

Because null means no value and shouldn't be checked like this.
Use isNull function:
val df3 = df2.withColumn("a1", when($"a1".isNull, $"a2"))
or coalesce, which returns first non-null value:
val df3 = df2.withColumn("a1", coalesce($"a1", $"a2"))

Related

size function applied to empty array column in dataframe returns 1 after spilt

Noticed that with size function on an array column in a dataframe using following code - which includes a split:
import org.apache.spark.sql.functions.{trim, explode, split, size}
val df1 = Seq(
(1, "[{a},{b},{c}]"),
(2, "[]"),
(3, "[{d},{e},{f}]")
).toDF("col1", "col2")
df1.show(false)
val df2 = df.withColumn("cola", split(trim($"col2", "[]"), ",")).withColumn("s", size($"cola"))
df2.show(false)
we get:
+----+-------------+---------------+---+
|col1|col2 |cola |s |
+----+-------------+---------------+---+
|1 |[{a},{b},{c}]|[{a}, {b}, {c}]|3 |
|2 |[] |[] |1 |
|3 |[{d},{e},{f}]|[{d}, {e}, {f}]|3 |
+----+-------------+---------------+---+
I was hoping for a zero so as to be able distinguish between 0 or 1 entries.
A few hints here and there on SO, but none that helped.
If I have the following entry: (2, null), then I get size -1, which is more helpful I guess.
On the other hand, this borrowed sample from the internet:
val df = Seq("a" -> Array(1,2,3), "b" -> null, "c" -> Array(7,8,9)).toDF("id","numbers")
df.show
val df2 = df.withColumn("numbers", coalesce($"numbers", array()))
df2.show
val df3 = df2.withColumn("s", size($"numbers"))
df3.show()
does return 0 - as expected.
Looking for the correct approach here so as to get size = 0.
This behavior is inherited from the Java function split which is used in the same way in Scala and Spark. The empty input is a special case, and this is well discussed in this SO post.
Spark sets the default value for the second parameter (limit) of the split function to -1. And as of Spark 3, we can now pass a limit parameter for split function.
You can see this in Scala split function vs Spark SQL split function:
"".split(",").length
//res31: Int = 1
spark.sql("""select size(split("", '[,]'))""").show
//+----------------------+
//|size(split(, [,], -1))|
//+----------------------+
//| 1|
//+----------------------+
And
",".split(",").length // without setting limit=-1 this gives empty array
//res33: Int = 0
",".split(",", -1).length
//res34: Int = 2
spark.sql("""select size(split(",", '[,]'))""").show
//+-----------------------+
//|size(split(,, [,], -1))|
//+-----------------------+
//| 2|
//+-----------------------+
I suppose the root cause is that split returns an empty string, instead of a null.
scala> df1.withColumn("cola", split(trim($"col2", "[]"), ",")).withColumn("s", $"cola"(0)).select("s").collect()(1)(0)
res53: Any = ""
And the size of an array containing an empty string is, of course, 1.
To get around this, perhaps you could do
val df2 = df1.withColumn("cola", split(trim($"col2", "[]"), ","))
.withColumn("s", when(length($"cola"(0)) =!= 0, size($"cola"))
.otherwise(lit(0)))
df2.show(false)
+----+-------------+---------------+---+
|col1|col2 |cola |s |
+----+-------------+---------------+---+
|1 |[{a},{b},{c}]|[{a}, {b}, {c}]|3 |
|2 |[] |[] |0 |
|3 |[{d},{e},{f}]|[{d}, {e}, {f}]|3 |
+----+-------------+---------------+---+

Split a dataframe string column by two different delimiters

The following is my dataset:
Itemcode
DB9450//DB9450/AD9066
DA0002/DE2396//DF2345
HWC72
GG7183/EB6693
TA444/B9X8X4:7-2-
The following is the code I have been trying to use
df.withColumn("item1", split(col("Itemcode"), "/").getItem(0)).withColumn("item2", split(col("Itemcode"), "/").getItem(1)).withColumn("item3", split(col("Itemcode"), "//").getItem(0))
But it fails when there is a double slash in between first and second item and also fails when there is a double slash between 2nd and 3rd item
Desired output is:
item1 item2 item3
DB9450 DB9450 AD9066
DA0002 DE2396 DF2345
HWC72
GG7183 EB6693
TA444 B9X8X4
You can first replace the // with / then you can split.. Please try the below and let us know if worked
Input
df_b = spark.createDataFrame([('DB9450//DB9450/AD9066',"a"),('DA0002/DE2396//DF2345',"a"),('HWC72',"a"),('GG7183/EB6693',"a"),('TA444/B9X8X4:7-2-',"a")],[ "reg","postime"])
+--------------------+-------+
| reg|postime|
+--------------------+-------+
|DB9450//DB9450/AD...| a|
|DA0002/DE2396//DF...| a|
| HWC72| a|
| GG7183/EB6693| a|
| TA444/B9X8X4:7-2-| a|
+--------------------+-------+
Logic
df_b = df_b.withColumn('split_col', F.regexp_replace(F.col('reg'), "//", "/"))
df_b = df_b.withColumn('split_col', F.split(df_b['split_col'], '/'))
df_b = df_b.withColumn('col1' , F.col('split_col').getItem(0))
df_b = df_b.withColumn('col2' , F.col('split_col').getItem(1))
df_b = df_b.withColumn('col2', F.regexp_replace(F.col('col2'), ":7-2-", ""))
df_b = df_b.withColumn('col3' , F.col('split_col').getItem(2))
Output
+--------------------+-------+--------------------+------+------+------+
| reg|postime| split_col| col1| col2| col3|
+--------------------+-------+--------------------+------+------+------+
|DB9450//DB9450/AD...| a|[DB9450, DB9450, ...|DB9450|DB9450|AD9066|
|DA0002/DE2396//DF...| a|[DA0002, DE2396, ...|DA0002|DE2396|DF2345|
| HWC72| a| [HWC72]| HWC72| null| null|
| GG7183/EB6693| a| [GG7183, EB6693]|GG7183|EB6693| null|
| TA444/B9X8X4:7-2-| a|[TA444, B9X8X4:7-2-]| TA444|B9X8X4| null|
+--------------------+-------+--------------------+------+------+------+
Processing the text as csv works well for this.
First, let's read in the text, replacing double backslashes along the way
Edit: Also removing everything after a colon
val items = """
Itemcode
DB9450//DB9450/AD9066
DA0002/DE2396//DF2345
HWC72
GG7183/EB6693
TA444/B9X8X4:7-2-
""".replaceAll("//", "/").split(":")(0)
Get the max number of items in a row
to create an appropriate header
val numItems = items.split("\n").map(_.split("/").size).reduce(_ max _)
val header = (1 to numItems).map("Itemcode" + _).mkString("/")
Then we're ready to create a Data Frame
val df = spark.read
.option("ignoreTrailingWhiteSpace", "true")
.option("delimiter", "/")
.option("header", "true")
.csv(spark.sparkContext.parallelize((header + items).split("\n")).toDS)
.filter("Itemcode1 <> 'Itemcode'")
df.show(false)
+---------+-----------+---------+
|Itemcode1|Itemcode2 |Itemcode3|
+---------+-----------+---------+
|DB9450 |DB9450 |AD9066 |
|DA0002 |DE2396 |DF2345 |
|HWC72 |null |null |
|GG7183 |EB6693 |null |
|TA444 |B9X8X4 |null |
+---------+-----------+---------+
Perhaps this is useful (spark>=2.4)-
split and TRANSFORM spark sql function will do the magic as below-
Load the provided test data
val data =
"""
|Itemcode
|
|DB9450//DB9450/AD9066
|
|DA0002/DE2396//DF2345
|
|HWC72
|
|GG7183/EB6693
|
|TA444/B9X8X4:7-2-
""".stripMargin
val stringDS = data.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString("|"))
.toSeq.toDS()
val df = spark.read
.option("sep", "|")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS)
df.show(false)
df.printSchema()
/**
* +---------------------+
* |Itemcode |
* +---------------------+
* |DB9450//DB9450/AD9066|
* |DA0002/DE2396//DF2345|
* |HWC72 |
* |GG7183/EB6693 |
* |TA444/B9X8X4:7-2- |
* +---------------------+
*
* root
* |-- Itemcode: string (nullable = true)
*/
Use split and TRANSFORM (you can run this query directly in pyspark)
df.withColumn("item_code", expr("TRANSFORM(split(Itemcode, '/+'), x -> split(x, ':')[0])"))
.selectExpr("item_code[0] item1", "item_code[1] item2", "item_code[2] item3")
.show(false)
/**
* +------+------+------+
* |item1 |item2 |item3 |
* +------+------+------+
* |DB9450|DB9450|AD9066|
* |DA0002|DE2396|DF2345|
* |HWC72 |null |null |
* |GG7183|EB6693|null |
* |TA444 |B9X8X4|null |
* +------+------+------+
*/

Create empty sparse vectors in PySpark

I have a dataframe DF1 that looks like this:
+-------+------+
|user_id|meta |
+-------+------+
| 1| null|
| 11| null|
| 15| null|
+-------+------+
Schema:
root
|-- user_id: string (nullable = true)
|-- meta: string (nullable = true)
and I have another dataframe DF2 that looks like this
+-------+------------------------------------+
|user_id| Vectorz |
+-------+------------------------------------+
| 10| (2,[1],[1.0])|
| 12| (2,[1],[1.0])|
| 13| (2,[0],[1.0])|
| 14| (2,[1],[1.0])|
---------------------------------------------
Schema is:
[user_id: string, Vectorz: vector]
I want to inject all the user_ids from DF1 into DF2, but create empty sparse vectors for them since their "meta" column is all NULLs.
So, I want DF2 to finally be:
+-------+------------------------------------+
|user_id| Vectorz |
+-------+------------------------------------+
| 1| (,[],[])|
| 10| (2,[1],[1.0])|
| 11| (,[],[])|
| 12| (2,[1],[1.0])|
| 13| (2,[0],[1.0])|
| 14| (2,[1],[1.0])|
| 15| (,[],[])|
---------------------------------------------
Can somebody please help?
I am new to PySpark. So, sorry if I don't sound informed enough.
You can go ahead and create empty vectors for all the user_ids when meta is null.
Anyways you need to decide when the meta column is not null.
Sample COde
DF1
val spark = sqlContext.sparkSession
val implicits = sqlContext.sparkSession.implicits
import implicits._
val df1 = sqlContext.range(1,4)
.withColumnRenamed("id", "user_id")
.withColumn("meta", lit(null).cast(DataTypes.StringType))
df1.show(false)
df1.printSchema()
+-------+----+
|user_id|meta|
+-------+----+
|1 |null|
|2 |null|
|3 |null|
+-------+----+
root
|-- user_id: long (nullable = false)
|-- meta: string (nullable = true)
DF2
import org.apache.spark.ml.linalg.Vectors
val staticVector = udf(() => Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))), SQLDataTypes.VectorType)
val df2 = sqlContext.range(5,8)
.withColumnRenamed("id", "user_id")
.withColumn("Vectorz", staticVector())
df2.show(false)
df2.printSchema()
+-------+-------------------+
|user_id|Vectorz |
+-------+-------------------+
|5 |(5,[1,3],[1.0,7.0])|
|6 |(5,[1,3],[1.0,7.0])|
|7 |(5,[1,3],[1.0,7.0])|
+-------+-------------------+
root
|-- user_id: long (nullable = false)
|-- Vectorz: vector (nullable = true)
Processed DF
val emptyVector = udf(() => Vectors.sparse(0, Array.empty[Int], Array.empty[Double]), SQLDataTypes.VectorType)
val processedDF =
// meta column shouldn't have any value
// for the safer side adding filter as meta is null
// need to decide what if meta is not null
// I'm assigning empty vector to that also
df1.where(col("meta").isNull)
.withColumn("Vectorz", when(col("meta").isNull, emptyVector()).otherwise(emptyVector()))
.drop("meta")
.unionByName(df2)
processedDF.show(false)
processedDF.printSchema()
+-------+-------------------+
|user_id|Vectorz |
+-------+-------------------+
|1 |(0,[],[]) |
|2 |(0,[],[]) |
|3 |(0,[],[]) |
|5 |(5,[1,3],[1.0,7.0])|
|6 |(5,[1,3],[1.0,7.0])|
|7 |(5,[1,3],[1.0,7.0])|
+-------+-------------------+
root
|-- user_id: long (nullable = false)
|-- Vectorz: vector (nullable = true)

using sparksql and spark dataframe How can we find the COLUMN NAME based on the minimum value in a row

i have a dataframe df . its having 4 columns
+-------+-------+-------+-------+
| dist1 | dist2 | dist3 | dist4 |
+-------+-------+-------+-------+
| 42 | 53 | 24 | 17 |
+-------+-------+-------+-------+
output i want is
dist4
seems easy but i did not find any proper solution using dataframe or sparksql query
You may use least function as
select least(dist1,dist2,dist3,dist4) as min_dist
from yourTable;
For the opposite cases greatest may be used.
EDIT :
To detect column names the following maybe used to get rows
select inline(array(struct(42, 'dist1'), struct(53, 'dist2'),
struct(24, 'dist3'), struct(17, 'dist4') ))
42 dist1
53 dist2
24 dist3
17 dist4
and then min function may be applied to get dist4
Try this,
df.show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
val temp_df = df.columns.foldLeft(df) { (acc: DataFrame, colName: String) => acc.withColumn(colName, concat(col(colName), lit(","+colName)))}
val minval = udf((ar: Seq[String]) => ar.min.split(",")(1))
val result = temp_df.withColumn("least", split(concat_ws(":",x.columns.map(col(_)):_*),":")).withColumn("least_col", minval(col("least")))
result.show
+---+---+---+---+--------------------+---------+
| A| B| C| D| least|least_col|
+---+---+---+---+--------------------+---------+
|1,A|2,B|3,C|4,D|[1,A, 2,B, 3,C, 4,D]| A|
|5,A|4,B|3,C|1,D|[5,A, 4,B, 3,C, 1,D]| D|
+---+---+---+---+--------------------+---------+
RDD way and without udf()s.
scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> val df2 = df.withColumn("arr", array(df.columns.map(col(_)):_*))
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 3 more fields]
scala> val rowarr = df.columns
rowarr: Array[String] = Array(A, B, C, D)
scala> val rdd1 = df2.rdd.map( x=> {val p = x.getAs[WrappedArray[Int]]("arr").toArray; val q=rowarr(p.indexWhere(_==p.min));Row.merge(x,Row(q)) })
rdd1: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[83] at map at <console>:47
scala> spark.createDataFrame(rdd1,df2.schema.add(StructField("mincol",StringType))).show
+---+---+---+---+------------+------+
| A| B| C| D| arr|mincol|
+---+---+---+---+------------+------+
| 1| 2| 3| 4|[1, 2, 3, 4]| A|
| 5| 4| 3| 1|[5, 4, 3, 1]| D|
+---+---+---+---+------------+------+
scala>
you can do something like,
import org.apache.spark.sql.functions._
val cols = df.columns
val u1 = udf((s: Seq[Int]) => cols(s.zipWithIndex.min._2))
df.withColumn("res", u1(array("*")))
You could access the rows schema, retrieve a list of names out of there and access the rows value by name and then figure it out that way.
See: https://spark.apache.org/docs/2.3.2/api/scala/index.html#org.apache.spark.sql.Row
It would look roughly like this
dataframe.map(
row => {
val schema = row.schema
val fieldNames:List[String] = ??? //extract names from schema
fieldNames.foldLeft(("", 0))(???) // retrieve field value using it's name and retain maximum
}
)
This would yield a Dataset[String]

Including null values in an Apache Spark Join

I would like to include null values in an Apache Spark join. Spark doesn't include rows with null by default.
Here is the default Spark behavior.
val numbersDf = Seq(
("123"),
("456"),
(null),
("")
).toDF("numbers")
val lettersDf = Seq(
("123", "abc"),
("456", "def"),
(null, "zzz"),
("", "hhh")
).toDF("numbers", "letters")
val joinedDf = numbersDf.join(lettersDf, Seq("numbers"))
Here is the output of joinedDf.show():
+-------+-------+
|numbers|letters|
+-------+-------+
| 123| abc|
| 456| def|
| | hhh|
+-------+-------+
This is the output I would like:
+-------+-------+
|numbers|letters|
+-------+-------+
| 123| abc|
| 456| def|
| | hhh|
| null| zzz|
+-------+-------+
Spark provides a special NULL safe equality operator:
numbersDf
.join(lettersDf, numbersDf("numbers") <=> lettersDf("numbers"))
.drop(lettersDf("numbers"))
+-------+-------+
|numbers|letters|
+-------+-------+
| 123| abc|
| 456| def|
| null| zzz|
| | hhh|
+-------+-------+
Be careful not to use it with Spark 1.5 or earlier. Prior to Spark 1.6 it required a Cartesian product (SPARK-11111 - Fast null-safe join).
In Spark 2.3.0 or later you can use Column.eqNullSafe in PySpark:
numbers_df = sc.parallelize([
("123", ), ("456", ), (None, ), ("", )
]).toDF(["numbers"])
letters_df = sc.parallelize([
("123", "abc"), ("456", "def"), (None, "zzz"), ("", "hhh")
]).toDF(["numbers", "letters"])
numbers_df.join(letters_df, numbers_df.numbers.eqNullSafe(letters_df.numbers))
+-------+-------+-------+
|numbers|numbers|letters|
+-------+-------+-------+
| 456| 456| def|
| null| null| zzz|
| | | hhh|
| 123| 123| abc|
+-------+-------+-------+
and %<=>% in SparkR:
numbers_df <- createDataFrame(data.frame(numbers = c("123", "456", NA, "")))
letters_df <- createDataFrame(data.frame(
numbers = c("123", "456", NA, ""),
letters = c("abc", "def", "zzz", "hhh")
))
head(join(numbers_df, letters_df, numbers_df$numbers %<=>% letters_df$numbers))
numbers numbers letters
1 456 456 def
2 <NA> <NA> zzz
3 hhh
4 123 123 abc
With SQL (Spark 2.2.0+) you can use IS NOT DISTINCT FROM:
SELECT * FROM numbers JOIN letters
ON numbers.numbers IS NOT DISTINCT FROM letters.numbers
This is can be used with DataFrame API as well:
numbersDf.alias("numbers")
.join(lettersDf.alias("letters"))
.where("numbers.numbers IS NOT DISTINCT FROM letters.numbers")
val numbers2 = numbersDf.withColumnRenamed("numbers","num1") //rename columns so that we can disambiguate them in the join
val letters2 = lettersDf.withColumnRenamed("numbers","num2")
val joinedDf = numbers2.join(letters2, $"num1" === $"num2" || ($"num1".isNull && $"num2".isNull) ,"outer")
joinedDf.select("num1","letters").withColumnRenamed("num1","numbers").show //rename the columns back to the original names
Based on K L's idea, you could use foldLeft to generate join column expression:
def nullSafeJoin(rightDF: DataFrame, columns: Seq[String], joinType: String)(leftDF: DataFrame): DataFrame =
{
val colExpr: Column = leftDF(columns.head) <=> rightDF(columns.head)
val fullExpr = columns.tail.foldLeft(colExpr) {
(colExpr, p) => colExpr && leftDF(p) <=> rightDF(p)
}
leftDF.join(rightDF, fullExpr, joinType)
}
then, you could call this function just like:
aDF.transform(nullSafejoin(bDF, columns, joinType))
Complementing the other answers, for PYSPARK < 2.3.0 you would not have Column.eqNullSafe neither IS NOT DISTINCT FROM.
You still can build the <=> operator with an sql expression to include it in the join, as long as you define alias for the join queries:
from pyspark.sql.types import StringType
import pyspark.sql.functions as F
numbers_df = spark.createDataFrame (["123","456",None,""], StringType()).toDF("numbers")
letters_df = spark.createDataFrame ([("123", "abc"),("456", "def"),(None, "zzz"),("", "hhh") ]).\
toDF("numbers", "letters")
joined_df = numbers_df.alias("numbers").join(letters_df.alias("letters"),
F.expr('numbers.numbers <=> letters.numbers')).\
select('letters.*')
joined_df.show()
+-------+-------+
|numbers|letters|
+-------+-------+
| 456| def|
| null| zzz|
| | hhh|
| 123| abc|
+-------+-------+
Based on timothyzhang's idea one can further improve it by removing duplicate columns:
def dropDuplicateColumns(df: DataFrame, rightDf: DataFrame, cols: Seq[String]): DataFrame
= cols.foldLeft(df)((df, c) => df.drop(rightDf(c)))
def joinTablesWithSafeNulls(rightDF: DataFrame, leftDF: DataFrame, columns: Seq[String], joinType: String): DataFrame =
{
val colExpr: Column = leftDF(columns.head) <=> rightDF(columns.head)
val fullExpr = columns.tail.foldLeft(colExpr) {
(colExpr, p) => colExpr && leftDF(p) <=> rightDF(p)
}
val finalDF = leftDF.join(rightDF, fullExpr, joinType)
val filteredDF = dropDuplicateColumns(finalDF, rightDF, columns)
filteredDF
}
Try the following method to include the null rows to the result of JOIN operator:
def nullSafeJoin(leftDF: DataFrame, rightDF: DataFrame, columns: Seq[String], joinType: String): DataFrame = {
var columnsExpr: Column = leftDF(columns.head) <=> rightDF(columns.head)
columns.drop(1).foreach(column => {
columnsExpr = columnsExpr && (leftDF(column) <=> rightDF(column))
})
var joinedDF: DataFrame = leftDF.join(rightDF, columnsExpr, joinType)
columns.foreach(column => {
joinedDF = joinedDF.drop(leftDF(column))
})
joinedDF
}