size function applied to empty array column in dataframe returns 1 after spilt - dataframe

Noticed that with size function on an array column in a dataframe using following code - which includes a split:
import org.apache.spark.sql.functions.{trim, explode, split, size}
val df1 = Seq(
(1, "[{a},{b},{c}]"),
(2, "[]"),
(3, "[{d},{e},{f}]")
).toDF("col1", "col2")
df1.show(false)
val df2 = df.withColumn("cola", split(trim($"col2", "[]"), ",")).withColumn("s", size($"cola"))
df2.show(false)
we get:
+----+-------------+---------------+---+
|col1|col2 |cola |s |
+----+-------------+---------------+---+
|1 |[{a},{b},{c}]|[{a}, {b}, {c}]|3 |
|2 |[] |[] |1 |
|3 |[{d},{e},{f}]|[{d}, {e}, {f}]|3 |
+----+-------------+---------------+---+
I was hoping for a zero so as to be able distinguish between 0 or 1 entries.
A few hints here and there on SO, but none that helped.
If I have the following entry: (2, null), then I get size -1, which is more helpful I guess.
On the other hand, this borrowed sample from the internet:
val df = Seq("a" -> Array(1,2,3), "b" -> null, "c" -> Array(7,8,9)).toDF("id","numbers")
df.show
val df2 = df.withColumn("numbers", coalesce($"numbers", array()))
df2.show
val df3 = df2.withColumn("s", size($"numbers"))
df3.show()
does return 0 - as expected.
Looking for the correct approach here so as to get size = 0.

This behavior is inherited from the Java function split which is used in the same way in Scala and Spark. The empty input is a special case, and this is well discussed in this SO post.
Spark sets the default value for the second parameter (limit) of the split function to -1. And as of Spark 3, we can now pass a limit parameter for split function.
You can see this in Scala split function vs Spark SQL split function:
"".split(",").length
//res31: Int = 1
spark.sql("""select size(split("", '[,]'))""").show
//+----------------------+
//|size(split(, [,], -1))|
//+----------------------+
//| 1|
//+----------------------+
And
",".split(",").length // without setting limit=-1 this gives empty array
//res33: Int = 0
",".split(",", -1).length
//res34: Int = 2
spark.sql("""select size(split(",", '[,]'))""").show
//+-----------------------+
//|size(split(,, [,], -1))|
//+-----------------------+
//| 2|
//+-----------------------+

I suppose the root cause is that split returns an empty string, instead of a null.
scala> df1.withColumn("cola", split(trim($"col2", "[]"), ",")).withColumn("s", $"cola"(0)).select("s").collect()(1)(0)
res53: Any = ""
And the size of an array containing an empty string is, of course, 1.
To get around this, perhaps you could do
val df2 = df1.withColumn("cola", split(trim($"col2", "[]"), ","))
.withColumn("s", when(length($"cola"(0)) =!= 0, size($"cola"))
.otherwise(lit(0)))
df2.show(false)
+----+-------------+---------------+---+
|col1|col2 |cola |s |
+----+-------------+---------------+---+
|1 |[{a},{b},{c}]|[{a}, {b}, {c}]|3 |
|2 |[] |[] |0 |
|3 |[{d},{e},{f}]|[{d}, {e}, {f}]|3 |
+----+-------------+---------------+---+

Related

Transform list of map to dataframe

I have the following data:
d = Some(List(Map(id -> 1, n -> Hi), Map(id -> 2, n -> Hello)))
I would like to transform it into a dataframe like the following:
+--------+
|id|n |
+--------+
|1 |Hi |
+--------+
|2 |Hello|
+--------+
I tried the following:
import spark.implicits._
val df = d
.map( m => (m.get("id"),m.get("n")))
.toDF("id", "n")
But im getting:
error: value get is not a member of Any
.map( m => (m.get("id"),m.get("n")))
Your top level here is Option and i think thats the reason why you cant handle it with single map. I managed to do it with something like this:
import spark.implicits._
val d = Some(List(Map("id" -> "1", "n" -> "Hi"), Map("id" -> "2", "n" -> "Hello")))
val data = d.fold (List.empty [(Option [String], Option [String])]) (_.map (m => (m.get ("id"), m.get ("n"))))
val df = data.toDF("id", "n").show()
Output:
+---+-----+
| id| n|
+---+-----+
| 1| Hi|
| 2|Hello|
+---+-----+

Split the column in pyspark

I am trying to split a column in pyspark on a bunch of delimiters: "_", "-", "|", "\", "/" etc. So if the data frame is as follows:
df = spark.createDataFrame([(1, "foo-23.aBD"), (2, "bar12/bg_$"),(3,"iO9_5Gh"),(4,"fg4555(dfd")],["id", "label"] )
I would like to have the values "foo", "bar12", "i09" and "fg4555" in the column "label".
I can do this separately, for each delimiter:
from pyspark.sql.functions import regexp_extract, col
split_col = functions.split(df['label'], '-')
df = df.withColumn('label', split_col.getItem(0))
split_col = functions.split(df['label'], '_')
df = df.withColumn('label', split_col.getItem(0))
split_col = functions.split(df['label'], '/')
df = df.withColumn('label', split_col.getItem(0))
But it doesn't look nice. Is there any possibility to do this in a more compact way? Thanks in advance.
You probably want to split on word boundaries (\b). Somehow, _ is not a word boundary, so you need to add that to the list of patterns too.
import pyspark.sql.functions as F
df2 = df.withColumn('label', F.split('label', '(\\b|_)')[0])
df2.show()
+---+------+
|id |label |
+---+------+
|1 |foo |
|2 |bar12 |
|3 |iO9 |
|4 |fg4555|
+---+------+

Split a dataframe string column by two different delimiters

The following is my dataset:
Itemcode
DB9450//DB9450/AD9066
DA0002/DE2396//DF2345
HWC72
GG7183/EB6693
TA444/B9X8X4:7-2-
The following is the code I have been trying to use
df.withColumn("item1", split(col("Itemcode"), "/").getItem(0)).withColumn("item2", split(col("Itemcode"), "/").getItem(1)).withColumn("item3", split(col("Itemcode"), "//").getItem(0))
But it fails when there is a double slash in between first and second item and also fails when there is a double slash between 2nd and 3rd item
Desired output is:
item1 item2 item3
DB9450 DB9450 AD9066
DA0002 DE2396 DF2345
HWC72
GG7183 EB6693
TA444 B9X8X4
You can first replace the // with / then you can split.. Please try the below and let us know if worked
Input
df_b = spark.createDataFrame([('DB9450//DB9450/AD9066',"a"),('DA0002/DE2396//DF2345',"a"),('HWC72',"a"),('GG7183/EB6693',"a"),('TA444/B9X8X4:7-2-',"a")],[ "reg","postime"])
+--------------------+-------+
| reg|postime|
+--------------------+-------+
|DB9450//DB9450/AD...| a|
|DA0002/DE2396//DF...| a|
| HWC72| a|
| GG7183/EB6693| a|
| TA444/B9X8X4:7-2-| a|
+--------------------+-------+
Logic
df_b = df_b.withColumn('split_col', F.regexp_replace(F.col('reg'), "//", "/"))
df_b = df_b.withColumn('split_col', F.split(df_b['split_col'], '/'))
df_b = df_b.withColumn('col1' , F.col('split_col').getItem(0))
df_b = df_b.withColumn('col2' , F.col('split_col').getItem(1))
df_b = df_b.withColumn('col2', F.regexp_replace(F.col('col2'), ":7-2-", ""))
df_b = df_b.withColumn('col3' , F.col('split_col').getItem(2))
Output
+--------------------+-------+--------------------+------+------+------+
| reg|postime| split_col| col1| col2| col3|
+--------------------+-------+--------------------+------+------+------+
|DB9450//DB9450/AD...| a|[DB9450, DB9450, ...|DB9450|DB9450|AD9066|
|DA0002/DE2396//DF...| a|[DA0002, DE2396, ...|DA0002|DE2396|DF2345|
| HWC72| a| [HWC72]| HWC72| null| null|
| GG7183/EB6693| a| [GG7183, EB6693]|GG7183|EB6693| null|
| TA444/B9X8X4:7-2-| a|[TA444, B9X8X4:7-2-]| TA444|B9X8X4| null|
+--------------------+-------+--------------------+------+------+------+
Processing the text as csv works well for this.
First, let's read in the text, replacing double backslashes along the way
Edit: Also removing everything after a colon
val items = """
Itemcode
DB9450//DB9450/AD9066
DA0002/DE2396//DF2345
HWC72
GG7183/EB6693
TA444/B9X8X4:7-2-
""".replaceAll("//", "/").split(":")(0)
Get the max number of items in a row
to create an appropriate header
val numItems = items.split("\n").map(_.split("/").size).reduce(_ max _)
val header = (1 to numItems).map("Itemcode" + _).mkString("/")
Then we're ready to create a Data Frame
val df = spark.read
.option("ignoreTrailingWhiteSpace", "true")
.option("delimiter", "/")
.option("header", "true")
.csv(spark.sparkContext.parallelize((header + items).split("\n")).toDS)
.filter("Itemcode1 <> 'Itemcode'")
df.show(false)
+---------+-----------+---------+
|Itemcode1|Itemcode2 |Itemcode3|
+---------+-----------+---------+
|DB9450 |DB9450 |AD9066 |
|DA0002 |DE2396 |DF2345 |
|HWC72 |null |null |
|GG7183 |EB6693 |null |
|TA444 |B9X8X4 |null |
+---------+-----------+---------+
Perhaps this is useful (spark>=2.4)-
split and TRANSFORM spark sql function will do the magic as below-
Load the provided test data
val data =
"""
|Itemcode
|
|DB9450//DB9450/AD9066
|
|DA0002/DE2396//DF2345
|
|HWC72
|
|GG7183/EB6693
|
|TA444/B9X8X4:7-2-
""".stripMargin
val stringDS = data.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString("|"))
.toSeq.toDS()
val df = spark.read
.option("sep", "|")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS)
df.show(false)
df.printSchema()
/**
* +---------------------+
* |Itemcode |
* +---------------------+
* |DB9450//DB9450/AD9066|
* |DA0002/DE2396//DF2345|
* |HWC72 |
* |GG7183/EB6693 |
* |TA444/B9X8X4:7-2- |
* +---------------------+
*
* root
* |-- Itemcode: string (nullable = true)
*/
Use split and TRANSFORM (you can run this query directly in pyspark)
df.withColumn("item_code", expr("TRANSFORM(split(Itemcode, '/+'), x -> split(x, ':')[0])"))
.selectExpr("item_code[0] item1", "item_code[1] item2", "item_code[2] item3")
.show(false)
/**
* +------+------+------+
* |item1 |item2 |item3 |
* +------+------+------+
* |DB9450|DB9450|AD9066|
* |DA0002|DE2396|DF2345|
* |HWC72 |null |null |
* |GG7183|EB6693|null |
* |TA444 |B9X8X4|null |
* +------+------+------+
*/

Spark: Multiple filter inside agg and concat not null values

I'm trying to concatenate not null values from a List column.
I know this can be done easily by using UDF but would like to know how to handle this by using multiple filter conditions inside agg function.
Don't know what's missing here?
val df = sc.parallelize(Seq(("foo", List(null,"bar",null)),
("bar", List("one","two",null)),
("rio", List("Ria","","Kevin")))).toDF("key", "value")
+---+-----------------+
|key| value|
+---+-----------------+
|foo|[null, bar, null]|
|bar| [one, two, null]|
|rio| [Ria, , Kevin]|
+---+-----------------+
df.groupBy("key")
.agg(concat_ws(",",first(when(($"value".isNotNull || $"value" =!= ""),$"value"))).as("RemovedNullSeq"))
.show(false)
+---+--------------+
|key|RemovedNullSeq|
+---+--------------+
|bar|one,two |
|rio|Ria,,Kevin |
|foo|bar |
+---+--------------+
I don't need that blank value in the second record.
Thanks
I'm not immediately sure if using aggregate functions is necessary based on the example provided.
If you're just trying to concatenate the values in an array then the following works:
val df = Seq(List(null,"abc", null),
List(null, null, null),
List(null, "def", "ghi", "kjl"),
List("mno", null, "pqr")).toDF("list")
df.withColumn("concat", concat_ws(",",$"list")).show(false)
+---------------------+-----------+
|list |concat |
+---------------------+-----------+
|[null, abc, null] |abc |
|[null, null, null] | |
|[null, def, ghi, kjl]|def,ghi,kjl|
|[mno, null, pqr] |mno,pqr |
+---------------------+-----------+
If there is a need to group first:
val df2 = Seq((123,List(null,"abc", null)),
(123,List(null,"def", "hij"))).toDF("key","list")
df2.show(false)
+---+-----------------+
|key|list |
+---+-----------------+
|123|[null, abc, null]|
|123|[null, def, hij] |
+---+-----------------+
You might think you could do something like
val grouped = df2.groupBy($"key").agg(collect_list($"list").as("collected"))
And then apply some functions to the array of arrays to obtain your concatenated result. However, I have been unable to find a way to do this without resorting to UDFs.
In this case, exploding before the grouping does the trick:
val grouped = df2.groupBy($"key").agg(collect_list($"list").as("collected"))
.groupBy($"key").agg(collect_list($"listItem").as("collected"))
.withColumn("concat", concat_ws(",",$"collected")).show(false)
+---+---------------+-----------+
|key|collected |concat |
+---+---------------+-----------+
|123|[abc, def, hij]|abc,def,hij|
+---+---------------+-----------+
Note however that there is no guarantee of the order in which the lists will be collected.
Hope this helps
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(("foo", List(null,"bar",null)),
("bar", List("one","two",null)),
("rio", List("Ria","","Kevin")))).toDF("key", "value")
val filtd = df.select($"key" as "key", explode($"value") as "val").where (length($"val") > 0)
val rsult = filtd.select($"*").groupBy($"key").agg(collect_list("val"))
rsult.show(5)
You can add ultiple conditions like this
val filtd = df.select($"key" as "key", explode($"value") as "val").where (length($"val") > 0 && $"val".isNotNull)
Output
+---+-----------------+
|key|collect_list(val)|
+---+-----------------+
|bar| [one, two]|
|rio| [Ria, Kevin]|
|foo| [bar]|
+---+-----------------+

In DataFrame.withColumn, how can I check if the column's value is null as a condition for the second parameter?

If I have a DataFrame called df that looks like:
+----+----+
| a1+ a2|
+----+----+
| foo| bar|
| N/A| baz|
|null| etc|
+----+----+
I can selectively replace values like so:
val df2 = df.withColumn("a1", when($"a1" === "N/A", $"a2"))
so that df2 looks like:
+----+----+
| a1+ a2|
+----+----+
| foo| bar|
| baz| baz|
|null| etc|
+----+----+
but why can't I check if it's null, like:
val df3 = df2.withColumn("a1", when($"a1" === null, $"a2"))
so that I get:
+----+----+
| a1+ a2|
+----+----+
| foo| bar|
| baz| baz|
| etc| etc|
+----+----+
Edit: $"a1".isNull doesn't seem to work. Could it be because of how I'm constructing the dataframe I'm using to test, which is as follows?
val schema = StructType(
StructField("a1", StringType, false) ::
StructField("a2", StringType, false) :: Nil
)
val data = sc.parallelize(Array(
Row("foo","bar"),
Row("N/A","baz"),
Row(null,"etc"))
)
val df = sqlContext.createDataFrame(data, schema)
I also can't use coalesce, as far as I know, because sometimes I need to use a static value instead of another column's value.
Edit again: Setting my test columns to nullable = false doesn't help.
Because null means no value and shouldn't be checked like this.
Use isNull function:
val df3 = df2.withColumn("a1", when($"a1".isNull, $"a2"))
or coalesce, which returns first non-null value:
val df3 = df2.withColumn("a1", coalesce($"a1", $"a2"))