Transform column with int flags to array of strings in pyspark - apache-spark-sql

I have a dataframe with a column called "traits" which is an integer composed of multiple flags.
I need to convert this column to a list of strings (for elastic search indexing). Conversion looks like this.
TRAIT_0 = 0
TRAIT_1 = 1
TRAIT_2 = 2
def flag_to_list(flag: int) -> List[str]:
trait_list = []
if flag & (1 << TRAIT_0):
trait_list.append("TRAIT_0")
elif flag & (1 << TRAIT_1):
trait_list.append("TRAIT_1")
elif flag & (1 << TRAIT_2):
trait_list.append("TRAIT_2")
return trait_list
What is the most efficient way of doing this transformation in pyspark? I saw lots of examples on how to do concatenation and splitting of strings, but not an operation like this.
Using pyspark vesion 2.4.5
Input json looks like this:
{ "name": "John Doe", "traits": 5 }
Output json should look like this:
{ "name": "John Doe", "traits": ["TRAIT_0", "TRAIT_2"] }

IIUC, you can try SparkSQL built-in functions: (1) use conv + split to convert integer(base-10) -> binary(base-2) -> string -> array of strings(reversed), (2) based on 0 or 1 values and their array indices to filter and transform the array into the corresponding array of named traits:
from pyspark.sql.functions import expr
df = spark.createDataFrame([("name1", 5),("name2", 1),("name3", 0),("name4", 12)], ['name', 'traits'])
#DataFrame[name: string, traits: bigint]
traits = [ "Traits_{}".format(i) for i in range(8) ]
traits_array = "array({})".format(",".join("'{}'".format(e) for e in traits))
# array('Traits_0','Traits_1','Traits_2','Traits_3','Traits_4','Traits_5','Traits_6','Traits_7')
sql_expr = """
filter(
transform(
/* convert int -> binary -> string -> array of strings, and then reverse the array */
reverse(split(string(conv(traits,10,2)),'(?!$)')),
/* take the corresponding items from the traits_array when value > 0, else NULL */
(x,i) -> {}[IF(x='1',i,NULL)]
),
/* filter out NULL items from the array */
y -> y is not NULL
) AS trait_list
""".format(traits_array)
# filter(
# transform(
# reverse(split(string(conv(traits,10,2)),'(?!$)')),
# (x,i) -> array('Traits_0','Traits_1','Traits_2','Traits_3','Traits_4','Traits_5','Traits_6','Traits_7')[IF(x='1',i,NULL)]
# ),
# y -> y is not NULL
# )
df.withColumn("traits_list", expr(sql_expr)).show(truncate=False)
+-----+------+--------------------+
|name |traits|traits_list |
+-----+------+--------------------+
|name1|5 |[Traits_0, Traits_2]|
|name2|1 |[Traits_0] |
|name3|0 |[] |
|name4|12 |[Traits_2, Traits_3]|
+-----+------+--------------------+
Below is the result after running reverse(split(string(conv(traits,10,2)),'(?!$)')), notice that the split-pattern (?!$) is used to avoid a NULL shown as the last array item.
df.selectExpr("*", "reverse(split(string(conv(traits,10,2)),'(?!$)')) as t1").show()
+-----+------+------------+
| name|traits| t1|
+-----+------+------------+
|name1| 5| [1, 0, 1]|
|name2| 1| [1]|
|name3| 0| [0]|
|name4| 12|[0, 0, 1, 1]|
+-----+------+------------+

We can define a UDF to wrap your function and then call it. This is some sample code:
from typing import List
from pyspark.sql.types import ArrayType, StringType
TRAIT_0 = 0
TRAIT_1 = 1
TRAIT_2 = 2
def flag_to_list(flag: int) -> List[str]:
trait_list = []
if flag & (1 << TRAIT_0):
trait_list.append("TRAIT_0")
elif flag & (1 << TRAIT_1):
trait_list.append("TRAIT_1")
elif flag & (1 << TRAIT_2):
trait_list.append("TRAIT_2")
return trait_list
flag_to_list_udf = udf(lambda x: None if x is None else flag_to_list(x),
ArrayType(StringType()))
# Create dummy data to test
data = [
{ "name": "John Doe", "traits": 5 },
{ "name": "Jane Doe", "traits": 2 },
{ "name": "Jane Roe", "traits": 0 },
{ "name": "John Roe", "traits": 6 },
]
df = spark.createDataFrame(data, 'name STRING, traits INT')
df.show()
# +--------+------+
# | name|traits|
# +--------+------+
# |John Doe| 5|
# |Jane Doe| 2|
# |Jane Roe| 0|
# |John Roe| 6|
# +--------+------+
df = df.withColumn('traits_processed', flag_to_list_udf(df['traits']))
df.show()
# +--------+------+----------------+
# | name|traits|traits_processed|
# +--------+------+----------------+
# |John Doe| 5| [TRAIT_0]|
# |Jane Doe| 2| [TRAIT_1]|
# |Jane Roe| 0| []|
# |John Roe| 6| [TRAIT_1]|
# +--------+------+----------------+
If you don't want to create a new column, you can replace traits_processed with traits.

Related

Transform list of map to dataframe

I have the following data:
d = Some(List(Map(id -> 1, n -> Hi), Map(id -> 2, n -> Hello)))
I would like to transform it into a dataframe like the following:
+--------+
|id|n |
+--------+
|1 |Hi |
+--------+
|2 |Hello|
+--------+
I tried the following:
import spark.implicits._
val df = d
.map( m => (m.get("id"),m.get("n")))
.toDF("id", "n")
But im getting:
error: value get is not a member of Any
.map( m => (m.get("id"),m.get("n")))
Your top level here is Option and i think thats the reason why you cant handle it with single map. I managed to do it with something like this:
import spark.implicits._
val d = Some(List(Map("id" -> "1", "n" -> "Hi"), Map("id" -> "2", "n" -> "Hello")))
val data = d.fold (List.empty [(Option [String], Option [String])]) (_.map (m => (m.get ("id"), m.get ("n"))))
val df = data.toDF("id", "n").show()
Output:
+---+-----+
| id| n|
+---+-----+
| 1| Hi|
| 2|Hello|
+---+-----+

N conditions in dynamically 'when' clause

I have the following code
from pyspark.sql.functions import col, count, when
from functools import reduce
df = spark.createDataFrame([ (1,""), (2,None),(3,"c"),(4,"d") ], ['id','name'])
filter1 = col("name").isNull()
filter2 = col("name") == ""
dfresult = df.filter(filter1 | filter2).select(col("id"), when(filter1, "name is null").when(filter2, "name is empty").alias("new_col"))
dfresult.show()
+---+-------------+
| id| new_col|
+---+-------------+
| 1|name is empty|
| 2| name is null|
+---+-------------+
In the scenario with N filters. I think about
filters = []
filters.append({ "item": filter1, "msg":"name is null"})
filters.append({ "item": filter2, "msg":"name is empty"})
dynamic_filter = reduce(
lambda x,y: x | y,
[s['item'] for s in filters]
)
df2 = df.filter(dynamic_filter).select(col("id"), when(filter1, "name is null").when(filter2, "name is empty").alias("new_col"))
df2.show()
How can I make something better for new_col column with dynamic when?
Simply use functools.reduce as your already did for the filter expression:
from functools import reduce
from pyspark.sql import functions as F
new_col = reduce(
lambda acc, x: acc.when(x["item"], F.lit(x["msg"])),
filters,
F
)
df2 = df.filter(dynamic_filter).select(col("id"), new_col.alias("new_col"))
df2.show()
#+---+-------------+
#| id| new_col|
#+---+-------------+
#| 1|name is empty|
#| 2| name is null|
#+---+-------------+

size function applied to empty array column in dataframe returns 1 after spilt

Noticed that with size function on an array column in a dataframe using following code - which includes a split:
import org.apache.spark.sql.functions.{trim, explode, split, size}
val df1 = Seq(
(1, "[{a},{b},{c}]"),
(2, "[]"),
(3, "[{d},{e},{f}]")
).toDF("col1", "col2")
df1.show(false)
val df2 = df.withColumn("cola", split(trim($"col2", "[]"), ",")).withColumn("s", size($"cola"))
df2.show(false)
we get:
+----+-------------+---------------+---+
|col1|col2 |cola |s |
+----+-------------+---------------+---+
|1 |[{a},{b},{c}]|[{a}, {b}, {c}]|3 |
|2 |[] |[] |1 |
|3 |[{d},{e},{f}]|[{d}, {e}, {f}]|3 |
+----+-------------+---------------+---+
I was hoping for a zero so as to be able distinguish between 0 or 1 entries.
A few hints here and there on SO, but none that helped.
If I have the following entry: (2, null), then I get size -1, which is more helpful I guess.
On the other hand, this borrowed sample from the internet:
val df = Seq("a" -> Array(1,2,3), "b" -> null, "c" -> Array(7,8,9)).toDF("id","numbers")
df.show
val df2 = df.withColumn("numbers", coalesce($"numbers", array()))
df2.show
val df3 = df2.withColumn("s", size($"numbers"))
df3.show()
does return 0 - as expected.
Looking for the correct approach here so as to get size = 0.
This behavior is inherited from the Java function split which is used in the same way in Scala and Spark. The empty input is a special case, and this is well discussed in this SO post.
Spark sets the default value for the second parameter (limit) of the split function to -1. And as of Spark 3, we can now pass a limit parameter for split function.
You can see this in Scala split function vs Spark SQL split function:
"".split(",").length
//res31: Int = 1
spark.sql("""select size(split("", '[,]'))""").show
//+----------------------+
//|size(split(, [,], -1))|
//+----------------------+
//| 1|
//+----------------------+
And
",".split(",").length // without setting limit=-1 this gives empty array
//res33: Int = 0
",".split(",", -1).length
//res34: Int = 2
spark.sql("""select size(split(",", '[,]'))""").show
//+-----------------------+
//|size(split(,, [,], -1))|
//+-----------------------+
//| 2|
//+-----------------------+
I suppose the root cause is that split returns an empty string, instead of a null.
scala> df1.withColumn("cola", split(trim($"col2", "[]"), ",")).withColumn("s", $"cola"(0)).select("s").collect()(1)(0)
res53: Any = ""
And the size of an array containing an empty string is, of course, 1.
To get around this, perhaps you could do
val df2 = df1.withColumn("cola", split(trim($"col2", "[]"), ","))
.withColumn("s", when(length($"cola"(0)) =!= 0, size($"cola"))
.otherwise(lit(0)))
df2.show(false)
+----+-------------+---------------+---+
|col1|col2 |cola |s |
+----+-------------+---------------+---+
|1 |[{a},{b},{c}]|[{a}, {b}, {c}]|3 |
|2 |[] |[] |0 |
|3 |[{d},{e},{f}]|[{d}, {e}, {f}]|3 |
+----+-------------+---------------+---+

Convert a URI query string to an Array of Struct key-value in PySpark

I have a DataFrame in PySpark with a column of URI query-string (StringType) like this:
+--------------+
| cs_uri_query |
+--------------+
| a=1&b=2&c=3 |
+--------------+
| d&e=&f=4 |
+--------------+
I need to convert this column in an ArrayType of StructField elements with the following structure:
ArrayType(StructType([StructField('key', StringType(), nullable=False),
StructField('value', StringType(), nullable=True)]))
My expected column is like this:
+------------------------------------------------------------+
| cs_uri_query |
+------------------------------------------------------------+
| [{key=a, value=1},{key=b, value=2},{key=c, value=3}] |
+------------------------------------------------------------+
| [{key=d, value=null},{key=e, value=null},{key=f, value=4}] |
+------------------------------------------------------------+
UDF is the only way i found to achieve this.
I'm using pure Spark functions and, if it is possible, i would like to avoid UDFs...
UDFs have very bad performance on PySpark, unlike using Spark on Scala lang.
This is my code using UDF:
def parse_query(query):
args = None
if query:
args = []
for arg in query.split("&"):
if arg:
if "=" in arg:
a = arg.split("=")
if a[0]:
v = a[1] if a[1] else None
args.append({"key": a[0], "value": v})
else:
args.append({"key": arg, "value": None})
return args
uri_query = ArrayType(StructType([StructField('key', StringType(), nullable=True),
StructField('value', StringType(), nullable=True)]))
udf_parse_query = udf(lambda args: parse_query(args), uri_query)
df = df.withColumn("cs_uri_query", udf_parse_query(df["cs_uri_query"]))
Someone able to open my eyes me with an amazing solution ?
For Spark 2.4+, you can split by & and then use transform function to convert each element key=value to a struct(key, value):
from pyspark.sql.functions import expr
df = spark.createDataFrame([("a=1&b=2&c=3",), ("d&e=&f=4",)], ["cs_uri_query"])
transform_expr = """transform(split(cs_uri_query, '&'),
x -> struct(split(x, '=')[0] as key, split(x, '=')[1] as value)
)
"""
df.withColumn("cs_uri_query", expr(transform_expr)).show(truncate=False)
#+------------------------+
#|cs_uri_query |
#+------------------------+
#|[[a, 1], [b, 2], [c, 3]]|
#|[[d,], [e, ], [f, 4]] |
#+------------------------+
EDIT
If you want to filter out keys that are null or empty then you can use filter along with the above transform expression:
transform_expr = """filter(transform(split(cs_uri_query, '&'),
x -> struct(split(x, '=')[0] as key, split(x, '=')[1] as value)
),
x -> ifnull(x.key, '') <> ''
)
"""

Including null values in an Apache Spark Join

I would like to include null values in an Apache Spark join. Spark doesn't include rows with null by default.
Here is the default Spark behavior.
val numbersDf = Seq(
("123"),
("456"),
(null),
("")
).toDF("numbers")
val lettersDf = Seq(
("123", "abc"),
("456", "def"),
(null, "zzz"),
("", "hhh")
).toDF("numbers", "letters")
val joinedDf = numbersDf.join(lettersDf, Seq("numbers"))
Here is the output of joinedDf.show():
+-------+-------+
|numbers|letters|
+-------+-------+
| 123| abc|
| 456| def|
| | hhh|
+-------+-------+
This is the output I would like:
+-------+-------+
|numbers|letters|
+-------+-------+
| 123| abc|
| 456| def|
| | hhh|
| null| zzz|
+-------+-------+
Spark provides a special NULL safe equality operator:
numbersDf
.join(lettersDf, numbersDf("numbers") <=> lettersDf("numbers"))
.drop(lettersDf("numbers"))
+-------+-------+
|numbers|letters|
+-------+-------+
| 123| abc|
| 456| def|
| null| zzz|
| | hhh|
+-------+-------+
Be careful not to use it with Spark 1.5 or earlier. Prior to Spark 1.6 it required a Cartesian product (SPARK-11111 - Fast null-safe join).
In Spark 2.3.0 or later you can use Column.eqNullSafe in PySpark:
numbers_df = sc.parallelize([
("123", ), ("456", ), (None, ), ("", )
]).toDF(["numbers"])
letters_df = sc.parallelize([
("123", "abc"), ("456", "def"), (None, "zzz"), ("", "hhh")
]).toDF(["numbers", "letters"])
numbers_df.join(letters_df, numbers_df.numbers.eqNullSafe(letters_df.numbers))
+-------+-------+-------+
|numbers|numbers|letters|
+-------+-------+-------+
| 456| 456| def|
| null| null| zzz|
| | | hhh|
| 123| 123| abc|
+-------+-------+-------+
and %<=>% in SparkR:
numbers_df <- createDataFrame(data.frame(numbers = c("123", "456", NA, "")))
letters_df <- createDataFrame(data.frame(
numbers = c("123", "456", NA, ""),
letters = c("abc", "def", "zzz", "hhh")
))
head(join(numbers_df, letters_df, numbers_df$numbers %<=>% letters_df$numbers))
numbers numbers letters
1 456 456 def
2 <NA> <NA> zzz
3 hhh
4 123 123 abc
With SQL (Spark 2.2.0+) you can use IS NOT DISTINCT FROM:
SELECT * FROM numbers JOIN letters
ON numbers.numbers IS NOT DISTINCT FROM letters.numbers
This is can be used with DataFrame API as well:
numbersDf.alias("numbers")
.join(lettersDf.alias("letters"))
.where("numbers.numbers IS NOT DISTINCT FROM letters.numbers")
val numbers2 = numbersDf.withColumnRenamed("numbers","num1") //rename columns so that we can disambiguate them in the join
val letters2 = lettersDf.withColumnRenamed("numbers","num2")
val joinedDf = numbers2.join(letters2, $"num1" === $"num2" || ($"num1".isNull && $"num2".isNull) ,"outer")
joinedDf.select("num1","letters").withColumnRenamed("num1","numbers").show //rename the columns back to the original names
Based on K L's idea, you could use foldLeft to generate join column expression:
def nullSafeJoin(rightDF: DataFrame, columns: Seq[String], joinType: String)(leftDF: DataFrame): DataFrame =
{
val colExpr: Column = leftDF(columns.head) <=> rightDF(columns.head)
val fullExpr = columns.tail.foldLeft(colExpr) {
(colExpr, p) => colExpr && leftDF(p) <=> rightDF(p)
}
leftDF.join(rightDF, fullExpr, joinType)
}
then, you could call this function just like:
aDF.transform(nullSafejoin(bDF, columns, joinType))
Complementing the other answers, for PYSPARK < 2.3.0 you would not have Column.eqNullSafe neither IS NOT DISTINCT FROM.
You still can build the <=> operator with an sql expression to include it in the join, as long as you define alias for the join queries:
from pyspark.sql.types import StringType
import pyspark.sql.functions as F
numbers_df = spark.createDataFrame (["123","456",None,""], StringType()).toDF("numbers")
letters_df = spark.createDataFrame ([("123", "abc"),("456", "def"),(None, "zzz"),("", "hhh") ]).\
toDF("numbers", "letters")
joined_df = numbers_df.alias("numbers").join(letters_df.alias("letters"),
F.expr('numbers.numbers <=> letters.numbers')).\
select('letters.*')
joined_df.show()
+-------+-------+
|numbers|letters|
+-------+-------+
| 456| def|
| null| zzz|
| | hhh|
| 123| abc|
+-------+-------+
Based on timothyzhang's idea one can further improve it by removing duplicate columns:
def dropDuplicateColumns(df: DataFrame, rightDf: DataFrame, cols: Seq[String]): DataFrame
= cols.foldLeft(df)((df, c) => df.drop(rightDf(c)))
def joinTablesWithSafeNulls(rightDF: DataFrame, leftDF: DataFrame, columns: Seq[String], joinType: String): DataFrame =
{
val colExpr: Column = leftDF(columns.head) <=> rightDF(columns.head)
val fullExpr = columns.tail.foldLeft(colExpr) {
(colExpr, p) => colExpr && leftDF(p) <=> rightDF(p)
}
val finalDF = leftDF.join(rightDF, fullExpr, joinType)
val filteredDF = dropDuplicateColumns(finalDF, rightDF, columns)
filteredDF
}
Try the following method to include the null rows to the result of JOIN operator:
def nullSafeJoin(leftDF: DataFrame, rightDF: DataFrame, columns: Seq[String], joinType: String): DataFrame = {
var columnsExpr: Column = leftDF(columns.head) <=> rightDF(columns.head)
columns.drop(1).foreach(column => {
columnsExpr = columnsExpr && (leftDF(column) <=> rightDF(column))
})
var joinedDF: DataFrame = leftDF.join(rightDF, columnsExpr, joinType)
columns.foreach(column => {
joinedDF = joinedDF.drop(leftDF(column))
})
joinedDF
}