How to replace characters in Spark-dataframe like python pandas-dataframe? - apache-spark-sql

I have a dataframe in spark like this :
+--------------------+--------------------+
| tag_1| tag_2|
+--------------------+--------------------+
|[music,rhythm,3ye...|[{"tag_id":1400,"...|
| [""]|[{"tag_id":1464,"...|
|[picture,language...| [""]|
+--------------------+--------------------+
Question 1:
Is there any solution to display it all not "..."?
Question 2:
How to repalce[""] by "Na"?
Question 3:
How to remove "[","]" in column-tag_1 and column-tag_2; and remove [{"tag_id": and "tag_name":" in column-tag_2?
How to get the same results in spark? How to deal with dataframe columns like pandas?
if in pandas, I'll use regular expression like this:
def replace_1(x):
pa_1 = re.compile(r"\[\"\"\]")
x = pa_1.sub("Na", x)
return x.strip()
def replace_2(x):
pa_1 = re.compile(r"[\[\]\"{}]")
pa_2 = re.compile(r'tag_id:')
pa_3 = re.compile(r'tag_name:')
x = pa_1.sub("", x)
x = pa_2.sub("", x)
x = pa_3.sub("", x)
return x.strip()
df_1.tag_1 = df_1.tag_1.apply(replace_1).apply(replace_2)
df_1.tag_2 = df_1.tag_2.apply(replace_1).apply(replace_2)
I tried:
def stemmize_loc(word: String): String ={
word.replaceAll("[", "")
}
def udfstemmize_loc = udf(stemmize_loc _)
val df1 = df.withColumn("tag_1",udfstemmize_loc(col("tag_1")))
df1.show()
This didn't work~
So, how to remove the characters and get the final result in spark like this:
+------------------------------+--------------------+
| tag_1| tag_2|
+------------------------------+--------------------+
| music,rhythm,3years,boy| 1400,music|
| Na| 1464,comunication|
|picture,language,4years,myself| Na|
+------------------------------+--------------------+
Thank you very much~~

Related

Transform list of map to dataframe

I have the following data:
d = Some(List(Map(id -> 1, n -> Hi), Map(id -> 2, n -> Hello)))
I would like to transform it into a dataframe like the following:
+--------+
|id|n |
+--------+
|1 |Hi |
+--------+
|2 |Hello|
+--------+
I tried the following:
import spark.implicits._
val df = d
.map( m => (m.get("id"),m.get("n")))
.toDF("id", "n")
But im getting:
error: value get is not a member of Any
.map( m => (m.get("id"),m.get("n")))
Your top level here is Option and i think thats the reason why you cant handle it with single map. I managed to do it with something like this:
import spark.implicits._
val d = Some(List(Map("id" -> "1", "n" -> "Hi"), Map("id" -> "2", "n" -> "Hello")))
val data = d.fold (List.empty [(Option [String], Option [String])]) (_.map (m => (m.get ("id"), m.get ("n"))))
val df = data.toDF("id", "n").show()
Output:
+---+-----+
| id| n|
+---+-----+
| 1| Hi|
| 2|Hello|
+---+-----+

How to use separate key lists to perform a join between two DataFrames?

I want to join two different DataFrames (dfA and dfB) built as follows :
dfA.show()
+-----+-------+-------+
| id_A| name_A|address|
+-----+-------+-------+
| 1| AAAA| Paris|
| 4| DDDD| Sydney|
+-----+-------+-------+
dfB.show()
+-----+-------+---------+
| id_B| name_B| job|
+-----+-------+---------+
| 1| AAAA| Analyst|
| 2| AERF| Engineer|
| 3| UOPY| Gardener|
| 4| DDDD| Insurer|
+-----+-------+---------+
I need to use the following lists in order to do the join :
val keyListA = List("id_A", "name_A")
val keyListB = List("id_B", "name_B")
A simple solution would be :
val join = dfA.join(
dfA("id_A") === dfB("id_B") &&
dfA("name_A") === dfB("name_B"),
"left_outer")
Is there a syntax that would allow you to do this join by using the keyListA and keyListB lists ?
If you really want to build your join expression from lists of column names:
import org.apache.spark.sql.{Column, DataFrame}
import org.apache.spark.sql.functions._
val dfA: DataFrame = ???
val dfB: DataFrame = ???
val keyListA = List("id_A", "name_A", "property1_A", "property2_A", "property3_A")
val keyListB = List("id_B", "name_B", "property1_B", "property2_B", "property3_B")
def joinExprsFrom(keyListA: List[String], keyListB: List[String]): Column =
keyListA
.zip(keyListB)
.map { case (fromA, fromB) => col(fromA) === col(fromB) }
.reduce((acc, expr) => acc && expr )
dfA.join(
dfB,
joinExprsFrom(keyListA, keyListB),
"left_outer")
You need to make sure keyListA and keyListB are the same size and non-empty.

Convert a URI query string to an Array of Struct key-value in PySpark

I have a DataFrame in PySpark with a column of URI query-string (StringType) like this:
+--------------+
| cs_uri_query |
+--------------+
| a=1&b=2&c=3 |
+--------------+
| d&e=&f=4 |
+--------------+
I need to convert this column in an ArrayType of StructField elements with the following structure:
ArrayType(StructType([StructField('key', StringType(), nullable=False),
StructField('value', StringType(), nullable=True)]))
My expected column is like this:
+------------------------------------------------------------+
| cs_uri_query |
+------------------------------------------------------------+
| [{key=a, value=1},{key=b, value=2},{key=c, value=3}] |
+------------------------------------------------------------+
| [{key=d, value=null},{key=e, value=null},{key=f, value=4}] |
+------------------------------------------------------------+
UDF is the only way i found to achieve this.
I'm using pure Spark functions and, if it is possible, i would like to avoid UDFs...
UDFs have very bad performance on PySpark, unlike using Spark on Scala lang.
This is my code using UDF:
def parse_query(query):
args = None
if query:
args = []
for arg in query.split("&"):
if arg:
if "=" in arg:
a = arg.split("=")
if a[0]:
v = a[1] if a[1] else None
args.append({"key": a[0], "value": v})
else:
args.append({"key": arg, "value": None})
return args
uri_query = ArrayType(StructType([StructField('key', StringType(), nullable=True),
StructField('value', StringType(), nullable=True)]))
udf_parse_query = udf(lambda args: parse_query(args), uri_query)
df = df.withColumn("cs_uri_query", udf_parse_query(df["cs_uri_query"]))
Someone able to open my eyes me with an amazing solution ?
For Spark 2.4+, you can split by & and then use transform function to convert each element key=value to a struct(key, value):
from pyspark.sql.functions import expr
df = spark.createDataFrame([("a=1&b=2&c=3",), ("d&e=&f=4",)], ["cs_uri_query"])
transform_expr = """transform(split(cs_uri_query, '&'),
x -> struct(split(x, '=')[0] as key, split(x, '=')[1] as value)
)
"""
df.withColumn("cs_uri_query", expr(transform_expr)).show(truncate=False)
#+------------------------+
#|cs_uri_query |
#+------------------------+
#|[[a, 1], [b, 2], [c, 3]]|
#|[[d,], [e, ], [f, 4]] |
#+------------------------+
EDIT
If you want to filter out keys that are null or empty then you can use filter along with the above transform expression:
transform_expr = """filter(transform(split(cs_uri_query, '&'),
x -> struct(split(x, '=')[0] as key, split(x, '=')[1] as value)
),
x -> ifnull(x.key, '') <> ''
)
"""

Including null values in an Apache Spark Join

I would like to include null values in an Apache Spark join. Spark doesn't include rows with null by default.
Here is the default Spark behavior.
val numbersDf = Seq(
("123"),
("456"),
(null),
("")
).toDF("numbers")
val lettersDf = Seq(
("123", "abc"),
("456", "def"),
(null, "zzz"),
("", "hhh")
).toDF("numbers", "letters")
val joinedDf = numbersDf.join(lettersDf, Seq("numbers"))
Here is the output of joinedDf.show():
+-------+-------+
|numbers|letters|
+-------+-------+
| 123| abc|
| 456| def|
| | hhh|
+-------+-------+
This is the output I would like:
+-------+-------+
|numbers|letters|
+-------+-------+
| 123| abc|
| 456| def|
| | hhh|
| null| zzz|
+-------+-------+
Spark provides a special NULL safe equality operator:
numbersDf
.join(lettersDf, numbersDf("numbers") <=> lettersDf("numbers"))
.drop(lettersDf("numbers"))
+-------+-------+
|numbers|letters|
+-------+-------+
| 123| abc|
| 456| def|
| null| zzz|
| | hhh|
+-------+-------+
Be careful not to use it with Spark 1.5 or earlier. Prior to Spark 1.6 it required a Cartesian product (SPARK-11111 - Fast null-safe join).
In Spark 2.3.0 or later you can use Column.eqNullSafe in PySpark:
numbers_df = sc.parallelize([
("123", ), ("456", ), (None, ), ("", )
]).toDF(["numbers"])
letters_df = sc.parallelize([
("123", "abc"), ("456", "def"), (None, "zzz"), ("", "hhh")
]).toDF(["numbers", "letters"])
numbers_df.join(letters_df, numbers_df.numbers.eqNullSafe(letters_df.numbers))
+-------+-------+-------+
|numbers|numbers|letters|
+-------+-------+-------+
| 456| 456| def|
| null| null| zzz|
| | | hhh|
| 123| 123| abc|
+-------+-------+-------+
and %<=>% in SparkR:
numbers_df <- createDataFrame(data.frame(numbers = c("123", "456", NA, "")))
letters_df <- createDataFrame(data.frame(
numbers = c("123", "456", NA, ""),
letters = c("abc", "def", "zzz", "hhh")
))
head(join(numbers_df, letters_df, numbers_df$numbers %<=>% letters_df$numbers))
numbers numbers letters
1 456 456 def
2 <NA> <NA> zzz
3 hhh
4 123 123 abc
With SQL (Spark 2.2.0+) you can use IS NOT DISTINCT FROM:
SELECT * FROM numbers JOIN letters
ON numbers.numbers IS NOT DISTINCT FROM letters.numbers
This is can be used with DataFrame API as well:
numbersDf.alias("numbers")
.join(lettersDf.alias("letters"))
.where("numbers.numbers IS NOT DISTINCT FROM letters.numbers")
val numbers2 = numbersDf.withColumnRenamed("numbers","num1") //rename columns so that we can disambiguate them in the join
val letters2 = lettersDf.withColumnRenamed("numbers","num2")
val joinedDf = numbers2.join(letters2, $"num1" === $"num2" || ($"num1".isNull && $"num2".isNull) ,"outer")
joinedDf.select("num1","letters").withColumnRenamed("num1","numbers").show //rename the columns back to the original names
Based on K L's idea, you could use foldLeft to generate join column expression:
def nullSafeJoin(rightDF: DataFrame, columns: Seq[String], joinType: String)(leftDF: DataFrame): DataFrame =
{
val colExpr: Column = leftDF(columns.head) <=> rightDF(columns.head)
val fullExpr = columns.tail.foldLeft(colExpr) {
(colExpr, p) => colExpr && leftDF(p) <=> rightDF(p)
}
leftDF.join(rightDF, fullExpr, joinType)
}
then, you could call this function just like:
aDF.transform(nullSafejoin(bDF, columns, joinType))
Complementing the other answers, for PYSPARK < 2.3.0 you would not have Column.eqNullSafe neither IS NOT DISTINCT FROM.
You still can build the <=> operator with an sql expression to include it in the join, as long as you define alias for the join queries:
from pyspark.sql.types import StringType
import pyspark.sql.functions as F
numbers_df = spark.createDataFrame (["123","456",None,""], StringType()).toDF("numbers")
letters_df = spark.createDataFrame ([("123", "abc"),("456", "def"),(None, "zzz"),("", "hhh") ]).\
toDF("numbers", "letters")
joined_df = numbers_df.alias("numbers").join(letters_df.alias("letters"),
F.expr('numbers.numbers <=> letters.numbers')).\
select('letters.*')
joined_df.show()
+-------+-------+
|numbers|letters|
+-------+-------+
| 456| def|
| null| zzz|
| | hhh|
| 123| abc|
+-------+-------+
Based on timothyzhang's idea one can further improve it by removing duplicate columns:
def dropDuplicateColumns(df: DataFrame, rightDf: DataFrame, cols: Seq[String]): DataFrame
= cols.foldLeft(df)((df, c) => df.drop(rightDf(c)))
def joinTablesWithSafeNulls(rightDF: DataFrame, leftDF: DataFrame, columns: Seq[String], joinType: String): DataFrame =
{
val colExpr: Column = leftDF(columns.head) <=> rightDF(columns.head)
val fullExpr = columns.tail.foldLeft(colExpr) {
(colExpr, p) => colExpr && leftDF(p) <=> rightDF(p)
}
val finalDF = leftDF.join(rightDF, fullExpr, joinType)
val filteredDF = dropDuplicateColumns(finalDF, rightDF, columns)
filteredDF
}
Try the following method to include the null rows to the result of JOIN operator:
def nullSafeJoin(leftDF: DataFrame, rightDF: DataFrame, columns: Seq[String], joinType: String): DataFrame = {
var columnsExpr: Column = leftDF(columns.head) <=> rightDF(columns.head)
columns.drop(1).foreach(column => {
columnsExpr = columnsExpr && (leftDF(column) <=> rightDF(column))
})
var joinedDF: DataFrame = leftDF.join(rightDF, columnsExpr, joinType)
columns.foreach(column => {
joinedDF = joinedDF.drop(leftDF(column))
})
joinedDF
}

Django: Expression can be simplified. square brackets

def _by_list(request, names, every_name):
every_names = request.GET.getlist(every_name, '')
names_list = map(lambda n: Q(**{names:n}), every_names)
if names_list == []: # Expression can be simplified
return Q(**{names:'_'})
else:
return reduce(lambda a, b: a | b, names_list)
if i use 'if names_list' it's dosn't work as expected