I have a table of three columns [s,p,o]. I would like to remove rows, that for each entry in s , the p column does not include [P625, P36] values both. For example
+----+----+------
| s| p| o |
+----+----+-----|
| Q31| P36| Q239|
| Q31|P625| 51|
| Q45| P36| Q597|
| Q45|P625| 123|
| Q51|P625| 22|
| Q24|P625| 56|
The end result should be
+----+----+------
| s| p| o |
+----+----+-----|
| Q31| P36| Q239|
| Q31|P625| 51|
| Q45| P36| Q597|
| Q45|P625| 123|
Using join operation, the above task is easy.
df.filter(df.p=='P625').join(df.filter(df.p=='P36'),'s')
But is there a more elegant way to do this?
You need a window
from pyspark.sql import Window
from pyspark.sql.functions import *
winSpec = Window.partitionBy('s')
df.withColumn("s_list", collect_list("s").over(winSpec)).
filter(array_contains(col("s_list"), "P625") & array_contains(col("s_list"), "P36") & size(col("s_list")) = 2)
Forgive me, as I'm much more familiar with the Scala API, but perhaps you can easily convert it:
scala> val df = spark.createDataset(Seq(
| ("Q31", "P36", "Q239"),
| ("Q31", "P625", "51"),
| ("Q45", "P36", "Q597"),
| ("Q45", "P625", "123"),
| ("Q51", "P625", "22"),
| ("Q24", "P625", "56")
| )).toDF("s", "p", "o")
df: org.apache.spark.sql.DataFrame = [s: string, p: string ... 1 more field]
scala> (df.select($"s", struct($"p", $"o").as("po"))
| .groupBy("s")
| .agg(collect_list($"po").as("polist"))
| .as[(String, Array[(String, String)])]
| .flatMap(r => {
| val ps = r._2.map(_._1).toSet
| if(ps("P625") && ps("P36")) {
| r._2.flatMap(po => Some(r._1, po._1, po._2))
| } else {
| None
| }
| }).toDF("s", "p", "o")
| .show())
+---+----+----+
| s| p| o|
+---+----+----+
|Q31| P36|Q239|
|Q31|P625| 51|
|Q45| P36|Q597|
|Q45|P625| 123|
+---+----+----+
For reference, your join() command above would have returned:
scala> df.filter($"p" === "P625").join(df.filter($"p" === "P36"), "s").show
+---+----+---+---+----+
| s| p| o| p| o|
+---+----+---+---+----+
|Q31|P625| 51|P36|Q239|
|Q45|P625|123|P36|Q597|
+---+----+---+---+----+
Which can be worked into your final solution as well, perhaps with less code, but I'm not sure which method would be more efficient, as that's largely data dependent.
Related
I would like to replace a null value of a pyspark dataframe column with another string column converted to array.
import pyspark.sql.functions as F
import pyspark.sql.types as T
new_customers = spark.createDataFrame(data=[["Karen", ["a"]], ["Penny", ["b"]], ["John", [None]], ["Cosimo", ["d"]]], schema=["name", "val"])
new_customers.printSchema()
new_customers.show(5, False)
new_customers = new_customers.withColumn("new_val", F.coalesce(F.col("val"), F.array(F.col("name"))))
new_customers.show(10, truncate=False)
But, it is
root
|-- name: string (nullable = true)
|-- val: array (nullable = true)
| |-- element: string (containsNull = true)
+------+---+
|name |val|
+------+---+
|Karen |[a]|
|Penny |[b]|
|John |[] |
|Cosimo|[d]|
+------+---+
+------+---+-------+
|name |val|new_val|
+------+---+-------+
|Karen |[a]|[a] |
|Penny |[b]|[b] |
|John |[] |[] |
|Cosimo|[d]|[d] |
+------+---+-------+
what I expect:
+------+---+-------+
|name |val|new_val|
+------+---+-------+
|Karen |[a]|[a] |
|Penny |[b]|[b] |
|John |[] |[John] |
|Cosimo|[d]|[d] |
+------+---+-------+
Did I miss something ? thanks
Problem is that you've an array with null element in it. It will not test positive for isNull check.
First clean up single-null-element arrays:
import pyspark.sql.functions as F
import pyspark.sql.types as T
new_customers = spark.createDataFrame(data=[["Karen", ["a"]], ["Penny", ["b"]], ["John", [None]], ["Cosimo", ["d"]]], schema=["name", "val"])
+------+------+
|name |val |
+------+------+
|Karen |[a] |
|Penny |[b] |
|John |[null]|
|Cosimo|[d] |
+------+------+
new_customers = new_customers.withColumn("val", F.filter(F.col("val"), lambda x: x.isNotNull()))
+------+---+
|name |val|
+------+---+
|Karen |[a]|
|Penny |[b]|
|John |[] |
|Cosimo|[d]|
+------+---+
Then, change your expression for array empty check instead of null check:
new_customers = new_customers.withColumn("new_val", F.when(F.size("val")>0, F.col("val")).otherwise(F.array(F.col("name"))))
+------+---+-------+
|name |val|new_val|
+------+---+-------+
|Karen |[a]|[a] |
|Penny |[b]|[b] |
|John |[] |[John] |
|Cosimo|[d]|[d] |
+------+---+-------+
For Example
If I have a Column as given below by calling and showing the CSV in Pyspark
+--------+
| Names|
+--------+
|Rahul |
|Ravi |
|Raghu |
|Romeo |
+--------+
if I specify in my functions as Such
Length = 2
Maxsplit = 3
Then I have to get the results as
+----------+-----------+----------+
|Col_1 |Col_2 |Col_3 |
+----------+-----------+----------+
| Ra | hu | l |
| Ra | vi | Null |
| Ra | gh | u |
| Ro | me | o |
+----------+-----------+----------+
Simirarly in Pyspark
Length = 3
Max split = 2 it should provide me the output such as
+----------+-----------+
|Col_1 |Col_2 |
+----------+-----------+
| Rah | ul |
| Rav | i |
| Rag | hu |
| Rom | eo |
+----------+-----------+
This is how it should look like, Thank you
Another way to go about this. Should be faster than any looping or udf solution.
from pyspark.sql import functions as F
def split(df,length,maxsplit):
return df.withColumn('Names',F.split("Names","(?<=\\G{})".format('.'*length)))\
.select(*((F.col("Names")[x]).alias("Col_"+str(x+1)) for x in range(0,maxsplit)))
split(df,3,2).show()
#+-----+-----+
#|Col_1|Col_2|
#+-----+-----+
#| Rah| ul|
#| Rav| i|
#| Rag| hu|
#| Rom| eo|
#+-----+-----+
split(df,2,3).show()
#+-----+-----+-----+
#|col_1|col_2|col_3|
#+-----+-----+-----+
#| Ra| hu| l|
#| Ra| vi| |
#| Ra| gh| u|
#| Ro| me| o|
#+-----+-----+-----+
Try this,
import pyspark.sql.functions as F
tst = sqlContext.createDataFrame([("Raghu",1),("Ravi",2),("Rahul",3)],schema=["Name","val"])
def fn (split,max_n,tst):
for i in range(max_n):
tst_loop=tst.withColumn("coln"+str(i),F.substring(F.col("Name"),(i*split)+1,split))
tst=tst_loop
return(tst)
tst_res = fn(3,2,tst)
The for loop can also replaced by a list comprehension or reduce, but i felt in you case, a for loop looked neater. they have the same physical plan anyway.
The results
+-----+---+-----+-----+
| Name|val|coln0|coln1|
+-----+---+-----+-----+
|Raghu| 1| Rag| hu|
| Ravi| 2| Rav| i|
|Rahul| 3| Rah| ul|
+-----+---+-----+-----+
Try this
def split(data,length,maxSplit):
start=1
for i in range(0,maxSplit):
data = data.withColumn(f'col_{start}-{start+length-1}',f.substring('channel',start,length))
start=length+1
return data
df = split(data,3,2)
df.show()
+--------+----+-------+-------+
| channel|type|col_1-3|col_4-6|
+--------+----+-------+-------+
| web| 0| web| |
| web| 1| web| |
| web| 2| web| |
| twitter| 0| twi| tte|
| twitter| 1| twi| tte|
|facebook| 0| fac| ebo|
|facebook| 1| fac| ebo|
|facebook| 2| fac| ebo|
+--------+----+-------+-------+
Perhaps this is useful-
Load the test data
Note: written in scala
val Length = 2
val Maxsplit = 3
val df = Seq("Rahul", "Ravi", "Raghu", "Romeo").toDF("Names")
df.show(false)
/**
* +-----+
* |Names|
* +-----+
* |Rahul|
* |Ravi |
* |Raghu|
* |Romeo|
* +-----+
*/
split the string col as per the length and offset
val schema = StructType(Range(1, Maxsplit + 1).map(f => StructField(s"Col_$f", StringType)))
val split = udf((str:String, length: Int, maxSplit: Int) =>{
val splits = str.toCharArray.grouped(length).map(_.mkString).toArray
RowFactory.create(splits ++ Array.fill(maxSplit-splits.length)(null): _*)
}, schema)
val p = df
.withColumn("x", split($"Names", lit(Length), lit(Maxsplit)))
.selectExpr("x.*")
p.show(false)
p.printSchema()
/**
* +-----+-----+-----+
* |Col_1|Col_2|Col_3|
* +-----+-----+-----+
* |Ra |hu |l |
* |Ra |vi |null |
* |Ra |gh |u |
* |Ro |me |o |
* +-----+-----+-----+
*
* root
* |-- Col_1: string (nullable = true)
* |-- Col_2: string (nullable = true)
* |-- Col_3: string (nullable = true)
*/
Dataset[Row] -> Dataset[Array[String]]
val x = df.map(r => {
val splits = r.getString(0).toCharArray.grouped(Length).map(_.mkString).toArray
splits ++ Array.fill(Maxsplit-splits.length)(null)
})
x.show(false)
x.printSchema()
/**
* +-----------+
* |value |
* +-----------+
* |[Ra, hu, l]|
* |[Ra, vi,] |
* |[Ra, gh, u]|
* |[Ro, me, o]|
* +-----------+
*
* root
* |-- value: array (nullable = true)
* | |-- element: string (containsNull = true)
*/
I have two table, p_to_v mapping, g_to_v mapping.
scala> val p_to_v = Seq(("p1", "v1"), ("p1", "v2"), ("p2", "v1")).toDF("p", "v")
scala> p_to_v.show
+---+---+
| p| v|
+---+---+
| p1| v1|
| p1| v2|
| p2| v1|
+---+---+
"p1" is mapped to [v1, v2]
"p2" is mapped to [v1]
scala> val g_to_v = Seq(("g1", "v1"), ("g2", "v1"), ("g2", "v2"), ("g3", "v2")).toDF("g", "v")
scala> g_to_v.show
+---+---+
| g| v|
+---+---+
| g1| v1|
| g2| v1|
| g2| v2|
| g3| v2|
+---+---+
"g1" is mapped to [v1]
"g2" is mapped to [v1,v2]
"g3" is mapped to [v2]
I want to get all p and g combination for which corresponding v mapping of p is subset of v mapping of g
p1 [v1, v2] - g2 [v1, v2]
p2 [v1] - g1 [v1]
p2 [v1] - g2 [v1, v2]
How can I get same?
This is pretty straightforward. You need to use groupBy & then simple inner join
scala> val p_to_v = Seq(("p1", "v1"), ("p1", "v2"), ("p2", "v1")).toDF("p", "v")
19/10/16 22:11:55 WARN metastore: Failed to connect to the MetaStore Server...
p_to_v: org.apache.spark.sql.DataFrame = [p: string, v: string]
scala> val g_to_v = Seq(("g1", "v1"), ("g2", "v1"), ("g2", "v2"), ("g3", "v2")).toDF("g", "v")
g_to_v: org.apache.spark.sql.DataFrame = [g: string, v: string]
Now do groupBy operation
scala> val pv = p_to_v.groupBy($"p").agg(collect_list("v").as("pv"))
pv: org.apache.spark.sql.DataFrame = pv = [p: string, pv: array<string>]
scala> val gv = g_to_v.groupBy($"g").agg(collect_list("v").as("gv"))
gv: org.apache.spark.sql.DataFrame = [g: string, gv: array<string>]
scala> pv.show
+---+--------+
| p| pv|
+---+--------+
| p2| [v1]|
| p1|[v1, v2]|
+---+--------+
scala> gv.show
+---+--------+
| g| gv|
+---+--------+
| g2|[v2, v1]|
| g3| [v2]|
| g1| [v1]|
+---+--------+
Create an UDF for finding the subset
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions._
def subLis(ar1: Seq[Row], ar2: Seq[Row]) = ar1.toSet.subsetOf(ar2.toSet)
subLis: (ar1: Seq[org.apache.spark.sql.Row], ar2: Seq[org.apache.spark.sql.Row])Boolean
val subLisUDF = udf(subLis _)
UserDefinedFunction(<function2>,BooleanType,None)
Now you can perform a cross join & apply the UDF
spark.conf.set("spark.sql.crossJoin.enabled", "true")
pv.join(gv).withColumn("newdsa", subLisUDF($"pv", $"gv")).filter($"newdsa").show
+---+--------+---+--------+------+
| p| pv| g| gv|newdsa|
+---+--------+---+--------+------+
| p2| [v1]| g2|[v2, v1]| true|
| p1|[v1, v2]| g2|[v2, v1]| true|
| p2| [v1]| g1| [v1]| true|
+---+--------+---+--------+------+
Or joins with condition
pv.join(gv, pv("pv") === gv("gv") || subLisUDF($"pv", $"gv")).show
+---+--------+---+--------+
| p| pv| g| gv|
+---+--------+---+--------+
| p2| [v1]| g2|[v1, v2]|
| p1|[v1, v2]| g2|[v1, v2]|
| p2| [v1]| g1| [v1]|
+---+--------+---+--------+
Try both and take the best performing one.
I've a Pyspark Dataframe with this structure:
root
|-- Id: string (nullable = true)
|-- Q: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- pr: string (nullable = true)
| | |-- qt: double (nullable = true)
Something similar to:
+----+--------------------- ... --+
| Id | Q |
+----+---------------------- ... -+
| 001| [ [pr1,1.9], [pr3,2.0]...] |
| 002| [ [pr2,1.0], [pr9,3.9]...] |
| 003| [ [pr2,9.0], ... ] |
...
I wold like to convert Q array into columns (name pr value qt).
Also I would like to avoid duplicated columns by merging (add) same columns.
+----+-----+-----+------+ ... ----+
| Id | pr1 | pr2 | pr3 | ... prn |
+----+-----+-----+------+ ... ----+
| 001| 1.9 | 0.0 | 2.0 | ... |
| 002| 0.0 | 1.0 | 0 | ... |
| 003| 0.0 | 9.0 | ... | ... |
...
How can I perform this transformation?.
Thakyou in advance!!.
Julián.
You can do this with a combination of explode and pivot:
import pyspark.sql.functions as F
# explode to get "long" format
df=df.withColumn('exploded', F.explode('Q'))
# get the name and the name in separate columns
df=df.withColumn('name', F.col('exploded').getItem(0))
df=df.withColumn('value', F.col('exploded').getItem(1))
# now pivot
df.groupby('Id').pivot('name').agg(F.max('value')).na.fill(0)
Very interesting question. This is how I approached it.
test.csv
001,pr1:0.9,pr3:1.2,pr2:2.0
002,pr3:5.2,pr4:0.99
Pyspark
file = sc.textFile("file:///test2.csv")
//get it in (key,value)
//[(u'001', u'pr1:0.9')...]
//rdd1 = file.map(lambda r: r.replace(",","\t",1)).map(lambda r: r.split("\t")).map(lambda r: (r[0],r[1])).flatMapValues(lambda r: r.split(','))
rdd1 = file.map(lambda r: r.split(",")[0]).map(lambda r: (r[0],r[1])).flatMapValues(lambda r: r.split(','))
//create a DF with 3 columns
//[(u'001', u'pr1', u'0.9')...)]
+---+---+----+
| _1| _2| _3|
+---+---+----+
|001|pr1| 0.9|
|001|pr3| 1.2|
|001|pr2| 2.0|
|002|pr3| 5.2|
|002|pr4|0.99|
+---+---+----+
rdd2 = rdd1.map(lambda r: (r[0],r[1].split(":"))).map(lambda r: (r[0],r[1][0],r[1][1]))
df = rdd2.toDF()
//Perform the magic
df.groupBy("_1").pivot("_2").agg(expr("coalesce(first(_3),0)"))
+---+---+---+---+----+
| _1|pr1|pr2|pr3| pr4|
+---+---+---+---+----+
|001|0.9|2.0|1.2| 0|
|002| 0| 0|5.2|0.99|
+---+---+---+---+----+
I would like to include null values in an Apache Spark join. Spark doesn't include rows with null by default.
Here is the default Spark behavior.
val numbersDf = Seq(
("123"),
("456"),
(null),
("")
).toDF("numbers")
val lettersDf = Seq(
("123", "abc"),
("456", "def"),
(null, "zzz"),
("", "hhh")
).toDF("numbers", "letters")
val joinedDf = numbersDf.join(lettersDf, Seq("numbers"))
Here is the output of joinedDf.show():
+-------+-------+
|numbers|letters|
+-------+-------+
| 123| abc|
| 456| def|
| | hhh|
+-------+-------+
This is the output I would like:
+-------+-------+
|numbers|letters|
+-------+-------+
| 123| abc|
| 456| def|
| | hhh|
| null| zzz|
+-------+-------+
Spark provides a special NULL safe equality operator:
numbersDf
.join(lettersDf, numbersDf("numbers") <=> lettersDf("numbers"))
.drop(lettersDf("numbers"))
+-------+-------+
|numbers|letters|
+-------+-------+
| 123| abc|
| 456| def|
| null| zzz|
| | hhh|
+-------+-------+
Be careful not to use it with Spark 1.5 or earlier. Prior to Spark 1.6 it required a Cartesian product (SPARK-11111 - Fast null-safe join).
In Spark 2.3.0 or later you can use Column.eqNullSafe in PySpark:
numbers_df = sc.parallelize([
("123", ), ("456", ), (None, ), ("", )
]).toDF(["numbers"])
letters_df = sc.parallelize([
("123", "abc"), ("456", "def"), (None, "zzz"), ("", "hhh")
]).toDF(["numbers", "letters"])
numbers_df.join(letters_df, numbers_df.numbers.eqNullSafe(letters_df.numbers))
+-------+-------+-------+
|numbers|numbers|letters|
+-------+-------+-------+
| 456| 456| def|
| null| null| zzz|
| | | hhh|
| 123| 123| abc|
+-------+-------+-------+
and %<=>% in SparkR:
numbers_df <- createDataFrame(data.frame(numbers = c("123", "456", NA, "")))
letters_df <- createDataFrame(data.frame(
numbers = c("123", "456", NA, ""),
letters = c("abc", "def", "zzz", "hhh")
))
head(join(numbers_df, letters_df, numbers_df$numbers %<=>% letters_df$numbers))
numbers numbers letters
1 456 456 def
2 <NA> <NA> zzz
3 hhh
4 123 123 abc
With SQL (Spark 2.2.0+) you can use IS NOT DISTINCT FROM:
SELECT * FROM numbers JOIN letters
ON numbers.numbers IS NOT DISTINCT FROM letters.numbers
This is can be used with DataFrame API as well:
numbersDf.alias("numbers")
.join(lettersDf.alias("letters"))
.where("numbers.numbers IS NOT DISTINCT FROM letters.numbers")
val numbers2 = numbersDf.withColumnRenamed("numbers","num1") //rename columns so that we can disambiguate them in the join
val letters2 = lettersDf.withColumnRenamed("numbers","num2")
val joinedDf = numbers2.join(letters2, $"num1" === $"num2" || ($"num1".isNull && $"num2".isNull) ,"outer")
joinedDf.select("num1","letters").withColumnRenamed("num1","numbers").show //rename the columns back to the original names
Based on K L's idea, you could use foldLeft to generate join column expression:
def nullSafeJoin(rightDF: DataFrame, columns: Seq[String], joinType: String)(leftDF: DataFrame): DataFrame =
{
val colExpr: Column = leftDF(columns.head) <=> rightDF(columns.head)
val fullExpr = columns.tail.foldLeft(colExpr) {
(colExpr, p) => colExpr && leftDF(p) <=> rightDF(p)
}
leftDF.join(rightDF, fullExpr, joinType)
}
then, you could call this function just like:
aDF.transform(nullSafejoin(bDF, columns, joinType))
Complementing the other answers, for PYSPARK < 2.3.0 you would not have Column.eqNullSafe neither IS NOT DISTINCT FROM.
You still can build the <=> operator with an sql expression to include it in the join, as long as you define alias for the join queries:
from pyspark.sql.types import StringType
import pyspark.sql.functions as F
numbers_df = spark.createDataFrame (["123","456",None,""], StringType()).toDF("numbers")
letters_df = spark.createDataFrame ([("123", "abc"),("456", "def"),(None, "zzz"),("", "hhh") ]).\
toDF("numbers", "letters")
joined_df = numbers_df.alias("numbers").join(letters_df.alias("letters"),
F.expr('numbers.numbers <=> letters.numbers')).\
select('letters.*')
joined_df.show()
+-------+-------+
|numbers|letters|
+-------+-------+
| 456| def|
| null| zzz|
| | hhh|
| 123| abc|
+-------+-------+
Based on timothyzhang's idea one can further improve it by removing duplicate columns:
def dropDuplicateColumns(df: DataFrame, rightDf: DataFrame, cols: Seq[String]): DataFrame
= cols.foldLeft(df)((df, c) => df.drop(rightDf(c)))
def joinTablesWithSafeNulls(rightDF: DataFrame, leftDF: DataFrame, columns: Seq[String], joinType: String): DataFrame =
{
val colExpr: Column = leftDF(columns.head) <=> rightDF(columns.head)
val fullExpr = columns.tail.foldLeft(colExpr) {
(colExpr, p) => colExpr && leftDF(p) <=> rightDF(p)
}
val finalDF = leftDF.join(rightDF, fullExpr, joinType)
val filteredDF = dropDuplicateColumns(finalDF, rightDF, columns)
filteredDF
}
Try the following method to include the null rows to the result of JOIN operator:
def nullSafeJoin(leftDF: DataFrame, rightDF: DataFrame, columns: Seq[String], joinType: String): DataFrame = {
var columnsExpr: Column = leftDF(columns.head) <=> rightDF(columns.head)
columns.drop(1).foreach(column => {
columnsExpr = columnsExpr && (leftDF(column) <=> rightDF(column))
})
var joinedDF: DataFrame = leftDF.join(rightDF, columnsExpr, joinType)
columns.foreach(column => {
joinedDF = joinedDF.drop(leftDF(column))
})
joinedDF
}