Divide function in pyspark - numpy

Suppose I have this dataframe on PySpark:
df = spark.createDataFrame([
['red', 'banana', 1, 10], ['blue', 'banana', 2, 20], ['red', 'carrot', 3, 30],
['blue', 'grape', 4, 40], ['red', 'carrot', 5, 50], ['black', 'carrot', 6, 60],
['red', 'banana', 7, 70], ['red', 'grape', 8, 80]], schema=['color', 'fruit', 'v1', 'v2'])
I want to create a function that takes column v2 divided by column v1, with the condition:
import numpy as np
from pyspark.sql.functions import pandas_udf
#pandas_udf('long', PandasUDFType.SCALAR)
def pandas_div(a,b):
if b == 0:
return np.nan
else:
return (a/b)
However the result turn out to be like this
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
The output that I want should be like this:
+---------+---------+---+
|color_new|fruit_new|div|
+---------+---------+---+
| red| banana|10 |
| blue| banana|20 |
| red| carrot|30 |
| blue| grape|40 |
| red| carrot|50 |
| black| carrot|60 |
| red| banana|70 |
| red| grape|80 |
+---------+---------+---+

All you needed was a WHEN and OTHERWISE. See example below
# Create data frame
df = spark.createDataFrame([
['red', 'banana', 1, 10], ['blue', 'banana', 2, 20], ['red', 'carrot', 3, 30],
['blue', 'grape', 4, 40], ['red', 'carrot', 5, 50], ['black', 'carrot', 6, 60],
['red', 'banana', 7, 70], ['red', 'grape', 8, 80], ['orange', 'grapefruit', 0, 100]], schema=['color', 'fruit', 'v1', 'v2'])
# display result
df.show()
+------+----------+---+---+
| color| fruit| v1| v2|
+------+----------+---+---+
| red| banana| 1| 10|
| blue| banana| 2| 20|
| red| carrot| 3| 30|
| blue| grape| 4| 40|
| red| carrot| 5| 50|
| black| carrot| 6| 60|
| red| banana| 7| 70|
| red| grape| 8| 80|
|orange|grapefruit| 0|100|
+------+----------+---+---+
# Import functions
import pyspark.sql.functions as f
# apply case when
df1 = df.withColumn("divide", f.when(f.col("v1") == 0, None).otherwise(f.lit(f.col("v2")/f.col("v1"))))
# display result
df1.show()
+------+----------+---+---+------+
| color| fruit| v1| v2|divide|
+------+----------+---+---+------+
| red| banana| 1| 10| 10.0|
| blue| banana| 2| 20| 10.0|
| red| carrot| 3| 30| 10.0|
| blue| grape| 4| 40| 10.0|
| red| carrot| 5| 50| 10.0|
| black| carrot| 6| 60| 10.0|
| red| banana| 7| 70| 10.0|
| red| grape| 8| 80| 10.0|
|orange|grapefruit| 0|100| null|
+------+----------+---+---+------+

Related

Pyspark: Drop/Filter rows based on Summing of columns and Rank

I have a dataframe like this:
df = pd.DataFrame({"Date": ["2020-05-10", "2020-05-10", "2020-05-10", "2020-05-11", "2020-05-11", "2020-05-11", "2020-05-11", "2020-05-11", "2020-05-11"],
"Slot_Length": [30, 30, 30, 30, 30, 30, 30, 30, 30],
"Total_Space": [60, 60, 60, 120, 120, 120, 120, 120, 120],
"Amount_Over": [-30, -30, -30, -60, -60, -60, -60, -60, -60],
"Rank": [1, 1, 2, 1, 1, 1, 1, 2, 2]})
df = spark.createDataFrame(df)
+----------+-----------+-----------+-----------+----+
| Date|Slot_Length|Total_Space|Amount_Over|Rank|
+----------+-----------+-----------+-----------+----+
|2020-05-10| 30| 60| -30| 1|
|2020-05-10| 30| 60| -30| 1|
|2020-05-10| 30| 60| -30| 2|
|2020-05-11| 30| 120| -60| 1|
|2020-05-11| 30| 120| -60| 1|
|2020-05-11| 30| 120| -60| 1|
|2020-05-11| 30| 120| -60| 1|
|2020-05-11| 30| 120| -60| 2|
|2020-05-11| 30| 120| -60| 2|
+----------+-----------+-----------+-----------+----+
For each Date I have a Total_Space that can be filled. So for 2020-05-10, I have 60 seconds, and for 2020-05-11 I have 120 seconds.
Each Date also already have assigned slots with a certain Slot_Length.
For each Date I have already calculated the amount of space that Date is over in the Amount_Over column and have ranked them appropriately based on a priority column not shown here.
What I would like to do is to drop the rows with lowest Rank for a Date until the Slot_Lengths add up to the Total_Space for a Date.
+----------+-----------+-----------+-----------+----+
| Date|Slot_Length|Total_Space|Amount_Over|Rank|
+----------+-----------+-----------+-----------+----+
|2020-05-10| 30| 60| -30| 1|
|2020-05-10| 30| 60| -30| 1|
|2020-05-11| 30| 120| -60| 1|
|2020-05-11| 30| 120| -60| 1|
|2020-05-11| 30| 120| -60| 1|
|2020-05-11| 30| 120| -60| 1|
+----------+-----------+-----------+-----------+----+
In this example, it is as easy as dropping all Rank equal to 2, but there will be examples where there is a tie between ranks, so first take the highest ranks, and then take a random one if there is a tie.
What is the best way to do this? I already understand it will need a Window function over the Date to do each calculation over the Slot_Length, Total_Space, and Amount_Over columns correctly.
df = pd.DataFrame({"Date": ["2020-05-10", "2020-05-10", "2020-05-10", "2020-05-11", "2020-05-11", "2020-05-11",
"2020-05-11", "2020-05-11", "2020-05-11"],
"Slot_Length": [30, 30, 30, 30, 30, 30, 30, 30, 30],
"Total_Space": [60, 60, 60, 120, 120, 120, 120, 120, 120],
"Amount_Over": [-30, -30, -30, -60, -60, -60, -60, -60, -60],
"Rank": [1, 1, 2, 1, 1, 1, 1, 2, 2]})
df = spark.createDataFrame(df)
w = Window.partitionBy("Date").orderBy("Rank").rowsBetween(Window.unboundedPreceding, Window.currentRow)
df.withColumn(
"Cumulative_Sum", F.sum("Slot_Length").over(w)
).filter(
F.col("Cumulative_Sum") <= F.col("Total_Space")
).orderBy("Date","Rank","Cumulative_Sum").show()
which results
+----------+-----------+-----------+-----------+----+--------------+
| Date|Slot_Length|Total_Space|Amount_Over|Rank|Cumulative_Sum|
+----------+-----------+-----------+-----------+----+--------------+
|2020-05-10| 30| 60| -30| 1| 30|
|2020-05-10| 30| 60| -30| 1| 60|
|2020-05-11| 30| 120| -60| 1| 30|
|2020-05-11| 30| 120| -60| 1| 60|
|2020-05-11| 30| 120| -60| 1| 90|
|2020-05-11| 30| 120| -60| 1| 120|
+----------+-----------+-----------+-----------+----+--------------+

How to compute a cumulative sum under a limit with Spark?

After several tries and some research, I'm stuck on trying to solve the following problem with Spark.
I have a Dataframe of elements with a priority and a quantity.
+------+-------+--------+---+
|family|element|priority|qty|
+------+-------+--------+---+
| f1| elmt 1| 1| 20|
| f1| elmt 2| 2| 40|
| f1| elmt 3| 3| 10|
| f1| elmt 4| 4| 50|
| f1| elmt 5| 5| 40|
| f1| elmt 6| 6| 10|
| f1| elmt 7| 7| 20|
| f1| elmt 8| 8| 10|
+------+-------+--------+---+
I have a fixed limit quantity :
+------+--------+
|family|limitQty|
+------+--------+
| f1| 100|
+------+--------+
I want to mark as "ok" the elements whose the cumulative sum is under the limit. Here is the expected result :
+------+-------+--------+---+---+
|family|element|priority|qty| ok|
+------+-------+--------+---+---+
| f1| elmt 1| 1| 20| 1| -> 20 < 100 => ok
| f1| elmt 2| 2| 40| 1| -> 20 + 40 < 100 => ok
| f1| elmt 3| 3| 10| 1| -> 20 + 40 + 10 < 100 => ok
| f1| elmt 4| 4| 50| 0| -> 20 + 40 + 10 + 50 > 100 => ko
| f1| elmt 5| 5| 40| 0| -> 20 + 40 + 10 + 40 > 100 => ko
| f1| elmt 6| 6| 10| 1| -> 20 + 40 + 10 + 10 < 100 => ok
| f1| elmt 7| 7| 20| 1| -> 20 + 40 + 10 + 10 + 20 < 100 => ok
| f1| elmt 8| 8| 10| 0| -> 20 + 40 + 10 + 10 + 20 + 10 > 100 => ko
+------+-------+--------+---+---+
I try to solve if with a cumulative sum :
initDF
.join(limitQtyDF, Seq("family"), "left_outer")
.withColumn("cumulSum", sum($"qty").over(Window.partitionBy("family").orderBy("priority")))
.withColumn("ok", when($"cumulSum" <= $"limitQty", 1).otherwise(0))
.drop("cumulSum", "limitQty")
But it's not enough because the elements after the element that is up to the limit are not take into account.
I can't find a way to solve it with Spark. Do you have an idea ?
Here is the corresponding Scala code :
val sparkSession = SparkSession.builder()
.master("local[*]")
.getOrCreate()
import sparkSession.implicits._
val initDF = Seq(
("f1", "elmt 1", 1, 20),
("f1", "elmt 2", 2, 40),
("f1", "elmt 3", 3, 10),
("f1", "elmt 4", 4, 50),
("f1", "elmt 5", 5, 40),
("f1", "elmt 6", 6, 10),
("f1", "elmt 7", 7, 20),
("f1", "elmt 8", 8, 10)
).toDF("family", "element", "priority", "qty")
val limitQtyDF = Seq(("f1", 100)).toDF("family", "limitQty")
val expectedDF = Seq(
("f1", "elmt 1", 1, 20, 1),
("f1", "elmt 2", 2, 40, 1),
("f1", "elmt 3", 3, 10, 1),
("f1", "elmt 4", 4, 50, 0),
("f1", "elmt 5", 5, 40, 0),
("f1", "elmt 6", 6, 10, 1),
("f1", "elmt 7", 7, 20, 1),
("f1", "elmt 8", 8, 10, 0)
).toDF("family", "element", "priority", "qty", "ok").show()
Thank you for your help !
The solution is shown below:
scala> initDF.show
+------+-------+--------+---+
|family|element|priority|qty|
+------+-------+--------+---+
| f1| elmt 1| 1| 20|
| f1| elmt 2| 2| 40|
| f1| elmt 3| 3| 10|
| f1| elmt 4| 4| 50|
| f1| elmt 5| 5| 40|
| f1| elmt 6| 6| 10|
| f1| elmt 7| 7| 20|
| f1| elmt 8| 8| 10|
+------+-------+--------+---+
scala> val df1 = initDF.groupBy("family").agg(collect_list("qty").as("comb_qty"), collect_list("priority").as("comb_prior"), collect_list("element").as("comb_elem"))
df1: org.apache.spark.sql.DataFrame = [family: string, comb_qty: array<int> ... 2 more fields]
scala> df1.show
+------+--------------------+--------------------+--------------------+
|family| comb_qty| comb_prior| comb_elem|
+------+--------------------+--------------------+--------------------+
| f1|[20, 40, 10, 50, ...|[1, 2, 3, 4, 5, 6...|[elmt 1, elmt 2, ...|
+------+--------------------+--------------------+--------------------+
scala> val df2 = df1.join(limitQtyDF, df1("family") === limitQtyDF("family")).drop(limitQtyDF("family"))
df2: org.apache.spark.sql.DataFrame = [family: string, comb_qty: array<int> ... 3 more fields]
scala> df2.show
+------+--------------------+--------------------+--------------------+--------+
|family| comb_qty| comb_prior| comb_elem|limitQty|
+------+--------------------+--------------------+--------------------+--------+
| f1|[20, 40, 10, 50, ...|[1, 2, 3, 4, 5, 6...|[elmt 1, elmt 2, ...| 100|
+------+--------------------+--------------------+--------------------+--------+
scala> def validCheck = (qty: Seq[Int], limit: Int) => {
| var sum = 0
| qty.map(elem => {
| if (elem + sum <= limit) {
| sum = sum + elem
| 1}else{
| 0
| }})}
validCheck: (scala.collection.mutable.Seq[Int], Int) => scala.collection.mutable.Seq[Int]
scala> val newUdf = udf(validCheck)
newUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(IntegerType,false),Some(List(ArrayType(IntegerType,false), IntegerType)))
val df3 = df2.withColumn("valid", newUdf(col("comb_qty"),col("limitQty"))).drop("limitQty")
df3: org.apache.spark.sql.DataFrame = [family: string, comb_qty: array<int> ... 3 more fields]
scala> df3.show
+------+--------------------+--------------------+--------------------+--------------------+
|family| comb_qty| comb_prior| comb_elem| valid|
+------+--------------------+--------------------+--------------------+--------------------+
| f1|[20, 40, 10, 50, ...|[1, 2, 3, 4, 5, 6...|[elmt 1, elmt 2, ...|[1, 1, 1, 0, 0, 1...|
+------+--------------------+--------------------+--------------------+--------------------+
scala> val myUdf = udf((qty: Seq[Int], prior: Seq[Int], elem: Seq[String], valid: Seq[Int]) => {
| elem zip prior zip qty zip valid map{
| case (((a,b),c),d) => (a,b,c,d)}
| }
| )
scala> val df4 = df3.withColumn("combined", myUdf(col("comb_qty"),col("comb_prior"),col("comb_elem"),col("valid")))
df4: org.apache.spark.sql.DataFrame = [family: string, comb_qty: array<int> ... 4 more fields]
scala> val df5 = df4.drop("comb_qty","comb_prior","comb_elem","valid")
df5: org.apache.spark.sql.DataFrame = [family: string, combined: array<struct<_1:string,_2:int,_3:int,_4:int>>]
scala> df5.show(false)
+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|family|combined |
+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|f1 |[[elmt 1, 1, 20, 1], [elmt 2, 2, 40, 1], [elmt 3, 3, 10, 1], [elmt 4, 4, 50, 0], [elmt 5, 5, 40, 0], [elmt 6, 6, 10, 1], [elmt 7, 7, 20, 1], [elmt 8, 8, 10, 0]]|
+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
scala> val df6 = df5.withColumn("combined",explode(col("combined")))
df6: org.apache.spark.sql.DataFrame = [family: string, combined: struct<_1: string, _2: int ... 2 more fields>]
scala> df6.show
+------+------------------+
|family| combined|
+------+------------------+
| f1|[elmt 1, 1, 20, 1]|
| f1|[elmt 2, 2, 40, 1]|
| f1|[elmt 3, 3, 10, 1]|
| f1|[elmt 4, 4, 50, 0]|
| f1|[elmt 5, 5, 40, 0]|
| f1|[elmt 6, 6, 10, 1]|
| f1|[elmt 7, 7, 20, 1]|
| f1|[elmt 8, 8, 10, 0]|
+------+------------------+
scala> val df7 = df6.select("family", "combined._1", "combined._2", "combined._3", "combined._4").withColumnRenamed("_1","element").withColumnRenamed("_2","priority").withColumnRenamed("_3", "qty").withColumnRenamed("_4","ok")
df7: org.apache.spark.sql.DataFrame = [family: string, element: string ... 3 more fields]
scala> df7.show
+------+-------+--------+---+---+
|family|element|priority|qty| ok|
+------+-------+--------+---+---+
| f1| elmt 1| 1| 20| 1|
| f1| elmt 2| 2| 40| 1|
| f1| elmt 3| 3| 10| 1|
| f1| elmt 4| 4| 50| 0|
| f1| elmt 5| 5| 40| 0|
| f1| elmt 6| 6| 10| 1|
| f1| elmt 7| 7| 20| 1|
| f1| elmt 8| 8| 10| 0|
+------+-------+--------+---+---+
Let me know if it helps!!
Another way to do it will be an RDD based approach by iterating row by row.
var bufferRow: collection.mutable.Buffer[Row] = collection.mutable.Buffer.empty[Row]
var tempSum: Double = 0
val iterator = df.collect.iterator
while(iterator.hasNext){
val record = iterator.next()
val y = record.getAs[Integer]("qty")
tempSum = tempSum + y
print(record)
if (tempSum <= 100.0 ) {
bufferRow = bufferRow ++ Seq(transformRow(record,1))
}
else{
bufferRow = bufferRow ++ Seq(transformRow(record,0))
tempSum = tempSum - y
}
}
Defining transformRow function which is used to add a column to a row.
def transformRow(row: Row,flag : Int): Row = Row.fromSeq(row.toSeq ++ Array[Integer](flag))
Next thing to do will be adding an additional column to the schema.
val newSchema = StructType(df.schema.fields ++ Array(StructField("C_Sum", IntegerType, false))
Followed by creating a new dataframe.
val outputdf = spark.createDataFrame(spark.sparkContext.parallelize(bufferRow.toSeq),newSchema)
Output Dataframe :
+------+-------+--------+---+-----+
|family|element|priority|qty|C_Sum|
+------+-------+--------+---+-----+
| f1| elmt1| 1| 20| 1|
| f1| elmt2| 2| 40| 1|
| f1| elmt3| 3| 10| 1|
| f1| elmt4| 4| 50| 0|
| f1| elmt5| 5| 40| 0|
| f1| elmt6| 6| 10| 1|
| f1| elmt7| 7| 20| 1|
| f1| elmt8| 8| 10| 0|
+------+-------+--------+---+-----+
I am new to Spark so this solution may not be optimal. I am assuming the value of 100 is an input to the program here. In that case:
case class Frame(family:String, element : String, priority : Int, qty :Int)
import scala.collection.JavaConverters._
val ans = df.as[Frame].toLocalIterator
.asScala
.foldLeft((Seq.empty[Int],0))((acc,a) =>
if(acc._2 + a.qty <= 100) (acc._1 :+ a.priority, acc._2 + a.qty) else acc)._1
df.withColumn("OK" , when($"priority".isin(ans :_*), 1).otherwise(0)).show
results in:
+------+-------+--------+---+--------+
|family|element|priority|qty|OK |
+------+-------+--------+---+--------+
| f1| elmt 1| 1| 20| 1|
| f1| elmt 2| 2| 40| 1|
| f1| elmt 3| 3| 10| 1|
| f1| elmt 4| 4| 50| 0|
| f1| elmt 5| 5| 40| 0|
| f1| elmt 6| 6| 10| 1|
| f1| elmt 7| 7| 20| 1|
| f1| elmt 8| 8| 10| 0|
+------+-------+--------+---+--------+
The idea is simply to get a Scala iterator and extract the participating priority values from it and then use those values to filter out the participating rows. Given this solution gathers all the data in memory on one machine, it could run into memory problems if the dataframe size is too large to fit in memory.
Cumulative sum for each group
from pyspark.sql.window import Window as window
from pyspark.sql.types import IntegerType,StringType,FloatType,StructType,StructField,DateType
schema = StructType() \
.add(StructField("empno",IntegerType(),True)) \
.add(StructField("ename",StringType(),True)) \
.add(StructField("job",StringType(),True)) \
.add(StructField("mgr",StringType(),True)) \
.add(StructField("hiredate",DateType(),True)) \
.add(StructField("sal",FloatType(),True)) \
.add(StructField("comm",StringType(),True)) \
.add(StructField("deptno",IntegerType(),True))
emp = spark.read.csv('data/emp.csv',schema)
dept_partition = window.partitionBy(emp.deptno).orderBy(emp.sal)
emp_win = emp.withColumn("dept_cum_sal",
f.sum(emp.sal).over(dept_partition.rowsBetween(window.unboundedPreceding, window.currentRow)))
emp_win.show()
Results appear like below:
+-----+------+---------+----+----------+------+-------+------+------------
+
|empno| ename| job| mgr| hiredate| sal| comm|deptno|dept_cum_sal|
+-----+------+---------+----+----------+------+-------+------+------------
+
| 7369| SMITH| CLERK|7902|1980-12-17| 800.0| null| 20| 800.0|
| 7876| ADAMS| CLERK|7788|1983-01-12|1100.0| null| 20| 1900.0|
| 7566| JONES| MANAGER|7839|1981-04-02|2975.0| null| 20| 4875.0|
| 7788| SCOTT| ANALYST|7566|1982-12-09|3000.0| null| 20| 7875.0|
| 7902| FORD| ANALYST|7566|1981-12-03|3000.0| null| 20| 10875.0|
| 7934|MILLER| CLERK|7782|1982-01-23|1300.0| null| 10| 1300.0|
| 7782| CLARK| MANAGER|7839|1981-06-09|2450.0| null| 10| 3750.0|
| 7839| KING|PRESIDENT|null|1981-11-17|5000.0| null| 10| 8750.0|
| 7900| JAMES| CLERK|7698|1981-12-03| 950.0| null| 30| 950.0|
| 7521| WARD| SALESMAN|7698|1981-02-22|1250.0| 500.00| 30| 2200.0|
| 7654|MARTIN| SALESMAN|7698|1981-09-28|1250.0|1400.00| 30| 3450.0|
| 7844|TURNER| SALESMAN|7698|1981-09-08|1500.0| 0.00| 30| 4950.0|
| 7499| ALLEN| SALESMAN|7698|1981-02-20|1600.0| 300.00| 30| 6550.0|
| 7698| BLAKE| MANAGER|7839|1981-05-01|2850.0| null| 30| 9400.0|
+-----+------+---------+----+----------+------+-------+------+------------+
PFA the answer
val initDF = Seq(("f1", "elmt 1", 1, 20),("f1", "elmt 2", 2, 40),("f1", "elmt 3", 3, 10),
("f1", "elmt 4", 4, 50),
("f1", "elmt 5", 5, 40),
("f1", "elmt 6", 6, 10),
("f1", "elmt 7", 7, 20),
("f1", "elmt 8", 8, 10)
).toDF("family", "element", "priority", "qty")
val limitQtyDF = Seq(("f1", 100)).toDF("family", "limitQty")
sc.broadcast(limitQtyDF)
val joinedInitDF=initDF.join(limitQtyDF,Seq("family"),"left")
case class dataResult(family:String,element:String,priority:Int, qty:Int, comutedValue:Int, limitQty:Int,controlOut:String)
val familyIDs=initDF.select("family").distinct.collect.map(_(0).toString).toList
def checkingUDF(inputRows:List[Row])={
var controlVarQty=0
val outputArrayBuffer=collection.mutable.ArrayBuffer[dataResult]()
val setLimit=inputRows.head.getInt(4)
for(inputRow <- inputRows)
{
val currQty=inputRow.getInt(3)
//val outpurForRec=
controlVarQty + currQty match {
case value if value <= setLimit =>
controlVarQty+=currQty
outputArrayBuffer+=dataResult(inputRow.getString(0),inputRow.getString(1),inputRow.getInt(2),inputRow.getInt(3),value,setLimit,"ok")
case value =>
outputArrayBuffer+=dataResult(inputRow.getString(0),inputRow.getString(1),inputRow.getInt(2),inputRow.getInt(3),value,setLimit,"ko")
}
//outputArrayBuffer+=Row(inputRow.getString(0),inputRow.getString(1),inputRow.getInt(2),inputRow.getInt(3),controlVarQty+currQty,setLimit,outpurForRec)
}
outputArrayBuffer.toList
}
val tmpAB=collection.mutable.ArrayBuffer[List[dataResult]]()
for (familyID <- familyIDs) // val familyID="f1"
{
val currentFamily=joinedInitDF.filter(s"family = '${familyID}'").orderBy("element", "priority").collect.toList
tmpAB+=checkingUDF(currentFamily)
}
tmpAB.toSeq.flatMap(x => x).toDF.show(false)
This works for me .
+------+-------+--------+---+------------+--------+----------+
|family|element|priority|qty|comutedValue|limitQty|controlOut|
+------+-------+--------+---+------------+--------+----------+
|f1 |elmt 1 |1 |20 |20 |100 |ok |
|f1 |elmt 2 |2 |40 |60 |100 |ok |
|f1 |elmt 3 |3 |10 |70 |100 |ok |
|f1 |elmt 4 |4 |50 |120 |100 |ko |
|f1 |elmt 5 |5 |40 |110 |100 |ko |
|f1 |elmt 6 |6 |10 |80 |100 |ok |
|f1 |elmt 7 |7 |20 |100 |100 |ok |
|f1 |elmt 8 |8 |10 |110 |100 |ko |
+------+-------+--------+---+------------+--------+----------+
Please do drop unnecessary columns from the output

Writing select queries on dataframe based on where condition from other dataframe, scala

I have two dataframes with the following columns..
DF1 - partitionNum, lowerBound, upperBound
DF2- ID, cumulativeCount
I want a resulting Frame which has - ID, partitionNum
I have done a cross join which is performing bad as below
DF2.crossJoin(DF1).where(col("cumulativeCount").between(col("lowerBound"), col("upperBound"))).orderBy("cumulativeCount")
.select("ID", "partitionNum")
Since DF2 has 5 million of rows and DF1 has 50 rows, this cross join yields 250 million rows and this task is dying. How can i make this as a select where resulting frame should have ID from DF2 and partitionNum from DF1 and condition is select partition num from DF1 WHERE cumulative Count of DF2 is between lower and upperBound of DF1
I am looking for something like below will this work
sparkSession.sqlContext.sql("SELECT ID, cumulativeCount, A.partitionNum FROM CumulativeCountViewById WHERE cumulativeCount IN " +
"(SELECT partitionNum FROM CumulativeRangeView WHERE cumulativeCount BETWEEN lowerBound and upperBound) AS A")
Try this.
Solution is - you don't need to do crossjoin. Since your DF1 is only 50 rows, convert it to a map of key: partitionNum, value: Tuple2(lowerBound, UppperBound).
Create an UDF which takes a number(your cumulativeCount) and checks against the map to return keys(ie., partitionNums) when lowerBound < cumulativeCount < upperBound.
You may edit the UDF to return only partitionNumbers and explode the "partNums" array column in the end if you choose to.
scala> DF1.show
+------------+----------+----------+
|partitionNum|lowerBound|upperBound|
+------------+----------+----------+
| 1| 10| 20|
| 2| 5| 10|
| 3| 6| 15|
| 4| 8| 20|
+------------+----------+----------+
scala> DF2.show
+---+---------------+
| ID|cumulativeCount|
+---+---------------+
|100| 5|
|100| 10|
|100| 15|
|100| 20|
|100| 25|
|100| 30|
|100| 6|
|100| 12|
|100| 18|
|100| 24|
|101| 1|
|101| 2|
|101| 3|
|101| 4|
|101| 5|
|101| 6|
|101| 7|
|101| 8|
|101| 9|
|101| 10|
+---+---------------+
scala> val smallData = DF1.collect.map(row => row.getInt(0) -> (row.getInt(1), row.getInt(2))).toMap
smallData: scala.collection.immutable.Map[Int,(Int, Int)] = Map(1 -> (10,20), 2 -> (5,10), 3 -> (6,15), 4 -> (8,20))
scala> val myUdf = udf((num:Int) => smallData.filter((v) => v._2._2 > num && num > v._2._1))
myUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,MapType(IntegerType,StructType(StructField(_1,IntegerType,false), StructField(_2,IntegerType,false)),true),Some(List(IntegerType)))
scala> DF2.withColumn("partNums", myUdf($"cumulativeCount")).show(false)
+---+---------------+-------------------------------------------+
|ID |cumulativeCount|partNums |
+---+---------------+-------------------------------------------+
|100|5 |[] |
|100|10 |[3 -> [6, 15], 4 -> [8, 20]] |
|100|15 |[1 -> [10, 20], 4 -> [8, 20]] |
|100|20 |[] |
|100|25 |[] |
|100|30 |[] |
|100|6 |[2 -> [5, 10]] |
|100|12 |[1 -> [10, 20], 3 -> [6, 15], 4 -> [8, 20]]|
|100|18 |[1 -> [10, 20], 4 -> [8, 20]] |
|100|24 |[] |
|101|1 |[] |
|101|2 |[] |
|101|3 |[] |
|101|4 |[] |
|101|5 |[] |
|101|6 |[2 -> [5, 10]] |
|101|7 |[2 -> [5, 10], 3 -> [6, 15]] |
|101|8 |[2 -> [5, 10], 3 -> [6, 15]] |
|101|9 |[2 -> [5, 10], 3 -> [6, 15], 4 -> [8, 20]] |
|101|10 |[3 -> [6, 15], 4 -> [8, 20]] |
+---+---------------+-------------------------------------------+

Spark, SQL aggregation based on a second data set

I have two datasets (dataframes)
idPeersDS - which has an id column and it's peers' ids.
infoDS - which has two type columns (type1, type2) and a metric column.
--
idPeersDS
+---+---------+
| id| peers|
+---+---------+
| 1|[1, 2, 3]|
| 2|[2, 1, 6]|
| 3|[3, 1, 2]|
| 4|[4, 5, 6]|
| 5|[5, 4, 6]|
| 6|[6, 1, 2]|
+---+---------+
infoDS
+---+-----+-----+------+
| id|type1|type2|metric|
+---+-----+-----+------+
| 1| A| X| 10.0|
| 1| A| Y| 20.0|
| 1| B| X| 30.0|
| 1| B| Y| 40.0|
| 2| A| Y| 10.0|
| 2| B| X| 20.0|
| 2| B| Y| 30.0|
| 3| A| X| 40.0|
| 4| B| Y| 10.0|
| 5| A| X| 20.0|
| 5| B| X| 30.0|
| 6| A| Y| 40.0|
| 6| B| Y| 10.0|
+---+-----+-----+------+
I need to calculate the zscore of the metric for each id grouped by type1 and type2. But it is not the score of the metrics for the grouped data, it is zscore of the metrics of peers with in a group. If a peerId doesnot have a metric in the group, the peerId's metric is treated as 0.
example:
for group ("A", "X") and for id = 1, the peers are (1,2,3), the metrics for zscore will be (10, 0, 40); since id = 2 doesn't exist in group ("A","X") it is 0. id=5 is not a peer of id=1 so it is not part of zscore calculation.
+---+------+---------+-----------+
| id|metric| peers|type1|type2|
+---+------+---------+-----------+
| 1| 10.0|[1, 2, 3]| A| X|
| 3| 40.0|[3, 1, 2]| A| X|
| 5| 20.0|[5, 4, 6]| A| X|
Z = (X - μ) / σ
Z = (10 - 16.66666) / 16.99673
Z = -0.39223
Output should be the following table. I can compute zscore if `peersmetrics` column instead of `zScoreValue` column like my code did.
+---+------+---------+-----------+-----+-----+
| id|metric| peers|zScoreValue|type1|type2| peersmetrics
+---+------+---------+-----------+-----+-----+
| 1| 10.0|[1, 2, 3]| -0.39| A| X| [10, 0, 40]
| 3| 40.0|[3, 1, 2]| 1.37| A| X| [40, 10, 0]
| 5| 20.0|[5, 4, 6]| 1.41| A| X| [20, 0 , 0]
| 1| 40.0|[1, 2, 3]| 0.98| B| Y| [40, 30, 0]
| 2| 30.0|[2, 1, 6]| 0.27| B| Y| [30, 40, 10]
| 4| 10.0|[4, 5, 6]| 0.71| B| Y|
| 6| 10.0|[6, 1, 2]| -1.34| B| Y|
| 1| 30.0|[1, 2, 3]| 1.07| B| X|
| 2| 20.0|[2, 1, 6]| 0.27| B| X|
| 5| 30.0|[5, 4, 6]| 1.41| B| X|
| 1| 20.0|[1, 2, 3]| 1.22| A| Y|
| 2| 10.0|[2, 1, 6]| -1.07| A| Y|
| 6| 40.0|[6, 1, 2]| 1.34| A| Y|
+---+------+---------+-----------+-----+-----+
Edit1: SQL solution is equally appreciated. I can transform SQL to Scala code in my spark job.
Following is my solution but the computation is taking longer than I wish.
the size of true datasets:
idPeersDS has 17000 and infoDS has 17000 * 6 * 15
Any other solution is greatly appreciated. Feel free to edit/recommend title and correct grammar. English is not my first language. Thanks.
Here is my code.
val idPeersDS = Seq(
(1, Seq(1,2,3)),
(2, Seq(2,1,6)),
(3, Seq(3,1,2)),
(4, Seq(4,5,6)),
(5, Seq(5,4,6)),
(6, Seq(6,1,2))
).toDS.select($"_1" as "id", $"_2" as "peers")
val infoDS = Seq(
(1, "A", "X", 10),
(1, "A", "Y", 20),
(1, "B", "X", 30),
(1, "B", "Y", 40),
(2, "A", "Y", 10),
(2, "B", "X", 20),
(2, "B", "Y", 30),
(3, "A", "X", 40),
(4, "B", "Y", 10),
(5, "A", "X", 20),
(5, "B", "X", 30),
(6, "A", "Y", 40),
(6, "B", "Y", 10)
).toDS.select($"_1" as "id", $"_2" as "type1", $"_3" as "type2", $"_4" cast "double" as "metric")
def calculateZScoreGivenPeers(idMetricDS: DataFrame, irPeersDS: DataFrame, roundTo: Int = 2)
(implicit spark: SparkSession): DataFrame = {
import spark.implicits._
// for every id in the idMetricDS, get the peers and their metric for zscore, calculate zscore
val fir = idMetricDS.join(irPeersDS, "id")
val fsMapBroadcast = spark.sparkContext.broadcast(
idMetricDS.toDF.map((r: Row) => {r.getInt(0) -> r.getDouble(1)}).rdd.collectAsMap)
val fsMap = fsMapBroadcast.value
val funUdf = udf((currId: Int, xs: WrappedArray[Int]) => {
val zScoreMetrics: Array[Double] = xs.toArray.map(x => fsMap.getOrElse(x, 0.0))
val ds = new DescriptiveStatistics(zScoreMetrics)
val mean = ds.getMean()
val sd = Math.sqrt(ds.getPopulationVariance())
val zScore = if (sd == 0.0) {0.0} else {(fsMap.getOrElse(currId, 0.0)- mean) / sd}
zScore
})
val idStatsWithZscoreDS =
fir.withColumn("zScoreValue", round(funUdf(fir("id"), fir("peers")), roundTo))
fsMapBroadcast.unpersist
fsMapBroadcast.destroy
return idStatsWithZscoreDS
}
val typesComb = infoDS.select("type1", "type2").dropDuplicates.collect
val zScoreDS = typesComb.map(
ept => {
val et = ept.getString(0)
val pt = ept.getString(1)
val idMetricDS = infoDS.where($"type1" === lit(et) && $"type2" === lit(pt)).select($"id", $"metric")
val zScoreDS = calculateZScoreGivenPeers(idMetricDS, idPeersDS)(spark)
zScoreDS.select($"id", $"metric", $"peers", $"zScoreValue").withColumn("type1", lit(et)).withColumn("type2", lit(pt))
}
).reduce(_.union(_))
scala> idPeersDS.show(100)
+---+---------+
| id| peers|
+---+---------+
| 1|[1, 2, 3]|
| 2|[2, 1, 6]|
| 3|[3, 1, 2]|
| 4|[4, 5, 6]|
| 5|[5, 4, 6]|
| 6|[6, 1, 2]|
+---+---------+
scala> infoDS.show(100)
+---+-----+-----+------+
| id|type1|type2|metric|
+---+-----+-----+------+
| 1| A| X| 10.0|
| 1| A| Y| 20.0|
| 1| B| X| 30.0|
| 1| B| Y| 40.0|
| 2| A| Y| 10.0|
| 2| B| X| 20.0|
| 2| B| Y| 30.0|
| 3| A| X| 40.0|
| 4| B| Y| 10.0|
| 5| A| X| 20.0|
| 5| B| X| 30.0|
| 6| A| Y| 40.0|
| 6| B| Y| 10.0|
+---+-----+-----+------+
scala> typesComb
res3: Array[org.apache.spark.sql.Row] = Array([A,X], [B,Y], [B,X], [A,Y])
scala> zScoreDS.show(100)
+---+------+---------+-----------+-----+-----+
| id|metric| peers|zScoreValue|type1|type2|
+---+------+---------+-----------+-----+-----+
| 1| 10.0|[1, 2, 3]| -0.39| A| X|
| 3| 40.0|[3, 1, 2]| 1.37| A| X|
| 5| 20.0|[5, 4, 6]| 1.41| A| X|
| 1| 40.0|[1, 2, 3]| 0.98| B| Y|
| 2| 30.0|[2, 1, 6]| 0.27| B| Y|
| 4| 10.0|[4, 5, 6]| 0.71| B| Y|
| 6| 10.0|[6, 1, 2]| -1.34| B| Y|
| 1| 30.0|[1, 2, 3]| 1.07| B| X|
| 2| 20.0|[2, 1, 6]| 0.27| B| X|
| 5| 30.0|[5, 4, 6]| 1.41| B| X|
| 1| 20.0|[1, 2, 3]| 1.22| A| Y|
| 2| 10.0|[2, 1, 6]| -1.07| A| Y|
| 6| 40.0|[6, 1, 2]| 1.34| A| Y|
+---+------+---------+-----------+-----+-----+
I solved it. Here is my answer. This solution did run significantly faster (< 1/10th) than my previous solution I have in the question on my true datasets.
I avoided collect to driver and map and union of datasets in the reduce.
val idPeersDS = Seq(
(1, Seq(1,2,3)),
(2, Seq(2,1,6)),
(3, Seq(3,1,2)),
(4, Seq(4,5,6)),
(5, Seq(5,4,6)),
(6, Seq(6,1,2))
).toDS.select($"_1" as "id", $"_2" as "peers")
val infoDS = Seq(
(1, "A", "X", 10),
(1, "A", "Y", 20),
(1, "B", "X", 30),
(1, "B", "Y", 40),
(2, "A", "Y", 10),
(2, "B", "X", 20),
(2, "B", "Y", 30),
(3, "A", "X", 40),
(4, "B", "Y", 10),
(5, "A", "X", 20),
(5, "B", "X", 30),
(6, "A", "Y", 40),
(6, "B", "Y", 10)
).toDS.select($"_1" as "id", $"_2" as "type1", $"_3" as "type2", $"_4" cast "double" as "metric")
// Exiting paste mode, now interpreting.
idPeersDS: org.apache.spark.sql.DataFrame = [id: int, peers: array<int>]
infoDS: org.apache.spark.sql.DataFrame = [id: int, type1: string ... 2 more fields]
scala> idPeersDS.show
+---+---------+
| id| peers|
+---+---------+
| 1|[1, 2, 3]|
| 2|[2, 1, 6]|
| 3|[3, 1, 2]|
| 4|[4, 5, 6]|
| 5|[5, 4, 6]|
| 6|[6, 1, 2]|
+---+---------+
scala> infoDS.show
+---+-----+-----+------+
| id|type1|type2|metric|
+---+-----+-----+------+
| 1| A| X| 10.0|
| 1| A| Y| 20.0|
| 1| B| X| 30.0|
| 1| B| Y| 40.0|
| 2| A| Y| 10.0|
| 2| B| X| 20.0|
| 2| B| Y| 30.0|
| 3| A| X| 40.0|
| 4| B| Y| 10.0|
| 5| A| X| 20.0|
| 5| B| X| 30.0|
| 6| A| Y| 40.0|
| 6| B| Y| 10.0|
+---+-----+-----+------+
scala> val infowithpeers = infoDS.join(idPeersDS, "id")
infowithpeers: org.apache.spark.sql.DataFrame = [id: int, type1: string ... 3 more fields]
scala> infowithpeers.show
+---+-----+-----+------+---------+
| id|type1|type2|metric| peers|
+---+-----+-----+------+---------+
| 1| A| X| 10.0|[1, 2, 3]|
| 1| A| Y| 20.0|[1, 2, 3]|
| 1| B| X| 30.0|[1, 2, 3]|
| 1| B| Y| 40.0|[1, 2, 3]|
| 2| A| Y| 10.0|[2, 1, 6]|
| 2| B| X| 20.0|[2, 1, 6]|
| 2| B| Y| 30.0|[2, 1, 6]|
| 3| A| X| 40.0|[3, 1, 2]|
| 4| B| Y| 10.0|[4, 5, 6]|
| 5| A| X| 20.0|[5, 4, 6]|
| 5| B| X| 30.0|[5, 4, 6]|
| 6| A| Y| 40.0|[6, 1, 2]|
| 6| B| Y| 10.0|[6, 1, 2]|
+---+-----+-----+------+---------+
scala> val joinMap = udf { values: Seq[Map[Int,Double]] => values.flatten.toMap }
joinMap: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,MapType(IntegerType,DoubleType,false),Some(List(ArrayType(MapType(IntegerType,DoubleType,false),true))))
scala> val zScoreCal = udf { (metric: Double, zScoreMetrics: WrappedArray[Double]) =>
| val ds = new DescriptiveStatistics(zScoreMetrics.toArray)
| val mean = ds.getMean()
| val sd = Math.sqrt(ds.getPopulationVariance())
| val zScore = if (sd == 0.0) {0.0} else {(metric - mean) / sd}
| zScore
| }
zScoreCal: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,DoubleType,Some(List(DoubleType, ArrayType(DoubleType,false))))
scala> :paste
// Entering paste mode (ctrl-D to finish)
val infowithpeersidmetric = infowithpeers.withColumn("idmetric", map($"id",$"metric"))
val idsingrpdf = infowithpeersidmetric.groupBy("type1","type2").agg(joinMap(collect_list(map($"id", $"metric"))) as "idsingrp")
val metricsMap = udf { (peers: Seq[Int], values: Map[Int,Double]) => {
peers.map(p => values.getOrElse(p,0.0))
}
}
// Exiting paste mode, now interpreting.
infowithpeersidmetric: org.apache.spark.sql.DataFrame = [id: int, type1: string ... 4 more fields]
idsingrpdf: org.apache.spark.sql.DataFrame = [type1: string, type2: string ... 1 more field]
metricsMap: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(DoubleType,false),Some(List(ArrayType(IntegerType,false), MapType(IntegerType,DoubleType,false))))
scala> val infoWithMap = infowithpeers.join(idsingrpdf, Seq("type1","type2")).withColumn("zScoreMetrics", metricsMap($"peers", $"idsingrp")).withColumn("zscore", round(zScoreCal($"metric",$"zScoreMetrics"),2))
infoWithMap: org.apache.spark.sql.DataFrame = [type1: string, type2: string ... 6 more fields]
scala> infoWithMap.show
+-----+-----+---+------+---------+--------------------+------------------+------+
|type1|type2| id|metric| peers| idsingrp| zScoreMetrics|zscore|
+-----+-----+---+------+---------+--------------------+------------------+------+
| A| X| 1| 10.0|[1, 2, 3]|[3 -> 40.0, 5 -> ...| [10.0, 0.0, 40.0]| -0.39|
| A| Y| 1| 20.0|[1, 2, 3]|[2 -> 10.0, 6 -> ...| [20.0, 10.0, 0.0]| 1.22|
| B| X| 1| 30.0|[1, 2, 3]|[1 -> 30.0, 2 -> ...| [30.0, 20.0, 0.0]| 1.07|
| B| Y| 1| 40.0|[1, 2, 3]|[4 -> 10.0, 1 -> ...| [40.0, 30.0, 0.0]| 0.98|
| A| Y| 2| 10.0|[2, 1, 6]|[2 -> 10.0, 6 -> ...|[10.0, 20.0, 40.0]| -1.07|
| B| X| 2| 20.0|[2, 1, 6]|[1 -> 30.0, 2 -> ...| [20.0, 30.0, 0.0]| 0.27|
| B| Y| 2| 30.0|[2, 1, 6]|[4 -> 10.0, 1 -> ...|[30.0, 40.0, 10.0]| 0.27|
| A| X| 3| 40.0|[3, 1, 2]|[3 -> 40.0, 5 -> ...| [40.0, 10.0, 0.0]| 1.37|
| B| Y| 4| 10.0|[4, 5, 6]|[4 -> 10.0, 1 -> ...| [10.0, 0.0, 10.0]| 0.71|
| A| X| 5| 20.0|[5, 4, 6]|[3 -> 40.0, 5 -> ...| [20.0, 0.0, 0.0]| 1.41|
| B| X| 5| 30.0|[5, 4, 6]|[1 -> 30.0, 2 -> ...| [30.0, 0.0, 0.0]| 1.41|
| A| Y| 6| 40.0|[6, 1, 2]|[2 -> 10.0, 6 -> ...|[40.0, 20.0, 10.0]| 1.34|
| B| Y| 6| 10.0|[6, 1, 2]|[4 -> 10.0, 1 -> ...|[10.0, 40.0, 30.0]| -1.34|
+-----+-----+---+------+---------+--------------------+------------------+------+

Grab last different data on Spark Dataframe?

I have this data on Spark Dataframe
+------+-------+-----+------------+----------+---------+
|sernum|product|state|testDateTime|testResult| msg|
+------+-------+-----+------------+----------+---------+
| 8| PA1| 1.0| 1.18| pass|testlog18|
| 7| PA1| 1.0| 1.17| fail|testlog17|
| 6| PA1| 1.0| 1.16| pass|testlog16|
| 5| PA1| 1.0| 1.15| fail|testlog15|
| 4| PA1| 2.0| 1.14| fail|testlog14|
| 3| PA1| 1.0| 1.13| pass|testlog13|
| 2| PA1| 2.0| 1.12| pass|testlog12|
| 1| PA1| 1.0| 1.11| fail|testlog11|
+------+-------+-----+------------+----------+---------+
What I care about is the testResult == "fail", and the hard part is that I need the to get the last "pass" message as an extra column GROUP BY product+state:
+------+-------+-----+------------+----------+---------+---------+
|sernum|product|state|testDateTime|testResult| msg| passMsg|
+------+-------+-----+------------+----------+---------+---------+
| 7| PA1| 1.0| 1.17| fail|testlog17|testlog16|
| 5| PA1| 1.0| 1.15| fail|testlog15|testlog13|
| 4| PA1| 2.0| 1.14| fail|testlog14|testlog12|
| 1| PA1| 1.0| 1.11| fail|testlog11| null|
+------+-------+-----+------------+----------+---------+---------+
How can I achieve this using DataFrame or SQL?
The trick is to define groups where each group starts with a passed test. Then, use again window-functions with group as an additional partition-column:
val df = Seq(
(8, "PA1", 1.0, 1.18, "pass", "testlog18"),
(7, "PA1", 1.0, 1.17, "fail", "testlog17"),
(6, "PA1", 1.0, 1.16, "pass", "testlog16"),
(5, "PA1", 1.0, 1.15, "fail", "testlog15"),
(4, "PA1", 2.0, 1.14, "fail", "testlog14"),
(3, "PA1", 1.0, 1.13, "pass", "testlog13"),
(2, "PA1", 2.0, 1.12, "pass", "testlog12"),
(1, "PA1", 1.0, 1.11, "fail", "testlog11")
).toDF("sernum", "product", "state", "testDateTime", "testResult", "msg")
df
.withColumn("group", sum(when($"testResult" === "pass", 1)).over(Window.partitionBy($"product", $"state").orderBy($"testDateTime")))
.withColumn("passMsg", when($"group".isNotNull,first($"msg").over(Window.partitionBy($"product", $"state", $"group").orderBy($"testDateTime"))))
.drop($"group")
.where($"testResult"==="fail")
.orderBy($"product", $"state", $"testDateTime")
.show()
+------+-------+-----+------------+----------+---------+---------+
|sernum|product|state|testDateTime|testResult| msg| passMsg|
+------+-------+-----+------------+----------+---------+---------+
| 7| PA1| 1.0| 1.17| fail|testlog17|testlog16|
| 5| PA1| 1.0| 1.15| fail|testlog15|testlog13|
| 4| PA1| 2.0| 1.14| fail|testlog14|testlog12|
| 1| PA1| 1.0| 1.11| fail|testlog11| null|
+------+-------+-----+------------+----------+---------+---------+
This is an alternate approach, by joining the passed logs with failed ones for previous times, and taking the latest "pass" message log.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
Window.partitionBy($"msg").orderBy($"p_testDateTime".desc)
val fDf = df.filter($"testResult" === "fail")
var pDf = df.filter($"testResult" === "pass")
pDf.columns.foreach(x => pDf = pDf.withColumnRenamed(x, "p_"+x))
val jDf = fDf.join(
pDf,
pDf("p_product") === fDf("product") &&
pDf("p_state") === fDf("state") &&
fDf("testDateTime") > pDf("p_testDateTime") ,
"left").
select(fDf("*"),
pDf("p_testResult"),
pDf("p_testDateTime"),
pDf("p_msg")
)
jDf.withColumn(
"rnk",
row_number().
over(window)
).
filter($"rnk" === 1).
drop("rnk","p_testResult","p_testDateTime").
show()
+---------+-------+------+-----+------------+----------+---------+
| msg|product|sernum|state|testDateTime|testResult| p_msg|
+---------+-------+------+-----+------------+----------+---------+
|testlog14| PA1| 4| 2| 1.14| fail|testlog12|
|testlog11| PA1| 1| 1| 1.11| fail| null|
|testlog15| PA1| 5| 1| 1.15| fail|testlog13|
|testlog17| PA1| 7| 1| 1.17| fail|testlog16|
+---------+-------+------+-----+------------+----------+---------+