Suppose I have this dataframe on PySpark:
df = spark.createDataFrame([
['red', 'banana', 1, 10], ['blue', 'banana', 2, 20], ['red', 'carrot', 3, 30],
['blue', 'grape', 4, 40], ['red', 'carrot', 5, 50], ['black', 'carrot', 6, 60],
['red', 'banana', 7, 70], ['red', 'grape', 8, 80]], schema=['color', 'fruit', 'v1', 'v2'])
I want to create a function that takes column v2 divided by column v1, with the condition:
import numpy as np
from pyspark.sql.functions import pandas_udf
#pandas_udf('long', PandasUDFType.SCALAR)
def pandas_div(a,b):
if b == 0:
return np.nan
return (a/b)
However the result turn out to be like this
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
The output that I want should be like this:
| red| banana|10 |
| blue| banana|20 |
| red| carrot|30 |
| blue| grape|40 |
| red| carrot|50 |
| black| carrot|60 |
| red| banana|70 |
| red| grape|80 |
All you needed was a WHEN and OTHERWISE. See example below
# Create data frame
df = spark.createDataFrame([
['red', 'banana', 1, 10], ['blue', 'banana', 2, 20], ['red', 'carrot', 3, 30],
['blue', 'grape', 4, 40], ['red', 'carrot', 5, 50], ['black', 'carrot', 6, 60],
['red', 'banana', 7, 70], ['red', 'grape', 8, 80], ['orange', 'grapefruit', 0, 100]], schema=['color', 'fruit', 'v1', 'v2'])
# display result
| color| fruit| v1| v2|
| red| banana| 1| 10|
| blue| banana| 2| 20|
| red| carrot| 3| 30|
| blue| grape| 4| 40|
| red| carrot| 5| 50|
| black| carrot| 6| 60|
| red| banana| 7| 70|
| red| grape| 8| 80|
|orange|grapefruit| 0|100|
# Import functions
import pyspark.sql.functions as f
# apply case when
df1 = df.withColumn("divide", f.when(f.col("v1") == 0, None).otherwise(f.lit(f.col("v2")/f.col("v1"))))
# display result
| color| fruit| v1| v2|divide|
| red| banana| 1| 10| 10.0|
| blue| banana| 2| 20| 10.0|
| red| carrot| 3| 30| 10.0|
| blue| grape| 4| 40| 10.0|
| red| carrot| 5| 50| 10.0|
| black| carrot| 6| 60| 10.0|
| red| banana| 7| 70| 10.0|
| red| grape| 8| 80| 10.0|
|orange|grapefruit| 0|100| null|
I have two data frames to work on, the first one looks like this the following df1
df1_schema = StructType([StructField("Date", StringType(), True),\
StructField("store_id", StringType(), True),\
StructField("warehouse_id", StringType(), True),\
StructField("class_id", StringType(), True) ,\
StructField("total_time", IntegerType(), True) ])
df_data = [('2020-08-01','110','1','11010',3),('2020-08-02','110','1','11010',2),\
rdd = sc.parallelize(df_data)
df1 = sqlContext.createDataFrame(df_data, df1_schema)
df1 = df1.withColumn("Date",to_date("Date", 'yyyy-MM-dd'))
| Date|store_id|warehouse_id|class_id|total_time|
|2020-08-01| 110| 1| 11010| 3|
|2020-08-02| 110| 1| 11010| 2|
|2020-08-03| 110| 1| 11010| 3|
|2020-08-04| 110| 1| 11010| 3|
|2020-08-05| 111| 1| 11010| 1|
|2020-08-06| 111| 1| 11010| -1|
I calculated something called arrival_date
#To calculate the arrival_date
#logic : add the Date + total_time so in first row, 2020-08-01 +3 would give me 2020-08-04
#if total_time is -1 then return blank
df1= df1.withColumn('arrival_date', F.when(col('total_time') != -1, expr("date_add(date, total_time)"))
| Date|store_id|warehouse_id|class_id|total_time|arrival_date|
|2020-08-01| 110| 1| 11010| 3| 2020-08-04|
|2020-08-02| 110| 1| 11010| 2| 2020-08-04|
|2020-08-03| 110| 1| 11010| 3| 2020-08-06|
|2020-08-04| 110| 1| 11010| 3| 2020-08-07|
|2020-08-05| 111| 1| 11010| 1| 2020-08-06|
|2020-08-06| 111| 1| 11010| -1| |
and what I want to calculate is this..
#to calculate the transit_date
#if arrival_date is same, ex) 2020-08-04 is repeated 2 or more times, then take min("Date")
#which will be 2020-08-01 otherwise just return the Date ex) 2020-08-07 would just return 2020-08-04
#we need to care about cloth_id too, we have arrival_date = 2020-08-06 repeated 2 times as well but since
#if one of store_id or warehouse_id is different we treat them separately. so at arrival_date = 2020-08-06 at date = 2020-08-03,
##we must return 2020-08-03
#so we treat them separately when one of (store_id, warehouse_id ) is different.
#*Note* we dont care about class_id, its not effective.
#if arrival_date = blank then leave it as blank..
#so our df would look something like this.
| Date|store_id|warehouse_id|class_id|total_time|arrival_date|transit_date|
|2020-08-01| 110| 1| 11010| 3| 2020-08-04| 2020-08-01|
|2020-08-02| 110| 1| 11010| 2| 2020-08-04| 2020-08-01|
|2020-08-03| 110| 1| 11010| 3| 2020-08-06| 2020-08-03|
|2020-08-04| 110| 1| 11010| 3| 2020-08-07| 2020-08-04|
|2020-08-05| 111| 1| 11010| 1| 2020-08-06| 2020-08-05|
|2020-08-06| 111| 1| 11010| -1| | |
Next, I have df2 looks like the following..
#we have another dataframe call it df2
df2_schema = StructType([StructField("Date", StringType(), True),\
StructField("store_id", StringType(), True),\
StructField("warehouse_id", StringType(), True),\
StructField("cloth_id", StringType(), True),\
StructField("class_id", StringType(), True) ,\
StructField("type", StringType(), True),\
StructField("quantity", IntegerType(), True)])
df_data = [('2020-08-01','110','1','M_1','11010','R',5),('2020-08-01','110','1','M_1','11010','R',2),\
rdd = sc.parallelize(df_data)
df2 = sqlContext.createDataFrame(df_data, df2_schema)
df2 = df2.withColumn("Date",to_date("Date", 'yyyy-MM-dd'))
| Date|store_id|warehouse_id|cloth_id|class_id|type|quantity|
|2020-08-01| 110| 1| M_1| 11010| R| 5|
|2020-08-01| 110| 1| M_1| 11010| R| 2|
|2020-08-02| 110| 1| M_1| 11010| C| 3|
|2020-08-03| 110| 1| M_1| 11010| R| 1|
|2020-08-04| 110| 1| M_1| 11010| R| 3|
|2020-08-05| 111| 1| M_2| 11010| R| 5|
and I calculated quantity2, this is just sum of quantity where type=R
df2 =df2.groupBy('Date','store_id','warehouse_id','cloth_id','class_id')\
.agg( F.sum(F.when(col('type')=='R', col('quantity'))\
| Date|store_id|warehouse_id|cloth_id|class_id|quantity2|
|2020-08-01| 110| 1| M_1| 11010| 7|
|2020-08-02| 110| 1| M_1| 11010| 3|
|2020-08-03| 110| 1| M_1| 11010| 1|
|2020-08-04| 110| 1| M_1| 11010| 3|
|2020-08-05| 111| 1| M_2| 11010| 5|
Now I have df1, and df2. I want to join such that It will look something like this...
I tried something like this
df4 ='store_id','warehouse_id','class_id','arrival_date','transit_date')
df4= df4.filter(" transit_date != '' ")
df4=df4.withColumnRenamed('arrival_date', 'date')
df3 = df2.join(df1, on=['Date','store_id','warehouse_id','class_id'],how='inner').orderBy('Date')
df5 = df3.join(df4, on=['Date','store_id','warehouse_id','class_id'], how='left').orderBy('Date')
but I dont think this is the correct approach.... the result df should look like below..
| Date|store_id|warehouse_id|class_id|cloth_id|quantity2|total_time|arrival_date|transit_date|
|2020-08-01| 110| 1| 11010| M_1| 7| 3| 2020-08-04| null|
|2020-08-02| 110| 1| 11010| M_1| 3| 2| 2020-08-04| null|
|2020-08-03| 110| 1| 11010| M_1| 1| 3| 2020-08-06| null|
|2020-08-04| 110| 1| 11010| M_1| 3| 3| 2020-08-07| 2020-08-01|
|2020-08-05| 111| 1| 11010| M_2| 5| 1| 2020-08-06| null|
note that the transit_date went to where Date = arrival_date of course the null is replaced by blank.
LASTLY, if today is 2020-08-04, then look at where arrival_date == 2020-08-04 and sum up the quantity and place it at today. so.... It will look like this... where the store_id = 111, it will have separate date. not shown here.. so logic needs to make sense when store_id = 111 as well.. i've just shown the example where store_id = 110
From my understanding about your question and where you already have with the following df1 and df2:
df1.orderBy('Date').show() df2.orderBy('Date').show()
+----------+--------+------------+--------+----------+------------+ +----------+--------+------------+--------+--------+---------+
| Date|store_id|warehouse_id|class_id|total_time|arrival_date| | Date|store_id|warehouse_id|cloth_id|class_id|quantity2|
+----------+--------+------------+--------+----------+------------+ +----------+--------+------------+--------+--------+---------+
|2020-08-01| 110| 1| 11010| 3| 2020-08-04| |2020-08-01| 110| 1| M_1| 11010| 7|
|2020-08-02| 110| 1| 11010| 2| 2020-08-04| |2020-08-02| 110| 1| M_1| 11010| 3|
|2020-08-03| 110| 1| 11010| 3| 2020-08-06| |2020-08-03| 110| 1| M_1| 11010| 1|
|2020-08-04| 110| 1| 11010| 3| 2020-08-07| |2020-08-04| 110| 1| M_1| 11010| 3|
|2020-08-05| 111| 1| 11010| 1| 2020-08-06| |2020-08-05| 111| 1| M_2| 11010| 5|
|2020-08-06| 111| 1| 11010| -1| | +----------+--------+------------+--------+--------+---------+
you can try the following 5 steps:
Step-1: Set up the list of column names grp_cols for join:
from pyspark.sql import functions as F
grp_cols = ["Date", "store_id", "warehouse_id", "class_id"]
Step-2: create df3 containing transit_date which is the min Date on each combination of arrival_date, store_id, warehouse_id and class_id:
df3 = df1.filter('total_time != -1') \
.groupby("arrival_date", "store_id", "warehouse_id", "class_id") \
.agg(F.min('Date').alias('transit_date')) \
.withColumnRenamed("arrival_date", "Date")
| Date|store_id|warehouse_id|class_id|transit_date|
|2020-08-04| 110| 1| 11010| 2020-08-01|
|2020-08-06| 111| 1| 11010| 2020-08-05|
|2020-08-06| 110| 1| 11010| 2020-08-03|
|2020-08-07| 110| 1| 11010| 2020-08-04|
Step-3: set up df4 by join df2 with df1 and left join df3 using grp_cols, persist df4
df4 = df2.join(df1, grp_cols).join(df3, grp_cols, "left") \
.withColumn('transit_date', F.when(F.col('total_time') != -1, F.col("transit_date")).otherwise('')) \
_ = df4.count()
| Date|store_id|warehouse_id|class_id|cloth_id|quantity2|total_time|arrival_date|transit_date|
|2020-08-01| 110| 1| 11010| M_1| 7| 3| 2020-08-04| null|
|2020-08-02| 110| 1| 11010| M_1| 3| 2| 2020-08-04| null|
|2020-08-03| 110| 1| 11010| M_1| 1| 3| 2020-08-06| null|
|2020-08-04| 110| 1| 11010| M_1| 3| 3| 2020-08-07| 2020-08-01|
|2020-08-05| 111| 1| 11010| M_2| 5| 1| 2020-08-06| null|
Step-4: calculate sum(quantity2) as want from df4 for each arrival_date + store_id + warehouse_id + class_id + cloth_id
df5 = df4 \
.groupby("arrival_date", "store_id", "warehouse_id", "class_id", "cloth_id") \
.agg(F.sum("quantity2").alias("want")) \
.withColumnRenamed("arrival_date", "Date")
| Date|store_id|warehouse_id|class_id|cloth_id|want|
|2020-08-04| 110| 1| 11010| M_1| 10|
|2020-08-06| 111| 1| 11010| M_2| 5|
|2020-08-06| 110| 1| 11010| M_1| 1|
|2020-08-07| 110| 1| 11010| M_1| 3|
Step-5: create the final dataframe by left join df4 with df5
df_new = df4.join(df5, grp_cols+["cloth_id"], "left").fillna(0, subset=['want'])
| Date|store_id|warehouse_id|class_id|cloth_id|quantity2|total_time|arrival_date|transit_date|want|
|2020-08-01| 110| 1| 11010| M_1| 7| 3| 2020-08-04| null| 0|
|2020-08-02| 110| 1| 11010| M_1| 3| 2| 2020-08-04| null| 0|
|2020-08-03| 110| 1| 11010| M_1| 1| 3| 2020-08-06| null| 0|
|2020-08-04| 110| 1| 11010| M_1| 3| 3| 2020-08-07| 2020-08-01| 10|
|2020-08-05| 111| 1| 11010| M_2| 5| 1| 2020-08-06| null| 0|
Here is for the df1,
from pyspark.sql import Window
from pyspark.sql.functions import *
from pyspark.sql.types import *
import builtins as p
df1_schema = StructType(
StructField('Date', StringType(), True),
StructField('store_id', StringType(), True),
StructField('warehouse_id', StringType(), True),
StructField('class_id', StringType(), True),
StructField('total_time', IntegerType(), True)
df1_data = [
df1 = spark.createDataFrame(df1_data, df1_schema)
df1 = df1.withColumn('Date', to_date('Date'))
df1 = df1.withColumn('arrival_date', when(col('total_time') != -1, expr("date_add(date, total_time)")).otherwise(''))
w = Window.partitionBy('arrival_date', 'store_id', 'warehouse_id').orderBy('Date')
df1 = df1.withColumn('transit_date', when(col('total_time') != -1, first('Date').over(w)).otherwise('')).orderBy('Date')
| Date|store_id|warehouse_id|class_id|total_time|arrival_date|transit_date|
|2020-08-01| 110| 1| 11010| 3| 2020-08-04| 2020-08-01|
|2020-08-02| 110| 1| 11010| 2| 2020-08-04| 2020-08-01|
|2020-08-03| 110| 1| 11010| 3| 2020-08-06| 2020-08-03|
|2020-08-04| 110| 1| 11010| 3| 2020-08-07| 2020-08-04|
|2020-08-05| 111| 1| 11010| 1| 2020-08-06| 2020-08-05|
|2020-08-06| 111| 1| 11010| -1| | |
and df2 as you did,
df2_schema = StructType(
StructField('Date', StringType(), True),
StructField('store_id', StringType(), True),
StructField('warehouse_id', StringType(), True),
StructField('cloth_id', StringType(), True),
StructField('class_id', StringType(), True),
StructField('type', StringType(), True),
StructField('quantity', IntegerType(), True)
df2_data = [
df2 = spark.createDataFrame(df2_data, df2_schema)
df2 = df2.withColumn('Date', to_date('Date'))
df2 = df2.groupBy('Date', 'store_id', 'warehouse_id', 'cloth_id', 'class_id') \
when(col('type') == 'R', col('quantity')).otherwise(0)
| Date|store_id|warehouse_id|cloth_id|class_id|quantity2|
|2020-08-01| 110| 1| M_1| 11010| 7|
|2020-08-02| 110| 1| M_1| 11010| 0|
|2020-08-03| 110| 1| M_1| 11010| 1|
|2020-08-04| 110| 1| M_1| 11010| 3|
|2020-08-05| 111| 1| M_2| 11010| 5|
and finally the join result.
df3 = df1.filter('total_time != -1') \
.join(df2, on=['Date', 'store_id', 'warehouse_id', 'class_id'], how='left') \
.drop('Date', 'total_time', 'cloth_id') \
.withColumnRenamed('arrival_date', 'Date')
df4 = df1.drop('transit_date') \
.join(df3, on=['Date', 'store_id', 'warehouse_id', 'class_id'], how='left') \
.groupBy('Date', 'store_id', 'warehouse_id', 'class_id', 'arrival_date', 'transit_date') \
.agg(sum('quantity2').alias('want')) \
| Date|store_id|warehouse_id|class_id|arrival_date|transit_date|want|
|2020-08-01| 110| 1| 11010| 2020-08-04| null|null|
|2020-08-02| 110| 1| 11010| 2020-08-04| null|null|
|2020-08-03| 110| 1| 11010| 2020-08-06| null|null|
|2020-08-04| 110| 1| 11010| 2020-08-07| 2020-08-01| 7|
|2020-08-05| 111| 1| 11010| 2020-08-06| null|null|
|2020-08-06| 111| 1| 11010| | 2020-08-05| 5|
I have a dataframe like this:
df = pd.DataFrame({"Date": ["2020-05-10", "2020-05-10", "2020-05-10", "2020-05-11", "2020-05-11", "2020-05-11", "2020-05-11", "2020-05-11", "2020-05-11"],
"Slot_Length": [30, 30, 30, 30, 30, 30, 30, 30, 30],
"Total_Space": [60, 60, 60, 120, 120, 120, 120, 120, 120],
"Amount_Over": [-30, -30, -30, -60, -60, -60, -60, -60, -60],
"Rank": [1, 1, 2, 1, 1, 1, 1, 2, 2]})
df = spark.createDataFrame(df)
| Date|Slot_Length|Total_Space|Amount_Over|Rank|
|2020-05-10| 30| 60| -30| 1|
|2020-05-10| 30| 60| -30| 1|
|2020-05-10| 30| 60| -30| 2|
|2020-05-11| 30| 120| -60| 1|
|2020-05-11| 30| 120| -60| 1|
|2020-05-11| 30| 120| -60| 1|
|2020-05-11| 30| 120| -60| 1|
|2020-05-11| 30| 120| -60| 2|
|2020-05-11| 30| 120| -60| 2|
For each Date I have a Total_Space that can be filled. So for 2020-05-10, I have 60 seconds, and for 2020-05-11 I have 120 seconds.
Each Date also already have assigned slots with a certain Slot_Length.
For each Date I have already calculated the amount of space that Date is over in the Amount_Over column and have ranked them appropriately based on a priority column not shown here.
What I would like to do is to drop the rows with lowest Rank for a Date until the Slot_Lengths add up to the Total_Space for a Date.
| Date|Slot_Length|Total_Space|Amount_Over|Rank|
|2020-05-10| 30| 60| -30| 1|
|2020-05-10| 30| 60| -30| 1|
|2020-05-11| 30| 120| -60| 1|
|2020-05-11| 30| 120| -60| 1|
|2020-05-11| 30| 120| -60| 1|
|2020-05-11| 30| 120| -60| 1|
In this example, it is as easy as dropping all Rank equal to 2, but there will be examples where there is a tie between ranks, so first take the highest ranks, and then take a random one if there is a tie.
What is the best way to do this? I already understand it will need a Window function over the Date to do each calculation over the Slot_Length, Total_Space, and Amount_Over columns correctly.
df = pd.DataFrame({"Date": ["2020-05-10", "2020-05-10", "2020-05-10", "2020-05-11", "2020-05-11", "2020-05-11",
"2020-05-11", "2020-05-11", "2020-05-11"],
"Slot_Length": [30, 30, 30, 30, 30, 30, 30, 30, 30],
"Total_Space": [60, 60, 60, 120, 120, 120, 120, 120, 120],
"Amount_Over": [-30, -30, -30, -60, -60, -60, -60, -60, -60],
"Rank": [1, 1, 2, 1, 1, 1, 1, 2, 2]})
df = spark.createDataFrame(df)
w = Window.partitionBy("Date").orderBy("Rank").rowsBetween(Window.unboundedPreceding, Window.currentRow)
"Cumulative_Sum", F.sum("Slot_Length").over(w)
F.col("Cumulative_Sum") <= F.col("Total_Space")
which results
| Date|Slot_Length|Total_Space|Amount_Over|Rank|Cumulative_Sum|
|2020-05-10| 30| 60| -30| 1| 30|
|2020-05-10| 30| 60| -30| 1| 60|
|2020-05-11| 30| 120| -60| 1| 30|
|2020-05-11| 30| 120| -60| 1| 60|
|2020-05-11| 30| 120| -60| 1| 90|
|2020-05-11| 30| 120| -60| 1| 120|
I have two datasets (dataframes)
idPeersDS - which has an id column and it's peers' ids.
infoDS - which has two type columns (type1, type2) and a metric column.
| id| peers|
| 1|[1, 2, 3]|
| 2|[2, 1, 6]|
| 3|[3, 1, 2]|
| 4|[4, 5, 6]|
| 5|[5, 4, 6]|
| 6|[6, 1, 2]|
| id|type1|type2|metric|
| 1| A| X| 10.0|
| 1| A| Y| 20.0|
| 1| B| X| 30.0|
| 1| B| Y| 40.0|
| 2| A| Y| 10.0|
| 2| B| X| 20.0|
| 2| B| Y| 30.0|
| 3| A| X| 40.0|
| 4| B| Y| 10.0|
| 5| A| X| 20.0|
| 5| B| X| 30.0|
| 6| A| Y| 40.0|
| 6| B| Y| 10.0|
I need to calculate the zscore of the metric for each id grouped by type1 and type2. But it is not the score of the metrics for the grouped data, it is zscore of the metrics of peers with in a group. If a peerId doesnot have a metric in the group, the peerId's metric is treated as 0.
for group ("A", "X") and for id = 1, the peers are (1,2,3), the metrics for zscore will be (10, 0, 40); since id = 2 doesn't exist in group ("A","X") it is 0. id=5 is not a peer of id=1 so it is not part of zscore calculation.
| id|metric| peers|type1|type2|
| 1| 10.0|[1, 2, 3]| A| X|
| 3| 40.0|[3, 1, 2]| A| X|
| 5| 20.0|[5, 4, 6]| A| X|
Z = (X - μ) / σ
Z = (10 - 16.66666) / 16.99673
Z = -0.39223
Output should be the following table. I can compute zscore if `peersmetrics` column instead of `zScoreValue` column like my code did.
| id|metric| peers|zScoreValue|type1|type2| peersmetrics
| 1| 10.0|[1, 2, 3]| -0.39| A| X| [10, 0, 40]
| 3| 40.0|[3, 1, 2]| 1.37| A| X| [40, 10, 0]
| 5| 20.0|[5, 4, 6]| 1.41| A| X| [20, 0 , 0]
| 1| 40.0|[1, 2, 3]| 0.98| B| Y| [40, 30, 0]
| 2| 30.0|[2, 1, 6]| 0.27| B| Y| [30, 40, 10]
| 4| 10.0|[4, 5, 6]| 0.71| B| Y|
| 6| 10.0|[6, 1, 2]| -1.34| B| Y|
| 1| 30.0|[1, 2, 3]| 1.07| B| X|
| 2| 20.0|[2, 1, 6]| 0.27| B| X|
| 5| 30.0|[5, 4, 6]| 1.41| B| X|
| 1| 20.0|[1, 2, 3]| 1.22| A| Y|
| 2| 10.0|[2, 1, 6]| -1.07| A| Y|
| 6| 40.0|[6, 1, 2]| 1.34| A| Y|
Edit1: SQL solution is equally appreciated. I can transform SQL to Scala code in my spark job.
Following is my solution but the computation is taking longer than I wish.
the size of true datasets:
idPeersDS has 17000 and infoDS has 17000 * 6 * 15
Any other solution is greatly appreciated. Feel free to edit/recommend title and correct grammar. English is not my first language. Thanks.
Here is my code.
val idPeersDS = Seq(
(1, Seq(1,2,3)),
(2, Seq(2,1,6)),
(3, Seq(3,1,2)),
(4, Seq(4,5,6)),
(5, Seq(5,4,6)),
(6, Seq(6,1,2))
)$"_1" as "id", $"_2" as "peers")
val infoDS = Seq(
(1, "A", "X", 10),
(1, "A", "Y", 20),
(1, "B", "X", 30),
(1, "B", "Y", 40),
(2, "A", "Y", 10),
(2, "B", "X", 20),
(2, "B", "Y", 30),
(3, "A", "X", 40),
(4, "B", "Y", 10),
(5, "A", "X", 20),
(5, "B", "X", 30),
(6, "A", "Y", 40),
(6, "B", "Y", 10)
)$"_1" as "id", $"_2" as "type1", $"_3" as "type2", $"_4" cast "double" as "metric")
def calculateZScoreGivenPeers(idMetricDS: DataFrame, irPeersDS: DataFrame, roundTo: Int = 2)
(implicit spark: SparkSession): DataFrame = {
import spark.implicits._
// for every id in the idMetricDS, get the peers and their metric for zscore, calculate zscore
val fir = idMetricDS.join(irPeersDS, "id")
val fsMapBroadcast = spark.sparkContext.broadcast( Row) => {r.getInt(0) -> r.getDouble(1)}).rdd.collectAsMap)
val fsMap = fsMapBroadcast.value
val funUdf = udf((currId: Int, xs: WrappedArray[Int]) => {
val zScoreMetrics: Array[Double] = => fsMap.getOrElse(x, 0.0))
val ds = new DescriptiveStatistics(zScoreMetrics)
val mean = ds.getMean()
val sd = Math.sqrt(ds.getPopulationVariance())
val zScore = if (sd == 0.0) {0.0} else {(fsMap.getOrElse(currId, 0.0)- mean) / sd}
val idStatsWithZscoreDS =
fir.withColumn("zScoreValue", round(funUdf(fir("id"), fir("peers")), roundTo))
return idStatsWithZscoreDS
val typesComb ="type1", "type2").dropDuplicates.collect
val zScoreDS =
ept => {
val et = ept.getString(0)
val pt = ept.getString(1)
val idMetricDS = infoDS.where($"type1" === lit(et) && $"type2" === lit(pt)).select($"id", $"metric")
val zScoreDS = calculateZScoreGivenPeers(idMetricDS, idPeersDS)(spark)$"id", $"metric", $"peers", $"zScoreValue").withColumn("type1", lit(et)).withColumn("type2", lit(pt))
| id| peers|
| 1|[1, 2, 3]|
| 2|[2, 1, 6]|
| 3|[3, 1, 2]|
| 4|[4, 5, 6]|
| 5|[5, 4, 6]|
| 6|[6, 1, 2]|
| id|type1|type2|metric|
| 1| A| X| 10.0|
| 1| A| Y| 20.0|
| 1| B| X| 30.0|
| 1| B| Y| 40.0|
| 2| A| Y| 10.0|
| 2| B| X| 20.0|
| 2| B| Y| 30.0|
| 3| A| X| 40.0|
| 4| B| Y| 10.0|
| 5| A| X| 20.0|
| 5| B| X| 30.0|
| 6| A| Y| 40.0|
| 6| B| Y| 10.0|
scala> typesComb
res3: Array[org.apache.spark.sql.Row] = Array([A,X], [B,Y], [B,X], [A,Y])
| id|metric| peers|zScoreValue|type1|type2|
| 1| 10.0|[1, 2, 3]| -0.39| A| X|
| 3| 40.0|[3, 1, 2]| 1.37| A| X|
| 5| 20.0|[5, 4, 6]| 1.41| A| X|
| 1| 40.0|[1, 2, 3]| 0.98| B| Y|
| 2| 30.0|[2, 1, 6]| 0.27| B| Y|
| 4| 10.0|[4, 5, 6]| 0.71| B| Y|
| 6| 10.0|[6, 1, 2]| -1.34| B| Y|
| 1| 30.0|[1, 2, 3]| 1.07| B| X|
| 2| 20.0|[2, 1, 6]| 0.27| B| X|
| 5| 30.0|[5, 4, 6]| 1.41| B| X|
| 1| 20.0|[1, 2, 3]| 1.22| A| Y|
| 2| 10.0|[2, 1, 6]| -1.07| A| Y|
| 6| 40.0|[6, 1, 2]| 1.34| A| Y|
I solved it. Here is my answer. This solution did run significantly faster (< 1/10th) than my previous solution I have in the question on my true datasets.
I avoided collect to driver and map and union of datasets in the reduce.
val idPeersDS = Seq(
(1, Seq(1,2,3)),
(2, Seq(2,1,6)),
(3, Seq(3,1,2)),
(4, Seq(4,5,6)),
(5, Seq(5,4,6)),
(6, Seq(6,1,2))
)$"_1" as "id", $"_2" as "peers")
val infoDS = Seq(
(1, "A", "X", 10),
(1, "A", "Y", 20),
(1, "B", "X", 30),
(1, "B", "Y", 40),
(2, "A", "Y", 10),
(2, "B", "X", 20),
(2, "B", "Y", 30),
(3, "A", "X", 40),
(4, "B", "Y", 10),
(5, "A", "X", 20),
(5, "B", "X", 30),
(6, "A", "Y", 40),
(6, "B", "Y", 10)
)$"_1" as "id", $"_2" as "type1", $"_3" as "type2", $"_4" cast "double" as "metric")
// Exiting paste mode, now interpreting.
idPeersDS: org.apache.spark.sql.DataFrame = [id: int, peers: array<int>]
infoDS: org.apache.spark.sql.DataFrame = [id: int, type1: string ... 2 more fields]
| id| peers|
| 1|[1, 2, 3]|
| 2|[2, 1, 6]|
| 3|[3, 1, 2]|
| 4|[4, 5, 6]|
| 5|[5, 4, 6]|
| 6|[6, 1, 2]|
| id|type1|type2|metric|
| 1| A| X| 10.0|
| 1| A| Y| 20.0|
| 1| B| X| 30.0|
| 1| B| Y| 40.0|
| 2| A| Y| 10.0|
| 2| B| X| 20.0|
| 2| B| Y| 30.0|
| 3| A| X| 40.0|
| 4| B| Y| 10.0|
| 5| A| X| 20.0|
| 5| B| X| 30.0|
| 6| A| Y| 40.0|
| 6| B| Y| 10.0|
scala> val infowithpeers = infoDS.join(idPeersDS, "id")
infowithpeers: org.apache.spark.sql.DataFrame = [id: int, type1: string ... 3 more fields]
| id|type1|type2|metric| peers|
| 1| A| X| 10.0|[1, 2, 3]|
| 1| A| Y| 20.0|[1, 2, 3]|
| 1| B| X| 30.0|[1, 2, 3]|
| 1| B| Y| 40.0|[1, 2, 3]|
| 2| A| Y| 10.0|[2, 1, 6]|
| 2| B| X| 20.0|[2, 1, 6]|
| 2| B| Y| 30.0|[2, 1, 6]|
| 3| A| X| 40.0|[3, 1, 2]|
| 4| B| Y| 10.0|[4, 5, 6]|
| 5| A| X| 20.0|[5, 4, 6]|
| 5| B| X| 30.0|[5, 4, 6]|
| 6| A| Y| 40.0|[6, 1, 2]|
| 6| B| Y| 10.0|[6, 1, 2]|
scala> val joinMap = udf { values: Seq[Map[Int,Double]] => values.flatten.toMap }
joinMap: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,MapType(IntegerType,DoubleType,false),Some(List(ArrayType(MapType(IntegerType,DoubleType,false),true))))
scala> val zScoreCal = udf { (metric: Double, zScoreMetrics: WrappedArray[Double]) =>
| val ds = new DescriptiveStatistics(zScoreMetrics.toArray)
| val mean = ds.getMean()
| val sd = Math.sqrt(ds.getPopulationVariance())
| val zScore = if (sd == 0.0) {0.0} else {(metric - mean) / sd}
| zScore
| }
zScoreCal: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,DoubleType,Some(List(DoubleType, ArrayType(DoubleType,false))))
scala> :paste
// Entering paste mode (ctrl-D to finish)
val infowithpeersidmetric = infowithpeers.withColumn("idmetric", map($"id",$"metric"))
val idsingrpdf = infowithpeersidmetric.groupBy("type1","type2").agg(joinMap(collect_list(map($"id", $"metric"))) as "idsingrp")
val metricsMap = udf { (peers: Seq[Int], values: Map[Int,Double]) => { => values.getOrElse(p,0.0))
// Exiting paste mode, now interpreting.
infowithpeersidmetric: org.apache.spark.sql.DataFrame = [id: int, type1: string ... 4 more fields]
idsingrpdf: org.apache.spark.sql.DataFrame = [type1: string, type2: string ... 1 more field]
metricsMap: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(DoubleType,false),Some(List(ArrayType(IntegerType,false), MapType(IntegerType,DoubleType,false))))
scala> val infoWithMap = infowithpeers.join(idsingrpdf, Seq("type1","type2")).withColumn("zScoreMetrics", metricsMap($"peers", $"idsingrp")).withColumn("zscore", round(zScoreCal($"metric",$"zScoreMetrics"),2))
infoWithMap: org.apache.spark.sql.DataFrame = [type1: string, type2: string ... 6 more fields]
|type1|type2| id|metric| peers| idsingrp| zScoreMetrics|zscore|
| A| X| 1| 10.0|[1, 2, 3]|[3 -> 40.0, 5 -> ...| [10.0, 0.0, 40.0]| -0.39|
| A| Y| 1| 20.0|[1, 2, 3]|[2 -> 10.0, 6 -> ...| [20.0, 10.0, 0.0]| 1.22|
| B| X| 1| 30.0|[1, 2, 3]|[1 -> 30.0, 2 -> ...| [30.0, 20.0, 0.0]| 1.07|
| B| Y| 1| 40.0|[1, 2, 3]|[4 -> 10.0, 1 -> ...| [40.0, 30.0, 0.0]| 0.98|
| A| Y| 2| 10.0|[2, 1, 6]|[2 -> 10.0, 6 -> ...|[10.0, 20.0, 40.0]| -1.07|
| B| X| 2| 20.0|[2, 1, 6]|[1 -> 30.0, 2 -> ...| [20.0, 30.0, 0.0]| 0.27|
| B| Y| 2| 30.0|[2, 1, 6]|[4 -> 10.0, 1 -> ...|[30.0, 40.0, 10.0]| 0.27|
| A| X| 3| 40.0|[3, 1, 2]|[3 -> 40.0, 5 -> ...| [40.0, 10.0, 0.0]| 1.37|
| B| Y| 4| 10.0|[4, 5, 6]|[4 -> 10.0, 1 -> ...| [10.0, 0.0, 10.0]| 0.71|
| A| X| 5| 20.0|[5, 4, 6]|[3 -> 40.0, 5 -> ...| [20.0, 0.0, 0.0]| 1.41|
| B| X| 5| 30.0|[5, 4, 6]|[1 -> 30.0, 2 -> ...| [30.0, 0.0, 0.0]| 1.41|
| A| Y| 6| 40.0|[6, 1, 2]|[2 -> 10.0, 6 -> ...|[40.0, 20.0, 10.0]| 1.34|
| B| Y| 6| 10.0|[6, 1, 2]|[4 -> 10.0, 1 -> ...|[10.0, 40.0, 30.0]| -1.34|
I have this data on Spark Dataframe
|sernum|product|state|testDateTime|testResult| msg|
| 8| PA1| 1.0| 1.18| pass|testlog18|
| 7| PA1| 1.0| 1.17| fail|testlog17|
| 6| PA1| 1.0| 1.16| pass|testlog16|
| 5| PA1| 1.0| 1.15| fail|testlog15|
| 4| PA1| 2.0| 1.14| fail|testlog14|
| 3| PA1| 1.0| 1.13| pass|testlog13|
| 2| PA1| 2.0| 1.12| pass|testlog12|
| 1| PA1| 1.0| 1.11| fail|testlog11|
What I care about is the testResult == "fail", and the hard part is that I need the to get the last "pass" message as an extra column GROUP BY product+state:
|sernum|product|state|testDateTime|testResult| msg| passMsg|
| 7| PA1| 1.0| 1.17| fail|testlog17|testlog16|
| 5| PA1| 1.0| 1.15| fail|testlog15|testlog13|
| 4| PA1| 2.0| 1.14| fail|testlog14|testlog12|
| 1| PA1| 1.0| 1.11| fail|testlog11| null|
How can I achieve this using DataFrame or SQL?
The trick is to define groups where each group starts with a passed test. Then, use again window-functions with group as an additional partition-column:
val df = Seq(
(8, "PA1", 1.0, 1.18, "pass", "testlog18"),
(7, "PA1", 1.0, 1.17, "fail", "testlog17"),
(6, "PA1", 1.0, 1.16, "pass", "testlog16"),
(5, "PA1", 1.0, 1.15, "fail", "testlog15"),
(4, "PA1", 2.0, 1.14, "fail", "testlog14"),
(3, "PA1", 1.0, 1.13, "pass", "testlog13"),
(2, "PA1", 2.0, 1.12, "pass", "testlog12"),
(1, "PA1", 1.0, 1.11, "fail", "testlog11")
).toDF("sernum", "product", "state", "testDateTime", "testResult", "msg")
.withColumn("group", sum(when($"testResult" === "pass", 1)).over(Window.partitionBy($"product", $"state").orderBy($"testDateTime")))
.withColumn("passMsg", when($"group".isNotNull,first($"msg").over(Window.partitionBy($"product", $"state", $"group").orderBy($"testDateTime"))))
.orderBy($"product", $"state", $"testDateTime")
|sernum|product|state|testDateTime|testResult| msg| passMsg|
| 7| PA1| 1.0| 1.17| fail|testlog17|testlog16|
| 5| PA1| 1.0| 1.15| fail|testlog15|testlog13|
| 4| PA1| 2.0| 1.14| fail|testlog14|testlog12|
| 1| PA1| 1.0| 1.11| fail|testlog11| null|
This is an alternate approach, by joining the passed logs with failed ones for previous times, and taking the latest "pass" message log.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val fDf = df.filter($"testResult" === "fail")
var pDf = df.filter($"testResult" === "pass")
pDf.columns.foreach(x => pDf = pDf.withColumnRenamed(x, "p_"+x))
val jDf = fDf.join(
pDf("p_product") === fDf("product") &&
pDf("p_state") === fDf("state") &&
fDf("testDateTime") > pDf("p_testDateTime") ,
filter($"rnk" === 1).
| msg|product|sernum|state|testDateTime|testResult| p_msg|
|testlog14| PA1| 4| 2| 1.14| fail|testlog12|
|testlog11| PA1| 1| 1| 1.11| fail| null|
|testlog15| PA1| 5| 1| 1.15| fail|testlog13|
|testlog17| PA1| 7| 1| 1.17| fail|testlog16|