How do I pass parameters to selectExpr? SparkSQL-Scala

How do I pass parameters to selectExpr? SparkSQL-Scala - apache-spark-sql

:)
When you have a data frame, you can add columns and fill their rows with the method selectExprt
Something like this:
scala> table.show
+------+--------+---------+--------+--------+
|idempr|tipperrd| codperrd|tipperrt|codperrt|
+------+--------+---------+--------+--------+
| OlcM| h|999999999| J| 0|
| zOcQ| r|777777777| J| 1|
| kyGp| t|333333333| J| 2|
| BEuX| A|999999999| F| 3|
scala> var table2 = table.selectExpr("idempr", "tipperrd", "codperrd", "tipperrt", "codperrt", "'hola' as Saludo")
tabla: org.apache.spark.sql.DataFrame = [idempr: string, tipperrd: string, codperrd: decimal(9,0), tipperrt: string, codperrt: decimal(9,0), Saludo: string]
scala> table2.show
+------+--------+---------+--------+--------+------+
|idempr|tipperrd| codperrd|tipperrt|codperrt|Saludo|
+------+--------+---------+--------+--------+------+
| OlcM| h|999999999| J| 0| hola|
| zOcQ| r|777777777| J| 1| hola|
| kyGp| t|333333333| J| 2| hola|
| BEuX| A|999999999| F| 3| hola|
My point is:
I define strings and call a method which use this String parameter to fill a column in the data frame. But I am not able to do the select expresion get the string (I tried $, +, etc..) . To achieve something like this:
scala> var english = "hello"
scala> def generar_informe(df: DataFrame, tabla: String) {
var selectExpr_df = df.selectExpr(
"TIPPERSCON_BAS as TIP.PERSONA CONTACTABILIDAD",
"CODPERSCON_BAS as COD.PERSONA CONTACTABILIDAD",
"'tabla' as PUNTO DEL FLUJO" )
}
scala> generar_informe(df,english)
.....
scala> table2.show
+------+--------+---------+--------+--------+------+
|idempr|tipperrd| codperrd|tipperrt|codperrt|Saludo|
+------+--------+---------+--------+--------+------+
| OlcM| h|999999999| J| 0| hello|
| zOcQ| r|777777777| J| 1| hello|
| kyGp| t|333333333| J| 2| hello|
| BEuX| A|999999999| F| 3| hello|
I tried:
scala> var result = tabl.selectExpr("A", "B", "$tabla as C")
scala> var abc = tabl.selectExpr("A", "B", ${tabla} as C)
<console>:31: error: not found: value $
var abc = tabl.selectExpr("A", "B", ${tabla} as C)
scala> var abc = tabl.selectExpr("A", "B", "${tabla} as C")
scala> sqlContext.sql("set tabla='hello'")
scala> var abc = tabl.selectExpr("A", "B", "${tabla} as C")
SAME ERROR:
java.lang.RuntimeException: [1.1] failure: identifier expected
${tabla} as C
^
at scala.sys.package$.error(package.scala:27)
Thanks in advance!

Can you try this.
val english = "hello"
generar_informe(data,english).show()
}
def generar_informe(df: DataFrame , english : String)={
df.selectExpr(
"transactionId" , "customerId" , "itemId","amountPaid" , s"""'${english}' as saludo """)
}
This is the output I got.
17/11/02 23:56:44 INFO CodeGenerator: Code generated in 13.857987 ms
+-------------+----------+------+----------+------+
|transactionId|customerId|itemId|amountPaid|saludo|
+-------------+----------+------+----------+------+
| 111| 1| 1| 100.0| hello|
| 112| 2| 2| 505.0| hello|
| 113| 3| 3| 510.0| hello|
| 114| 4| 4| 600.0| hello|
| 115| 1| 2| 500.0| hello|
| 116| 1| 2| 500.0| hello|
| 117| 1| 2| 500.0| hello|
| 118| 1| 2| 500.0| hello|
| 119| 2| 3| 500.0| hello|
| 120| 1| 2| 500.0| hello|
| 121| 1| 4| 500.0| hello|
| 122| 1| 2| 500.0| hello|
| 123| 1| 4| 500.0| hello|
| 124| 1| 2| 500.0| hello|
+-------------+----------+------+----------+------+
17/11/02 23:56:44 INFO SparkContext: Invoking stop() from shutdown hook

Related

Pyspark: how to solve complicated dataframe logic plus join

I have two data frames to work on, the first one looks like this the following df1
df1_schema = StructType([StructField("Date", StringType(), True),\
StructField("store_id", StringType(), True),\
StructField("warehouse_id", StringType(), True),\
StructField("class_id", StringType(), True) ,\
StructField("total_time", IntegerType(), True) ])
df_data = [('2020-08-01','110','1','11010',3),('2020-08-02','110','1','11010',2),\
('2020-08-03','110','1','11010',3),('2020-08-04','110','1','11010',3),\
('2020-08-05','111','1','11010',1),('2020-08-06','111','1','11010',-1)]
rdd = sc.parallelize(df_data)
df1 = sqlContext.createDataFrame(df_data, df1_schema)
df1 = df1.withColumn("Date",to_date("Date", 'yyyy-MM-dd'))
df1.show()
+----------+--------+------------+--------+----------+
| Date|store_id|warehouse_id|class_id|total_time|
+----------+--------+------------+--------+----------+
|2020-08-01| 110| 1| 11010| 3|
|2020-08-02| 110| 1| 11010| 2|
|2020-08-03| 110| 1| 11010| 3|
|2020-08-04| 110| 1| 11010| 3|
|2020-08-05| 111| 1| 11010| 1|
|2020-08-06| 111| 1| 11010| -1|
+----------+--------+------------+--------+----------+
I calculated something called arrival_date
#To calculate the arrival_date
#logic : add the Date + total_time so in first row, 2020-08-01 +3 would give me 2020-08-04
#if total_time is -1 then return blank
df1= df1.withColumn('arrival_date', F.when(col('total_time') != -1, expr("date_add(date, total_time)"))
.otherwise(''))
+----------+--------+------------+--------+----------+------------+
| Date|store_id|warehouse_id|class_id|total_time|arrival_date|
+----------+--------+------------+--------+----------+------------+
|2020-08-01| 110| 1| 11010| 3| 2020-08-04|
|2020-08-02| 110| 1| 11010| 2| 2020-08-04|
|2020-08-03| 110| 1| 11010| 3| 2020-08-06|
|2020-08-04| 110| 1| 11010| 3| 2020-08-07|
|2020-08-05| 111| 1| 11010| 1| 2020-08-06|
|2020-08-06| 111| 1| 11010| -1| |
+----------+--------+------------+--------+----------+------------+
and what I want to calculate is this..
#to calculate the transit_date
#if arrival_date is same, ex) 2020-08-04 is repeated 2 or more times, then take min("Date")
#which will be 2020-08-01 otherwise just return the Date ex) 2020-08-07 would just return 2020-08-04
#we need to care about cloth_id too, we have arrival_date = 2020-08-06 repeated 2 times as well but since
#if one of store_id or warehouse_id is different we treat them separately. so at arrival_date = 2020-08-06 at date = 2020-08-03,
##we must return 2020-08-03
#so we treat them separately when one of (store_id, warehouse_id ) is different.
#*Note* we dont care about class_id, its not effective.
#if arrival_date = blank then leave it as blank..
#so our df would look something like this.
+----------+--------+------------+--------+----------+------------+------------+
| Date|store_id|warehouse_id|class_id|total_time|arrival_date|transit_date|
+----------+--------+------------+--------+----------+------------+------------+
|2020-08-01| 110| 1| 11010| 3| 2020-08-04| 2020-08-01|
|2020-08-02| 110| 1| 11010| 2| 2020-08-04| 2020-08-01|
|2020-08-03| 110| 1| 11010| 3| 2020-08-06| 2020-08-03|
|2020-08-04| 110| 1| 11010| 3| 2020-08-07| 2020-08-04|
|2020-08-05| 111| 1| 11010| 1| 2020-08-06| 2020-08-05|
|2020-08-06| 111| 1| 11010| -1| | |
+----------+--------+------------+--------+----------+------------+------------+
Next, I have df2 looks like the following..
#we have another dataframe call it df2
df2_schema = StructType([StructField("Date", StringType(), True),\
StructField("store_id", StringType(), True),\
StructField("warehouse_id", StringType(), True),\
StructField("cloth_id", StringType(), True),\
StructField("class_id", StringType(), True) ,\
StructField("type", StringType(), True),\
StructField("quantity", IntegerType(), True)])
df_data = [('2020-08-01','110','1','M_1','11010','R',5),('2020-08-01','110','1','M_1','11010','R',2),\
('2020-08-02','110','1','M_1','11010','C',3),('2020-08-03','110','1','M_1','11010','R',1),\
('2020-08-04','110','1','M_1','11010','R',3),('2020-08-05','111','1','M_2','11010','R',5)]
rdd = sc.parallelize(df_data)
df2 = sqlContext.createDataFrame(df_data, df2_schema)
df2 = df2.withColumn("Date",to_date("Date", 'yyyy-MM-dd'))
df2.show()
+----------+--------+------------+--------+--------+----+--------+
| Date|store_id|warehouse_id|cloth_id|class_id|type|quantity|
+----------+--------+------------+--------+--------+----+--------+
|2020-08-01| 110| 1| M_1| 11010| R| 5|
|2020-08-01| 110| 1| M_1| 11010| R| 2|
|2020-08-02| 110| 1| M_1| 11010| C| 3|
|2020-08-03| 110| 1| M_1| 11010| R| 1|
|2020-08-04| 110| 1| M_1| 11010| R| 3|
|2020-08-05| 111| 1| M_2| 11010| R| 5|
+----------+--------+------------+--------+--------+----+--------+
and I calculated quantity2, this is just sum of quantity where type=R
df2 =df2.groupBy('Date','store_id','warehouse_id','cloth_id','class_id')\
.agg( F.sum(F.when(col('type')=='R', col('quantity'))\
.otherwise(col('quantity'))).alias('quantity2')).orderBy('Date')
+----------+--------+------------+--------+--------+---------+
| Date|store_id|warehouse_id|cloth_id|class_id|quantity2|
+----------+--------+------------+--------+--------+---------+
|2020-08-01| 110| 1| M_1| 11010| 7|
|2020-08-02| 110| 1| M_1| 11010| 3|
|2020-08-03| 110| 1| M_1| 11010| 1|
|2020-08-04| 110| 1| M_1| 11010| 3|
|2020-08-05| 111| 1| M_2| 11010| 5|
+----------+--------+------------+--------+--------+---------+
Now I have df1, and df2. I want to join such that It will look something like this...
I tried something like this
df4 = df1.select('store_id','warehouse_id','class_id','arrival_date','transit_date')
df4= df4.filter(" transit_date != '' ")
df4=df4.withColumnRenamed('arrival_date', 'date')
df3 = df2.join(df1, on=['Date','store_id','warehouse_id','class_id'],how='inner').orderBy('Date')
df5 = df3.join(df4, on=['Date','store_id','warehouse_id','class_id'], how='left').orderBy('Date')
but I dont think this is the correct approach.... the result df should look like below..
+----------+--------+------------+--------+--------+---------+----------+------------+------------+
| Date|store_id|warehouse_id|class_id|cloth_id|quantity2|total_time|arrival_date|transit_date|
+----------+--------+------------+--------+--------+---------+----------+------------+------------+
|2020-08-01| 110| 1| 11010| M_1| 7| 3| 2020-08-04| null|
|2020-08-02| 110| 1| 11010| M_1| 3| 2| 2020-08-04| null|
|2020-08-03| 110| 1| 11010| M_1| 1| 3| 2020-08-06| null|
|2020-08-04| 110| 1| 11010| M_1| 3| 3| 2020-08-07| 2020-08-01|
|2020-08-05| 111| 1| 11010| M_2| 5| 1| 2020-08-06| null|
+----------+--------+------------+--------+--------+---------+----------+------------+------------+
note that the transit_date went to where Date = arrival_date of course the null is replaced by blank.
LASTLY, if today is 2020-08-04, then look at where arrival_date == 2020-08-04 and sum up the quantity and place it at today. so.... It will look like this... where the store_id = 111, it will have separate date. not shown here.. so logic needs to make sense when store_id = 111 as well.. i've just shown the example where store_id = 110

From my understanding about your question and where you already have with the following df1 and df2:
df1.orderBy('Date').show() df2.orderBy('Date').show()
+----------+--------+------------+--------+----------+------------+ +----------+--------+------------+--------+--------+---------+
| Date|store_id|warehouse_id|class_id|total_time|arrival_date| | Date|store_id|warehouse_id|cloth_id|class_id|quantity2|
+----------+--------+------------+--------+----------+------------+ +----------+--------+------------+--------+--------+---------+
|2020-08-01| 110| 1| 11010| 3| 2020-08-04| |2020-08-01| 110| 1| M_1| 11010| 7|
|2020-08-02| 110| 1| 11010| 2| 2020-08-04| |2020-08-02| 110| 1| M_1| 11010| 3|
|2020-08-03| 110| 1| 11010| 3| 2020-08-06| |2020-08-03| 110| 1| M_1| 11010| 1|
|2020-08-04| 110| 1| 11010| 3| 2020-08-07| |2020-08-04| 110| 1| M_1| 11010| 3|
|2020-08-05| 111| 1| 11010| 1| 2020-08-06| |2020-08-05| 111| 1| M_2| 11010| 5|
|2020-08-06| 111| 1| 11010| -1| | +----------+--------+------------+--------+--------+---------+
+----------+--------+------------+--------+----------+------------+
you can try the following 5 steps:
Step-1: Set up the list of column names grp_cols for join:
from pyspark.sql import functions as F
grp_cols = ["Date", "store_id", "warehouse_id", "class_id"]
Step-2: create df3 containing transit_date which is the min Date on each combination of arrival_date, store_id, warehouse_id and class_id:
df3 = df1.filter('total_time != -1') \
.groupby("arrival_date", "store_id", "warehouse_id", "class_id") \
.agg(F.min('Date').alias('transit_date')) \
.withColumnRenamed("arrival_date", "Date")
df3.orderBy('Date').show()
+----------+--------+------------+--------+------------+
| Date|store_id|warehouse_id|class_id|transit_date|
+----------+--------+------------+--------+------------+
|2020-08-04| 110| 1| 11010| 2020-08-01|
|2020-08-06| 111| 1| 11010| 2020-08-05|
|2020-08-06| 110| 1| 11010| 2020-08-03|
|2020-08-07| 110| 1| 11010| 2020-08-04|
+----------+--------+------------+--------+------------+
Step-3: set up df4 by join df2 with df1 and left join df3 using grp_cols, persist df4
df4 = df2.join(df1, grp_cols).join(df3, grp_cols, "left") \
.withColumn('transit_date', F.when(F.col('total_time') != -1, F.col("transit_date")).otherwise('')) \
.persist()
_ = df4.count()
df4.orderBy('Date').show()
+----------+--------+------------+--------+--------+---------+----------+------------+------------+
| Date|store_id|warehouse_id|class_id|cloth_id|quantity2|total_time|arrival_date|transit_date|
+----------+--------+------------+--------+--------+---------+----------+------------+------------+
|2020-08-01| 110| 1| 11010| M_1| 7| 3| 2020-08-04| null|
|2020-08-02| 110| 1| 11010| M_1| 3| 2| 2020-08-04| null|
|2020-08-03| 110| 1| 11010| M_1| 1| 3| 2020-08-06| null|
|2020-08-04| 110| 1| 11010| M_1| 3| 3| 2020-08-07| 2020-08-01|
|2020-08-05| 111| 1| 11010| M_2| 5| 1| 2020-08-06| null|
+----------+--------+------------+--------+--------+---------+----------+------------+------------+
Step-4: calculate sum(quantity2) as want from df4 for each arrival_date + store_id + warehouse_id + class_id + cloth_id
df5 = df4 \
.groupby("arrival_date", "store_id", "warehouse_id", "class_id", "cloth_id") \
.agg(F.sum("quantity2").alias("want")) \
.withColumnRenamed("arrival_date", "Date")
df5.orderBy('Date').show()
+----------+--------+------------+--------+--------+----+
| Date|store_id|warehouse_id|class_id|cloth_id|want|
+----------+--------+------------+--------+--------+----+
|2020-08-04| 110| 1| 11010| M_1| 10|
|2020-08-06| 111| 1| 11010| M_2| 5|
|2020-08-06| 110| 1| 11010| M_1| 1|
|2020-08-07| 110| 1| 11010| M_1| 3|
+----------+--------+------------+--------+--------+----+
Step-5: create the final dataframe by left join df4 with df5
df_new = df4.join(df5, grp_cols+["cloth_id"], "left").fillna(0, subset=['want'])
df_new.orderBy("Date").show()
+----------+--------+------------+--------+--------+---------+----------+------------+------------+----+
| Date|store_id|warehouse_id|class_id|cloth_id|quantity2|total_time|arrival_date|transit_date|want|
+----------+--------+------------+--------+--------+---------+----------+------------+------------+----+
|2020-08-01| 110| 1| 11010| M_1| 7| 3| 2020-08-04| null| 0|
|2020-08-02| 110| 1| 11010| M_1| 3| 2| 2020-08-04| null| 0|
|2020-08-03| 110| 1| 11010| M_1| 1| 3| 2020-08-06| null| 0|
|2020-08-04| 110| 1| 11010| M_1| 3| 3| 2020-08-07| 2020-08-01| 10|
|2020-08-05| 111| 1| 11010| M_2| 5| 1| 2020-08-06| null| 0|
+----------+--------+------------+--------+--------+---------+----------+------------+------------+----+
df4.unpersist()

Here is for the df1,
from pyspark.sql import Window
from pyspark.sql.functions import *
from pyspark.sql.types import *
import builtins as p
df1_schema = StructType(
[
StructField('Date', StringType(), True),
StructField('store_id', StringType(), True),
StructField('warehouse_id', StringType(), True),
StructField('class_id', StringType(), True),
StructField('total_time', IntegerType(), True)
]
)
df1_data = [
('2020-08-01','110','1','11010',3),
('2020-08-02','110','1','11010',2),
('2020-08-03','110','1','11010',3),
('2020-08-04','110','1','11010',3),
('2020-08-05','111','1','11010',1),
('2020-08-06','111','1','11010',-1)
]
df1 = spark.createDataFrame(df1_data, df1_schema)
df1 = df1.withColumn('Date', to_date('Date'))
df1 = df1.withColumn('arrival_date', when(col('total_time') != -1, expr("date_add(date, total_time)")).otherwise(''))
w = Window.partitionBy('arrival_date', 'store_id', 'warehouse_id').orderBy('Date')
df1 = df1.withColumn('transit_date', when(col('total_time') != -1, first('Date').over(w)).otherwise('')).orderBy('Date')
df1.show()
+----------+--------+------------+--------+----------+------------+------------+
| Date|store_id|warehouse_id|class_id|total_time|arrival_date|transit_date|
+----------+--------+------------+--------+----------+------------+------------+
|2020-08-01| 110| 1| 11010| 3| 2020-08-04| 2020-08-01|
|2020-08-02| 110| 1| 11010| 2| 2020-08-04| 2020-08-01|
|2020-08-03| 110| 1| 11010| 3| 2020-08-06| 2020-08-03|
|2020-08-04| 110| 1| 11010| 3| 2020-08-07| 2020-08-04|
|2020-08-05| 111| 1| 11010| 1| 2020-08-06| 2020-08-05|
|2020-08-06| 111| 1| 11010| -1| | |
+----------+--------+------------+--------+----------+------------+------------+
and df2 as you did,
df2_schema = StructType(
[
StructField('Date', StringType(), True),
StructField('store_id', StringType(), True),
StructField('warehouse_id', StringType(), True),
StructField('cloth_id', StringType(), True),
StructField('class_id', StringType(), True),
StructField('type', StringType(), True),
StructField('quantity', IntegerType(), True)
]
)
df2_data = [
('2020-08-01','110','1','M_1','11010','R',5),
('2020-08-01','110','1','M_1','11010','R',2),
('2020-08-02','110','1','M_1','11010','C',3),
('2020-08-03','110','1','M_1','11010','R',1),
('2020-08-04','110','1','M_1','11010','R',3),
('2020-08-05','111','1','M_2','11010','R',5)
]
df2 = spark.createDataFrame(df2_data, df2_schema)
df2 = df2.withColumn('Date', to_date('Date'))
df2 = df2.groupBy('Date', 'store_id', 'warehouse_id', 'cloth_id', 'class_id') \
.agg(
sum(
when(col('type') == 'R', col('quantity')).otherwise(0)
).alias('quantity2')
).orderBy('Date')
df2.show()
+----------+--------+------------+--------+--------+---------+
| Date|store_id|warehouse_id|cloth_id|class_id|quantity2|
+----------+--------+------------+--------+--------+---------+
|2020-08-01| 110| 1| M_1| 11010| 7|
|2020-08-02| 110| 1| M_1| 11010| 0|
|2020-08-03| 110| 1| M_1| 11010| 1|
|2020-08-04| 110| 1| M_1| 11010| 3|
|2020-08-05| 111| 1| M_2| 11010| 5|
+----------+--------+------------+--------+--------+---------+
and finally the join result.
df3 = df1.filter('total_time != -1') \
.join(df2, on=['Date', 'store_id', 'warehouse_id', 'class_id'], how='left') \
.drop('Date', 'total_time', 'cloth_id') \
.withColumnRenamed('arrival_date', 'Date')
df4 = df1.drop('transit_date') \
.join(df3, on=['Date', 'store_id', 'warehouse_id', 'class_id'], how='left') \
.groupBy('Date', 'store_id', 'warehouse_id', 'class_id', 'arrival_date', 'transit_date') \
.agg(sum('quantity2').alias('want')) \
.orderBy('Date')
df4.show()
+----------+--------+------------+--------+------------+------------+----+
| Date|store_id|warehouse_id|class_id|arrival_date|transit_date|want|
+----------+--------+------------+--------+------------+------------+----+
|2020-08-01| 110| 1| 11010| 2020-08-04| null|null|
|2020-08-02| 110| 1| 11010| 2020-08-04| null|null|
|2020-08-03| 110| 1| 11010| 2020-08-06| null|null|
|2020-08-04| 110| 1| 11010| 2020-08-07| 2020-08-01| 7|
|2020-08-05| 111| 1| 11010| 2020-08-06| null|null|
|2020-08-06| 111| 1| 11010| | 2020-08-05| 5|
+----------+--------+------------+--------+------------+------------+----+

How to compute a cumulative sum under a limit with Spark?

After several tries and some research, I'm stuck on trying to solve the following problem with Spark.
I have a Dataframe of elements with a priority and a quantity.
+------+-------+--------+---+
|family|element|priority|qty|
+------+-------+--------+---+
| f1| elmt 1| 1| 20|
| f1| elmt 2| 2| 40|
| f1| elmt 3| 3| 10|
| f1| elmt 4| 4| 50|
| f1| elmt 5| 5| 40|
| f1| elmt 6| 6| 10|
| f1| elmt 7| 7| 20|
| f1| elmt 8| 8| 10|
+------+-------+--------+---+
I have a fixed limit quantity :
+------+--------+
|family|limitQty|
+------+--------+
| f1| 100|
+------+--------+
I want to mark as "ok" the elements whose the cumulative sum is under the limit. Here is the expected result :
+------+-------+--------+---+---+
|family|element|priority|qty| ok|
+------+-------+--------+---+---+
| f1| elmt 1| 1| 20| 1| -> 20 < 100 => ok
| f1| elmt 2| 2| 40| 1| -> 20 + 40 < 100 => ok
| f1| elmt 3| 3| 10| 1| -> 20 + 40 + 10 < 100 => ok
| f1| elmt 4| 4| 50| 0| -> 20 + 40 + 10 + 50 > 100 => ko
| f1| elmt 5| 5| 40| 0| -> 20 + 40 + 10 + 40 > 100 => ko
| f1| elmt 6| 6| 10| 1| -> 20 + 40 + 10 + 10 < 100 => ok
| f1| elmt 7| 7| 20| 1| -> 20 + 40 + 10 + 10 + 20 < 100 => ok
| f1| elmt 8| 8| 10| 0| -> 20 + 40 + 10 + 10 + 20 + 10 > 100 => ko
+------+-------+--------+---+---+
I try to solve if with a cumulative sum :
initDF
.join(limitQtyDF, Seq("family"), "left_outer")
.withColumn("cumulSum", sum($"qty").over(Window.partitionBy("family").orderBy("priority")))
.withColumn("ok", when($"cumulSum" <= $"limitQty", 1).otherwise(0))
.drop("cumulSum", "limitQty")
But it's not enough because the elements after the element that is up to the limit are not take into account.
I can't find a way to solve it with Spark. Do you have an idea ?
Here is the corresponding Scala code :
val sparkSession = SparkSession.builder()
.master("local[*]")
.getOrCreate()
import sparkSession.implicits._
val initDF = Seq(
("f1", "elmt 1", 1, 20),
("f1", "elmt 2", 2, 40),
("f1", "elmt 3", 3, 10),
("f1", "elmt 4", 4, 50),
("f1", "elmt 5", 5, 40),
("f1", "elmt 6", 6, 10),
("f1", "elmt 7", 7, 20),
("f1", "elmt 8", 8, 10)
).toDF("family", "element", "priority", "qty")
val limitQtyDF = Seq(("f1", 100)).toDF("family", "limitQty")
val expectedDF = Seq(
("f1", "elmt 1", 1, 20, 1),
("f1", "elmt 2", 2, 40, 1),
("f1", "elmt 3", 3, 10, 1),
("f1", "elmt 4", 4, 50, 0),
("f1", "elmt 5", 5, 40, 0),
("f1", "elmt 6", 6, 10, 1),
("f1", "elmt 7", 7, 20, 1),
("f1", "elmt 8", 8, 10, 0)
).toDF("family", "element", "priority", "qty", "ok").show()
Thank you for your help !

The solution is shown below:
scala> initDF.show
+------+-------+--------+---+
|family|element|priority|qty|
+------+-------+--------+---+
| f1| elmt 1| 1| 20|
| f1| elmt 2| 2| 40|
| f1| elmt 3| 3| 10|
| f1| elmt 4| 4| 50|
| f1| elmt 5| 5| 40|
| f1| elmt 6| 6| 10|
| f1| elmt 7| 7| 20|
| f1| elmt 8| 8| 10|
+------+-------+--------+---+
scala> val df1 = initDF.groupBy("family").agg(collect_list("qty").as("comb_qty"), collect_list("priority").as("comb_prior"), collect_list("element").as("comb_elem"))
df1: org.apache.spark.sql.DataFrame = [family: string, comb_qty: array<int> ... 2 more fields]
scala> df1.show
+------+--------------------+--------------------+--------------------+
|family| comb_qty| comb_prior| comb_elem|
+------+--------------------+--------------------+--------------------+
| f1|[20, 40, 10, 50, ...|[1, 2, 3, 4, 5, 6...|[elmt 1, elmt 2, ...|
+------+--------------------+--------------------+--------------------+
scala> val df2 = df1.join(limitQtyDF, df1("family") === limitQtyDF("family")).drop(limitQtyDF("family"))
df2: org.apache.spark.sql.DataFrame = [family: string, comb_qty: array<int> ... 3 more fields]
scala> df2.show
+------+--------------------+--------------------+--------------------+--------+
|family| comb_qty| comb_prior| comb_elem|limitQty|
+------+--------------------+--------------------+--------------------+--------+
| f1|[20, 40, 10, 50, ...|[1, 2, 3, 4, 5, 6...|[elmt 1, elmt 2, ...| 100|
+------+--------------------+--------------------+--------------------+--------+
scala> def validCheck = (qty: Seq[Int], limit: Int) => {
| var sum = 0
| qty.map(elem => {
| if (elem + sum <= limit) {
| sum = sum + elem
| 1}else{
| 0
| }})}
validCheck: (scala.collection.mutable.Seq[Int], Int) => scala.collection.mutable.Seq[Int]
scala> val newUdf = udf(validCheck)
newUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(IntegerType,false),Some(List(ArrayType(IntegerType,false), IntegerType)))
val df3 = df2.withColumn("valid", newUdf(col("comb_qty"),col("limitQty"))).drop("limitQty")
df3: org.apache.spark.sql.DataFrame = [family: string, comb_qty: array<int> ... 3 more fields]
scala> df3.show
+------+--------------------+--------------------+--------------------+--------------------+
|family| comb_qty| comb_prior| comb_elem| valid|
+------+--------------------+--------------------+--------------------+--------------------+
| f1|[20, 40, 10, 50, ...|[1, 2, 3, 4, 5, 6...|[elmt 1, elmt 2, ...|[1, 1, 1, 0, 0, 1...|
+------+--------------------+--------------------+--------------------+--------------------+
scala> val myUdf = udf((qty: Seq[Int], prior: Seq[Int], elem: Seq[String], valid: Seq[Int]) => {
| elem zip prior zip qty zip valid map{
| case (((a,b),c),d) => (a,b,c,d)}
| }
| )
scala> val df4 = df3.withColumn("combined", myUdf(col("comb_qty"),col("comb_prior"),col("comb_elem"),col("valid")))
df4: org.apache.spark.sql.DataFrame = [family: string, comb_qty: array<int> ... 4 more fields]
scala> val df5 = df4.drop("comb_qty","comb_prior","comb_elem","valid")
df5: org.apache.spark.sql.DataFrame = [family: string, combined: array<struct<_1:string,_2:int,_3:int,_4:int>>]
scala> df5.show(false)
+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|family|combined |
+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|f1 |[[elmt 1, 1, 20, 1], [elmt 2, 2, 40, 1], [elmt 3, 3, 10, 1], [elmt 4, 4, 50, 0], [elmt 5, 5, 40, 0], [elmt 6, 6, 10, 1], [elmt 7, 7, 20, 1], [elmt 8, 8, 10, 0]]|
+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
scala> val df6 = df5.withColumn("combined",explode(col("combined")))
df6: org.apache.spark.sql.DataFrame = [family: string, combined: struct<_1: string, _2: int ... 2 more fields>]
scala> df6.show
+------+------------------+
|family| combined|
+------+------------------+
| f1|[elmt 1, 1, 20, 1]|
| f1|[elmt 2, 2, 40, 1]|
| f1|[elmt 3, 3, 10, 1]|
| f1|[elmt 4, 4, 50, 0]|
| f1|[elmt 5, 5, 40, 0]|
| f1|[elmt 6, 6, 10, 1]|
| f1|[elmt 7, 7, 20, 1]|
| f1|[elmt 8, 8, 10, 0]|
+------+------------------+
scala> val df7 = df6.select("family", "combined._1", "combined._2", "combined._3", "combined._4").withColumnRenamed("_1","element").withColumnRenamed("_2","priority").withColumnRenamed("_3", "qty").withColumnRenamed("_4","ok")
df7: org.apache.spark.sql.DataFrame = [family: string, element: string ... 3 more fields]
scala> df7.show
+------+-------+--------+---+---+
|family|element|priority|qty| ok|
+------+-------+--------+---+---+
| f1| elmt 1| 1| 20| 1|
| f1| elmt 2| 2| 40| 1|
| f1| elmt 3| 3| 10| 1|
| f1| elmt 4| 4| 50| 0|
| f1| elmt 5| 5| 40| 0|
| f1| elmt 6| 6| 10| 1|
| f1| elmt 7| 7| 20| 1|
| f1| elmt 8| 8| 10| 0|
+------+-------+--------+---+---+
Let me know if it helps!!

Another way to do it will be an RDD based approach by iterating row by row.
var bufferRow: collection.mutable.Buffer[Row] = collection.mutable.Buffer.empty[Row]
var tempSum: Double = 0
val iterator = df.collect.iterator
while(iterator.hasNext){
val record = iterator.next()
val y = record.getAs[Integer]("qty")
tempSum = tempSum + y
print(record)
if (tempSum <= 100.0 ) {
bufferRow = bufferRow ++ Seq(transformRow(record,1))
}
else{
bufferRow = bufferRow ++ Seq(transformRow(record,0))
tempSum = tempSum - y
}
}
Defining transformRow function which is used to add a column to a row.
def transformRow(row: Row,flag : Int): Row = Row.fromSeq(row.toSeq ++ Array[Integer](flag))
Next thing to do will be adding an additional column to the schema.
val newSchema = StructType(df.schema.fields ++ Array(StructField("C_Sum", IntegerType, false))
Followed by creating a new dataframe.
val outputdf = spark.createDataFrame(spark.sparkContext.parallelize(bufferRow.toSeq),newSchema)
Output Dataframe :
+------+-------+--------+---+-----+
|family|element|priority|qty|C_Sum|
+------+-------+--------+---+-----+
| f1| elmt1| 1| 20| 1|
| f1| elmt2| 2| 40| 1|
| f1| elmt3| 3| 10| 1|
| f1| elmt4| 4| 50| 0|
| f1| elmt5| 5| 40| 0|
| f1| elmt6| 6| 10| 1|
| f1| elmt7| 7| 20| 1|
| f1| elmt8| 8| 10| 0|
+------+-------+--------+---+-----+

I am new to Spark so this solution may not be optimal. I am assuming the value of 100 is an input to the program here. In that case:
case class Frame(family:String, element : String, priority : Int, qty :Int)
import scala.collection.JavaConverters._
val ans = df.as[Frame].toLocalIterator
.asScala
.foldLeft((Seq.empty[Int],0))((acc,a) =>
if(acc._2 + a.qty <= 100) (acc._1 :+ a.priority, acc._2 + a.qty) else acc)._1
df.withColumn("OK" , when($"priority".isin(ans :_*), 1).otherwise(0)).show
results in:
+------+-------+--------+---+--------+
|family|element|priority|qty|OK |
+------+-------+--------+---+--------+
| f1| elmt 1| 1| 20| 1|
| f1| elmt 2| 2| 40| 1|
| f1| elmt 3| 3| 10| 1|
| f1| elmt 4| 4| 50| 0|
| f1| elmt 5| 5| 40| 0|
| f1| elmt 6| 6| 10| 1|
| f1| elmt 7| 7| 20| 1|
| f1| elmt 8| 8| 10| 0|
+------+-------+--------+---+--------+
The idea is simply to get a Scala iterator and extract the participating priority values from it and then use those values to filter out the participating rows. Given this solution gathers all the data in memory on one machine, it could run into memory problems if the dataframe size is too large to fit in memory.

Cumulative sum for each group
from pyspark.sql.window import Window as window
from pyspark.sql.types import IntegerType,StringType,FloatType,StructType,StructField,DateType
schema = StructType() \
.add(StructField("empno",IntegerType(),True)) \
.add(StructField("ename",StringType(),True)) \
.add(StructField("job",StringType(),True)) \
.add(StructField("mgr",StringType(),True)) \
.add(StructField("hiredate",DateType(),True)) \
.add(StructField("sal",FloatType(),True)) \
.add(StructField("comm",StringType(),True)) \
.add(StructField("deptno",IntegerType(),True))
emp = spark.read.csv('data/emp.csv',schema)
dept_partition = window.partitionBy(emp.deptno).orderBy(emp.sal)
emp_win = emp.withColumn("dept_cum_sal",
f.sum(emp.sal).over(dept_partition.rowsBetween(window.unboundedPreceding, window.currentRow)))
emp_win.show()
Results appear like below:
+-----+------+---------+----+----------+------+-------+------+------------
+
|empno| ename| job| mgr| hiredate| sal| comm|deptno|dept_cum_sal|
+-----+------+---------+----+----------+------+-------+------+------------
+
| 7369| SMITH| CLERK|7902|1980-12-17| 800.0| null| 20| 800.0|
| 7876| ADAMS| CLERK|7788|1983-01-12|1100.0| null| 20| 1900.0|
| 7566| JONES| MANAGER|7839|1981-04-02|2975.0| null| 20| 4875.0|
| 7788| SCOTT| ANALYST|7566|1982-12-09|3000.0| null| 20| 7875.0|
| 7902| FORD| ANALYST|7566|1981-12-03|3000.0| null| 20| 10875.0|
| 7934|MILLER| CLERK|7782|1982-01-23|1300.0| null| 10| 1300.0|
| 7782| CLARK| MANAGER|7839|1981-06-09|2450.0| null| 10| 3750.0|
| 7839| KING|PRESIDENT|null|1981-11-17|5000.0| null| 10| 8750.0|
| 7900| JAMES| CLERK|7698|1981-12-03| 950.0| null| 30| 950.0|
| 7521| WARD| SALESMAN|7698|1981-02-22|1250.0| 500.00| 30| 2200.0|
| 7654|MARTIN| SALESMAN|7698|1981-09-28|1250.0|1400.00| 30| 3450.0|
| 7844|TURNER| SALESMAN|7698|1981-09-08|1500.0| 0.00| 30| 4950.0|
| 7499| ALLEN| SALESMAN|7698|1981-02-20|1600.0| 300.00| 30| 6550.0|
| 7698| BLAKE| MANAGER|7839|1981-05-01|2850.0| null| 30| 9400.0|
+-----+------+---------+----+----------+------+-------+------+------------+

PFA the answer
val initDF = Seq(("f1", "elmt 1", 1, 20),("f1", "elmt 2", 2, 40),("f1", "elmt 3", 3, 10),
("f1", "elmt 4", 4, 50),
("f1", "elmt 5", 5, 40),
("f1", "elmt 6", 6, 10),
("f1", "elmt 7", 7, 20),
("f1", "elmt 8", 8, 10)
).toDF("family", "element", "priority", "qty")
val limitQtyDF = Seq(("f1", 100)).toDF("family", "limitQty")
sc.broadcast(limitQtyDF)
val joinedInitDF=initDF.join(limitQtyDF,Seq("family"),"left")
case class dataResult(family:String,element:String,priority:Int, qty:Int, comutedValue:Int, limitQty:Int,controlOut:String)
val familyIDs=initDF.select("family").distinct.collect.map(_(0).toString).toList
def checkingUDF(inputRows:List[Row])={
var controlVarQty=0
val outputArrayBuffer=collection.mutable.ArrayBuffer[dataResult]()
val setLimit=inputRows.head.getInt(4)
for(inputRow <- inputRows)
{
val currQty=inputRow.getInt(3)
//val outpurForRec=
controlVarQty + currQty match {
case value if value <= setLimit =>
controlVarQty+=currQty
outputArrayBuffer+=dataResult(inputRow.getString(0),inputRow.getString(1),inputRow.getInt(2),inputRow.getInt(3),value,setLimit,"ok")
case value =>
outputArrayBuffer+=dataResult(inputRow.getString(0),inputRow.getString(1),inputRow.getInt(2),inputRow.getInt(3),value,setLimit,"ko")
}
//outputArrayBuffer+=Row(inputRow.getString(0),inputRow.getString(1),inputRow.getInt(2),inputRow.getInt(3),controlVarQty+currQty,setLimit,outpurForRec)
}
outputArrayBuffer.toList
}
val tmpAB=collection.mutable.ArrayBuffer[List[dataResult]]()
for (familyID <- familyIDs) // val familyID="f1"
{
val currentFamily=joinedInitDF.filter(s"family = '${familyID}'").orderBy("element", "priority").collect.toList
tmpAB+=checkingUDF(currentFamily)
}
tmpAB.toSeq.flatMap(x => x).toDF.show(false)
This works for me .
+------+-------+--------+---+------------+--------+----------+
|family|element|priority|qty|comutedValue|limitQty|controlOut|
+------+-------+--------+---+------------+--------+----------+
|f1 |elmt 1 |1 |20 |20 |100 |ok |
|f1 |elmt 2 |2 |40 |60 |100 |ok |
|f1 |elmt 3 |3 |10 |70 |100 |ok |
|f1 |elmt 4 |4 |50 |120 |100 |ko |
|f1 |elmt 5 |5 |40 |110 |100 |ko |
|f1 |elmt 6 |6 |10 |80 |100 |ok |
|f1 |elmt 7 |7 |20 |100 |100 |ok |
|f1 |elmt 8 |8 |10 |110 |100 |ko |
+------+-------+--------+---+------------+--------+----------+
Please do drop unnecessary columns from the output

Window Function Tie breaker on other field to get the Latest Record

I have following data, where the data is partitioned by the stores and month id and ordered by amount in order to get the primary vendor for the store.
I need a tie breaker if the amount is equal between two vendors,
then if one of the tied vendor was the previous months most sales vendor, make that vendor as the most sales vendor for the month.
The look back will increase if there is a tie again. Lag of 1 month will not work if there is tie again. Worst case scenario we will have more duplicates in previous month also.
sample data
val data = Seq((201801, 10941, 115, 80890.44900, 135799.66400),
(201801, 10941, 3, 80890.44900, 135799.66400) ,
(201712, 10941, 3, 517440.74500, 975893.79000),
(201712, 10941, 115, 517440.74500, 975893.79000),
(201711, 10941, 3 , 371501.92100, 574223.52300),
(201710, 10941, 115, 552435.57800, 746912.06700),
(201709, 10941, 115,1523492.60700,1871480.06800),
(201708, 10941, 115,1027698.93600,1236544.50900),
(201707, 10941, 33 ,1469219.86900,1622949.53000)
).toDF("MTH_ID", "store_id" ,"brand" ,"brndSales","TotalSales")
Code:
val window = Window.partitionBy("store_id","MTH_ID").orderBy("brndSales")
val res = data.withColumn("rank",rank over window)
Output:
+------+--------+-----+-----------+-----------+----+
|MTH_ID|store_id|brand| brndSales| TotalSales|rank|
+------+--------+-----+-----------+-----------+----+
|201801| 10941| 115| 80890.449| 135799.664| 1|
|201801| 10941| 3| 80890.449| 135799.664| 1|
|201712| 10941| 3| 517440.745| 975893.79| 1|
|201712| 10941| 115| 517440.745| 975893.79| 1|
|201711| 10941| 115| 371501.921| 574223.523| 1|
|201710| 10941| 115| 552435.578| 746912.067| 1|
|201709| 10941| 115|1523492.607|1871480.068| 1|
|201708| 10941| 115|1027698.936|1236544.509| 1|
|201707| 10941| 33|1469219.869| 1622949.53| 1|
+------+--------+-----+-----------+-----------+----+
My rank is 1 for both 1 and 2 records, but my rank should be 1 for second record based on previous month max dollars
I am expecting the following output.
+------+--------+-----+-----------+-----------+----+
|MTH_ID|store_id|brand| brndSales| TotalSales|rank|
+------+--------+-----+-----------+-----------+----+
|201801| 10941| 115| 80890.449| 135799.664| 2|
|201801| 10941| 3| 80890.449| 135799.664| 1|
|201712| 10941| 3| 517440.745| 975893.79| 1|
|201712| 10941| 115| 517440.745| 975893.79| 1|
|201711| 10941| 3| 371501.921| 574223.523| 1|
|201710| 10941| 115| 552435.578| 746912.067| 1|
|201709| 10941| 115|1523492.607|1871480.068| 1|
|201708| 10941| 115|1027698.936|1236544.509| 1|
|201707| 10941| 33|1469219.869| 1622949.53| 1|
+------+--------+-----+-----------+-----------+----+
Should I write a UDAF? Any suggestions would help.

You can do this with 2 windows. First, you will need to use the lag() function to carry over the previous month's sales values so that you can use that in your rank window. here's that part in pyspark:
lag_window = Window.partitionBy("store_id", "brand").orderBy("MTH_ID")
lag_df = data.withColumn("last_month_sales", lag("brndSales").over(lag_window))
Then edit your window to include that new column:
window = Window.partitionBy("store_id","MTH_ID").orderBy("brndSales", "last_month_sales")
lag_df.withColumn("rank",rank().over(window)).show()
+------+--------+-----+-----------+-----------+----------------+----+
|MTH_ID|store_id|brand| brndSales| TotalSales|last_month_sales|rank|
+------+--------+-----+-----------+-----------+----------------+----+
|201711| 10941| 99| 371501.921| 574223.523| null| 1|
|201709| 10941| 115|1523492.607|1871480.068| 1027698.936| 1|
|201707| 10941| 33|1469219.869| 1622949.53| null| 1|
|201708| 10941| 115|1027698.936|1236544.509| null| 1|
|201710| 10941| 115| 552435.578| 746912.067| 1523492.607| 1|
|201712| 10941| 3| 517440.745| 975893.79| null| 1|
|201801| 10941| 3| 80890.449| 135799.664| 517440.745| 1|
|201801| 10941| 115| 80890.449| 135799.664| 552435.578| 2|
+------+--------+-----+-----------+-----------+----------------+----+

For each row, collect an array of that brands previous sales, in a (Month, Sales) struct.
val storeAndBrandWindow = Window.partitionBy("store_id", "brand").orderBy($"MTH_ID")
val df1 = data.withColumn("brndSales_list", collect_list(struct($"MTH_ID", $"brndSales")).over(storeAndBrandWindow))
Reverse that array with a UDF.
val returnType = ArrayType(StructType(Array(StructField("month", IntegerType), StructField("sales", DoubleType))))
val reverseUdf = udf((list: Seq[Row]) => list.reverse, returnType)
val df2 = df1.withColumn("brndSales_list", reverseUdf($"brndSales_list"))
And then sort by the array.
val window = Window.partitionBy("store_id", "MTH_ID").orderBy($"brndSales_list".desc)
val df3 = df2.withColumn("rank", rank over window).orderBy("MTH_ID", "brand")
df3.show(false)
Result
+------+--------+-----+-----------+-----------+-----------------------------------------------------------------------------------------+----+
|MTH_ID|store_id|brand|brndSales |TotalSales |brndSales_list |rank|
+------+--------+-----+-----------+-----------+-----------------------------------------------------------------------------------------+----+
|201707|10941 |33 |1469219.869|1622949.53 |[[201707, 1469219.869]] |1 |
|201708|10941 |115 |1027698.936|1236544.509|[[201708, 1027698.936]] |1 |
|201709|10941 |115 |1523492.607|1871480.068|[[201709, 1523492.607], [201708, 1027698.936]] |1 |
|201710|10941 |115 |552435.578 |746912.067 |[[201710, 552435.578], [201709, 1523492.607], [201708, 1027698.936]] |1 |
|201711|10941 |99 |371501.921 |574223.523 |[[201711, 371501.921]] |1 |
|201712|10941 |3 |517440.745 |975893.79 |[[201712, 517440.745]] |1 |
|201801|10941 |3 |80890.449 |135799.664 |[[201801, 80890.449], [201712, 517440.745]] |1 |
|201801|10941 |115 |80890.449 |135799.664 |[[201801, 80890.449], [201710, 552435.578], [201709, 1523492.607], [201708, 1027698.936]]|2 |
+------+--------+-----+-----------+-----------+-----------------------------------------------------------------------------------------+----+

Grab last different data on Spark Dataframe?

I have this data on Spark Dataframe
+------+-------+-----+------------+----------+---------+
|sernum|product|state|testDateTime|testResult| msg|
+------+-------+-----+------------+----------+---------+
| 8| PA1| 1.0| 1.18| pass|testlog18|
| 7| PA1| 1.0| 1.17| fail|testlog17|
| 6| PA1| 1.0| 1.16| pass|testlog16|
| 5| PA1| 1.0| 1.15| fail|testlog15|
| 4| PA1| 2.0| 1.14| fail|testlog14|
| 3| PA1| 1.0| 1.13| pass|testlog13|
| 2| PA1| 2.0| 1.12| pass|testlog12|
| 1| PA1| 1.0| 1.11| fail|testlog11|
+------+-------+-----+------------+----------+---------+
What I care about is the testResult == "fail", and the hard part is that I need the to get the last "pass" message as an extra column GROUP BY product+state:
+------+-------+-----+------------+----------+---------+---------+
|sernum|product|state|testDateTime|testResult| msg| passMsg|
+------+-------+-----+------------+----------+---------+---------+
| 7| PA1| 1.0| 1.17| fail|testlog17|testlog16|
| 5| PA1| 1.0| 1.15| fail|testlog15|testlog13|
| 4| PA1| 2.0| 1.14| fail|testlog14|testlog12|
| 1| PA1| 1.0| 1.11| fail|testlog11| null|
+------+-------+-----+------------+----------+---------+---------+
How can I achieve this using DataFrame or SQL?

The trick is to define groups where each group starts with a passed test. Then, use again window-functions with group as an additional partition-column:
val df = Seq(
(8, "PA1", 1.0, 1.18, "pass", "testlog18"),
(7, "PA1", 1.0, 1.17, "fail", "testlog17"),
(6, "PA1", 1.0, 1.16, "pass", "testlog16"),
(5, "PA1", 1.0, 1.15, "fail", "testlog15"),
(4, "PA1", 2.0, 1.14, "fail", "testlog14"),
(3, "PA1", 1.0, 1.13, "pass", "testlog13"),
(2, "PA1", 2.0, 1.12, "pass", "testlog12"),
(1, "PA1", 1.0, 1.11, "fail", "testlog11")
).toDF("sernum", "product", "state", "testDateTime", "testResult", "msg")
df
.withColumn("group", sum(when($"testResult" === "pass", 1)).over(Window.partitionBy($"product", $"state").orderBy($"testDateTime")))
.withColumn("passMsg", when($"group".isNotNull,first($"msg").over(Window.partitionBy($"product", $"state", $"group").orderBy($"testDateTime"))))
.drop($"group")
.where($"testResult"==="fail")
.orderBy($"product", $"state", $"testDateTime")
.show()
+------+-------+-----+------------+----------+---------+---------+
|sernum|product|state|testDateTime|testResult| msg| passMsg|
+------+-------+-----+------------+----------+---------+---------+
| 7| PA1| 1.0| 1.17| fail|testlog17|testlog16|
| 5| PA1| 1.0| 1.15| fail|testlog15|testlog13|
| 4| PA1| 2.0| 1.14| fail|testlog14|testlog12|
| 1| PA1| 1.0| 1.11| fail|testlog11| null|
+------+-------+-----+------------+----------+---------+---------+

This is an alternate approach, by joining the passed logs with failed ones for previous times, and taking the latest "pass" message log.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
Window.partitionBy($"msg").orderBy($"p_testDateTime".desc)
val fDf = df.filter($"testResult" === "fail")
var pDf = df.filter($"testResult" === "pass")
pDf.columns.foreach(x => pDf = pDf.withColumnRenamed(x, "p_"+x))
val jDf = fDf.join(
pDf,
pDf("p_product") === fDf("product") &&
pDf("p_state") === fDf("state") &&
fDf("testDateTime") > pDf("p_testDateTime") ,
"left").
select(fDf("*"),
pDf("p_testResult"),
pDf("p_testDateTime"),
pDf("p_msg")
)
jDf.withColumn(
"rnk",
row_number().
over(window)
).
filter($"rnk" === 1).
drop("rnk","p_testResult","p_testDateTime").
show()
+---------+-------+------+-----+------------+----------+---------+
| msg|product|sernum|state|testDateTime|testResult| p_msg|
+---------+-------+------+-----+------------+----------+---------+
|testlog14| PA1| 4| 2| 1.14| fail|testlog12|
|testlog11| PA1| 1| 1| 1.11| fail| null|
|testlog15| PA1| 5| 1| 1.15| fail|testlog13|
|testlog17| PA1| 7| 1| 1.17| fail|testlog16|
+---------+-------+------+-----+------------+----------+---------+

Find date of each week in from week in Spark Dataframe

I want to add a column with date of each corresponding week in Dataframe (appending friday in each date)
My Dataframe looks like this
+----+------+---------+
|Week| City|sum(Sale)|
+----+------+---------+
| 29|City 2| 72|
| 28|City 3| 48|
| 28|City 2| 19|
| 27|City 2| 16|
| 28|City 1| 84|
| 28|City 4| 72|
| 29|City 4| 39|
| 27|City 3| 42|
| 26|City 3| 68|
| 27|City 1| 89|
| 27|City 4| 104|
| 26|City 2| 19|
| 29|City 3| 27|
+----+------+---------+
I need to convert it as below dataframe
----+------+---------+--------------- |
|Week| City|sum(Sale)|perticular day(dd/mm/yyyy) |
+----+------+---------+---------------|
| 29|City 2| 72|Friday(07/21/2017)|
| 28|City 3| 48|Friday(07/14/2017)|
| 28|City 2| 19|Friday(07/14/2017)|
| 27|City 2| 16|Friday(07/07/2017)|
| 28|City 1| 84|Friday(07/14/2017)|
| 28|City 4| 72|Friday(07/14/2017)|
| 29|City 4| 39|Friday(07/21/2017)|
| 27|City 3| 42|Friday(07/07/2017)|
| 26|City 3| 68|Friday(06/30/2017)|
| 27|City 1| 89|Friday(07/07/2017)|
| 27|City 4| 104|Friday(07/07/2017)|
| 26|City 2| 19|Friday(06/30/2017)|
| 29|City 3| 27|Friday(07/21/2017)|
+----+------+---------+
please help me

You can write a simple UDF and get the date from adding week in it.
Here is the simple example
import spark.implicits._
val data = spark.sparkContext.parallelize(Seq(
(29,"City 2", 72),
(28,"City 3", 48),
(28,"City 2", 19),
(27,"City 2", 16),
(28,"City 1", 84),
(28,"City 4", 72),
(29,"City 4", 39),
(27,"City 3", 42),
(26,"City 3", 68),
(27,"City 1", 89),
(27,"City 4", 104),
(26,"City 2", 19),
(29,"City 3", 27)
)).toDF("week", "city", "sale")
val getDateFromWeek = udf((week : Int) => {
//create a default date for week 1
val week1 = LocalDate.of(2016, 12, 30)
val day = "Friday"
//add week from the week column
val result = week1.plusWeeks(week).format(DateTimeFormatter.ofPattern("MM/dd/yyyy"))
//return result as Friday (date)
s"${day} (${result})"
})
//use the udf and create a new column named day
data.withColumn("day", getDateFromWeek($"week")).show
can anyone convert this to Pyspark?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How do I pass parameters to selectExpr? SparkSQL-Scala - apache-spark-sql

Related

Pyspark: how to solve complicated dataframe logic plus join

How to compute a cumulative sum under a limit with Spark?

Window Function Tie breaker on other field to get the Latest Record

Grab last different data on Spark Dataframe?

Find date of each week in from week in Spark Dataframe

Categories

Resources