Related
After several tries and some research, I'm stuck on trying to solve the following problem with Spark.
I have a Dataframe of elements with a priority and a quantity.
+------+-------+--------+---+
|family|element|priority|qty|
+------+-------+--------+---+
| f1| elmt 1| 1| 20|
| f1| elmt 2| 2| 40|
| f1| elmt 3| 3| 10|
| f1| elmt 4| 4| 50|
| f1| elmt 5| 5| 40|
| f1| elmt 6| 6| 10|
| f1| elmt 7| 7| 20|
| f1| elmt 8| 8| 10|
+------+-------+--------+---+
I have a fixed limit quantity :
+------+--------+
|family|limitQty|
+------+--------+
| f1| 100|
+------+--------+
I want to mark as "ok" the elements whose the cumulative sum is under the limit. Here is the expected result :
+------+-------+--------+---+---+
|family|element|priority|qty| ok|
+------+-------+--------+---+---+
| f1| elmt 1| 1| 20| 1| -> 20 < 100 => ok
| f1| elmt 2| 2| 40| 1| -> 20 + 40 < 100 => ok
| f1| elmt 3| 3| 10| 1| -> 20 + 40 + 10 < 100 => ok
| f1| elmt 4| 4| 50| 0| -> 20 + 40 + 10 + 50 > 100 => ko
| f1| elmt 5| 5| 40| 0| -> 20 + 40 + 10 + 40 > 100 => ko
| f1| elmt 6| 6| 10| 1| -> 20 + 40 + 10 + 10 < 100 => ok
| f1| elmt 7| 7| 20| 1| -> 20 + 40 + 10 + 10 + 20 < 100 => ok
| f1| elmt 8| 8| 10| 0| -> 20 + 40 + 10 + 10 + 20 + 10 > 100 => ko
+------+-------+--------+---+---+
I try to solve if with a cumulative sum :
initDF
.join(limitQtyDF, Seq("family"), "left_outer")
.withColumn("cumulSum", sum($"qty").over(Window.partitionBy("family").orderBy("priority")))
.withColumn("ok", when($"cumulSum" <= $"limitQty", 1).otherwise(0))
.drop("cumulSum", "limitQty")
But it's not enough because the elements after the element that is up to the limit are not take into account.
I can't find a way to solve it with Spark. Do you have an idea ?
Here is the corresponding Scala code :
val sparkSession = SparkSession.builder()
.master("local[*]")
.getOrCreate()
import sparkSession.implicits._
val initDF = Seq(
("f1", "elmt 1", 1, 20),
("f1", "elmt 2", 2, 40),
("f1", "elmt 3", 3, 10),
("f1", "elmt 4", 4, 50),
("f1", "elmt 5", 5, 40),
("f1", "elmt 6", 6, 10),
("f1", "elmt 7", 7, 20),
("f1", "elmt 8", 8, 10)
).toDF("family", "element", "priority", "qty")
val limitQtyDF = Seq(("f1", 100)).toDF("family", "limitQty")
val expectedDF = Seq(
("f1", "elmt 1", 1, 20, 1),
("f1", "elmt 2", 2, 40, 1),
("f1", "elmt 3", 3, 10, 1),
("f1", "elmt 4", 4, 50, 0),
("f1", "elmt 5", 5, 40, 0),
("f1", "elmt 6", 6, 10, 1),
("f1", "elmt 7", 7, 20, 1),
("f1", "elmt 8", 8, 10, 0)
).toDF("family", "element", "priority", "qty", "ok").show()
Thank you for your help !
The solution is shown below:
scala> initDF.show
+------+-------+--------+---+
|family|element|priority|qty|
+------+-------+--------+---+
| f1| elmt 1| 1| 20|
| f1| elmt 2| 2| 40|
| f1| elmt 3| 3| 10|
| f1| elmt 4| 4| 50|
| f1| elmt 5| 5| 40|
| f1| elmt 6| 6| 10|
| f1| elmt 7| 7| 20|
| f1| elmt 8| 8| 10|
+------+-------+--------+---+
scala> val df1 = initDF.groupBy("family").agg(collect_list("qty").as("comb_qty"), collect_list("priority").as("comb_prior"), collect_list("element").as("comb_elem"))
df1: org.apache.spark.sql.DataFrame = [family: string, comb_qty: array<int> ... 2 more fields]
scala> df1.show
+------+--------------------+--------------------+--------------------+
|family| comb_qty| comb_prior| comb_elem|
+------+--------------------+--------------------+--------------------+
| f1|[20, 40, 10, 50, ...|[1, 2, 3, 4, 5, 6...|[elmt 1, elmt 2, ...|
+------+--------------------+--------------------+--------------------+
scala> val df2 = df1.join(limitQtyDF, df1("family") === limitQtyDF("family")).drop(limitQtyDF("family"))
df2: org.apache.spark.sql.DataFrame = [family: string, comb_qty: array<int> ... 3 more fields]
scala> df2.show
+------+--------------------+--------------------+--------------------+--------+
|family| comb_qty| comb_prior| comb_elem|limitQty|
+------+--------------------+--------------------+--------------------+--------+
| f1|[20, 40, 10, 50, ...|[1, 2, 3, 4, 5, 6...|[elmt 1, elmt 2, ...| 100|
+------+--------------------+--------------------+--------------------+--------+
scala> def validCheck = (qty: Seq[Int], limit: Int) => {
| var sum = 0
| qty.map(elem => {
| if (elem + sum <= limit) {
| sum = sum + elem
| 1}else{
| 0
| }})}
validCheck: (scala.collection.mutable.Seq[Int], Int) => scala.collection.mutable.Seq[Int]
scala> val newUdf = udf(validCheck)
newUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(IntegerType,false),Some(List(ArrayType(IntegerType,false), IntegerType)))
val df3 = df2.withColumn("valid", newUdf(col("comb_qty"),col("limitQty"))).drop("limitQty")
df3: org.apache.spark.sql.DataFrame = [family: string, comb_qty: array<int> ... 3 more fields]
scala> df3.show
+------+--------------------+--------------------+--------------------+--------------------+
|family| comb_qty| comb_prior| comb_elem| valid|
+------+--------------------+--------------------+--------------------+--------------------+
| f1|[20, 40, 10, 50, ...|[1, 2, 3, 4, 5, 6...|[elmt 1, elmt 2, ...|[1, 1, 1, 0, 0, 1...|
+------+--------------------+--------------------+--------------------+--------------------+
scala> val myUdf = udf((qty: Seq[Int], prior: Seq[Int], elem: Seq[String], valid: Seq[Int]) => {
| elem zip prior zip qty zip valid map{
| case (((a,b),c),d) => (a,b,c,d)}
| }
| )
scala> val df4 = df3.withColumn("combined", myUdf(col("comb_qty"),col("comb_prior"),col("comb_elem"),col("valid")))
df4: org.apache.spark.sql.DataFrame = [family: string, comb_qty: array<int> ... 4 more fields]
scala> val df5 = df4.drop("comb_qty","comb_prior","comb_elem","valid")
df5: org.apache.spark.sql.DataFrame = [family: string, combined: array<struct<_1:string,_2:int,_3:int,_4:int>>]
scala> df5.show(false)
+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|family|combined |
+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|f1 |[[elmt 1, 1, 20, 1], [elmt 2, 2, 40, 1], [elmt 3, 3, 10, 1], [elmt 4, 4, 50, 0], [elmt 5, 5, 40, 0], [elmt 6, 6, 10, 1], [elmt 7, 7, 20, 1], [elmt 8, 8, 10, 0]]|
+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
scala> val df6 = df5.withColumn("combined",explode(col("combined")))
df6: org.apache.spark.sql.DataFrame = [family: string, combined: struct<_1: string, _2: int ... 2 more fields>]
scala> df6.show
+------+------------------+
|family| combined|
+------+------------------+
| f1|[elmt 1, 1, 20, 1]|
| f1|[elmt 2, 2, 40, 1]|
| f1|[elmt 3, 3, 10, 1]|
| f1|[elmt 4, 4, 50, 0]|
| f1|[elmt 5, 5, 40, 0]|
| f1|[elmt 6, 6, 10, 1]|
| f1|[elmt 7, 7, 20, 1]|
| f1|[elmt 8, 8, 10, 0]|
+------+------------------+
scala> val df7 = df6.select("family", "combined._1", "combined._2", "combined._3", "combined._4").withColumnRenamed("_1","element").withColumnRenamed("_2","priority").withColumnRenamed("_3", "qty").withColumnRenamed("_4","ok")
df7: org.apache.spark.sql.DataFrame = [family: string, element: string ... 3 more fields]
scala> df7.show
+------+-------+--------+---+---+
|family|element|priority|qty| ok|
+------+-------+--------+---+---+
| f1| elmt 1| 1| 20| 1|
| f1| elmt 2| 2| 40| 1|
| f1| elmt 3| 3| 10| 1|
| f1| elmt 4| 4| 50| 0|
| f1| elmt 5| 5| 40| 0|
| f1| elmt 6| 6| 10| 1|
| f1| elmt 7| 7| 20| 1|
| f1| elmt 8| 8| 10| 0|
+------+-------+--------+---+---+
Let me know if it helps!!
Another way to do it will be an RDD based approach by iterating row by row.
var bufferRow: collection.mutable.Buffer[Row] = collection.mutable.Buffer.empty[Row]
var tempSum: Double = 0
val iterator = df.collect.iterator
while(iterator.hasNext){
val record = iterator.next()
val y = record.getAs[Integer]("qty")
tempSum = tempSum + y
print(record)
if (tempSum <= 100.0 ) {
bufferRow = bufferRow ++ Seq(transformRow(record,1))
}
else{
bufferRow = bufferRow ++ Seq(transformRow(record,0))
tempSum = tempSum - y
}
}
Defining transformRow function which is used to add a column to a row.
def transformRow(row: Row,flag : Int): Row = Row.fromSeq(row.toSeq ++ Array[Integer](flag))
Next thing to do will be adding an additional column to the schema.
val newSchema = StructType(df.schema.fields ++ Array(StructField("C_Sum", IntegerType, false))
Followed by creating a new dataframe.
val outputdf = spark.createDataFrame(spark.sparkContext.parallelize(bufferRow.toSeq),newSchema)
Output Dataframe :
+------+-------+--------+---+-----+
|family|element|priority|qty|C_Sum|
+------+-------+--------+---+-----+
| f1| elmt1| 1| 20| 1|
| f1| elmt2| 2| 40| 1|
| f1| elmt3| 3| 10| 1|
| f1| elmt4| 4| 50| 0|
| f1| elmt5| 5| 40| 0|
| f1| elmt6| 6| 10| 1|
| f1| elmt7| 7| 20| 1|
| f1| elmt8| 8| 10| 0|
+------+-------+--------+---+-----+
I am new to Spark so this solution may not be optimal. I am assuming the value of 100 is an input to the program here. In that case:
case class Frame(family:String, element : String, priority : Int, qty :Int)
import scala.collection.JavaConverters._
val ans = df.as[Frame].toLocalIterator
.asScala
.foldLeft((Seq.empty[Int],0))((acc,a) =>
if(acc._2 + a.qty <= 100) (acc._1 :+ a.priority, acc._2 + a.qty) else acc)._1
df.withColumn("OK" , when($"priority".isin(ans :_*), 1).otherwise(0)).show
results in:
+------+-------+--------+---+--------+
|family|element|priority|qty|OK |
+------+-------+--------+---+--------+
| f1| elmt 1| 1| 20| 1|
| f1| elmt 2| 2| 40| 1|
| f1| elmt 3| 3| 10| 1|
| f1| elmt 4| 4| 50| 0|
| f1| elmt 5| 5| 40| 0|
| f1| elmt 6| 6| 10| 1|
| f1| elmt 7| 7| 20| 1|
| f1| elmt 8| 8| 10| 0|
+------+-------+--------+---+--------+
The idea is simply to get a Scala iterator and extract the participating priority values from it and then use those values to filter out the participating rows. Given this solution gathers all the data in memory on one machine, it could run into memory problems if the dataframe size is too large to fit in memory.
Cumulative sum for each group
from pyspark.sql.window import Window as window
from pyspark.sql.types import IntegerType,StringType,FloatType,StructType,StructField,DateType
schema = StructType() \
.add(StructField("empno",IntegerType(),True)) \
.add(StructField("ename",StringType(),True)) \
.add(StructField("job",StringType(),True)) \
.add(StructField("mgr",StringType(),True)) \
.add(StructField("hiredate",DateType(),True)) \
.add(StructField("sal",FloatType(),True)) \
.add(StructField("comm",StringType(),True)) \
.add(StructField("deptno",IntegerType(),True))
emp = spark.read.csv('data/emp.csv',schema)
dept_partition = window.partitionBy(emp.deptno).orderBy(emp.sal)
emp_win = emp.withColumn("dept_cum_sal",
f.sum(emp.sal).over(dept_partition.rowsBetween(window.unboundedPreceding, window.currentRow)))
emp_win.show()
Results appear like below:
+-----+------+---------+----+----------+------+-------+------+------------
+
|empno| ename| job| mgr| hiredate| sal| comm|deptno|dept_cum_sal|
+-----+------+---------+----+----------+------+-------+------+------------
+
| 7369| SMITH| CLERK|7902|1980-12-17| 800.0| null| 20| 800.0|
| 7876| ADAMS| CLERK|7788|1983-01-12|1100.0| null| 20| 1900.0|
| 7566| JONES| MANAGER|7839|1981-04-02|2975.0| null| 20| 4875.0|
| 7788| SCOTT| ANALYST|7566|1982-12-09|3000.0| null| 20| 7875.0|
| 7902| FORD| ANALYST|7566|1981-12-03|3000.0| null| 20| 10875.0|
| 7934|MILLER| CLERK|7782|1982-01-23|1300.0| null| 10| 1300.0|
| 7782| CLARK| MANAGER|7839|1981-06-09|2450.0| null| 10| 3750.0|
| 7839| KING|PRESIDENT|null|1981-11-17|5000.0| null| 10| 8750.0|
| 7900| JAMES| CLERK|7698|1981-12-03| 950.0| null| 30| 950.0|
| 7521| WARD| SALESMAN|7698|1981-02-22|1250.0| 500.00| 30| 2200.0|
| 7654|MARTIN| SALESMAN|7698|1981-09-28|1250.0|1400.00| 30| 3450.0|
| 7844|TURNER| SALESMAN|7698|1981-09-08|1500.0| 0.00| 30| 4950.0|
| 7499| ALLEN| SALESMAN|7698|1981-02-20|1600.0| 300.00| 30| 6550.0|
| 7698| BLAKE| MANAGER|7839|1981-05-01|2850.0| null| 30| 9400.0|
+-----+------+---------+----+----------+------+-------+------+------------+
PFA the answer
val initDF = Seq(("f1", "elmt 1", 1, 20),("f1", "elmt 2", 2, 40),("f1", "elmt 3", 3, 10),
("f1", "elmt 4", 4, 50),
("f1", "elmt 5", 5, 40),
("f1", "elmt 6", 6, 10),
("f1", "elmt 7", 7, 20),
("f1", "elmt 8", 8, 10)
).toDF("family", "element", "priority", "qty")
val limitQtyDF = Seq(("f1", 100)).toDF("family", "limitQty")
sc.broadcast(limitQtyDF)
val joinedInitDF=initDF.join(limitQtyDF,Seq("family"),"left")
case class dataResult(family:String,element:String,priority:Int, qty:Int, comutedValue:Int, limitQty:Int,controlOut:String)
val familyIDs=initDF.select("family").distinct.collect.map(_(0).toString).toList
def checkingUDF(inputRows:List[Row])={
var controlVarQty=0
val outputArrayBuffer=collection.mutable.ArrayBuffer[dataResult]()
val setLimit=inputRows.head.getInt(4)
for(inputRow <- inputRows)
{
val currQty=inputRow.getInt(3)
//val outpurForRec=
controlVarQty + currQty match {
case value if value <= setLimit =>
controlVarQty+=currQty
outputArrayBuffer+=dataResult(inputRow.getString(0),inputRow.getString(1),inputRow.getInt(2),inputRow.getInt(3),value,setLimit,"ok")
case value =>
outputArrayBuffer+=dataResult(inputRow.getString(0),inputRow.getString(1),inputRow.getInt(2),inputRow.getInt(3),value,setLimit,"ko")
}
//outputArrayBuffer+=Row(inputRow.getString(0),inputRow.getString(1),inputRow.getInt(2),inputRow.getInt(3),controlVarQty+currQty,setLimit,outpurForRec)
}
outputArrayBuffer.toList
}
val tmpAB=collection.mutable.ArrayBuffer[List[dataResult]]()
for (familyID <- familyIDs) // val familyID="f1"
{
val currentFamily=joinedInitDF.filter(s"family = '${familyID}'").orderBy("element", "priority").collect.toList
tmpAB+=checkingUDF(currentFamily)
}
tmpAB.toSeq.flatMap(x => x).toDF.show(false)
This works for me .
+------+-------+--------+---+------------+--------+----------+
|family|element|priority|qty|comutedValue|limitQty|controlOut|
+------+-------+--------+---+------------+--------+----------+
|f1 |elmt 1 |1 |20 |20 |100 |ok |
|f1 |elmt 2 |2 |40 |60 |100 |ok |
|f1 |elmt 3 |3 |10 |70 |100 |ok |
|f1 |elmt 4 |4 |50 |120 |100 |ko |
|f1 |elmt 5 |5 |40 |110 |100 |ko |
|f1 |elmt 6 |6 |10 |80 |100 |ok |
|f1 |elmt 7 |7 |20 |100 |100 |ok |
|f1 |elmt 8 |8 |10 |110 |100 |ko |
+------+-------+--------+---+------------+--------+----------+
Please do drop unnecessary columns from the output
I have following data, where the data is partitioned by the stores and month id and ordered by amount in order to get the primary vendor for the store.
I need a tie breaker if the amount is equal between two vendors,
then if one of the tied vendor was the previous months most sales vendor, make that vendor as the most sales vendor for the month.
The look back will increase if there is a tie again. Lag of 1 month will not work if there is tie again. Worst case scenario we will have more duplicates in previous month also.
sample data
val data = Seq((201801, 10941, 115, 80890.44900, 135799.66400),
(201801, 10941, 3, 80890.44900, 135799.66400) ,
(201712, 10941, 3, 517440.74500, 975893.79000),
(201712, 10941, 115, 517440.74500, 975893.79000),
(201711, 10941, 3 , 371501.92100, 574223.52300),
(201710, 10941, 115, 552435.57800, 746912.06700),
(201709, 10941, 115,1523492.60700,1871480.06800),
(201708, 10941, 115,1027698.93600,1236544.50900),
(201707, 10941, 33 ,1469219.86900,1622949.53000)
).toDF("MTH_ID", "store_id" ,"brand" ,"brndSales","TotalSales")
Code:
val window = Window.partitionBy("store_id","MTH_ID").orderBy("brndSales")
val res = data.withColumn("rank",rank over window)
Output:
+------+--------+-----+-----------+-----------+----+
|MTH_ID|store_id|brand| brndSales| TotalSales|rank|
+------+--------+-----+-----------+-----------+----+
|201801| 10941| 115| 80890.449| 135799.664| 1|
|201801| 10941| 3| 80890.449| 135799.664| 1|
|201712| 10941| 3| 517440.745| 975893.79| 1|
|201712| 10941| 115| 517440.745| 975893.79| 1|
|201711| 10941| 115| 371501.921| 574223.523| 1|
|201710| 10941| 115| 552435.578| 746912.067| 1|
|201709| 10941| 115|1523492.607|1871480.068| 1|
|201708| 10941| 115|1027698.936|1236544.509| 1|
|201707| 10941| 33|1469219.869| 1622949.53| 1|
+------+--------+-----+-----------+-----------+----+
My rank is 1 for both 1 and 2 records, but my rank should be 1 for second record based on previous month max dollars
I am expecting the following output.
+------+--------+-----+-----------+-----------+----+
|MTH_ID|store_id|brand| brndSales| TotalSales|rank|
+------+--------+-----+-----------+-----------+----+
|201801| 10941| 115| 80890.449| 135799.664| 2|
|201801| 10941| 3| 80890.449| 135799.664| 1|
|201712| 10941| 3| 517440.745| 975893.79| 1|
|201712| 10941| 115| 517440.745| 975893.79| 1|
|201711| 10941| 3| 371501.921| 574223.523| 1|
|201710| 10941| 115| 552435.578| 746912.067| 1|
|201709| 10941| 115|1523492.607|1871480.068| 1|
|201708| 10941| 115|1027698.936|1236544.509| 1|
|201707| 10941| 33|1469219.869| 1622949.53| 1|
+------+--------+-----+-----------+-----------+----+
Should I write a UDAF? Any suggestions would help.
You can do this with 2 windows. First, you will need to use the lag() function to carry over the previous month's sales values so that you can use that in your rank window. here's that part in pyspark:
lag_window = Window.partitionBy("store_id", "brand").orderBy("MTH_ID")
lag_df = data.withColumn("last_month_sales", lag("brndSales").over(lag_window))
Then edit your window to include that new column:
window = Window.partitionBy("store_id","MTH_ID").orderBy("brndSales", "last_month_sales")
lag_df.withColumn("rank",rank().over(window)).show()
+------+--------+-----+-----------+-----------+----------------+----+
|MTH_ID|store_id|brand| brndSales| TotalSales|last_month_sales|rank|
+------+--------+-----+-----------+-----------+----------------+----+
|201711| 10941| 99| 371501.921| 574223.523| null| 1|
|201709| 10941| 115|1523492.607|1871480.068| 1027698.936| 1|
|201707| 10941| 33|1469219.869| 1622949.53| null| 1|
|201708| 10941| 115|1027698.936|1236544.509| null| 1|
|201710| 10941| 115| 552435.578| 746912.067| 1523492.607| 1|
|201712| 10941| 3| 517440.745| 975893.79| null| 1|
|201801| 10941| 3| 80890.449| 135799.664| 517440.745| 1|
|201801| 10941| 115| 80890.449| 135799.664| 552435.578| 2|
+------+--------+-----+-----------+-----------+----------------+----+
For each row, collect an array of that brands previous sales, in a (Month, Sales) struct.
val storeAndBrandWindow = Window.partitionBy("store_id", "brand").orderBy($"MTH_ID")
val df1 = data.withColumn("brndSales_list", collect_list(struct($"MTH_ID", $"brndSales")).over(storeAndBrandWindow))
Reverse that array with a UDF.
val returnType = ArrayType(StructType(Array(StructField("month", IntegerType), StructField("sales", DoubleType))))
val reverseUdf = udf((list: Seq[Row]) => list.reverse, returnType)
val df2 = df1.withColumn("brndSales_list", reverseUdf($"brndSales_list"))
And then sort by the array.
val window = Window.partitionBy("store_id", "MTH_ID").orderBy($"brndSales_list".desc)
val df3 = df2.withColumn("rank", rank over window).orderBy("MTH_ID", "brand")
df3.show(false)
Result
+------+--------+-----+-----------+-----------+-----------------------------------------------------------------------------------------+----+
|MTH_ID|store_id|brand|brndSales |TotalSales |brndSales_list |rank|
+------+--------+-----+-----------+-----------+-----------------------------------------------------------------------------------------+----+
|201707|10941 |33 |1469219.869|1622949.53 |[[201707, 1469219.869]] |1 |
|201708|10941 |115 |1027698.936|1236544.509|[[201708, 1027698.936]] |1 |
|201709|10941 |115 |1523492.607|1871480.068|[[201709, 1523492.607], [201708, 1027698.936]] |1 |
|201710|10941 |115 |552435.578 |746912.067 |[[201710, 552435.578], [201709, 1523492.607], [201708, 1027698.936]] |1 |
|201711|10941 |99 |371501.921 |574223.523 |[[201711, 371501.921]] |1 |
|201712|10941 |3 |517440.745 |975893.79 |[[201712, 517440.745]] |1 |
|201801|10941 |3 |80890.449 |135799.664 |[[201801, 80890.449], [201712, 517440.745]] |1 |
|201801|10941 |115 |80890.449 |135799.664 |[[201801, 80890.449], [201710, 552435.578], [201709, 1523492.607], [201708, 1027698.936]]|2 |
+------+--------+-----+-----------+-----------+-----------------------------------------------------------------------------------------+----+
I have two datasets (dataframes)
idPeersDS - which has an id column and it's peers' ids.
infoDS - which has two type columns (type1, type2) and a metric column.
--
idPeersDS
+---+---------+
| id| peers|
+---+---------+
| 1|[1, 2, 3]|
| 2|[2, 1, 6]|
| 3|[3, 1, 2]|
| 4|[4, 5, 6]|
| 5|[5, 4, 6]|
| 6|[6, 1, 2]|
+---+---------+
infoDS
+---+-----+-----+------+
| id|type1|type2|metric|
+---+-----+-----+------+
| 1| A| X| 10.0|
| 1| A| Y| 20.0|
| 1| B| X| 30.0|
| 1| B| Y| 40.0|
| 2| A| Y| 10.0|
| 2| B| X| 20.0|
| 2| B| Y| 30.0|
| 3| A| X| 40.0|
| 4| B| Y| 10.0|
| 5| A| X| 20.0|
| 5| B| X| 30.0|
| 6| A| Y| 40.0|
| 6| B| Y| 10.0|
+---+-----+-----+------+
I need to calculate the zscore of the metric for each id grouped by type1 and type2. But it is not the score of the metrics for the grouped data, it is zscore of the metrics of peers with in a group. If a peerId doesnot have a metric in the group, the peerId's metric is treated as 0.
example:
for group ("A", "X") and for id = 1, the peers are (1,2,3), the metrics for zscore will be (10, 0, 40); since id = 2 doesn't exist in group ("A","X") it is 0. id=5 is not a peer of id=1 so it is not part of zscore calculation.
+---+------+---------+-----------+
| id|metric| peers|type1|type2|
+---+------+---------+-----------+
| 1| 10.0|[1, 2, 3]| A| X|
| 3| 40.0|[3, 1, 2]| A| X|
| 5| 20.0|[5, 4, 6]| A| X|
Z = (X - μ) / σ
Z = (10 - 16.66666) / 16.99673
Z = -0.39223
Output should be the following table. I can compute zscore if `peersmetrics` column instead of `zScoreValue` column like my code did.
+---+------+---------+-----------+-----+-----+
| id|metric| peers|zScoreValue|type1|type2| peersmetrics
+---+------+---------+-----------+-----+-----+
| 1| 10.0|[1, 2, 3]| -0.39| A| X| [10, 0, 40]
| 3| 40.0|[3, 1, 2]| 1.37| A| X| [40, 10, 0]
| 5| 20.0|[5, 4, 6]| 1.41| A| X| [20, 0 , 0]
| 1| 40.0|[1, 2, 3]| 0.98| B| Y| [40, 30, 0]
| 2| 30.0|[2, 1, 6]| 0.27| B| Y| [30, 40, 10]
| 4| 10.0|[4, 5, 6]| 0.71| B| Y|
| 6| 10.0|[6, 1, 2]| -1.34| B| Y|
| 1| 30.0|[1, 2, 3]| 1.07| B| X|
| 2| 20.0|[2, 1, 6]| 0.27| B| X|
| 5| 30.0|[5, 4, 6]| 1.41| B| X|
| 1| 20.0|[1, 2, 3]| 1.22| A| Y|
| 2| 10.0|[2, 1, 6]| -1.07| A| Y|
| 6| 40.0|[6, 1, 2]| 1.34| A| Y|
+---+------+---------+-----------+-----+-----+
Edit1: SQL solution is equally appreciated. I can transform SQL to Scala code in my spark job.
Following is my solution but the computation is taking longer than I wish.
the size of true datasets:
idPeersDS has 17000 and infoDS has 17000 * 6 * 15
Any other solution is greatly appreciated. Feel free to edit/recommend title and correct grammar. English is not my first language. Thanks.
Here is my code.
val idPeersDS = Seq(
(1, Seq(1,2,3)),
(2, Seq(2,1,6)),
(3, Seq(3,1,2)),
(4, Seq(4,5,6)),
(5, Seq(5,4,6)),
(6, Seq(6,1,2))
).toDS.select($"_1" as "id", $"_2" as "peers")
val infoDS = Seq(
(1, "A", "X", 10),
(1, "A", "Y", 20),
(1, "B", "X", 30),
(1, "B", "Y", 40),
(2, "A", "Y", 10),
(2, "B", "X", 20),
(2, "B", "Y", 30),
(3, "A", "X", 40),
(4, "B", "Y", 10),
(5, "A", "X", 20),
(5, "B", "X", 30),
(6, "A", "Y", 40),
(6, "B", "Y", 10)
).toDS.select($"_1" as "id", $"_2" as "type1", $"_3" as "type2", $"_4" cast "double" as "metric")
def calculateZScoreGivenPeers(idMetricDS: DataFrame, irPeersDS: DataFrame, roundTo: Int = 2)
(implicit spark: SparkSession): DataFrame = {
import spark.implicits._
// for every id in the idMetricDS, get the peers and their metric for zscore, calculate zscore
val fir = idMetricDS.join(irPeersDS, "id")
val fsMapBroadcast = spark.sparkContext.broadcast(
idMetricDS.toDF.map((r: Row) => {r.getInt(0) -> r.getDouble(1)}).rdd.collectAsMap)
val fsMap = fsMapBroadcast.value
val funUdf = udf((currId: Int, xs: WrappedArray[Int]) => {
val zScoreMetrics: Array[Double] = xs.toArray.map(x => fsMap.getOrElse(x, 0.0))
val ds = new DescriptiveStatistics(zScoreMetrics)
val mean = ds.getMean()
val sd = Math.sqrt(ds.getPopulationVariance())
val zScore = if (sd == 0.0) {0.0} else {(fsMap.getOrElse(currId, 0.0)- mean) / sd}
zScore
})
val idStatsWithZscoreDS =
fir.withColumn("zScoreValue", round(funUdf(fir("id"), fir("peers")), roundTo))
fsMapBroadcast.unpersist
fsMapBroadcast.destroy
return idStatsWithZscoreDS
}
val typesComb = infoDS.select("type1", "type2").dropDuplicates.collect
val zScoreDS = typesComb.map(
ept => {
val et = ept.getString(0)
val pt = ept.getString(1)
val idMetricDS = infoDS.where($"type1" === lit(et) && $"type2" === lit(pt)).select($"id", $"metric")
val zScoreDS = calculateZScoreGivenPeers(idMetricDS, idPeersDS)(spark)
zScoreDS.select($"id", $"metric", $"peers", $"zScoreValue").withColumn("type1", lit(et)).withColumn("type2", lit(pt))
}
).reduce(_.union(_))
scala> idPeersDS.show(100)
+---+---------+
| id| peers|
+---+---------+
| 1|[1, 2, 3]|
| 2|[2, 1, 6]|
| 3|[3, 1, 2]|
| 4|[4, 5, 6]|
| 5|[5, 4, 6]|
| 6|[6, 1, 2]|
+---+---------+
scala> infoDS.show(100)
+---+-----+-----+------+
| id|type1|type2|metric|
+---+-----+-----+------+
| 1| A| X| 10.0|
| 1| A| Y| 20.0|
| 1| B| X| 30.0|
| 1| B| Y| 40.0|
| 2| A| Y| 10.0|
| 2| B| X| 20.0|
| 2| B| Y| 30.0|
| 3| A| X| 40.0|
| 4| B| Y| 10.0|
| 5| A| X| 20.0|
| 5| B| X| 30.0|
| 6| A| Y| 40.0|
| 6| B| Y| 10.0|
+---+-----+-----+------+
scala> typesComb
res3: Array[org.apache.spark.sql.Row] = Array([A,X], [B,Y], [B,X], [A,Y])
scala> zScoreDS.show(100)
+---+------+---------+-----------+-----+-----+
| id|metric| peers|zScoreValue|type1|type2|
+---+------+---------+-----------+-----+-----+
| 1| 10.0|[1, 2, 3]| -0.39| A| X|
| 3| 40.0|[3, 1, 2]| 1.37| A| X|
| 5| 20.0|[5, 4, 6]| 1.41| A| X|
| 1| 40.0|[1, 2, 3]| 0.98| B| Y|
| 2| 30.0|[2, 1, 6]| 0.27| B| Y|
| 4| 10.0|[4, 5, 6]| 0.71| B| Y|
| 6| 10.0|[6, 1, 2]| -1.34| B| Y|
| 1| 30.0|[1, 2, 3]| 1.07| B| X|
| 2| 20.0|[2, 1, 6]| 0.27| B| X|
| 5| 30.0|[5, 4, 6]| 1.41| B| X|
| 1| 20.0|[1, 2, 3]| 1.22| A| Y|
| 2| 10.0|[2, 1, 6]| -1.07| A| Y|
| 6| 40.0|[6, 1, 2]| 1.34| A| Y|
+---+------+---------+-----------+-----+-----+
I solved it. Here is my answer. This solution did run significantly faster (< 1/10th) than my previous solution I have in the question on my true datasets.
I avoided collect to driver and map and union of datasets in the reduce.
val idPeersDS = Seq(
(1, Seq(1,2,3)),
(2, Seq(2,1,6)),
(3, Seq(3,1,2)),
(4, Seq(4,5,6)),
(5, Seq(5,4,6)),
(6, Seq(6,1,2))
).toDS.select($"_1" as "id", $"_2" as "peers")
val infoDS = Seq(
(1, "A", "X", 10),
(1, "A", "Y", 20),
(1, "B", "X", 30),
(1, "B", "Y", 40),
(2, "A", "Y", 10),
(2, "B", "X", 20),
(2, "B", "Y", 30),
(3, "A", "X", 40),
(4, "B", "Y", 10),
(5, "A", "X", 20),
(5, "B", "X", 30),
(6, "A", "Y", 40),
(6, "B", "Y", 10)
).toDS.select($"_1" as "id", $"_2" as "type1", $"_3" as "type2", $"_4" cast "double" as "metric")
// Exiting paste mode, now interpreting.
idPeersDS: org.apache.spark.sql.DataFrame = [id: int, peers: array<int>]
infoDS: org.apache.spark.sql.DataFrame = [id: int, type1: string ... 2 more fields]
scala> idPeersDS.show
+---+---------+
| id| peers|
+---+---------+
| 1|[1, 2, 3]|
| 2|[2, 1, 6]|
| 3|[3, 1, 2]|
| 4|[4, 5, 6]|
| 5|[5, 4, 6]|
| 6|[6, 1, 2]|
+---+---------+
scala> infoDS.show
+---+-----+-----+------+
| id|type1|type2|metric|
+---+-----+-----+------+
| 1| A| X| 10.0|
| 1| A| Y| 20.0|
| 1| B| X| 30.0|
| 1| B| Y| 40.0|
| 2| A| Y| 10.0|
| 2| B| X| 20.0|
| 2| B| Y| 30.0|
| 3| A| X| 40.0|
| 4| B| Y| 10.0|
| 5| A| X| 20.0|
| 5| B| X| 30.0|
| 6| A| Y| 40.0|
| 6| B| Y| 10.0|
+---+-----+-----+------+
scala> val infowithpeers = infoDS.join(idPeersDS, "id")
infowithpeers: org.apache.spark.sql.DataFrame = [id: int, type1: string ... 3 more fields]
scala> infowithpeers.show
+---+-----+-----+------+---------+
| id|type1|type2|metric| peers|
+---+-----+-----+------+---------+
| 1| A| X| 10.0|[1, 2, 3]|
| 1| A| Y| 20.0|[1, 2, 3]|
| 1| B| X| 30.0|[1, 2, 3]|
| 1| B| Y| 40.0|[1, 2, 3]|
| 2| A| Y| 10.0|[2, 1, 6]|
| 2| B| X| 20.0|[2, 1, 6]|
| 2| B| Y| 30.0|[2, 1, 6]|
| 3| A| X| 40.0|[3, 1, 2]|
| 4| B| Y| 10.0|[4, 5, 6]|
| 5| A| X| 20.0|[5, 4, 6]|
| 5| B| X| 30.0|[5, 4, 6]|
| 6| A| Y| 40.0|[6, 1, 2]|
| 6| B| Y| 10.0|[6, 1, 2]|
+---+-----+-----+------+---------+
scala> val joinMap = udf { values: Seq[Map[Int,Double]] => values.flatten.toMap }
joinMap: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,MapType(IntegerType,DoubleType,false),Some(List(ArrayType(MapType(IntegerType,DoubleType,false),true))))
scala> val zScoreCal = udf { (metric: Double, zScoreMetrics: WrappedArray[Double]) =>
| val ds = new DescriptiveStatistics(zScoreMetrics.toArray)
| val mean = ds.getMean()
| val sd = Math.sqrt(ds.getPopulationVariance())
| val zScore = if (sd == 0.0) {0.0} else {(metric - mean) / sd}
| zScore
| }
zScoreCal: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,DoubleType,Some(List(DoubleType, ArrayType(DoubleType,false))))
scala> :paste
// Entering paste mode (ctrl-D to finish)
val infowithpeersidmetric = infowithpeers.withColumn("idmetric", map($"id",$"metric"))
val idsingrpdf = infowithpeersidmetric.groupBy("type1","type2").agg(joinMap(collect_list(map($"id", $"metric"))) as "idsingrp")
val metricsMap = udf { (peers: Seq[Int], values: Map[Int,Double]) => {
peers.map(p => values.getOrElse(p,0.0))
}
}
// Exiting paste mode, now interpreting.
infowithpeersidmetric: org.apache.spark.sql.DataFrame = [id: int, type1: string ... 4 more fields]
idsingrpdf: org.apache.spark.sql.DataFrame = [type1: string, type2: string ... 1 more field]
metricsMap: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(DoubleType,false),Some(List(ArrayType(IntegerType,false), MapType(IntegerType,DoubleType,false))))
scala> val infoWithMap = infowithpeers.join(idsingrpdf, Seq("type1","type2")).withColumn("zScoreMetrics", metricsMap($"peers", $"idsingrp")).withColumn("zscore", round(zScoreCal($"metric",$"zScoreMetrics"),2))
infoWithMap: org.apache.spark.sql.DataFrame = [type1: string, type2: string ... 6 more fields]
scala> infoWithMap.show
+-----+-----+---+------+---------+--------------------+------------------+------+
|type1|type2| id|metric| peers| idsingrp| zScoreMetrics|zscore|
+-----+-----+---+------+---------+--------------------+------------------+------+
| A| X| 1| 10.0|[1, 2, 3]|[3 -> 40.0, 5 -> ...| [10.0, 0.0, 40.0]| -0.39|
| A| Y| 1| 20.0|[1, 2, 3]|[2 -> 10.0, 6 -> ...| [20.0, 10.0, 0.0]| 1.22|
| B| X| 1| 30.0|[1, 2, 3]|[1 -> 30.0, 2 -> ...| [30.0, 20.0, 0.0]| 1.07|
| B| Y| 1| 40.0|[1, 2, 3]|[4 -> 10.0, 1 -> ...| [40.0, 30.0, 0.0]| 0.98|
| A| Y| 2| 10.0|[2, 1, 6]|[2 -> 10.0, 6 -> ...|[10.0, 20.0, 40.0]| -1.07|
| B| X| 2| 20.0|[2, 1, 6]|[1 -> 30.0, 2 -> ...| [20.0, 30.0, 0.0]| 0.27|
| B| Y| 2| 30.0|[2, 1, 6]|[4 -> 10.0, 1 -> ...|[30.0, 40.0, 10.0]| 0.27|
| A| X| 3| 40.0|[3, 1, 2]|[3 -> 40.0, 5 -> ...| [40.0, 10.0, 0.0]| 1.37|
| B| Y| 4| 10.0|[4, 5, 6]|[4 -> 10.0, 1 -> ...| [10.0, 0.0, 10.0]| 0.71|
| A| X| 5| 20.0|[5, 4, 6]|[3 -> 40.0, 5 -> ...| [20.0, 0.0, 0.0]| 1.41|
| B| X| 5| 30.0|[5, 4, 6]|[1 -> 30.0, 2 -> ...| [30.0, 0.0, 0.0]| 1.41|
| A| Y| 6| 40.0|[6, 1, 2]|[2 -> 10.0, 6 -> ...|[40.0, 20.0, 10.0]| 1.34|
| B| Y| 6| 10.0|[6, 1, 2]|[4 -> 10.0, 1 -> ...|[10.0, 40.0, 30.0]| -1.34|
+-----+-----+---+------+---------+--------------------+------------------+------+
:)
When you have a data frame, you can add columns and fill their rows with the method selectExprt
Something like this:
scala> table.show
+------+--------+---------+--------+--------+
|idempr|tipperrd| codperrd|tipperrt|codperrt|
+------+--------+---------+--------+--------+
| OlcM| h|999999999| J| 0|
| zOcQ| r|777777777| J| 1|
| kyGp| t|333333333| J| 2|
| BEuX| A|999999999| F| 3|
scala> var table2 = table.selectExpr("idempr", "tipperrd", "codperrd", "tipperrt", "codperrt", "'hola' as Saludo")
tabla: org.apache.spark.sql.DataFrame = [idempr: string, tipperrd: string, codperrd: decimal(9,0), tipperrt: string, codperrt: decimal(9,0), Saludo: string]
scala> table2.show
+------+--------+---------+--------+--------+------+
|idempr|tipperrd| codperrd|tipperrt|codperrt|Saludo|
+------+--------+---------+--------+--------+------+
| OlcM| h|999999999| J| 0| hola|
| zOcQ| r|777777777| J| 1| hola|
| kyGp| t|333333333| J| 2| hola|
| BEuX| A|999999999| F| 3| hola|
My point is:
I define strings and call a method which use this String parameter to fill a column in the data frame. But I am not able to do the select expresion get the string (I tried $, +, etc..) . To achieve something like this:
scala> var english = "hello"
scala> def generar_informe(df: DataFrame, tabla: String) {
var selectExpr_df = df.selectExpr(
"TIPPERSCON_BAS as TIP.PERSONA CONTACTABILIDAD",
"CODPERSCON_BAS as COD.PERSONA CONTACTABILIDAD",
"'tabla' as PUNTO DEL FLUJO" )
}
scala> generar_informe(df,english)
.....
scala> table2.show
+------+--------+---------+--------+--------+------+
|idempr|tipperrd| codperrd|tipperrt|codperrt|Saludo|
+------+--------+---------+--------+--------+------+
| OlcM| h|999999999| J| 0| hello|
| zOcQ| r|777777777| J| 1| hello|
| kyGp| t|333333333| J| 2| hello|
| BEuX| A|999999999| F| 3| hello|
I tried:
scala> var result = tabl.selectExpr("A", "B", "$tabla as C")
scala> var abc = tabl.selectExpr("A", "B", ${tabla} as C)
<console>:31: error: not found: value $
var abc = tabl.selectExpr("A", "B", ${tabla} as C)
scala> var abc = tabl.selectExpr("A", "B", "${tabla} as C")
scala> sqlContext.sql("set tabla='hello'")
scala> var abc = tabl.selectExpr("A", "B", "${tabla} as C")
SAME ERROR:
java.lang.RuntimeException: [1.1] failure: identifier expected
${tabla} as C
^
at scala.sys.package$.error(package.scala:27)
Thanks in advance!
Can you try this.
val english = "hello"
generar_informe(data,english).show()
}
def generar_informe(df: DataFrame , english : String)={
df.selectExpr(
"transactionId" , "customerId" , "itemId","amountPaid" , s"""'${english}' as saludo """)
}
This is the output I got.
17/11/02 23:56:44 INFO CodeGenerator: Code generated in 13.857987 ms
+-------------+----------+------+----------+------+
|transactionId|customerId|itemId|amountPaid|saludo|
+-------------+----------+------+----------+------+
| 111| 1| 1| 100.0| hello|
| 112| 2| 2| 505.0| hello|
| 113| 3| 3| 510.0| hello|
| 114| 4| 4| 600.0| hello|
| 115| 1| 2| 500.0| hello|
| 116| 1| 2| 500.0| hello|
| 117| 1| 2| 500.0| hello|
| 118| 1| 2| 500.0| hello|
| 119| 2| 3| 500.0| hello|
| 120| 1| 2| 500.0| hello|
| 121| 1| 4| 500.0| hello|
| 122| 1| 2| 500.0| hello|
| 123| 1| 4| 500.0| hello|
| 124| 1| 2| 500.0| hello|
+-------------+----------+------+----------+------+
17/11/02 23:56:44 INFO SparkContext: Invoking stop() from shutdown hook
I want to add a column with date of each corresponding week in Dataframe (appending friday in each date)
My Dataframe looks like this
+----+------+---------+
|Week| City|sum(Sale)|
+----+------+---------+
| 29|City 2| 72|
| 28|City 3| 48|
| 28|City 2| 19|
| 27|City 2| 16|
| 28|City 1| 84|
| 28|City 4| 72|
| 29|City 4| 39|
| 27|City 3| 42|
| 26|City 3| 68|
| 27|City 1| 89|
| 27|City 4| 104|
| 26|City 2| 19|
| 29|City 3| 27|
+----+------+---------+
I need to convert it as below dataframe
----+------+---------+--------------- |
|Week| City|sum(Sale)|perticular day(dd/mm/yyyy) |
+----+------+---------+---------------|
| 29|City 2| 72|Friday(07/21/2017)|
| 28|City 3| 48|Friday(07/14/2017)|
| 28|City 2| 19|Friday(07/14/2017)|
| 27|City 2| 16|Friday(07/07/2017)|
| 28|City 1| 84|Friday(07/14/2017)|
| 28|City 4| 72|Friday(07/14/2017)|
| 29|City 4| 39|Friday(07/21/2017)|
| 27|City 3| 42|Friday(07/07/2017)|
| 26|City 3| 68|Friday(06/30/2017)|
| 27|City 1| 89|Friday(07/07/2017)|
| 27|City 4| 104|Friday(07/07/2017)|
| 26|City 2| 19|Friday(06/30/2017)|
| 29|City 3| 27|Friday(07/21/2017)|
+----+------+---------+
please help me
You can write a simple UDF and get the date from adding week in it.
Here is the simple example
import spark.implicits._
val data = spark.sparkContext.parallelize(Seq(
(29,"City 2", 72),
(28,"City 3", 48),
(28,"City 2", 19),
(27,"City 2", 16),
(28,"City 1", 84),
(28,"City 4", 72),
(29,"City 4", 39),
(27,"City 3", 42),
(26,"City 3", 68),
(27,"City 1", 89),
(27,"City 4", 104),
(26,"City 2", 19),
(29,"City 3", 27)
)).toDF("week", "city", "sale")
val getDateFromWeek = udf((week : Int) => {
//create a default date for week 1
val week1 = LocalDate.of(2016, 12, 30)
val day = "Friday"
//add week from the week column
val result = week1.plusWeeks(week).format(DateTimeFormatter.ofPattern("MM/dd/yyyy"))
//return result as Friday (date)
s"${day} (${result})"
})
//use the udf and create a new column named day
data.withColumn("day", getDateFromWeek($"week")).show
can anyone convert this to Pyspark?