Spark, SQL aggregation based on a second data set - sql
I have two datasets (dataframes)
idPeersDS - which has an id column and it's peers' ids.
infoDS - which has two type columns (type1, type2) and a metric column.
--
idPeersDS
+---+---------+
| id| peers|
+---+---------+
| 1|[1, 2, 3]|
| 2|[2, 1, 6]|
| 3|[3, 1, 2]|
| 4|[4, 5, 6]|
| 5|[5, 4, 6]|
| 6|[6, 1, 2]|
+---+---------+
infoDS
+---+-----+-----+------+
| id|type1|type2|metric|
+---+-----+-----+------+
| 1| A| X| 10.0|
| 1| A| Y| 20.0|
| 1| B| X| 30.0|
| 1| B| Y| 40.0|
| 2| A| Y| 10.0|
| 2| B| X| 20.0|
| 2| B| Y| 30.0|
| 3| A| X| 40.0|
| 4| B| Y| 10.0|
| 5| A| X| 20.0|
| 5| B| X| 30.0|
| 6| A| Y| 40.0|
| 6| B| Y| 10.0|
+---+-----+-----+------+
I need to calculate the zscore of the metric for each id grouped by type1 and type2. But it is not the score of the metrics for the grouped data, it is zscore of the metrics of peers with in a group. If a peerId doesnot have a metric in the group, the peerId's metric is treated as 0.
example:
for group ("A", "X") and for id = 1, the peers are (1,2,3), the metrics for zscore will be (10, 0, 40); since id = 2 doesn't exist in group ("A","X") it is 0. id=5 is not a peer of id=1 so it is not part of zscore calculation.
+---+------+---------+-----------+
| id|metric| peers|type1|type2|
+---+------+---------+-----------+
| 1| 10.0|[1, 2, 3]| A| X|
| 3| 40.0|[3, 1, 2]| A| X|
| 5| 20.0|[5, 4, 6]| A| X|
Z = (X - μ) / σ
Z = (10 - 16.66666) / 16.99673
Z = -0.39223
Output should be the following table. I can compute zscore if `peersmetrics` column instead of `zScoreValue` column like my code did.
+---+------+---------+-----------+-----+-----+
| id|metric| peers|zScoreValue|type1|type2| peersmetrics
+---+------+---------+-----------+-----+-----+
| 1| 10.0|[1, 2, 3]| -0.39| A| X| [10, 0, 40]
| 3| 40.0|[3, 1, 2]| 1.37| A| X| [40, 10, 0]
| 5| 20.0|[5, 4, 6]| 1.41| A| X| [20, 0 , 0]
| 1| 40.0|[1, 2, 3]| 0.98| B| Y| [40, 30, 0]
| 2| 30.0|[2, 1, 6]| 0.27| B| Y| [30, 40, 10]
| 4| 10.0|[4, 5, 6]| 0.71| B| Y|
| 6| 10.0|[6, 1, 2]| -1.34| B| Y|
| 1| 30.0|[1, 2, 3]| 1.07| B| X|
| 2| 20.0|[2, 1, 6]| 0.27| B| X|
| 5| 30.0|[5, 4, 6]| 1.41| B| X|
| 1| 20.0|[1, 2, 3]| 1.22| A| Y|
| 2| 10.0|[2, 1, 6]| -1.07| A| Y|
| 6| 40.0|[6, 1, 2]| 1.34| A| Y|
+---+------+---------+-----------+-----+-----+
Edit1: SQL solution is equally appreciated. I can transform SQL to Scala code in my spark job.
Following is my solution but the computation is taking longer than I wish.
the size of true datasets:
idPeersDS has 17000 and infoDS has 17000 * 6 * 15
Any other solution is greatly appreciated. Feel free to edit/recommend title and correct grammar. English is not my first language. Thanks.
Here is my code.
val idPeersDS = Seq(
(1, Seq(1,2,3)),
(2, Seq(2,1,6)),
(3, Seq(3,1,2)),
(4, Seq(4,5,6)),
(5, Seq(5,4,6)),
(6, Seq(6,1,2))
).toDS.select($"_1" as "id", $"_2" as "peers")
val infoDS = Seq(
(1, "A", "X", 10),
(1, "A", "Y", 20),
(1, "B", "X", 30),
(1, "B", "Y", 40),
(2, "A", "Y", 10),
(2, "B", "X", 20),
(2, "B", "Y", 30),
(3, "A", "X", 40),
(4, "B", "Y", 10),
(5, "A", "X", 20),
(5, "B", "X", 30),
(6, "A", "Y", 40),
(6, "B", "Y", 10)
).toDS.select($"_1" as "id", $"_2" as "type1", $"_3" as "type2", $"_4" cast "double" as "metric")
def calculateZScoreGivenPeers(idMetricDS: DataFrame, irPeersDS: DataFrame, roundTo: Int = 2)
(implicit spark: SparkSession): DataFrame = {
import spark.implicits._
// for every id in the idMetricDS, get the peers and their metric for zscore, calculate zscore
val fir = idMetricDS.join(irPeersDS, "id")
val fsMapBroadcast = spark.sparkContext.broadcast(
idMetricDS.toDF.map((r: Row) => {r.getInt(0) -> r.getDouble(1)}).rdd.collectAsMap)
val fsMap = fsMapBroadcast.value
val funUdf = udf((currId: Int, xs: WrappedArray[Int]) => {
val zScoreMetrics: Array[Double] = xs.toArray.map(x => fsMap.getOrElse(x, 0.0))
val ds = new DescriptiveStatistics(zScoreMetrics)
val mean = ds.getMean()
val sd = Math.sqrt(ds.getPopulationVariance())
val zScore = if (sd == 0.0) {0.0} else {(fsMap.getOrElse(currId, 0.0)- mean) / sd}
zScore
})
val idStatsWithZscoreDS =
fir.withColumn("zScoreValue", round(funUdf(fir("id"), fir("peers")), roundTo))
fsMapBroadcast.unpersist
fsMapBroadcast.destroy
return idStatsWithZscoreDS
}
val typesComb = infoDS.select("type1", "type2").dropDuplicates.collect
val zScoreDS = typesComb.map(
ept => {
val et = ept.getString(0)
val pt = ept.getString(1)
val idMetricDS = infoDS.where($"type1" === lit(et) && $"type2" === lit(pt)).select($"id", $"metric")
val zScoreDS = calculateZScoreGivenPeers(idMetricDS, idPeersDS)(spark)
zScoreDS.select($"id", $"metric", $"peers", $"zScoreValue").withColumn("type1", lit(et)).withColumn("type2", lit(pt))
}
).reduce(_.union(_))
scala> idPeersDS.show(100)
+---+---------+
| id| peers|
+---+---------+
| 1|[1, 2, 3]|
| 2|[2, 1, 6]|
| 3|[3, 1, 2]|
| 4|[4, 5, 6]|
| 5|[5, 4, 6]|
| 6|[6, 1, 2]|
+---+---------+
scala> infoDS.show(100)
+---+-----+-----+------+
| id|type1|type2|metric|
+---+-----+-----+------+
| 1| A| X| 10.0|
| 1| A| Y| 20.0|
| 1| B| X| 30.0|
| 1| B| Y| 40.0|
| 2| A| Y| 10.0|
| 2| B| X| 20.0|
| 2| B| Y| 30.0|
| 3| A| X| 40.0|
| 4| B| Y| 10.0|
| 5| A| X| 20.0|
| 5| B| X| 30.0|
| 6| A| Y| 40.0|
| 6| B| Y| 10.0|
+---+-----+-----+------+
scala> typesComb
res3: Array[org.apache.spark.sql.Row] = Array([A,X], [B,Y], [B,X], [A,Y])
scala> zScoreDS.show(100)
+---+------+---------+-----------+-----+-----+
| id|metric| peers|zScoreValue|type1|type2|
+---+------+---------+-----------+-----+-----+
| 1| 10.0|[1, 2, 3]| -0.39| A| X|
| 3| 40.0|[3, 1, 2]| 1.37| A| X|
| 5| 20.0|[5, 4, 6]| 1.41| A| X|
| 1| 40.0|[1, 2, 3]| 0.98| B| Y|
| 2| 30.0|[2, 1, 6]| 0.27| B| Y|
| 4| 10.0|[4, 5, 6]| 0.71| B| Y|
| 6| 10.0|[6, 1, 2]| -1.34| B| Y|
| 1| 30.0|[1, 2, 3]| 1.07| B| X|
| 2| 20.0|[2, 1, 6]| 0.27| B| X|
| 5| 30.0|[5, 4, 6]| 1.41| B| X|
| 1| 20.0|[1, 2, 3]| 1.22| A| Y|
| 2| 10.0|[2, 1, 6]| -1.07| A| Y|
| 6| 40.0|[6, 1, 2]| 1.34| A| Y|
+---+------+---------+-----------+-----+-----+
I solved it. Here is my answer. This solution did run significantly faster (< 1/10th) than my previous solution I have in the question on my true datasets.
I avoided collect to driver and map and union of datasets in the reduce.
val idPeersDS = Seq(
(1, Seq(1,2,3)),
(2, Seq(2,1,6)),
(3, Seq(3,1,2)),
(4, Seq(4,5,6)),
(5, Seq(5,4,6)),
(6, Seq(6,1,2))
).toDS.select($"_1" as "id", $"_2" as "peers")
val infoDS = Seq(
(1, "A", "X", 10),
(1, "A", "Y", 20),
(1, "B", "X", 30),
(1, "B", "Y", 40),
(2, "A", "Y", 10),
(2, "B", "X", 20),
(2, "B", "Y", 30),
(3, "A", "X", 40),
(4, "B", "Y", 10),
(5, "A", "X", 20),
(5, "B", "X", 30),
(6, "A", "Y", 40),
(6, "B", "Y", 10)
).toDS.select($"_1" as "id", $"_2" as "type1", $"_3" as "type2", $"_4" cast "double" as "metric")
// Exiting paste mode, now interpreting.
idPeersDS: org.apache.spark.sql.DataFrame = [id: int, peers: array<int>]
infoDS: org.apache.spark.sql.DataFrame = [id: int, type1: string ... 2 more fields]
scala> idPeersDS.show
+---+---------+
| id| peers|
+---+---------+
| 1|[1, 2, 3]|
| 2|[2, 1, 6]|
| 3|[3, 1, 2]|
| 4|[4, 5, 6]|
| 5|[5, 4, 6]|
| 6|[6, 1, 2]|
+---+---------+
scala> infoDS.show
+---+-----+-----+------+
| id|type1|type2|metric|
+---+-----+-----+------+
| 1| A| X| 10.0|
| 1| A| Y| 20.0|
| 1| B| X| 30.0|
| 1| B| Y| 40.0|
| 2| A| Y| 10.0|
| 2| B| X| 20.0|
| 2| B| Y| 30.0|
| 3| A| X| 40.0|
| 4| B| Y| 10.0|
| 5| A| X| 20.0|
| 5| B| X| 30.0|
| 6| A| Y| 40.0|
| 6| B| Y| 10.0|
+---+-----+-----+------+
scala> val infowithpeers = infoDS.join(idPeersDS, "id")
infowithpeers: org.apache.spark.sql.DataFrame = [id: int, type1: string ... 3 more fields]
scala> infowithpeers.show
+---+-----+-----+------+---------+
| id|type1|type2|metric| peers|
+---+-----+-----+------+---------+
| 1| A| X| 10.0|[1, 2, 3]|
| 1| A| Y| 20.0|[1, 2, 3]|
| 1| B| X| 30.0|[1, 2, 3]|
| 1| B| Y| 40.0|[1, 2, 3]|
| 2| A| Y| 10.0|[2, 1, 6]|
| 2| B| X| 20.0|[2, 1, 6]|
| 2| B| Y| 30.0|[2, 1, 6]|
| 3| A| X| 40.0|[3, 1, 2]|
| 4| B| Y| 10.0|[4, 5, 6]|
| 5| A| X| 20.0|[5, 4, 6]|
| 5| B| X| 30.0|[5, 4, 6]|
| 6| A| Y| 40.0|[6, 1, 2]|
| 6| B| Y| 10.0|[6, 1, 2]|
+---+-----+-----+------+---------+
scala> val joinMap = udf { values: Seq[Map[Int,Double]] => values.flatten.toMap }
joinMap: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,MapType(IntegerType,DoubleType,false),Some(List(ArrayType(MapType(IntegerType,DoubleType,false),true))))
scala> val zScoreCal = udf { (metric: Double, zScoreMetrics: WrappedArray[Double]) =>
| val ds = new DescriptiveStatistics(zScoreMetrics.toArray)
| val mean = ds.getMean()
| val sd = Math.sqrt(ds.getPopulationVariance())
| val zScore = if (sd == 0.0) {0.0} else {(metric - mean) / sd}
| zScore
| }
zScoreCal: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,DoubleType,Some(List(DoubleType, ArrayType(DoubleType,false))))
scala> :paste
// Entering paste mode (ctrl-D to finish)
val infowithpeersidmetric = infowithpeers.withColumn("idmetric", map($"id",$"metric"))
val idsingrpdf = infowithpeersidmetric.groupBy("type1","type2").agg(joinMap(collect_list(map($"id", $"metric"))) as "idsingrp")
val metricsMap = udf { (peers: Seq[Int], values: Map[Int,Double]) => {
peers.map(p => values.getOrElse(p,0.0))
}
}
// Exiting paste mode, now interpreting.
infowithpeersidmetric: org.apache.spark.sql.DataFrame = [id: int, type1: string ... 4 more fields]
idsingrpdf: org.apache.spark.sql.DataFrame = [type1: string, type2: string ... 1 more field]
metricsMap: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(DoubleType,false),Some(List(ArrayType(IntegerType,false), MapType(IntegerType,DoubleType,false))))
scala> val infoWithMap = infowithpeers.join(idsingrpdf, Seq("type1","type2")).withColumn("zScoreMetrics", metricsMap($"peers", $"idsingrp")).withColumn("zscore", round(zScoreCal($"metric",$"zScoreMetrics"),2))
infoWithMap: org.apache.spark.sql.DataFrame = [type1: string, type2: string ... 6 more fields]
scala> infoWithMap.show
+-----+-----+---+------+---------+--------------------+------------------+------+
|type1|type2| id|metric| peers| idsingrp| zScoreMetrics|zscore|
+-----+-----+---+------+---------+--------------------+------------------+------+
| A| X| 1| 10.0|[1, 2, 3]|[3 -> 40.0, 5 -> ...| [10.0, 0.0, 40.0]| -0.39|
| A| Y| 1| 20.0|[1, 2, 3]|[2 -> 10.0, 6 -> ...| [20.0, 10.0, 0.0]| 1.22|
| B| X| 1| 30.0|[1, 2, 3]|[1 -> 30.0, 2 -> ...| [30.0, 20.0, 0.0]| 1.07|
| B| Y| 1| 40.0|[1, 2, 3]|[4 -> 10.0, 1 -> ...| [40.0, 30.0, 0.0]| 0.98|
| A| Y| 2| 10.0|[2, 1, 6]|[2 -> 10.0, 6 -> ...|[10.0, 20.0, 40.0]| -1.07|
| B| X| 2| 20.0|[2, 1, 6]|[1 -> 30.0, 2 -> ...| [20.0, 30.0, 0.0]| 0.27|
| B| Y| 2| 30.0|[2, 1, 6]|[4 -> 10.0, 1 -> ...|[30.0, 40.0, 10.0]| 0.27|
| A| X| 3| 40.0|[3, 1, 2]|[3 -> 40.0, 5 -> ...| [40.0, 10.0, 0.0]| 1.37|
| B| Y| 4| 10.0|[4, 5, 6]|[4 -> 10.0, 1 -> ...| [10.0, 0.0, 10.0]| 0.71|
| A| X| 5| 20.0|[5, 4, 6]|[3 -> 40.0, 5 -> ...| [20.0, 0.0, 0.0]| 1.41|
| B| X| 5| 30.0|[5, 4, 6]|[1 -> 30.0, 2 -> ...| [30.0, 0.0, 0.0]| 1.41|
| A| Y| 6| 40.0|[6, 1, 2]|[2 -> 10.0, 6 -> ...|[40.0, 20.0, 10.0]| 1.34|
| B| Y| 6| 10.0|[6, 1, 2]|[4 -> 10.0, 1 -> ...|[10.0, 40.0, 30.0]| -1.34|
+-----+-----+---+------+---------+--------------------+------------------+------+
Related
How to compute a cumulative sum under a limit with Spark?
After several tries and some research, I'm stuck on trying to solve the following problem with Spark. I have a Dataframe of elements with a priority and a quantity. +------+-------+--------+---+ |family|element|priority|qty| +------+-------+--------+---+ | f1| elmt 1| 1| 20| | f1| elmt 2| 2| 40| | f1| elmt 3| 3| 10| | f1| elmt 4| 4| 50| | f1| elmt 5| 5| 40| | f1| elmt 6| 6| 10| | f1| elmt 7| 7| 20| | f1| elmt 8| 8| 10| +------+-------+--------+---+ I have a fixed limit quantity : +------+--------+ |family|limitQty| +------+--------+ | f1| 100| +------+--------+ I want to mark as "ok" the elements whose the cumulative sum is under the limit. Here is the expected result : +------+-------+--------+---+---+ |family|element|priority|qty| ok| +------+-------+--------+---+---+ | f1| elmt 1| 1| 20| 1| -> 20 < 100 => ok | f1| elmt 2| 2| 40| 1| -> 20 + 40 < 100 => ok | f1| elmt 3| 3| 10| 1| -> 20 + 40 + 10 < 100 => ok | f1| elmt 4| 4| 50| 0| -> 20 + 40 + 10 + 50 > 100 => ko | f1| elmt 5| 5| 40| 0| -> 20 + 40 + 10 + 40 > 100 => ko | f1| elmt 6| 6| 10| 1| -> 20 + 40 + 10 + 10 < 100 => ok | f1| elmt 7| 7| 20| 1| -> 20 + 40 + 10 + 10 + 20 < 100 => ok | f1| elmt 8| 8| 10| 0| -> 20 + 40 + 10 + 10 + 20 + 10 > 100 => ko +------+-------+--------+---+---+ I try to solve if with a cumulative sum : initDF .join(limitQtyDF, Seq("family"), "left_outer") .withColumn("cumulSum", sum($"qty").over(Window.partitionBy("family").orderBy("priority"))) .withColumn("ok", when($"cumulSum" <= $"limitQty", 1).otherwise(0)) .drop("cumulSum", "limitQty") But it's not enough because the elements after the element that is up to the limit are not take into account. I can't find a way to solve it with Spark. Do you have an idea ? Here is the corresponding Scala code : val sparkSession = SparkSession.builder() .master("local[*]") .getOrCreate() import sparkSession.implicits._ val initDF = Seq( ("f1", "elmt 1", 1, 20), ("f1", "elmt 2", 2, 40), ("f1", "elmt 3", 3, 10), ("f1", "elmt 4", 4, 50), ("f1", "elmt 5", 5, 40), ("f1", "elmt 6", 6, 10), ("f1", "elmt 7", 7, 20), ("f1", "elmt 8", 8, 10) ).toDF("family", "element", "priority", "qty") val limitQtyDF = Seq(("f1", 100)).toDF("family", "limitQty") val expectedDF = Seq( ("f1", "elmt 1", 1, 20, 1), ("f1", "elmt 2", 2, 40, 1), ("f1", "elmt 3", 3, 10, 1), ("f1", "elmt 4", 4, 50, 0), ("f1", "elmt 5", 5, 40, 0), ("f1", "elmt 6", 6, 10, 1), ("f1", "elmt 7", 7, 20, 1), ("f1", "elmt 8", 8, 10, 0) ).toDF("family", "element", "priority", "qty", "ok").show() Thank you for your help !
The solution is shown below: scala> initDF.show +------+-------+--------+---+ |family|element|priority|qty| +------+-------+--------+---+ | f1| elmt 1| 1| 20| | f1| elmt 2| 2| 40| | f1| elmt 3| 3| 10| | f1| elmt 4| 4| 50| | f1| elmt 5| 5| 40| | f1| elmt 6| 6| 10| | f1| elmt 7| 7| 20| | f1| elmt 8| 8| 10| +------+-------+--------+---+ scala> val df1 = initDF.groupBy("family").agg(collect_list("qty").as("comb_qty"), collect_list("priority").as("comb_prior"), collect_list("element").as("comb_elem")) df1: org.apache.spark.sql.DataFrame = [family: string, comb_qty: array<int> ... 2 more fields] scala> df1.show +------+--------------------+--------------------+--------------------+ |family| comb_qty| comb_prior| comb_elem| +------+--------------------+--------------------+--------------------+ | f1|[20, 40, 10, 50, ...|[1, 2, 3, 4, 5, 6...|[elmt 1, elmt 2, ...| +------+--------------------+--------------------+--------------------+ scala> val df2 = df1.join(limitQtyDF, df1("family") === limitQtyDF("family")).drop(limitQtyDF("family")) df2: org.apache.spark.sql.DataFrame = [family: string, comb_qty: array<int> ... 3 more fields] scala> df2.show +------+--------------------+--------------------+--------------------+--------+ |family| comb_qty| comb_prior| comb_elem|limitQty| +------+--------------------+--------------------+--------------------+--------+ | f1|[20, 40, 10, 50, ...|[1, 2, 3, 4, 5, 6...|[elmt 1, elmt 2, ...| 100| +------+--------------------+--------------------+--------------------+--------+ scala> def validCheck = (qty: Seq[Int], limit: Int) => { | var sum = 0 | qty.map(elem => { | if (elem + sum <= limit) { | sum = sum + elem | 1}else{ | 0 | }})} validCheck: (scala.collection.mutable.Seq[Int], Int) => scala.collection.mutable.Seq[Int] scala> val newUdf = udf(validCheck) newUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(IntegerType,false),Some(List(ArrayType(IntegerType,false), IntegerType))) val df3 = df2.withColumn("valid", newUdf(col("comb_qty"),col("limitQty"))).drop("limitQty") df3: org.apache.spark.sql.DataFrame = [family: string, comb_qty: array<int> ... 3 more fields] scala> df3.show +------+--------------------+--------------------+--------------------+--------------------+ |family| comb_qty| comb_prior| comb_elem| valid| +------+--------------------+--------------------+--------------------+--------------------+ | f1|[20, 40, 10, 50, ...|[1, 2, 3, 4, 5, 6...|[elmt 1, elmt 2, ...|[1, 1, 1, 0, 0, 1...| +------+--------------------+--------------------+--------------------+--------------------+ scala> val myUdf = udf((qty: Seq[Int], prior: Seq[Int], elem: Seq[String], valid: Seq[Int]) => { | elem zip prior zip qty zip valid map{ | case (((a,b),c),d) => (a,b,c,d)} | } | ) scala> val df4 = df3.withColumn("combined", myUdf(col("comb_qty"),col("comb_prior"),col("comb_elem"),col("valid"))) df4: org.apache.spark.sql.DataFrame = [family: string, comb_qty: array<int> ... 4 more fields] scala> val df5 = df4.drop("comb_qty","comb_prior","comb_elem","valid") df5: org.apache.spark.sql.DataFrame = [family: string, combined: array<struct<_1:string,_2:int,_3:int,_4:int>>] scala> df5.show(false) +------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+ |family|combined | +------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+ |f1 |[[elmt 1, 1, 20, 1], [elmt 2, 2, 40, 1], [elmt 3, 3, 10, 1], [elmt 4, 4, 50, 0], [elmt 5, 5, 40, 0], [elmt 6, 6, 10, 1], [elmt 7, 7, 20, 1], [elmt 8, 8, 10, 0]]| +------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+ scala> val df6 = df5.withColumn("combined",explode(col("combined"))) df6: org.apache.spark.sql.DataFrame = [family: string, combined: struct<_1: string, _2: int ... 2 more fields>] scala> df6.show +------+------------------+ |family| combined| +------+------------------+ | f1|[elmt 1, 1, 20, 1]| | f1|[elmt 2, 2, 40, 1]| | f1|[elmt 3, 3, 10, 1]| | f1|[elmt 4, 4, 50, 0]| | f1|[elmt 5, 5, 40, 0]| | f1|[elmt 6, 6, 10, 1]| | f1|[elmt 7, 7, 20, 1]| | f1|[elmt 8, 8, 10, 0]| +------+------------------+ scala> val df7 = df6.select("family", "combined._1", "combined._2", "combined._3", "combined._4").withColumnRenamed("_1","element").withColumnRenamed("_2","priority").withColumnRenamed("_3", "qty").withColumnRenamed("_4","ok") df7: org.apache.spark.sql.DataFrame = [family: string, element: string ... 3 more fields] scala> df7.show +------+-------+--------+---+---+ |family|element|priority|qty| ok| +------+-------+--------+---+---+ | f1| elmt 1| 1| 20| 1| | f1| elmt 2| 2| 40| 1| | f1| elmt 3| 3| 10| 1| | f1| elmt 4| 4| 50| 0| | f1| elmt 5| 5| 40| 0| | f1| elmt 6| 6| 10| 1| | f1| elmt 7| 7| 20| 1| | f1| elmt 8| 8| 10| 0| +------+-------+--------+---+---+ Let me know if it helps!!
Another way to do it will be an RDD based approach by iterating row by row. var bufferRow: collection.mutable.Buffer[Row] = collection.mutable.Buffer.empty[Row] var tempSum: Double = 0 val iterator = df.collect.iterator while(iterator.hasNext){ val record = iterator.next() val y = record.getAs[Integer]("qty") tempSum = tempSum + y print(record) if (tempSum <= 100.0 ) { bufferRow = bufferRow ++ Seq(transformRow(record,1)) } else{ bufferRow = bufferRow ++ Seq(transformRow(record,0)) tempSum = tempSum - y } } Defining transformRow function which is used to add a column to a row. def transformRow(row: Row,flag : Int): Row = Row.fromSeq(row.toSeq ++ Array[Integer](flag)) Next thing to do will be adding an additional column to the schema. val newSchema = StructType(df.schema.fields ++ Array(StructField("C_Sum", IntegerType, false)) Followed by creating a new dataframe. val outputdf = spark.createDataFrame(spark.sparkContext.parallelize(bufferRow.toSeq),newSchema) Output Dataframe : +------+-------+--------+---+-----+ |family|element|priority|qty|C_Sum| +------+-------+--------+---+-----+ | f1| elmt1| 1| 20| 1| | f1| elmt2| 2| 40| 1| | f1| elmt3| 3| 10| 1| | f1| elmt4| 4| 50| 0| | f1| elmt5| 5| 40| 0| | f1| elmt6| 6| 10| 1| | f1| elmt7| 7| 20| 1| | f1| elmt8| 8| 10| 0| +------+-------+--------+---+-----+
I am new to Spark so this solution may not be optimal. I am assuming the value of 100 is an input to the program here. In that case: case class Frame(family:String, element : String, priority : Int, qty :Int) import scala.collection.JavaConverters._ val ans = df.as[Frame].toLocalIterator .asScala .foldLeft((Seq.empty[Int],0))((acc,a) => if(acc._2 + a.qty <= 100) (acc._1 :+ a.priority, acc._2 + a.qty) else acc)._1 df.withColumn("OK" , when($"priority".isin(ans :_*), 1).otherwise(0)).show results in: +------+-------+--------+---+--------+ |family|element|priority|qty|OK | +------+-------+--------+---+--------+ | f1| elmt 1| 1| 20| 1| | f1| elmt 2| 2| 40| 1| | f1| elmt 3| 3| 10| 1| | f1| elmt 4| 4| 50| 0| | f1| elmt 5| 5| 40| 0| | f1| elmt 6| 6| 10| 1| | f1| elmt 7| 7| 20| 1| | f1| elmt 8| 8| 10| 0| +------+-------+--------+---+--------+ The idea is simply to get a Scala iterator and extract the participating priority values from it and then use those values to filter out the participating rows. Given this solution gathers all the data in memory on one machine, it could run into memory problems if the dataframe size is too large to fit in memory.
Cumulative sum for each group from pyspark.sql.window import Window as window from pyspark.sql.types import IntegerType,StringType,FloatType,StructType,StructField,DateType schema = StructType() \ .add(StructField("empno",IntegerType(),True)) \ .add(StructField("ename",StringType(),True)) \ .add(StructField("job",StringType(),True)) \ .add(StructField("mgr",StringType(),True)) \ .add(StructField("hiredate",DateType(),True)) \ .add(StructField("sal",FloatType(),True)) \ .add(StructField("comm",StringType(),True)) \ .add(StructField("deptno",IntegerType(),True)) emp = spark.read.csv('data/emp.csv',schema) dept_partition = window.partitionBy(emp.deptno).orderBy(emp.sal) emp_win = emp.withColumn("dept_cum_sal", f.sum(emp.sal).over(dept_partition.rowsBetween(window.unboundedPreceding, window.currentRow))) emp_win.show() Results appear like below: +-----+------+---------+----+----------+------+-------+------+------------ + |empno| ename| job| mgr| hiredate| sal| comm|deptno|dept_cum_sal| +-----+------+---------+----+----------+------+-------+------+------------ + | 7369| SMITH| CLERK|7902|1980-12-17| 800.0| null| 20| 800.0| | 7876| ADAMS| CLERK|7788|1983-01-12|1100.0| null| 20| 1900.0| | 7566| JONES| MANAGER|7839|1981-04-02|2975.0| null| 20| 4875.0| | 7788| SCOTT| ANALYST|7566|1982-12-09|3000.0| null| 20| 7875.0| | 7902| FORD| ANALYST|7566|1981-12-03|3000.0| null| 20| 10875.0| | 7934|MILLER| CLERK|7782|1982-01-23|1300.0| null| 10| 1300.0| | 7782| CLARK| MANAGER|7839|1981-06-09|2450.0| null| 10| 3750.0| | 7839| KING|PRESIDENT|null|1981-11-17|5000.0| null| 10| 8750.0| | 7900| JAMES| CLERK|7698|1981-12-03| 950.0| null| 30| 950.0| | 7521| WARD| SALESMAN|7698|1981-02-22|1250.0| 500.00| 30| 2200.0| | 7654|MARTIN| SALESMAN|7698|1981-09-28|1250.0|1400.00| 30| 3450.0| | 7844|TURNER| SALESMAN|7698|1981-09-08|1500.0| 0.00| 30| 4950.0| | 7499| ALLEN| SALESMAN|7698|1981-02-20|1600.0| 300.00| 30| 6550.0| | 7698| BLAKE| MANAGER|7839|1981-05-01|2850.0| null| 30| 9400.0| +-----+------+---------+----+----------+------+-------+------+------------+
PFA the answer val initDF = Seq(("f1", "elmt 1", 1, 20),("f1", "elmt 2", 2, 40),("f1", "elmt 3", 3, 10), ("f1", "elmt 4", 4, 50), ("f1", "elmt 5", 5, 40), ("f1", "elmt 6", 6, 10), ("f1", "elmt 7", 7, 20), ("f1", "elmt 8", 8, 10) ).toDF("family", "element", "priority", "qty") val limitQtyDF = Seq(("f1", 100)).toDF("family", "limitQty") sc.broadcast(limitQtyDF) val joinedInitDF=initDF.join(limitQtyDF,Seq("family"),"left") case class dataResult(family:String,element:String,priority:Int, qty:Int, comutedValue:Int, limitQty:Int,controlOut:String) val familyIDs=initDF.select("family").distinct.collect.map(_(0).toString).toList def checkingUDF(inputRows:List[Row])={ var controlVarQty=0 val outputArrayBuffer=collection.mutable.ArrayBuffer[dataResult]() val setLimit=inputRows.head.getInt(4) for(inputRow <- inputRows) { val currQty=inputRow.getInt(3) //val outpurForRec= controlVarQty + currQty match { case value if value <= setLimit => controlVarQty+=currQty outputArrayBuffer+=dataResult(inputRow.getString(0),inputRow.getString(1),inputRow.getInt(2),inputRow.getInt(3),value,setLimit,"ok") case value => outputArrayBuffer+=dataResult(inputRow.getString(0),inputRow.getString(1),inputRow.getInt(2),inputRow.getInt(3),value,setLimit,"ko") } //outputArrayBuffer+=Row(inputRow.getString(0),inputRow.getString(1),inputRow.getInt(2),inputRow.getInt(3),controlVarQty+currQty,setLimit,outpurForRec) } outputArrayBuffer.toList } val tmpAB=collection.mutable.ArrayBuffer[List[dataResult]]() for (familyID <- familyIDs) // val familyID="f1" { val currentFamily=joinedInitDF.filter(s"family = '${familyID}'").orderBy("element", "priority").collect.toList tmpAB+=checkingUDF(currentFamily) } tmpAB.toSeq.flatMap(x => x).toDF.show(false) This works for me . +------+-------+--------+---+------------+--------+----------+ |family|element|priority|qty|comutedValue|limitQty|controlOut| +------+-------+--------+---+------------+--------+----------+ |f1 |elmt 1 |1 |20 |20 |100 |ok | |f1 |elmt 2 |2 |40 |60 |100 |ok | |f1 |elmt 3 |3 |10 |70 |100 |ok | |f1 |elmt 4 |4 |50 |120 |100 |ko | |f1 |elmt 5 |5 |40 |110 |100 |ko | |f1 |elmt 6 |6 |10 |80 |100 |ok | |f1 |elmt 7 |7 |20 |100 |100 |ok | |f1 |elmt 8 |8 |10 |110 |100 |ko | +------+-------+--------+---+------------+--------+----------+ Please do drop unnecessary columns from the output
Writing select queries on dataframe based on where condition from other dataframe, scala
I have two dataframes with the following columns.. DF1 - partitionNum, lowerBound, upperBound DF2- ID, cumulativeCount I want a resulting Frame which has - ID, partitionNum I have done a cross join which is performing bad as below DF2.crossJoin(DF1).where(col("cumulativeCount").between(col("lowerBound"), col("upperBound"))).orderBy("cumulativeCount") .select("ID", "partitionNum") Since DF2 has 5 million of rows and DF1 has 50 rows, this cross join yields 250 million rows and this task is dying. How can i make this as a select where resulting frame should have ID from DF2 and partitionNum from DF1 and condition is select partition num from DF1 WHERE cumulative Count of DF2 is between lower and upperBound of DF1 I am looking for something like below will this work sparkSession.sqlContext.sql("SELECT ID, cumulativeCount, A.partitionNum FROM CumulativeCountViewById WHERE cumulativeCount IN " + "(SELECT partitionNum FROM CumulativeRangeView WHERE cumulativeCount BETWEEN lowerBound and upperBound) AS A")
Try this. Solution is - you don't need to do crossjoin. Since your DF1 is only 50 rows, convert it to a map of key: partitionNum, value: Tuple2(lowerBound, UppperBound). Create an UDF which takes a number(your cumulativeCount) and checks against the map to return keys(ie., partitionNums) when lowerBound < cumulativeCount < upperBound. You may edit the UDF to return only partitionNumbers and explode the "partNums" array column in the end if you choose to. scala> DF1.show +------------+----------+----------+ |partitionNum|lowerBound|upperBound| +------------+----------+----------+ | 1| 10| 20| | 2| 5| 10| | 3| 6| 15| | 4| 8| 20| +------------+----------+----------+ scala> DF2.show +---+---------------+ | ID|cumulativeCount| +---+---------------+ |100| 5| |100| 10| |100| 15| |100| 20| |100| 25| |100| 30| |100| 6| |100| 12| |100| 18| |100| 24| |101| 1| |101| 2| |101| 3| |101| 4| |101| 5| |101| 6| |101| 7| |101| 8| |101| 9| |101| 10| +---+---------------+ scala> val smallData = DF1.collect.map(row => row.getInt(0) -> (row.getInt(1), row.getInt(2))).toMap smallData: scala.collection.immutable.Map[Int,(Int, Int)] = Map(1 -> (10,20), 2 -> (5,10), 3 -> (6,15), 4 -> (8,20)) scala> val myUdf = udf((num:Int) => smallData.filter((v) => v._2._2 > num && num > v._2._1)) myUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,MapType(IntegerType,StructType(StructField(_1,IntegerType,false), StructField(_2,IntegerType,false)),true),Some(List(IntegerType))) scala> DF2.withColumn("partNums", myUdf($"cumulativeCount")).show(false) +---+---------------+-------------------------------------------+ |ID |cumulativeCount|partNums | +---+---------------+-------------------------------------------+ |100|5 |[] | |100|10 |[3 -> [6, 15], 4 -> [8, 20]] | |100|15 |[1 -> [10, 20], 4 -> [8, 20]] | |100|20 |[] | |100|25 |[] | |100|30 |[] | |100|6 |[2 -> [5, 10]] | |100|12 |[1 -> [10, 20], 3 -> [6, 15], 4 -> [8, 20]]| |100|18 |[1 -> [10, 20], 4 -> [8, 20]] | |100|24 |[] | |101|1 |[] | |101|2 |[] | |101|3 |[] | |101|4 |[] | |101|5 |[] | |101|6 |[2 -> [5, 10]] | |101|7 |[2 -> [5, 10], 3 -> [6, 15]] | |101|8 |[2 -> [5, 10], 3 -> [6, 15]] | |101|9 |[2 -> [5, 10], 3 -> [6, 15], 4 -> [8, 20]] | |101|10 |[3 -> [6, 15], 4 -> [8, 20]] | +---+---------------+-------------------------------------------+
Grab last different data on Spark Dataframe?
I have this data on Spark Dataframe +------+-------+-----+------------+----------+---------+ |sernum|product|state|testDateTime|testResult| msg| +------+-------+-----+------------+----------+---------+ | 8| PA1| 1.0| 1.18| pass|testlog18| | 7| PA1| 1.0| 1.17| fail|testlog17| | 6| PA1| 1.0| 1.16| pass|testlog16| | 5| PA1| 1.0| 1.15| fail|testlog15| | 4| PA1| 2.0| 1.14| fail|testlog14| | 3| PA1| 1.0| 1.13| pass|testlog13| | 2| PA1| 2.0| 1.12| pass|testlog12| | 1| PA1| 1.0| 1.11| fail|testlog11| +------+-------+-----+------------+----------+---------+ What I care about is the testResult == "fail", and the hard part is that I need the to get the last "pass" message as an extra column GROUP BY product+state: +------+-------+-----+------------+----------+---------+---------+ |sernum|product|state|testDateTime|testResult| msg| passMsg| +------+-------+-----+------------+----------+---------+---------+ | 7| PA1| 1.0| 1.17| fail|testlog17|testlog16| | 5| PA1| 1.0| 1.15| fail|testlog15|testlog13| | 4| PA1| 2.0| 1.14| fail|testlog14|testlog12| | 1| PA1| 1.0| 1.11| fail|testlog11| null| +------+-------+-----+------------+----------+---------+---------+ How can I achieve this using DataFrame or SQL?
The trick is to define groups where each group starts with a passed test. Then, use again window-functions with group as an additional partition-column: val df = Seq( (8, "PA1", 1.0, 1.18, "pass", "testlog18"), (7, "PA1", 1.0, 1.17, "fail", "testlog17"), (6, "PA1", 1.0, 1.16, "pass", "testlog16"), (5, "PA1", 1.0, 1.15, "fail", "testlog15"), (4, "PA1", 2.0, 1.14, "fail", "testlog14"), (3, "PA1", 1.0, 1.13, "pass", "testlog13"), (2, "PA1", 2.0, 1.12, "pass", "testlog12"), (1, "PA1", 1.0, 1.11, "fail", "testlog11") ).toDF("sernum", "product", "state", "testDateTime", "testResult", "msg") df .withColumn("group", sum(when($"testResult" === "pass", 1)).over(Window.partitionBy($"product", $"state").orderBy($"testDateTime"))) .withColumn("passMsg", when($"group".isNotNull,first($"msg").over(Window.partitionBy($"product", $"state", $"group").orderBy($"testDateTime")))) .drop($"group") .where($"testResult"==="fail") .orderBy($"product", $"state", $"testDateTime") .show() +------+-------+-----+------------+----------+---------+---------+ |sernum|product|state|testDateTime|testResult| msg| passMsg| +------+-------+-----+------------+----------+---------+---------+ | 7| PA1| 1.0| 1.17| fail|testlog17|testlog16| | 5| PA1| 1.0| 1.15| fail|testlog15|testlog13| | 4| PA1| 2.0| 1.14| fail|testlog14|testlog12| | 1| PA1| 1.0| 1.11| fail|testlog11| null| +------+-------+-----+------------+----------+---------+---------+
This is an alternate approach, by joining the passed logs with failed ones for previous times, and taking the latest "pass" message log. import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions.Window Window.partitionBy($"msg").orderBy($"p_testDateTime".desc) val fDf = df.filter($"testResult" === "fail") var pDf = df.filter($"testResult" === "pass") pDf.columns.foreach(x => pDf = pDf.withColumnRenamed(x, "p_"+x)) val jDf = fDf.join( pDf, pDf("p_product") === fDf("product") && pDf("p_state") === fDf("state") && fDf("testDateTime") > pDf("p_testDateTime") , "left"). select(fDf("*"), pDf("p_testResult"), pDf("p_testDateTime"), pDf("p_msg") ) jDf.withColumn( "rnk", row_number(). over(window) ). filter($"rnk" === 1). drop("rnk","p_testResult","p_testDateTime"). show() +---------+-------+------+-----+------------+----------+---------+ | msg|product|sernum|state|testDateTime|testResult| p_msg| +---------+-------+------+-----+------------+----------+---------+ |testlog14| PA1| 4| 2| 1.14| fail|testlog12| |testlog11| PA1| 1| 1| 1.11| fail| null| |testlog15| PA1| 5| 1| 1.15| fail|testlog13| |testlog17| PA1| 7| 1| 1.17| fail|testlog16| +---------+-------+------+-----+------------+----------+---------+
How do I pass parameters to selectExpr? SparkSQL-Scala
:) When you have a data frame, you can add columns and fill their rows with the method selectExprt Something like this: scala> table.show +------+--------+---------+--------+--------+ |idempr|tipperrd| codperrd|tipperrt|codperrt| +------+--------+---------+--------+--------+ | OlcM| h|999999999| J| 0| | zOcQ| r|777777777| J| 1| | kyGp| t|333333333| J| 2| | BEuX| A|999999999| F| 3| scala> var table2 = table.selectExpr("idempr", "tipperrd", "codperrd", "tipperrt", "codperrt", "'hola' as Saludo") tabla: org.apache.spark.sql.DataFrame = [idempr: string, tipperrd: string, codperrd: decimal(9,0), tipperrt: string, codperrt: decimal(9,0), Saludo: string] scala> table2.show +------+--------+---------+--------+--------+------+ |idempr|tipperrd| codperrd|tipperrt|codperrt|Saludo| +------+--------+---------+--------+--------+------+ | OlcM| h|999999999| J| 0| hola| | zOcQ| r|777777777| J| 1| hola| | kyGp| t|333333333| J| 2| hola| | BEuX| A|999999999| F| 3| hola| My point is: I define strings and call a method which use this String parameter to fill a column in the data frame. But I am not able to do the select expresion get the string (I tried $, +, etc..) . To achieve something like this: scala> var english = "hello" scala> def generar_informe(df: DataFrame, tabla: String) { var selectExpr_df = df.selectExpr( "TIPPERSCON_BAS as TIP.PERSONA CONTACTABILIDAD", "CODPERSCON_BAS as COD.PERSONA CONTACTABILIDAD", "'tabla' as PUNTO DEL FLUJO" ) } scala> generar_informe(df,english) ..... scala> table2.show +------+--------+---------+--------+--------+------+ |idempr|tipperrd| codperrd|tipperrt|codperrt|Saludo| +------+--------+---------+--------+--------+------+ | OlcM| h|999999999| J| 0| hello| | zOcQ| r|777777777| J| 1| hello| | kyGp| t|333333333| J| 2| hello| | BEuX| A|999999999| F| 3| hello| I tried: scala> var result = tabl.selectExpr("A", "B", "$tabla as C") scala> var abc = tabl.selectExpr("A", "B", ${tabla} as C) <console>:31: error: not found: value $ var abc = tabl.selectExpr("A", "B", ${tabla} as C) scala> var abc = tabl.selectExpr("A", "B", "${tabla} as C") scala> sqlContext.sql("set tabla='hello'") scala> var abc = tabl.selectExpr("A", "B", "${tabla} as C") SAME ERROR: java.lang.RuntimeException: [1.1] failure: identifier expected ${tabla} as C ^ at scala.sys.package$.error(package.scala:27) Thanks in advance!
Can you try this. val english = "hello" generar_informe(data,english).show() } def generar_informe(df: DataFrame , english : String)={ df.selectExpr( "transactionId" , "customerId" , "itemId","amountPaid" , s"""'${english}' as saludo """) } This is the output I got. 17/11/02 23:56:44 INFO CodeGenerator: Code generated in 13.857987 ms +-------------+----------+------+----------+------+ |transactionId|customerId|itemId|amountPaid|saludo| +-------------+----------+------+----------+------+ | 111| 1| 1| 100.0| hello| | 112| 2| 2| 505.0| hello| | 113| 3| 3| 510.0| hello| | 114| 4| 4| 600.0| hello| | 115| 1| 2| 500.0| hello| | 116| 1| 2| 500.0| hello| | 117| 1| 2| 500.0| hello| | 118| 1| 2| 500.0| hello| | 119| 2| 3| 500.0| hello| | 120| 1| 2| 500.0| hello| | 121| 1| 4| 500.0| hello| | 122| 1| 2| 500.0| hello| | 123| 1| 4| 500.0| hello| | 124| 1| 2| 500.0| hello| +-------------+----------+------+----------+------+ 17/11/02 23:56:44 INFO SparkContext: Invoking stop() from shutdown hook
Find date of each week in from week in Spark Dataframe
I want to add a column with date of each corresponding week in Dataframe (appending friday in each date) My Dataframe looks like this +----+------+---------+ |Week| City|sum(Sale)| +----+------+---------+ | 29|City 2| 72| | 28|City 3| 48| | 28|City 2| 19| | 27|City 2| 16| | 28|City 1| 84| | 28|City 4| 72| | 29|City 4| 39| | 27|City 3| 42| | 26|City 3| 68| | 27|City 1| 89| | 27|City 4| 104| | 26|City 2| 19| | 29|City 3| 27| +----+------+---------+ I need to convert it as below dataframe ----+------+---------+--------------- | |Week| City|sum(Sale)|perticular day(dd/mm/yyyy) | +----+------+---------+---------------| | 29|City 2| 72|Friday(07/21/2017)| | 28|City 3| 48|Friday(07/14/2017)| | 28|City 2| 19|Friday(07/14/2017)| | 27|City 2| 16|Friday(07/07/2017)| | 28|City 1| 84|Friday(07/14/2017)| | 28|City 4| 72|Friday(07/14/2017)| | 29|City 4| 39|Friday(07/21/2017)| | 27|City 3| 42|Friday(07/07/2017)| | 26|City 3| 68|Friday(06/30/2017)| | 27|City 1| 89|Friday(07/07/2017)| | 27|City 4| 104|Friday(07/07/2017)| | 26|City 2| 19|Friday(06/30/2017)| | 29|City 3| 27|Friday(07/21/2017)| +----+------+---------+ please help me
You can write a simple UDF and get the date from adding week in it. Here is the simple example import spark.implicits._ val data = spark.sparkContext.parallelize(Seq( (29,"City 2", 72), (28,"City 3", 48), (28,"City 2", 19), (27,"City 2", 16), (28,"City 1", 84), (28,"City 4", 72), (29,"City 4", 39), (27,"City 3", 42), (26,"City 3", 68), (27,"City 1", 89), (27,"City 4", 104), (26,"City 2", 19), (29,"City 3", 27) )).toDF("week", "city", "sale") val getDateFromWeek = udf((week : Int) => { //create a default date for week 1 val week1 = LocalDate.of(2016, 12, 30) val day = "Friday" //add week from the week column val result = week1.plusWeeks(week).format(DateTimeFormatter.ofPattern("MM/dd/yyyy")) //return result as Friday (date) s"${day} (${result})" }) //use the udf and create a new column named day data.withColumn("day", getDateFromWeek($"week")).show can anyone convert this to Pyspark?