Add total per group as a new row in dataframe in Pyspark - apache-spark-sql

Referring to my previous question Here if I trying to compute and add total row, for each brand , parent and week_num (total of usage)
Here is dummy sample :
df0 = spark.createDataFrame(
[
(2, "A", "A2", "A2web", 2500),
(2, "A", "A2", "A2TV", 3500),
(4, "A", "A1", "A2app", 5500),
(4, "A", "AD", "ADapp", 2000),
(4, "B", "B25", "B25app", 7600),
(4, "B", "B26", "B26app", 5600),
(5, "C", "c25", "c25app", 2658),
(5, "C", "c27", "c27app", 1100),
(5, "C", "c28", "c26app", 1200),
],
["week_num", "parent", "brand", "channel", "usage"],
)
This snippet add total row per channel
# Group by and sum to get the totals
totals = (
df0.groupBy(["week_num", "parent", "brand"])
.agg(f.sum("usage").alias("usage"))
.withColumn("channel", f.lit("Total"))
)
# create a temp variable to sort
totals = totals.withColumn("sort_id", f.lit(2))
df0 = df0.withColumn("sort_id", f.lit(1))
# Union dataframes, drop temp variable and show
df1 = df0.unionByName(totals).sort(["week_num", "parent", "brand", "sort_id"])
df1.show()
result:
+--------+------+-----+-------+-----+
|week_num|parent|brand|channel|usage|
+--------+------+-----+-------+-----+
| 2| A| A2| A2web| 2500|
| 2| A| A2| A2TV| 3500|
| 2| A| A2| Total| 6000|
| 4| A| A1| A2app| 5500|
| 4| A| A1| Total| 5500|
| 4| A| AD| ADapp| 2000|
| 4| A| AD| Total| 2000|
| 4| B| B25| B25app| 7600|
| 4| B| B25| Total| 7600|
| 4| B| B26| B26app| 5600|
| 4| B| B26| Total| 5600|
| 5| C| c25| c25app| 2658|
| 5| C| c25| Total| 2658|
| 5| C| c27| c27app| 1100|
| 5| C| c27| Total| 1100|
| 5| C| c28| c26app| 1200|
| 5| C| c28| Total| 1200|
+--------+------+-----+-------+-----+
That is ok for channel column, in order to to get something like below, I simply repeat the first process groupby+sum and then union the result back
+--------+------+-----+-------+-----+
|week_num|parent|brand|channel|usage|
+--------+------+-----+-------+-----+
| 2| A| A2| A2web| 2500|
| 2| A| A2| A2TV| 3500|
| 2| A| A2| Total| 6000|
| 2| A|Total| | 6000|
| 2| Total| | | 6000|
Here in two steps
# add brand total row
df2 = (
df0.groupBy(["week_num", "parent"])
.agg(f.sum("usage").alias("usage"))
.withColumn("brand", f.lit("Total"))
.withColumn("channel", f.lit(""))
)
df2 = df1.unionByName(df2).sort(["week_num", "parent", "brand", "channel"])
# add weeknum total row
df3 = (
df0.groupBy(["week_num"])
.agg(f.sum("usage").alias("usage"))
.withColumn("parent", f.lit("Total"))
.withColumn("brand", f.lit(""))
.withColumn("channel", f.lit(""))
)
df3 = df2.unionByName(df3).sort(["week_num", "parent", "brand", "channel"])
result:
+--------+------+-----+-------+-----+
|week_num|parent|brand|channel|usage|
+--------+------+-----+-------+-----+
| 2| A| A2| A2TV| 3500|
| 2| A| A2| A2web| 2500|
| 2| A| A2| Total| 6000|
| 2| A|Total| | 6000|
| 2| Total| | | 6000|
| 4| A| A1| A2app| 5500|
| 4| A| A1| Total| 5500|
| 4| A| AD| ADapp| 2000|
| 4| A| AD| Total| 2000|
| 4| A|Total| | 7500|
| 4| B| B25| B25app| 7600|
| 4| B| B25| Total| 7600|
| 4| B| B26| B26app| 5600|
| 4| B| B26| Total| 5600|
| 4| B|Total| |13200|
| 4| Total| | |20700|
| 5| C|Total| | 4958|
| 5| C| c25| Total| 2658|
| 5| C| c25| c25app| 2658|
| 5| C| c27| Total| 1100|
+--------+------+-----+-------+-----+
First question, is there any alternative approach or more efficient way without repetition?
and second, what if I want to show total always at top per each group , regardless of parent/brand/channel alphabetical name, How can I sort this. like this:(this is dummy data but I hope it is clear enough)
+--------+------+-----+-------+-----+
|week_num|parent|brand|channel|usage|
+--------+------+-----+-------+-----+
| 2| Total| | | 6000|
| 2| A|Total| | 6000|
| 2| A| A2| Total| 6000|
| 2| A| A2| A2TV| 3500|
| 2| A| A2| A2web| 2500|
| 4| Total| | |20700|
| 4| A|Total| | 7500|
| 4| B|Total| |13200|
| 4| A| A1| Total| 5500|
| 4| A| A1| A2app| 5500|
| 4| A| AD| Total| 2000|
| 4| A| AD| ADapp| 2000|
| 4| B| B25| Total| 7600|
| 4| B| B25| B25app| 7600|
| 4| B| B26| Total| 5600|
| 4| B| B26| B26app| 5600|

I think you just need the rollup method.
agg_df = (
df.rollup(["week_num", "parent", "brand", "channel"])
.agg(F.sum("usage").alias("usage"), F.grouping_id().alias("lvl"))
.orderBy(agg_cols)
)
agg_df.show()
+--------+------+-----+-------+-----+---+
|week_num|parent|brand|channel|usage|lvl|
+--------+------+-----+-------+-----+---+
| null| null| null| null|31658| 15|
| 2| null| null| null| 6000| 7|
| 2| A| null| null| 6000| 3|
| 2| A| A2| null| 6000| 1|
| 2| A| A2| A2TV| 3500| 0|
| 2| A| A2| A2web| 2500| 0|
| 4| null| null| null|20700| 7|
| 4| A| null| null| 7500| 3|
| 4| A| A1| null| 5500| 1|
| 4| A| A1| A2app| 5500| 0|
| 4| A| AD| null| 2000| 1|
| 4| A| AD| ADapp| 2000| 0|
| 4| B| null| null|13200| 3|
| 4| B| B25| null| 7600| 1|
| 4| B| B25| B25app| 7600| 0|
| 4| B| B26| null| 5600| 1|
| 4| B| B26| B26app| 5600| 0|
| 5| null| null| null| 4958| 7|
| 5| C| null| null| 4958| 3|
| 5| C| c25| null| 2658| 1|
+--------+------+-----+-------+-----+---+
only showing top 20 rows
The rest is pure cosmetic. Probably not a good idea to do that with spark. better do that in the restition tool you will use after.
agg_df = agg_df.withColumn("lvl", F.dense_rank().over(Window.orderBy("lvl")))
TOTAL = "Total"
agg_df = (
agg_df.withColumn(
"parent", F.when(F.col("lvl") == 4, TOTAL).otherwise(F.col("parent"))
)
.withColumn(
"brand",
F.when(F.col("lvl") == 3, TOTAL).otherwise(
F.coalesce(F.col("brand"), F.lit(""))
),
)
.withColumn(
"channel",
F.when(F.col("lvl") == 2, TOTAL).otherwise(
F.coalesce(F.col("channel"), F.lit(""))
),
)
)
agg_df.where(F.col("lvl") != 5).orderBy(
"week_num", F.col("lvl").desc(), "parent", "brand", "channel"
).drop("lvl").show(500)
+--------+------+-----+-------+-----+
|week_num|parent|brand|channel|usage|
+--------+------+-----+-------+-----+
| 2| Total| | | 6000|
| 2| A|Total| | 6000|
| 2| A| A2| Total| 6000|
| 2| A| A2| A2TV| 3500|
| 2| A| A2| A2web| 2500|
| 4| Total| | |20700|
| 4| A|Total| | 7500|
| 4| B|Total| |13200|
| 4| A| A1| Total| 5500|
| 4| A| AD| Total| 2000|
| 4| B| B25| Total| 7600|
| 4| B| B26| Total| 5600|
| 4| A| A1| A2app| 5500|
| 4| A| AD| ADapp| 2000|
| 4| B| B25| B25app| 7600|
| 4| B| B26| B26app| 5600|
| 5| Total| | | 4958|
| 5| C|Total| | 4958|
| 5| C| c25| Total| 2658|
| 5| C| c27| Total| 1100|
| 5| C| c28| Total| 1200|
| 5| C| c25| c25app| 2658|
| 5| C| c27| c27app| 1100|
| 5| C| c28| c26app| 1200|
+--------+------+-----+-------+-----+

Related

assign max+1 to Sequence field if the Register Number set reappears

I have a dataframe as like below:
+-------+--------------+----+-------------+
|recType|registerNumber|mnId| sequence|
+-------+--------------+----+-------------+
| 01| 13578999| 0| 1|
| 11| 13578999| 1| 1|
| 13| 13578999| 2| 1|
| 14| 13578999| 3| 1|
| 14| 13578999| 4| 1|
| 01| 11121000| 5| 2|
| 11| 11121000| 6| 2|
| 13| 11121000| 7| 2|
| 14| 11121000| 8| 2|
| 01| OC387520| 9| 3|
| 11| OC387520| 10| 3|
| 13| OC387520| 11| 3|
| 01| 11121000| 12| 2|
| 11| 11121000| 13| 2|
| 13| 11121000| 14| 2|
| 14| 11121000| 15| 2|
| 01| OC321000| 16| 4|
| 11| OC321000| 17| 4|
| 13| OC321000| 18| 4|
| 01| OC522000| 19| 5|
| 11| OC522000| 20| 5|
| 13| OC522000| 21| 5|
+-------+--------------+----+-------------+
Each record set starts with recType equal to 01 and ends either with recType equal to 13 or 14.
In some cases, we have some duplicates registerNumber which assigns a duplicates sequence field to the record set.
In the given dataframe, the registerNumber value 11121000 is duplicate.
I want to assign a new sequence value to the duplicate registerNumber value 11121000. So the output dataframe should look as below:
+-------+--------------+----+-------------+
|recType|registerNumber|mnId| sequence|
+-------+--------------+----+-------------+
| 01| 13578999| 0| 1|
| 11| 13578999| 1| 1|
| 13| 13578999| 2| 1|
| 14| 13578999| 3| 1|
| 14| 13578999| 4| 1|
| 01| 11121000| 5| 2|
| 11| 11121000| 6| 2|
| 13| 11121000| 7| 2|
| 14| 11121000| 8| 2|
| 01| OC387520| 9| 3|
| 11| OC387520| 10| 3|
| 13| OC387520| 11| 3|
| 01| 11121000| 12| 6|
| 11| 11121000| 13| 6|
| 13| 11121000| 14| 6|
| 14| 11121000| 15| 6|
| 01| OC321000| 16| 4|
| 11| OC321000| 17| 4|
| 13| OC321000| 18| 4|
| 01| OC522000| 19| 5|
| 11| OC522000| 20| 5|
| 13| OC522000| 21| 5|
+-------+--------------+----+-------------+
Please guide me, how to approach this problem.

Rowwise sum per group and add total as a new row in dataframe in Pyspark

I have a dataframe like this sample
df = spark.createDataFrame(
[(2, "A" , "A2" , 2500),
(2, "A" , "A11" , 3500),
(2, "A" , "A12" , 5500),
(4, "B" , "B25" , 7600),
(4, "B", "B26" ,5600),
(5, "C" , "c25" ,2658),
(5, "C" , "c27" , 1100),
(5, "C" , "c28" , 1200)],
['parent', 'group' , "brand" , "usage"])
output :
+------+-----+-----+-----+
|parent|group|brand|usage|
+------+-----+-----+-----+
| 2| A| A2| 2500|
| 2| A| A11| 3500|
| 4| B| B25| 7600|
| 4| B| B26| 5600|
| 5| C| c25| 2658|
| 5| C| c27| 1100|
| 5| C| c28| 1200|
+------+-----+-----+-----+
What I would like to do is to compute, for each group total of usage and add it as a new row with Total value for brand. How can I do this in PySpark?:
Expected result:
+------+-----+-----+-----+
|parent|group|brand|usage|
+------+-----+-----+-----+
| 2| A| A2| 2500|
| 2| A| A11| 3500|
| 2| A|Total| 6000|
| 4| B| B25| 7600|
| 4| B| B26| 5600|
| 4| B|Total|18700|
| 5| C| c25| 2658|
| 5| C| c27| 1100|
| 5| C| c28| 1200|
| 5| C|Total| 4958|
+------+-----+-----+-----+
import pyspark.sql.functions as F
df = spark.createDataFrame(
[(2, "A" , "A2" , 2500),
(2, "A" , "A11" , 3500),
(2, "A" , "A12" , 5500),
(4, "B" , "B25" , 7600),
(4, "B", "B26" ,5600),
(5, "C" , "c25" ,2658),
(5, "C" , "c27" , 1100),
(5, "C" , "c28" , 1200)],
['parent', 'group' , "brand" , "usage"])
df.show()
+------+-----+-----+-----+
|parent|group|brand|usage|
+------+-----+-----+-----+
| 2| A| A2| 2500|
| 2| A| A11| 3500|
| 2| A| A12| 5500|
| 4| B| B25| 7600|
| 4| B| B26| 5600|
| 5| C| c25| 2658|
| 5| C| c27| 1100|
| 5| C| c28| 1200|
+------+-----+-----+-----+
#Group by and sum to get the totals
totals = df.groupBy(['group','parent']).agg(F.sum('usage').alias('usage')).withColumn('brand', F.lit('Total'))
# create a temp variable to sort
totals = totals.withColumn('sort_id', F.lit(2))
df = df.withColumn('sort_id', F.lit(1))
#Union dataframes, drop temp variable and show
df.unionByName(totals).sort(['group','sort_id']).drop('sort_id').show()
+------+-----+-----+-----+
|parent|group|brand|usage|
+------+-----+-----+-----+
| 2| A| A12| 5500|
| 2| A| A11| 3500|
| 2| A| A2| 2500|
| 2| A|Total|11500|
| 4| B| B25| 7600|
| 4| B| B26| 5600|
| 4| B|Total|13200|
| 5| C| c25| 2658|
| 5| C| c28| 1200|
| 5| C| c27| 1100|
| 5| C|Total| 4958|
+------+-----+-----+-----+

Pyspark dataframes group by

I have dataframe like below
|123 |124 |125 |
+-----+-----+-----+
| 1| 2| 3|
| 9| 9| 4|
| 4| 12| 1|
| 2| 4| 8|
| 7| 6| 3|
| 19| 11| 2|
| 21| 10| 10
i need the data to be in
1:[123,125]
2:[123,124,125]
3:[125]
Order is not required to be sorted . I am new to dataframes in pyspark any help would be appreciated
There are no melt or pivot APIs in pyspark that will accomplish this directly. Instead, flatmap from the RDD into a new dataframe and aggregate:
df.show()
+---+---+---+
|123|124|125|
+---+---+---+
| 1| 2| 3|
| 9| 9| 4|
| 4| 12| 1|
| 2| 4| 8|
| 7| 6| 3|
| 19| 11| 2|
| 21| 10| 10|
+---+---+---+
For each column or each row in the RDD, output a row with two columns: the value of the column and the column name:
cols = df.columns
(df.rdd
.flatMap(lambda row: [(row[c], c) for c in cols]).toDF(["value", "column_name"])
.show())
+-----+-----------+
|value|column_name|
+-----+-----------+
| 1| 123|
| 2| 124|
| 3| 125|
| 9| 123|
| 9| 124|
| 4| 125|
| 4| 123|
| 12| 124|
| 1| 125|
| 2| 123|
| 4| 124|
| 8| 125|
| 7| 123|
| 6| 124|
| 3| 125|
| 19| 123|
| 11| 124|
| 2| 125|
| 21| 123|
| 10| 124|
+-----+-----------+
Then, group by the value and aggregate the column names into a list:
from pyspark.sql import functions as f
(df.rdd
.flatMap(lambda row: [(row[c], c) for c in cols]).toDF(["value", "column_name"])
.groupby("value").agg(f.collect_list("column_name"))
.show())
+-----+-------------------------+
|value|collect_list(column_name)|
+-----+-------------------------+
| 19| [123]|
| 7| [123]|
| 6| [124]|
| 9| [123, 124]|
| 1| [123, 125]|
| 10| [124, 125]|
| 3| [125, 125]|
| 12| [124]|
| 8| [125]|
| 11| [124]|
| 2| [124, 123, 125]|
| 4| [125, 123, 124]|
| 21| [123]|
+-----+-------------------------+

Getting Memory Error in PySpark during Filter & GroupBy computation

This is the Error :
Job aborted due to stage failure: Task 12 in stage 37.0 failed 4 times, most recent failure: Lost task 12.3 in stage 37.0 (TID 325, 10.139.64.5, executor 20): ExecutorLostFailure (executor 20 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages*
So is there any alternative, more efficient way to apply those function without causing out-of-memory error? I have the data in Billions to be computed.
Input Dataframe on which filtering is to be done:
+------+-------+-------+------+-------+-------+
|Pos_id|level_p|skill_p|Emp_id|level_e|skill_e|
+------+-------+-------+------+-------+-------+
| 1| 2| a| 100| 2| a|
| 1| 2| a| 100| 3| f|
| 1| 2| a| 100| 2| d|
| 1| 2| a| 101| 4| a|
| 1| 2| a| 101| 5| b|
| 1| 2| a| 101| 1| e|
| 1| 2| a| 102| 5| b|
| 1| 2| a| 102| 3| d|
| 1| 2| a| 102| 2| c|
| 2| 2| d| 100| 2| a|
| 2| 2| d| 100| 3| f|
| 2| 2| d| 100| 2| d|
| 2| 2| d| 101| 4| a|
| 2| 2| d| 101| 5| b|
| 2| 2| d| 101| 1| e|
| 2| 2| d| 102| 5| b|
| 2| 2| d| 102| 3| d|
| 2| 2| d| 102| 2| c|
| 2| 4| b| 100| 2| a|
| 2| 4| b| 100| 3| f|
+------+-------+-------+------+-------+-------+
Filtering Code:
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
from pyspark.sql import functions as sf
function = udf(lambda item, items: 1 if item in items else 0, IntegerType())
df_result = new_df.withColumn('result', function(sf.col('skill_p'), sf.col('skill_e')))
df_filter = df_result.filter(sf.col("result") == 1)
df_filter.show()
res = df_filter.groupBy("Pos_id", "Emp_id").agg(
sf.collect_set("skill_p").alias("SkillsMatch"),
sf.sum("result").alias("SkillsMatchedCount"))
res.show()
This needs to be done on Billions of rows.

Pyspark: Add new Column contain a value in a column counterpart another value in another column that meets a specified condition

Add new Column contain a value in a column counterpart another value in another column that meets a specified condition
For instance,
original DF as follows:
+-----+-----+-----+
|col1 |col2 |col3 |
+-----+-----+-----+
| A| 17| 1|
| A| 16| 2|
| A| 18| 2|
| A| 30| 3|
| B| 35| 1|
| B| 34| 2|
| B| 36| 2|
| C| 20| 1|
| C| 30| 1|
| C| 43| 1|
+-----+-----+-----+
I need to repeat the value in col2 that counterpart to 1 in col3 for each col1's groups. and if there are more value =1 in col3 for any group from col1 repeat the minimum value
the desired Df as follows:
+----+----+----+----------+
|col1|col2|col3|new_column|
+----+----+----+----------+
| A| 17| 1| 17|
| A| 16| 2| 17|
| A| 18| 2| 17|
| A| 30| 3| 17|
| B| 35| 1| 35|
| B| 34| 2| 35|
| B| 36| 2| 35|
| C| 20| 1| 20|
| C| 30| 1| 20|
| C| 43| 1| 20|
+----+----+----+----------+
df3=df.filter(df.col3==1)
+----+----+----+
|col1|col2|col3|
+----+----+----+
| B| 35| 1|
| C| 20| 1|
| C| 30| 1|
| C| 43| 1|
| A| 17| 1|
+----+----+----+
df3.createOrReplaceTempView("mytable")
To obtain minimum value of col2 I followed the accepted answer in this link How to find exact median for grouped data in Spark
df6=spark.sql("select col1, min(col2) as minimum from mytable group by col1 order by col1")
df6.show()
+----+-------+
|col1|minimum|
+----+-------+
| A| 17|
| B| 35|
| C| 20|
+----+-------+
df_a=df.join(df6,['col1'],'leftouter')
+----+----+----+-------+
|col1|col2|col3|minimum|
+----+----+----+-------+
| B| 35| 1| 35|
| B| 34| 2| 35|
| B| 36| 2| 35|
| C| 20| 1| 20|
| C| 30| 1| 20|
| C| 43| 1| 20|
| A| 17| 1| 17|
| A| 16| 2| 17|
| A| 18| 2| 17|
| A| 30| 3| 17|
+----+----+----+-------+
Is there way better than this solution?