Add total per group as a new row in dataframe in Pyspark - apache-spark-sql
Referring to my previous question Here if I trying to compute and add total row, for each brand , parent and week_num (total of usage)
Here is dummy sample :
df0 = spark.createDataFrame(
[
(2, "A", "A2", "A2web", 2500),
(2, "A", "A2", "A2TV", 3500),
(4, "A", "A1", "A2app", 5500),
(4, "A", "AD", "ADapp", 2000),
(4, "B", "B25", "B25app", 7600),
(4, "B", "B26", "B26app", 5600),
(5, "C", "c25", "c25app", 2658),
(5, "C", "c27", "c27app", 1100),
(5, "C", "c28", "c26app", 1200),
],
["week_num", "parent", "brand", "channel", "usage"],
)
This snippet add total row per channel
# Group by and sum to get the totals
totals = (
df0.groupBy(["week_num", "parent", "brand"])
.agg(f.sum("usage").alias("usage"))
.withColumn("channel", f.lit("Total"))
)
# create a temp variable to sort
totals = totals.withColumn("sort_id", f.lit(2))
df0 = df0.withColumn("sort_id", f.lit(1))
# Union dataframes, drop temp variable and show
df1 = df0.unionByName(totals).sort(["week_num", "parent", "brand", "sort_id"])
df1.show()
result:
+--------+------+-----+-------+-----+
|week_num|parent|brand|channel|usage|
+--------+------+-----+-------+-----+
| 2| A| A2| A2web| 2500|
| 2| A| A2| A2TV| 3500|
| 2| A| A2| Total| 6000|
| 4| A| A1| A2app| 5500|
| 4| A| A1| Total| 5500|
| 4| A| AD| ADapp| 2000|
| 4| A| AD| Total| 2000|
| 4| B| B25| B25app| 7600|
| 4| B| B25| Total| 7600|
| 4| B| B26| B26app| 5600|
| 4| B| B26| Total| 5600|
| 5| C| c25| c25app| 2658|
| 5| C| c25| Total| 2658|
| 5| C| c27| c27app| 1100|
| 5| C| c27| Total| 1100|
| 5| C| c28| c26app| 1200|
| 5| C| c28| Total| 1200|
+--------+------+-----+-------+-----+
That is ok for channel column, in order to to get something like below, I simply repeat the first process groupby+sum and then union the result back
+--------+------+-----+-------+-----+
|week_num|parent|brand|channel|usage|
+--------+------+-----+-------+-----+
| 2| A| A2| A2web| 2500|
| 2| A| A2| A2TV| 3500|
| 2| A| A2| Total| 6000|
| 2| A|Total| | 6000|
| 2| Total| | | 6000|
Here in two steps
# add brand total row
df2 = (
df0.groupBy(["week_num", "parent"])
.agg(f.sum("usage").alias("usage"))
.withColumn("brand", f.lit("Total"))
.withColumn("channel", f.lit(""))
)
df2 = df1.unionByName(df2).sort(["week_num", "parent", "brand", "channel"])
# add weeknum total row
df3 = (
df0.groupBy(["week_num"])
.agg(f.sum("usage").alias("usage"))
.withColumn("parent", f.lit("Total"))
.withColumn("brand", f.lit(""))
.withColumn("channel", f.lit(""))
)
df3 = df2.unionByName(df3).sort(["week_num", "parent", "brand", "channel"])
result:
+--------+------+-----+-------+-----+
|week_num|parent|brand|channel|usage|
+--------+------+-----+-------+-----+
| 2| A| A2| A2TV| 3500|
| 2| A| A2| A2web| 2500|
| 2| A| A2| Total| 6000|
| 2| A|Total| | 6000|
| 2| Total| | | 6000|
| 4| A| A1| A2app| 5500|
| 4| A| A1| Total| 5500|
| 4| A| AD| ADapp| 2000|
| 4| A| AD| Total| 2000|
| 4| A|Total| | 7500|
| 4| B| B25| B25app| 7600|
| 4| B| B25| Total| 7600|
| 4| B| B26| B26app| 5600|
| 4| B| B26| Total| 5600|
| 4| B|Total| |13200|
| 4| Total| | |20700|
| 5| C|Total| | 4958|
| 5| C| c25| Total| 2658|
| 5| C| c25| c25app| 2658|
| 5| C| c27| Total| 1100|
+--------+------+-----+-------+-----+
First question, is there any alternative approach or more efficient way without repetition?
and second, what if I want to show total always at top per each group , regardless of parent/brand/channel alphabetical name, How can I sort this. like this:(this is dummy data but I hope it is clear enough)
+--------+------+-----+-------+-----+
|week_num|parent|brand|channel|usage|
+--------+------+-----+-------+-----+
| 2| Total| | | 6000|
| 2| A|Total| | 6000|
| 2| A| A2| Total| 6000|
| 2| A| A2| A2TV| 3500|
| 2| A| A2| A2web| 2500|
| 4| Total| | |20700|
| 4| A|Total| | 7500|
| 4| B|Total| |13200|
| 4| A| A1| Total| 5500|
| 4| A| A1| A2app| 5500|
| 4| A| AD| Total| 2000|
| 4| A| AD| ADapp| 2000|
| 4| B| B25| Total| 7600|
| 4| B| B25| B25app| 7600|
| 4| B| B26| Total| 5600|
| 4| B| B26| B26app| 5600|
I think you just need the rollup method.
agg_df = (
df.rollup(["week_num", "parent", "brand", "channel"])
.agg(F.sum("usage").alias("usage"), F.grouping_id().alias("lvl"))
.orderBy(agg_cols)
)
agg_df.show()
+--------+------+-----+-------+-----+---+
|week_num|parent|brand|channel|usage|lvl|
+--------+------+-----+-------+-----+---+
| null| null| null| null|31658| 15|
| 2| null| null| null| 6000| 7|
| 2| A| null| null| 6000| 3|
| 2| A| A2| null| 6000| 1|
| 2| A| A2| A2TV| 3500| 0|
| 2| A| A2| A2web| 2500| 0|
| 4| null| null| null|20700| 7|
| 4| A| null| null| 7500| 3|
| 4| A| A1| null| 5500| 1|
| 4| A| A1| A2app| 5500| 0|
| 4| A| AD| null| 2000| 1|
| 4| A| AD| ADapp| 2000| 0|
| 4| B| null| null|13200| 3|
| 4| B| B25| null| 7600| 1|
| 4| B| B25| B25app| 7600| 0|
| 4| B| B26| null| 5600| 1|
| 4| B| B26| B26app| 5600| 0|
| 5| null| null| null| 4958| 7|
| 5| C| null| null| 4958| 3|
| 5| C| c25| null| 2658| 1|
+--------+------+-----+-------+-----+---+
only showing top 20 rows
The rest is pure cosmetic. Probably not a good idea to do that with spark. better do that in the restition tool you will use after.
agg_df = agg_df.withColumn("lvl", F.dense_rank().over(Window.orderBy("lvl")))
TOTAL = "Total"
agg_df = (
agg_df.withColumn(
"parent", F.when(F.col("lvl") == 4, TOTAL).otherwise(F.col("parent"))
)
.withColumn(
"brand",
F.when(F.col("lvl") == 3, TOTAL).otherwise(
F.coalesce(F.col("brand"), F.lit(""))
),
)
.withColumn(
"channel",
F.when(F.col("lvl") == 2, TOTAL).otherwise(
F.coalesce(F.col("channel"), F.lit(""))
),
)
)
agg_df.where(F.col("lvl") != 5).orderBy(
"week_num", F.col("lvl").desc(), "parent", "brand", "channel"
).drop("lvl").show(500)
+--------+------+-----+-------+-----+
|week_num|parent|brand|channel|usage|
+--------+------+-----+-------+-----+
| 2| Total| | | 6000|
| 2| A|Total| | 6000|
| 2| A| A2| Total| 6000|
| 2| A| A2| A2TV| 3500|
| 2| A| A2| A2web| 2500|
| 4| Total| | |20700|
| 4| A|Total| | 7500|
| 4| B|Total| |13200|
| 4| A| A1| Total| 5500|
| 4| A| AD| Total| 2000|
| 4| B| B25| Total| 7600|
| 4| B| B26| Total| 5600|
| 4| A| A1| A2app| 5500|
| 4| A| AD| ADapp| 2000|
| 4| B| B25| B25app| 7600|
| 4| B| B26| B26app| 5600|
| 5| Total| | | 4958|
| 5| C|Total| | 4958|
| 5| C| c25| Total| 2658|
| 5| C| c27| Total| 1100|
| 5| C| c28| Total| 1200|
| 5| C| c25| c25app| 2658|
| 5| C| c27| c27app| 1100|
| 5| C| c28| c26app| 1200|
+--------+------+-----+-------+-----+
Related
assign max+1 to Sequence field if the Register Number set reappears
I have a dataframe as like below: +-------+--------------+----+-------------+ |recType|registerNumber|mnId| sequence| +-------+--------------+----+-------------+ | 01| 13578999| 0| 1| | 11| 13578999| 1| 1| | 13| 13578999| 2| 1| | 14| 13578999| 3| 1| | 14| 13578999| 4| 1| | 01| 11121000| 5| 2| | 11| 11121000| 6| 2| | 13| 11121000| 7| 2| | 14| 11121000| 8| 2| | 01| OC387520| 9| 3| | 11| OC387520| 10| 3| | 13| OC387520| 11| 3| | 01| 11121000| 12| 2| | 11| 11121000| 13| 2| | 13| 11121000| 14| 2| | 14| 11121000| 15| 2| | 01| OC321000| 16| 4| | 11| OC321000| 17| 4| | 13| OC321000| 18| 4| | 01| OC522000| 19| 5| | 11| OC522000| 20| 5| | 13| OC522000| 21| 5| +-------+--------------+----+-------------+ Each record set starts with recType equal to 01 and ends either with recType equal to 13 or 14. In some cases, we have some duplicates registerNumber which assigns a duplicates sequence field to the record set. In the given dataframe, the registerNumber value 11121000 is duplicate. I want to assign a new sequence value to the duplicate registerNumber value 11121000. So the output dataframe should look as below: +-------+--------------+----+-------------+ |recType|registerNumber|mnId| sequence| +-------+--------------+----+-------------+ | 01| 13578999| 0| 1| | 11| 13578999| 1| 1| | 13| 13578999| 2| 1| | 14| 13578999| 3| 1| | 14| 13578999| 4| 1| | 01| 11121000| 5| 2| | 11| 11121000| 6| 2| | 13| 11121000| 7| 2| | 14| 11121000| 8| 2| | 01| OC387520| 9| 3| | 11| OC387520| 10| 3| | 13| OC387520| 11| 3| | 01| 11121000| 12| 6| | 11| 11121000| 13| 6| | 13| 11121000| 14| 6| | 14| 11121000| 15| 6| | 01| OC321000| 16| 4| | 11| OC321000| 17| 4| | 13| OC321000| 18| 4| | 01| OC522000| 19| 5| | 11| OC522000| 20| 5| | 13| OC522000| 21| 5| +-------+--------------+----+-------------+ Please guide me, how to approach this problem.
Rowwise sum per group and add total as a new row in dataframe in Pyspark
I have a dataframe like this sample df = spark.createDataFrame( [(2, "A" , "A2" , 2500), (2, "A" , "A11" , 3500), (2, "A" , "A12" , 5500), (4, "B" , "B25" , 7600), (4, "B", "B26" ,5600), (5, "C" , "c25" ,2658), (5, "C" , "c27" , 1100), (5, "C" , "c28" , 1200)], ['parent', 'group' , "brand" , "usage"]) output : +------+-----+-----+-----+ |parent|group|brand|usage| +------+-----+-----+-----+ | 2| A| A2| 2500| | 2| A| A11| 3500| | 4| B| B25| 7600| | 4| B| B26| 5600| | 5| C| c25| 2658| | 5| C| c27| 1100| | 5| C| c28| 1200| +------+-----+-----+-----+ What I would like to do is to compute, for each group total of usage and add it as a new row with Total value for brand. How can I do this in PySpark?: Expected result: +------+-----+-----+-----+ |parent|group|brand|usage| +------+-----+-----+-----+ | 2| A| A2| 2500| | 2| A| A11| 3500| | 2| A|Total| 6000| | 4| B| B25| 7600| | 4| B| B26| 5600| | 4| B|Total|18700| | 5| C| c25| 2658| | 5| C| c27| 1100| | 5| C| c28| 1200| | 5| C|Total| 4958| +------+-----+-----+-----+
import pyspark.sql.functions as F df = spark.createDataFrame( [(2, "A" , "A2" , 2500), (2, "A" , "A11" , 3500), (2, "A" , "A12" , 5500), (4, "B" , "B25" , 7600), (4, "B", "B26" ,5600), (5, "C" , "c25" ,2658), (5, "C" , "c27" , 1100), (5, "C" , "c28" , 1200)], ['parent', 'group' , "brand" , "usage"]) df.show() +------+-----+-----+-----+ |parent|group|brand|usage| +------+-----+-----+-----+ | 2| A| A2| 2500| | 2| A| A11| 3500| | 2| A| A12| 5500| | 4| B| B25| 7600| | 4| B| B26| 5600| | 5| C| c25| 2658| | 5| C| c27| 1100| | 5| C| c28| 1200| +------+-----+-----+-----+ #Group by and sum to get the totals totals = df.groupBy(['group','parent']).agg(F.sum('usage').alias('usage')).withColumn('brand', F.lit('Total')) # create a temp variable to sort totals = totals.withColumn('sort_id', F.lit(2)) df = df.withColumn('sort_id', F.lit(1)) #Union dataframes, drop temp variable and show df.unionByName(totals).sort(['group','sort_id']).drop('sort_id').show() +------+-----+-----+-----+ |parent|group|brand|usage| +------+-----+-----+-----+ | 2| A| A12| 5500| | 2| A| A11| 3500| | 2| A| A2| 2500| | 2| A|Total|11500| | 4| B| B25| 7600| | 4| B| B26| 5600| | 4| B|Total|13200| | 5| C| c25| 2658| | 5| C| c28| 1200| | 5| C| c27| 1100| | 5| C|Total| 4958| +------+-----+-----+-----+
Pyspark dataframes group by
I have dataframe like below |123 |124 |125 | +-----+-----+-----+ | 1| 2| 3| | 9| 9| 4| | 4| 12| 1| | 2| 4| 8| | 7| 6| 3| | 19| 11| 2| | 21| 10| 10 i need the data to be in 1:[123,125] 2:[123,124,125] 3:[125] Order is not required to be sorted . I am new to dataframes in pyspark any help would be appreciated
There are no melt or pivot APIs in pyspark that will accomplish this directly. Instead, flatmap from the RDD into a new dataframe and aggregate: df.show() +---+---+---+ |123|124|125| +---+---+---+ | 1| 2| 3| | 9| 9| 4| | 4| 12| 1| | 2| 4| 8| | 7| 6| 3| | 19| 11| 2| | 21| 10| 10| +---+---+---+ For each column or each row in the RDD, output a row with two columns: the value of the column and the column name: cols = df.columns (df.rdd .flatMap(lambda row: [(row[c], c) for c in cols]).toDF(["value", "column_name"]) .show()) +-----+-----------+ |value|column_name| +-----+-----------+ | 1| 123| | 2| 124| | 3| 125| | 9| 123| | 9| 124| | 4| 125| | 4| 123| | 12| 124| | 1| 125| | 2| 123| | 4| 124| | 8| 125| | 7| 123| | 6| 124| | 3| 125| | 19| 123| | 11| 124| | 2| 125| | 21| 123| | 10| 124| +-----+-----------+ Then, group by the value and aggregate the column names into a list: from pyspark.sql import functions as f (df.rdd .flatMap(lambda row: [(row[c], c) for c in cols]).toDF(["value", "column_name"]) .groupby("value").agg(f.collect_list("column_name")) .show()) +-----+-------------------------+ |value|collect_list(column_name)| +-----+-------------------------+ | 19| [123]| | 7| [123]| | 6| [124]| | 9| [123, 124]| | 1| [123, 125]| | 10| [124, 125]| | 3| [125, 125]| | 12| [124]| | 8| [125]| | 11| [124]| | 2| [124, 123, 125]| | 4| [125, 123, 124]| | 21| [123]| +-----+-------------------------+
Getting Memory Error in PySpark during Filter & GroupBy computation
This is the Error : Job aborted due to stage failure: Task 12 in stage 37.0 failed 4 times, most recent failure: Lost task 12.3 in stage 37.0 (TID 325, 10.139.64.5, executor 20): ExecutorLostFailure (executor 20 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages* So is there any alternative, more efficient way to apply those function without causing out-of-memory error? I have the data in Billions to be computed. Input Dataframe on which filtering is to be done: +------+-------+-------+------+-------+-------+ |Pos_id|level_p|skill_p|Emp_id|level_e|skill_e| +------+-------+-------+------+-------+-------+ | 1| 2| a| 100| 2| a| | 1| 2| a| 100| 3| f| | 1| 2| a| 100| 2| d| | 1| 2| a| 101| 4| a| | 1| 2| a| 101| 5| b| | 1| 2| a| 101| 1| e| | 1| 2| a| 102| 5| b| | 1| 2| a| 102| 3| d| | 1| 2| a| 102| 2| c| | 2| 2| d| 100| 2| a| | 2| 2| d| 100| 3| f| | 2| 2| d| 100| 2| d| | 2| 2| d| 101| 4| a| | 2| 2| d| 101| 5| b| | 2| 2| d| 101| 1| e| | 2| 2| d| 102| 5| b| | 2| 2| d| 102| 3| d| | 2| 2| d| 102| 2| c| | 2| 4| b| 100| 2| a| | 2| 4| b| 100| 3| f| +------+-------+-------+------+-------+-------+ Filtering Code: from pyspark.sql.types import IntegerType from pyspark.sql.functions import udf from pyspark.sql import functions as sf function = udf(lambda item, items: 1 if item in items else 0, IntegerType()) df_result = new_df.withColumn('result', function(sf.col('skill_p'), sf.col('skill_e'))) df_filter = df_result.filter(sf.col("result") == 1) df_filter.show() res = df_filter.groupBy("Pos_id", "Emp_id").agg( sf.collect_set("skill_p").alias("SkillsMatch"), sf.sum("result").alias("SkillsMatchedCount")) res.show() This needs to be done on Billions of rows.
Pyspark: Add new Column contain a value in a column counterpart another value in another column that meets a specified condition
Add new Column contain a value in a column counterpart another value in another column that meets a specified condition For instance, original DF as follows: +-----+-----+-----+ |col1 |col2 |col3 | +-----+-----+-----+ | A| 17| 1| | A| 16| 2| | A| 18| 2| | A| 30| 3| | B| 35| 1| | B| 34| 2| | B| 36| 2| | C| 20| 1| | C| 30| 1| | C| 43| 1| +-----+-----+-----+ I need to repeat the value in col2 that counterpart to 1 in col3 for each col1's groups. and if there are more value =1 in col3 for any group from col1 repeat the minimum value the desired Df as follows: +----+----+----+----------+ |col1|col2|col3|new_column| +----+----+----+----------+ | A| 17| 1| 17| | A| 16| 2| 17| | A| 18| 2| 17| | A| 30| 3| 17| | B| 35| 1| 35| | B| 34| 2| 35| | B| 36| 2| 35| | C| 20| 1| 20| | C| 30| 1| 20| | C| 43| 1| 20| +----+----+----+----------+
df3=df.filter(df.col3==1) +----+----+----+ |col1|col2|col3| +----+----+----+ | B| 35| 1| | C| 20| 1| | C| 30| 1| | C| 43| 1| | A| 17| 1| +----+----+----+ df3.createOrReplaceTempView("mytable") To obtain minimum value of col2 I followed the accepted answer in this link How to find exact median for grouped data in Spark df6=spark.sql("select col1, min(col2) as minimum from mytable group by col1 order by col1") df6.show() +----+-------+ |col1|minimum| +----+-------+ | A| 17| | B| 35| | C| 20| +----+-------+ df_a=df.join(df6,['col1'],'leftouter') +----+----+----+-------+ |col1|col2|col3|minimum| +----+----+----+-------+ | B| 35| 1| 35| | B| 34| 2| 35| | B| 36| 2| 35| | C| 20| 1| 20| | C| 30| 1| 20| | C| 43| 1| 20| | A| 17| 1| 17| | A| 16| 2| 17| | A| 18| 2| 17| | A| 30| 3| 17| +----+----+----+-------+ Is there way better than this solution?