Pyspark: Add new Column contain a value in a column counterpart another value in another column that meets a specified condition - sql

Add new Column contain a value in a column counterpart another value in another column that meets a specified condition
For instance,
original DF as follows:
+-----+-----+-----+
|col1 |col2 |col3 |
+-----+-----+-----+
| A| 17| 1|
| A| 16| 2|
| A| 18| 2|
| A| 30| 3|
| B| 35| 1|
| B| 34| 2|
| B| 36| 2|
| C| 20| 1|
| C| 30| 1|
| C| 43| 1|
+-----+-----+-----+
I need to repeat the value in col2 that counterpart to 1 in col3 for each col1's groups. and if there are more value =1 in col3 for any group from col1 repeat the minimum value
the desired Df as follows:
+----+----+----+----------+
|col1|col2|col3|new_column|
+----+----+----+----------+
| A| 17| 1| 17|
| A| 16| 2| 17|
| A| 18| 2| 17|
| A| 30| 3| 17|
| B| 35| 1| 35|
| B| 34| 2| 35|
| B| 36| 2| 35|
| C| 20| 1| 20|
| C| 30| 1| 20|
| C| 43| 1| 20|
+----+----+----+----------+

df3=df.filter(df.col3==1)
+----+----+----+
|col1|col2|col3|
+----+----+----+
| B| 35| 1|
| C| 20| 1|
| C| 30| 1|
| C| 43| 1|
| A| 17| 1|
+----+----+----+
df3.createOrReplaceTempView("mytable")
To obtain minimum value of col2 I followed the accepted answer in this link How to find exact median for grouped data in Spark
df6=spark.sql("select col1, min(col2) as minimum from mytable group by col1 order by col1")
df6.show()
+----+-------+
|col1|minimum|
+----+-------+
| A| 17|
| B| 35|
| C| 20|
+----+-------+
df_a=df.join(df6,['col1'],'leftouter')
+----+----+----+-------+
|col1|col2|col3|minimum|
+----+----+----+-------+
| B| 35| 1| 35|
| B| 34| 2| 35|
| B| 36| 2| 35|
| C| 20| 1| 20|
| C| 30| 1| 20|
| C| 43| 1| 20|
| A| 17| 1| 17|
| A| 16| 2| 17|
| A| 18| 2| 17|
| A| 30| 3| 17|
+----+----+----+-------+
Is there way better than this solution?

Related

sum of row values within a window range in spark dataframe

I have a dataframe as shown below where count column is having number of columns that has to added to get a new column.
+---+----+----+------------+
| ID|date| A| count|
+---+----+----+------------+
| 1| 1| 10| null|
| 1| 2| 10| null|
| 1| 3|null| null|
| 1| 4| 20| 1|
| 1| 5|null| null|
| 1| 6|null| null|
| 1| 7| 60| 2|
| 1| 7| 60| null|
| 1| 8|null| null|
| 1| 9|null| null|
| 1| 10|null| null|
| 1| 11| 80| 3|
| 2| 1| 10| null|
| 2| 2| 10| null|
| 2| 3|null| null|
| 2| 4| 20| 1|
| 2| 5|null| null|
| 2| 6|null| null|
| 2| 7| 60| 2|
+---+----+----+------------+
The expected output is as shown below.
+---+----+----+-----+-------+
| ID|date| A|count|new_col|
+---+----+----+-----+-------+
| 1| 1| 10| null| null|
| 1| 2| 10| null| null|
| 1| 3| 10| null| null|
| 1| 4| 20| 2| 30|
| 1| 5| 10| null| null|
| 1| 6| 10| null| null|
| 1| 7| 60| 3| 80|
| 1| 7| 60| null| null|
| 1| 8|null| null| null|
| 1| 9|null| null| null|
| 1| 10| 10| null| null|
| 1| 11| 80| 2| 90|
| 2| 1| 10| null| null|
| 2| 2| 10| null| null|
| 2| 3|null| null| null|
| 2| 4| 20| 1| 20|
| 2| 5|null| null| null|
| 2| 6| 20| null| null|
| 2| 7| 60| 2| 80|
+---+----+----+-----+-------+
I tried with window function as follows
val w2 = Window.partitionBy("ID").orderBy("date")
val newDf = df
.withColumn("new_col", when(col("A").isNotNull && col("count").isNotNull, sum(col("A).over(Window.partitionBy("ID").orderBy("date").rowsBetween(Window.currentRow - (col("count")), Window.currentRow)))
But I am getting error as below.
error: overloaded method value - with alternatives:
(x: Long)Long <and>
(x: Int)Long <and>
(x: Char)Long <and>
(x: Short)Long <and>
(x: Byte)Long
cannot be applied to (org.apache.spark.sql.Column)
seems like the column value provided inside window function is causing the issue.
Any idea about how to resolve this error to achieve the requirement or any other alternative solutions?
Any leads appreciated !

Pyspark sum of columns after union of dataframe

How can I sum all columns after unioning two dataframe ?
I have this first df with one row per user:
df = sqlContext.createDataFrame([("2022-01-10", 3, 2,"a"),("2022-01-10",3,4,"b"),("2022-01-10", 1,3,"c")], ["date", "value1", "value2", "userid"])
df.show()
+----------+------+------+------+
| date|value1|value2|userid|
+----------+------+------+------+
|2022-01-10| 3| 2| a|
|2022-01-10| 3| 4| b|
|2022-01-10| 1| 3| c|
+----------+------+------+------+
date value will always be the today's date.
and I have another df, with multiple row per userid this time, so one value for each day:
df2 = sqlContext.createDataFrame([("2022-01-01", 13, 12,"a"),("2022-01-02",13,14,"b"),("2022-01-03", 11,13,"c"),
("2022-01-04", 3, 2,"a"),("2022-01-05",3,4,"b"),("2022-01-06", 1,3,"c"),
("2022-01-10", 31, 21,"a"),("2022-01-07",31,41,"b"),("2022-01-09", 11,31,"c")], ["date", "value3", "value4", "userid"])
df2.show()
+----------+------+------+------+
| date|value3|value4|userid|
+----------+------+------+------+
|2022-01-01| 13| 12| a|
|2022-01-02| 13| 14| b|
|2022-01-03| 11| 13| c|
|2022-01-04| 3| 2| a|
|2022-01-05| 3| 4| b|
|2022-01-06| 1| 3| c|
|2022-01-10| 31| 21| a|
|2022-01-07| 31| 41| b|
|2022-01-09| 11| 31| c|
+----------+------+------+------+
After unioning the two of them with this function, here what I have:
def union_different_tables(df1, df2):
columns_df1 = df1.columns
columns_df2 = df2.columns
data_types_df1 = [i.dataType for i in df1.schema.fields]
data_types_df2 = [i.dataType for i in df2.schema.fields]
for col, _type in zip(columns_df1, data_types_df1):
if col not in df2.columns:
df2 = df2.withColumn(col, f.lit(None).cast(_type))
for col, _type in zip(columns_df2, data_types_df2):
if col not in df1.columns:
df1 = df1.withColumn(col, f.lit(None).cast(_type))
union = df1.unionByName(df2)
return union
+----------+------+------+------+------+------+
| date|value1|value2|userid|value3|value4|
+----------+------+------+------+------+------+
|2022-01-10| 3| 2| a| null| null|
|2022-01-10| 3| 4| b| null| null|
|2022-01-10| 1| 3| c| null| null|
|2022-01-01| null| null| a| 13| 12|
|2022-01-02| null| null| b| 13| 14|
|2022-01-03| null| null| c| 11| 13|
|2022-01-04| null| null| a| 3| 2|
|2022-01-05| null| null| b| 3| 4|
|2022-01-06| null| null| c| 1| 3|
|2022-01-10| null| null| a| 31| 21|
|2022-01-07| null| null| b| 31| 41|
|2022-01-09| null| null| c| 11| 31|
+----------+------+------+------+------+------+
What I want to get is the sum of all columns in df2 (I have 10 of them in the real case) till the date of the day for each userid, so one row per user:
+----------+------+------+------+------+------+
| date|value1|value2|userid|value3|value4|
+----------+------+------+------+------+------+
|2022-01-10| 3| 2| a| 47 | 35 |
|2022-01-10| 3| 4| b| 47 | 59 |
|2022-01-10| 1| 3| c| 23 | 47 |
+----------+------+------+------+------+------+
Since I have to do this operation for multiple tables, here what I tried:
user_window = Window.partitionBy(['userid']).orderBy('date')
list_tables = [df2]
list_col_original = df.columns
for table in list_tables:
df = union_different_tables(df, table)
list_column = list(set(table.columns) - set(list_col_original))
list_col_original.extend(list_column)
df = df.select('userid',
*[f.sum(f.col(col_name)).over(user_window).alias(col_name) for col_name in list_column])
df.show()
+------+------+------+
|userid|value4|value3|
+------+------+------+
| c| 13| 11|
| c| 16| 12|
| c| 47| 23|
| c| 47| 23|
| b| 14| 13|
| b| 18| 16|
| b| 59| 47|
| b| 59| 47|
| a| 12| 13|
| a| 14| 16|
| a| 35| 47|
| a| 35| 47|
+------+------+------+
But that give me a sort of cumulative sum, plus I didn't find a way to add all the columns in the resulting df.
The only thing is that I can't do any join ! My df are very very large and any join is taking too long to compute.
Do you know how I can fix my code to have the result I want ?
After union of df1 and df2, you can group by userid and sum all columns except date for which you get the max.
Note that for the union part, you can actually use DataFrame.unionByName if you have the same data types but only number of columns can differ:
df = df1.unionByName(df2, allowMissingColumns=True)
Then group by and agg:
import pyspark.sql.functions as F
result = df.groupBy("userid").agg(
F.max("date").alias("date"),
*[F.sum(c).alias(c) for c in df.columns if c not in ("date", "userid")]
)
result.show()
#+------+----------+------+------+------+------+
#|userid| date|value1|value2|value3|value4|
#+------+----------+------+------+------+------+
#| a|2022-01-10| 3| 2| 47| 35|
#| b|2022-01-10| 3| 4| 47| 59|
#| c|2022-01-10| 1| 3| 23| 47|
#+------+----------+------+------+------+------+
This supposes the second dataframe contains only dates prior to the today date in the first one. Otherwise, you'll need to filter df2 before union.

Add total per group as a new row in dataframe in Pyspark

Referring to my previous question Here if I trying to compute and add total row, for each brand , parent and week_num (total of usage)
Here is dummy sample :
df0 = spark.createDataFrame(
[
(2, "A", "A2", "A2web", 2500),
(2, "A", "A2", "A2TV", 3500),
(4, "A", "A1", "A2app", 5500),
(4, "A", "AD", "ADapp", 2000),
(4, "B", "B25", "B25app", 7600),
(4, "B", "B26", "B26app", 5600),
(5, "C", "c25", "c25app", 2658),
(5, "C", "c27", "c27app", 1100),
(5, "C", "c28", "c26app", 1200),
],
["week_num", "parent", "brand", "channel", "usage"],
)
This snippet add total row per channel
# Group by and sum to get the totals
totals = (
df0.groupBy(["week_num", "parent", "brand"])
.agg(f.sum("usage").alias("usage"))
.withColumn("channel", f.lit("Total"))
)
# create a temp variable to sort
totals = totals.withColumn("sort_id", f.lit(2))
df0 = df0.withColumn("sort_id", f.lit(1))
# Union dataframes, drop temp variable and show
df1 = df0.unionByName(totals).sort(["week_num", "parent", "brand", "sort_id"])
df1.show()
result:
+--------+------+-----+-------+-----+
|week_num|parent|brand|channel|usage|
+--------+------+-----+-------+-----+
| 2| A| A2| A2web| 2500|
| 2| A| A2| A2TV| 3500|
| 2| A| A2| Total| 6000|
| 4| A| A1| A2app| 5500|
| 4| A| A1| Total| 5500|
| 4| A| AD| ADapp| 2000|
| 4| A| AD| Total| 2000|
| 4| B| B25| B25app| 7600|
| 4| B| B25| Total| 7600|
| 4| B| B26| B26app| 5600|
| 4| B| B26| Total| 5600|
| 5| C| c25| c25app| 2658|
| 5| C| c25| Total| 2658|
| 5| C| c27| c27app| 1100|
| 5| C| c27| Total| 1100|
| 5| C| c28| c26app| 1200|
| 5| C| c28| Total| 1200|
+--------+------+-----+-------+-----+
That is ok for channel column, in order to to get something like below, I simply repeat the first process groupby+sum and then union the result back
+--------+------+-----+-------+-----+
|week_num|parent|brand|channel|usage|
+--------+------+-----+-------+-----+
| 2| A| A2| A2web| 2500|
| 2| A| A2| A2TV| 3500|
| 2| A| A2| Total| 6000|
| 2| A|Total| | 6000|
| 2| Total| | | 6000|
Here in two steps
# add brand total row
df2 = (
df0.groupBy(["week_num", "parent"])
.agg(f.sum("usage").alias("usage"))
.withColumn("brand", f.lit("Total"))
.withColumn("channel", f.lit(""))
)
df2 = df1.unionByName(df2).sort(["week_num", "parent", "brand", "channel"])
# add weeknum total row
df3 = (
df0.groupBy(["week_num"])
.agg(f.sum("usage").alias("usage"))
.withColumn("parent", f.lit("Total"))
.withColumn("brand", f.lit(""))
.withColumn("channel", f.lit(""))
)
df3 = df2.unionByName(df3).sort(["week_num", "parent", "brand", "channel"])
result:
+--------+------+-----+-------+-----+
|week_num|parent|brand|channel|usage|
+--------+------+-----+-------+-----+
| 2| A| A2| A2TV| 3500|
| 2| A| A2| A2web| 2500|
| 2| A| A2| Total| 6000|
| 2| A|Total| | 6000|
| 2| Total| | | 6000|
| 4| A| A1| A2app| 5500|
| 4| A| A1| Total| 5500|
| 4| A| AD| ADapp| 2000|
| 4| A| AD| Total| 2000|
| 4| A|Total| | 7500|
| 4| B| B25| B25app| 7600|
| 4| B| B25| Total| 7600|
| 4| B| B26| B26app| 5600|
| 4| B| B26| Total| 5600|
| 4| B|Total| |13200|
| 4| Total| | |20700|
| 5| C|Total| | 4958|
| 5| C| c25| Total| 2658|
| 5| C| c25| c25app| 2658|
| 5| C| c27| Total| 1100|
+--------+------+-----+-------+-----+
First question, is there any alternative approach or more efficient way without repetition?
and second, what if I want to show total always at top per each group , regardless of parent/brand/channel alphabetical name, How can I sort this. like this:(this is dummy data but I hope it is clear enough)
+--------+------+-----+-------+-----+
|week_num|parent|brand|channel|usage|
+--------+------+-----+-------+-----+
| 2| Total| | | 6000|
| 2| A|Total| | 6000|
| 2| A| A2| Total| 6000|
| 2| A| A2| A2TV| 3500|
| 2| A| A2| A2web| 2500|
| 4| Total| | |20700|
| 4| A|Total| | 7500|
| 4| B|Total| |13200|
| 4| A| A1| Total| 5500|
| 4| A| A1| A2app| 5500|
| 4| A| AD| Total| 2000|
| 4| A| AD| ADapp| 2000|
| 4| B| B25| Total| 7600|
| 4| B| B25| B25app| 7600|
| 4| B| B26| Total| 5600|
| 4| B| B26| B26app| 5600|
I think you just need the rollup method.
agg_df = (
df.rollup(["week_num", "parent", "brand", "channel"])
.agg(F.sum("usage").alias("usage"), F.grouping_id().alias("lvl"))
.orderBy(agg_cols)
)
agg_df.show()
+--------+------+-----+-------+-----+---+
|week_num|parent|brand|channel|usage|lvl|
+--------+------+-----+-------+-----+---+
| null| null| null| null|31658| 15|
| 2| null| null| null| 6000| 7|
| 2| A| null| null| 6000| 3|
| 2| A| A2| null| 6000| 1|
| 2| A| A2| A2TV| 3500| 0|
| 2| A| A2| A2web| 2500| 0|
| 4| null| null| null|20700| 7|
| 4| A| null| null| 7500| 3|
| 4| A| A1| null| 5500| 1|
| 4| A| A1| A2app| 5500| 0|
| 4| A| AD| null| 2000| 1|
| 4| A| AD| ADapp| 2000| 0|
| 4| B| null| null|13200| 3|
| 4| B| B25| null| 7600| 1|
| 4| B| B25| B25app| 7600| 0|
| 4| B| B26| null| 5600| 1|
| 4| B| B26| B26app| 5600| 0|
| 5| null| null| null| 4958| 7|
| 5| C| null| null| 4958| 3|
| 5| C| c25| null| 2658| 1|
+--------+------+-----+-------+-----+---+
only showing top 20 rows
The rest is pure cosmetic. Probably not a good idea to do that with spark. better do that in the restition tool you will use after.
agg_df = agg_df.withColumn("lvl", F.dense_rank().over(Window.orderBy("lvl")))
TOTAL = "Total"
agg_df = (
agg_df.withColumn(
"parent", F.when(F.col("lvl") == 4, TOTAL).otherwise(F.col("parent"))
)
.withColumn(
"brand",
F.when(F.col("lvl") == 3, TOTAL).otherwise(
F.coalesce(F.col("brand"), F.lit(""))
),
)
.withColumn(
"channel",
F.when(F.col("lvl") == 2, TOTAL).otherwise(
F.coalesce(F.col("channel"), F.lit(""))
),
)
)
agg_df.where(F.col("lvl") != 5).orderBy(
"week_num", F.col("lvl").desc(), "parent", "brand", "channel"
).drop("lvl").show(500)
+--------+------+-----+-------+-----+
|week_num|parent|brand|channel|usage|
+--------+------+-----+-------+-----+
| 2| Total| | | 6000|
| 2| A|Total| | 6000|
| 2| A| A2| Total| 6000|
| 2| A| A2| A2TV| 3500|
| 2| A| A2| A2web| 2500|
| 4| Total| | |20700|
| 4| A|Total| | 7500|
| 4| B|Total| |13200|
| 4| A| A1| Total| 5500|
| 4| A| AD| Total| 2000|
| 4| B| B25| Total| 7600|
| 4| B| B26| Total| 5600|
| 4| A| A1| A2app| 5500|
| 4| A| AD| ADapp| 2000|
| 4| B| B25| B25app| 7600|
| 4| B| B26| B26app| 5600|
| 5| Total| | | 4958|
| 5| C|Total| | 4958|
| 5| C| c25| Total| 2658|
| 5| C| c27| Total| 1100|
| 5| C| c28| Total| 1200|
| 5| C| c25| c25app| 2658|
| 5| C| c27| c27app| 1100|
| 5| C| c28| c26app| 1200|
+--------+------+-----+-------+-----+

Pyspark dataframes group by

I have dataframe like below
|123 |124 |125 |
+-----+-----+-----+
| 1| 2| 3|
| 9| 9| 4|
| 4| 12| 1|
| 2| 4| 8|
| 7| 6| 3|
| 19| 11| 2|
| 21| 10| 10
i need the data to be in
1:[123,125]
2:[123,124,125]
3:[125]
Order is not required to be sorted . I am new to dataframes in pyspark any help would be appreciated
There are no melt or pivot APIs in pyspark that will accomplish this directly. Instead, flatmap from the RDD into a new dataframe and aggregate:
df.show()
+---+---+---+
|123|124|125|
+---+---+---+
| 1| 2| 3|
| 9| 9| 4|
| 4| 12| 1|
| 2| 4| 8|
| 7| 6| 3|
| 19| 11| 2|
| 21| 10| 10|
+---+---+---+
For each column or each row in the RDD, output a row with two columns: the value of the column and the column name:
cols = df.columns
(df.rdd
.flatMap(lambda row: [(row[c], c) for c in cols]).toDF(["value", "column_name"])
.show())
+-----+-----------+
|value|column_name|
+-----+-----------+
| 1| 123|
| 2| 124|
| 3| 125|
| 9| 123|
| 9| 124|
| 4| 125|
| 4| 123|
| 12| 124|
| 1| 125|
| 2| 123|
| 4| 124|
| 8| 125|
| 7| 123|
| 6| 124|
| 3| 125|
| 19| 123|
| 11| 124|
| 2| 125|
| 21| 123|
| 10| 124|
+-----+-----------+
Then, group by the value and aggregate the column names into a list:
from pyspark.sql import functions as f
(df.rdd
.flatMap(lambda row: [(row[c], c) for c in cols]).toDF(["value", "column_name"])
.groupby("value").agg(f.collect_list("column_name"))
.show())
+-----+-------------------------+
|value|collect_list(column_name)|
+-----+-------------------------+
| 19| [123]|
| 7| [123]|
| 6| [124]|
| 9| [123, 124]|
| 1| [123, 125]|
| 10| [124, 125]|
| 3| [125, 125]|
| 12| [124]|
| 8| [125]|
| 11| [124]|
| 2| [124, 123, 125]|
| 4| [125, 123, 124]|
| 21| [123]|
+-----+-------------------------+

Getting Memory Error in PySpark during Filter & GroupBy computation

This is the Error :
Job aborted due to stage failure: Task 12 in stage 37.0 failed 4 times, most recent failure: Lost task 12.3 in stage 37.0 (TID 325, 10.139.64.5, executor 20): ExecutorLostFailure (executor 20 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages*
So is there any alternative, more efficient way to apply those function without causing out-of-memory error? I have the data in Billions to be computed.
Input Dataframe on which filtering is to be done:
+------+-------+-------+------+-------+-------+
|Pos_id|level_p|skill_p|Emp_id|level_e|skill_e|
+------+-------+-------+------+-------+-------+
| 1| 2| a| 100| 2| a|
| 1| 2| a| 100| 3| f|
| 1| 2| a| 100| 2| d|
| 1| 2| a| 101| 4| a|
| 1| 2| a| 101| 5| b|
| 1| 2| a| 101| 1| e|
| 1| 2| a| 102| 5| b|
| 1| 2| a| 102| 3| d|
| 1| 2| a| 102| 2| c|
| 2| 2| d| 100| 2| a|
| 2| 2| d| 100| 3| f|
| 2| 2| d| 100| 2| d|
| 2| 2| d| 101| 4| a|
| 2| 2| d| 101| 5| b|
| 2| 2| d| 101| 1| e|
| 2| 2| d| 102| 5| b|
| 2| 2| d| 102| 3| d|
| 2| 2| d| 102| 2| c|
| 2| 4| b| 100| 2| a|
| 2| 4| b| 100| 3| f|
+------+-------+-------+------+-------+-------+
Filtering Code:
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
from pyspark.sql import functions as sf
function = udf(lambda item, items: 1 if item in items else 0, IntegerType())
df_result = new_df.withColumn('result', function(sf.col('skill_p'), sf.col('skill_e')))
df_filter = df_result.filter(sf.col("result") == 1)
df_filter.show()
res = df_filter.groupBy("Pos_id", "Emp_id").agg(
sf.collect_set("skill_p").alias("SkillsMatch"),
sf.sum("result").alias("SkillsMatchedCount"))
res.show()
This needs to be done on Billions of rows.