assign max+1 to Sequence field if the Register Number set reappears - dataframe

I have a dataframe as like below:
+-------+--------------+----+-------------+
|recType|registerNumber|mnId| sequence|
+-------+--------------+----+-------------+
| 01| 13578999| 0| 1|
| 11| 13578999| 1| 1|
| 13| 13578999| 2| 1|
| 14| 13578999| 3| 1|
| 14| 13578999| 4| 1|
| 01| 11121000| 5| 2|
| 11| 11121000| 6| 2|
| 13| 11121000| 7| 2|
| 14| 11121000| 8| 2|
| 01| OC387520| 9| 3|
| 11| OC387520| 10| 3|
| 13| OC387520| 11| 3|
| 01| 11121000| 12| 2|
| 11| 11121000| 13| 2|
| 13| 11121000| 14| 2|
| 14| 11121000| 15| 2|
| 01| OC321000| 16| 4|
| 11| OC321000| 17| 4|
| 13| OC321000| 18| 4|
| 01| OC522000| 19| 5|
| 11| OC522000| 20| 5|
| 13| OC522000| 21| 5|
+-------+--------------+----+-------------+
Each record set starts with recType equal to 01 and ends either with recType equal to 13 or 14.
In some cases, we have some duplicates registerNumber which assigns a duplicates sequence field to the record set.
In the given dataframe, the registerNumber value 11121000 is duplicate.
I want to assign a new sequence value to the duplicate registerNumber value 11121000. So the output dataframe should look as below:
+-------+--------------+----+-------------+
|recType|registerNumber|mnId| sequence|
+-------+--------------+----+-------------+
| 01| 13578999| 0| 1|
| 11| 13578999| 1| 1|
| 13| 13578999| 2| 1|
| 14| 13578999| 3| 1|
| 14| 13578999| 4| 1|
| 01| 11121000| 5| 2|
| 11| 11121000| 6| 2|
| 13| 11121000| 7| 2|
| 14| 11121000| 8| 2|
| 01| OC387520| 9| 3|
| 11| OC387520| 10| 3|
| 13| OC387520| 11| 3|
| 01| 11121000| 12| 6|
| 11| 11121000| 13| 6|
| 13| 11121000| 14| 6|
| 14| 11121000| 15| 6|
| 01| OC321000| 16| 4|
| 11| OC321000| 17| 4|
| 13| OC321000| 18| 4|
| 01| OC522000| 19| 5|
| 11| OC522000| 20| 5|
| 13| OC522000| 21| 5|
+-------+--------------+----+-------------+
Please guide me, how to approach this problem.

Related

How to pivot by value in pyspark

Here's my input
+----+-----+---+------+----+------+-------+--------+
|year|month|day|new_ts|hour|minute|ts_rank| label|
+----+-----+---+------+----+------+-------+--------+
|2022| 1| 1| 13| 13| 24| 1| 7|
|2022| 1| 1| 14| 13| 24| 1| 8|
|2022| 1| 2| 15| 13| 24| 1| 7|
|2022| 1| 2| 16| 13| 44| 7| 8|
+----+-----+---+------+----+------+-------+--------+
Here's my output
+----+-----+---+-------+--------+
|year|month|day| 7 | 8|
+----+-----+---+-------+--------+
|2022| 1| 1| 13| 14|
|2022| 1| 2| 15| 16|
+----+-----+---+-------+--------+
Here's the pandas code
df_pivot = df.pivot(index=["year","month","day"], columns="label", values="new_ts").reset_index()
What I try
df_pivot = df.groupBy(["year","month","day"]).pivot("label").value("new_ts")
Note: sorry I can't show my error message here, because I'm using cloud solution and its only show the line of error not error message
df.groupBy("year","month","day").pivot('label').agg(first('new_ts')).show()
+----+-----+---+---+---+
|year|month|day| 7| 8|
+----+-----+---+---+---+
|2022| 1| 1| 13| 14|
|2022| 1| 2| 15| 16|
+----+-----+---+---+---+

How can count occurrence frequency of records in Spark data frame and add it as new column to data frame without affecting on index column?

I'm trying to add a new column named Freq to the given spark dataframe without affecting on index column or records' order of frame to assign back results of Statistic frequency (which is counts) to right row/incident/event/record in dataframe.
This is my data frame:
+---+-------------+------+------------+-------------+-----------------+
| id| Type|Length|Token_number|Encoding_type|Character_feature|
+---+-------------+------+------------+-------------+-----------------+
| 0| Sentence| 4014| 198| false| 136|
| 1| contextid| 90| 2| false| 15|
| 2| Sentence| 172| 11| false| 118|
| 3| String| 12| 0| true| 11|
| 4|version-style| 16| 0| false| 13|
| 5| Sentence| 339| 42| false| 110|
| 6|version-style| 16| 0| false| 13|
| 7| url_variable| 10| 2| false| 9|
| 8| url_variable| 10| 2| false| 9|
| 9| Sentence| 172| 11| false| 117|
| 10| contextid| 90| 2| false| 15|
| 11| Sentence| 170| 11| false| 114|
| 12|version-style| 16| 0| false| 13|
| 13| Sentence| 68| 10| false| 59|
| 14| String| 12| 0| true| 11|
| 15| Sentence| 173| 11| false| 118|
| 16| String| 12| 0| true| 11|
| 17| Sentence| 132| 8| false| 96|
| 18| String| 12| 0| true| 11|
| 19| contextid| 88| 2| false| 15|
+---+-------------+------+------------+-------------+-----------------+
I tried following scripts unsuccessfully due to presence of index column id:
from pyspark.sql import functions as F
from pyspark.sql import Window
bo = features_sdf.select('id', 'Type', 'Length', 'Token_number', 'Encoding_type', 'Character_feature')
sdf2 = (
bo.na.fill(0).withColumn(
'Freq',
F.count("*").over(Window.partitionBy(bo.columns))
).withColumn(
'MaxFreq',
F.max('Freq').over(Window.partitionBy())
).withColumn(
'MinFreq',
F.min('Freq').over(Window.partitionBy())
)
)
sdf2.show()
#bad result due to existence of id column which makes every record unique and causes Freq=1
+---+-------------+------+------------+-------------+-----------------+----+-------+-------+
| id| Type|Length|Token_number|Encoding_type|Character_feature|Freq|MaxFreq|MinFreq|
+---+-------------+------+------------+-------------+-----------------+----+-------+-------+
| 0| Sentence| 4014| 198| false| 136| 1| 1| 1|
| 1| contextid| 90| 2| false| 15| 1| 1| 1|
| 2| Sentence| 172| 11| false| 118| 1| 1| 1|
| 3| String| 12| 0| true| 11| 1| 1| 1|
| 4|version-style| 16| 0| false| 13| 1| 1| 1|
| 5| Sentence| 339| 42| false| 110| 1| 1| 1|
| 6|version-style| 16| 0| false| 13| 1| 1| 1|
| 7| url_variable| 10| 2| false| 9| 1| 1| 1|
| 8| url_variable| 10| 2| false| 9| 1| 1| 1|
| 9| Sentence| 172| 11| false| 117| 1| 1| 1|
| 10| contextid| 90| 2| false| 15| 1| 1| 1|
| 11| Sentence| 170| 11| false| 114| 1| 1| 1|
| 12|version-style| 16| 0| false| 13| 1| 1| 1|
| 13| Sentence| 68| 10| false| 59| 1| 1| 1|
| 14| String| 12| 0| true| 11| 1| 1| 1|
| 15| Sentence| 173| 11| false| 118| 1| 1| 1|
| 16| String| 12| 0| true| 11| 1| 1| 1|
| 17| Sentence| 132| 8| false| 96| 1| 1| 1|
| 18| String| 12| 0| true| 11| 1| 1| 1|
| 19| contextid| 88| 2| false| 15| 1| 1| 1|
+---+-------------+------+------------+-------------+-----------------+----+-------+-------+
If I exclude index column id the code works but somehow it messes up the order (due to unwanted sorting/ordering) and results are not going to be assigned to the right record/row as follows:
+--------+------+------------+-------------+-----------------+----+-------+-------+
| Type|Length|Token_number|Encoding_type|Character_feature|Freq|MaxFreq|MinFreq|
+--------+------+------------+-------------+-----------------+----+-------+-------+
|Sentence| 7| 1| false| 6| 2| 1665| 1|
|Sentence| 7| 1| false| 6| 2| 1665| 1|
|Sentence| 17| 4| false| 14| 6| 1665| 1|
|Sentence| 17| 4| false| 14| 6| 1665| 1|
|Sentence| 17| 4| false| 14| 6| 1665| 1|
|Sentence| 17| 4| false| 14| 6| 1665| 1|
|Sentence| 17| 4| false| 14| 6| 1665| 1|
|Sentence| 17| 4| false| 14| 6| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
+--------+------+------------+-------------+-----------------+----+-------+-------+
In the end, I wanted to add this function and normalized it between 0 and 1 using simple mathematic formula and use it as a new feature. during normalizing I faced problems also and get null values. I already implemented the pythonic version and it is so easy but I'm so fed up in spark:
#Statistical Preprocessing
def add_freq_to_features(df):
frequencies_df = df.groupby(list(df.columns)).size().to_frame().rename(columns={0: "Freq"})
frequencies_df["Freq"] = frequencies_df["Freq"] / frequencies_df["Freq"].sum() # Normalzing 0 & 1
new_df = pd.merge(df, frequencies_df, how='left', on=list(df.columns))
return new_df
# Apply frequency allocation and merge with extracted features df
features_df = add_freq_to_features(oba)
features_df.head(20)
and it turns following right results as I expected:
I also tried to litterally translted the pythonic scripts using df.groupBy(df.columns).count() but I couldn't:
# this is to build "raw" Freq based on #pltc answer
sdf2 = (sdf
.groupBy(sdf.columns)
.agg(F.count('*').alias('Freq'))
.withColumn('Encoding_type', F.col('Encoding_type').cast('string'))
)
sdf2.cache().count()
sdf2.show()
Here is the PySpark full code of what we have tried on simplified example available in this colab notebook based on the answer of #ggordon:
def add_freq_to_features_(df):
from pyspark.sql import functions as F
from pyspark.sql import Window
sdf_pltc = df.select('id', 'Type', 'Length', 'Token_number', 'Encoding_type', 'Character_feature')
print("Before Any Modification") # only included for debugging purposes
sdf_pltc.show(5,truncate=0)
# fill missing values with 0 using `na.fill(0)` before applying count as window function
sdf2 = (
sdf_pltc.na.fill(0).withColumn(
'Freq',
F.count("*").over(Window.partitionBy(sdf_pltc.columns))
).withColumn(
'MaxFreq',
F.max('Freq').over(Window.partitionBy())
).withColumn(
'MinFreq',
F.min('Freq').over(Window.partitionBy())
)
.withColumn('id' , F.col('id'))
)
print("After replacing null with 0 and counting by partitions") # only included for debugging purposes
# use orderby as your last operation, only included here for debugging purposes
#sdf2 = sdf2.orderBy(F.col('Type').desc(),F.col('Length').desc() )
sdf2.show(5,truncate=False) # only included for debugging purposes
sdf2 = (
sdf2.withColumn('Freq' , F.when(
F.col('MaxFreq')==0.000000000 , 0
).otherwise(
(F.col('Freq')-F.col('MinFreq')) / (F.col('MaxFreq') - F.col('MinFreq'))
)
) # Normalzing between 0 & 1
)
sdf2 = sdf2.drop('MinFreq').drop('MaxFreq')
sdf2 = sdf2.withColumn('Encoding_type', F.col('Encoding_type').cast('string'))
#sdf2 = sdf2.orderBy(F.col('Type').desc(),F.col('Length').desc() )
print("After normalization, encoding transformation and order by ") # only included for debugging purposes
sdf2.show(50,truncate=False)
return sdf2
Sadly due to dealing BigData I can't hack it with df.toPandas() it is inexpensive and cause OOM error.
Any help will be forever appreciated.
The pandas behavior is different because the ID field is the DataFrame index so it does not count in the "group by all" you do. You can get the same behavior in Spark with one change.
partitionBy takes any ordinary list of strings, Try removing the id column from your partition key list like this:
bo = features_sdf.select('id', 'Type', 'Length', 'Token_number', 'Encoding_type', 'Character_feature')
partition_columns = bo.columns.remove('id')
sdf2 = (
bo.na.fill(0).withColumn(
'Freq',
F.count("*").over(Window.partitionBy(partition_columns))
).withColumn(
'MaxFreq',
F.max('Freq').over(Window.partitionBy())
).withColumn(
'MinFreq',
F.min('Freq').over(Window.partitionBy())
)
)
That will give you the results you said worked but keep the ID field. You'll need to figure out how to do the division for your frequencies but that should get you started.

Add total per group as a new row in dataframe in Pyspark

Referring to my previous question Here if I trying to compute and add total row, for each brand , parent and week_num (total of usage)
Here is dummy sample :
df0 = spark.createDataFrame(
[
(2, "A", "A2", "A2web", 2500),
(2, "A", "A2", "A2TV", 3500),
(4, "A", "A1", "A2app", 5500),
(4, "A", "AD", "ADapp", 2000),
(4, "B", "B25", "B25app", 7600),
(4, "B", "B26", "B26app", 5600),
(5, "C", "c25", "c25app", 2658),
(5, "C", "c27", "c27app", 1100),
(5, "C", "c28", "c26app", 1200),
],
["week_num", "parent", "brand", "channel", "usage"],
)
This snippet add total row per channel
# Group by and sum to get the totals
totals = (
df0.groupBy(["week_num", "parent", "brand"])
.agg(f.sum("usage").alias("usage"))
.withColumn("channel", f.lit("Total"))
)
# create a temp variable to sort
totals = totals.withColumn("sort_id", f.lit(2))
df0 = df0.withColumn("sort_id", f.lit(1))
# Union dataframes, drop temp variable and show
df1 = df0.unionByName(totals).sort(["week_num", "parent", "brand", "sort_id"])
df1.show()
result:
+--------+------+-----+-------+-----+
|week_num|parent|brand|channel|usage|
+--------+------+-----+-------+-----+
| 2| A| A2| A2web| 2500|
| 2| A| A2| A2TV| 3500|
| 2| A| A2| Total| 6000|
| 4| A| A1| A2app| 5500|
| 4| A| A1| Total| 5500|
| 4| A| AD| ADapp| 2000|
| 4| A| AD| Total| 2000|
| 4| B| B25| B25app| 7600|
| 4| B| B25| Total| 7600|
| 4| B| B26| B26app| 5600|
| 4| B| B26| Total| 5600|
| 5| C| c25| c25app| 2658|
| 5| C| c25| Total| 2658|
| 5| C| c27| c27app| 1100|
| 5| C| c27| Total| 1100|
| 5| C| c28| c26app| 1200|
| 5| C| c28| Total| 1200|
+--------+------+-----+-------+-----+
That is ok for channel column, in order to to get something like below, I simply repeat the first process groupby+sum and then union the result back
+--------+------+-----+-------+-----+
|week_num|parent|brand|channel|usage|
+--------+------+-----+-------+-----+
| 2| A| A2| A2web| 2500|
| 2| A| A2| A2TV| 3500|
| 2| A| A2| Total| 6000|
| 2| A|Total| | 6000|
| 2| Total| | | 6000|
Here in two steps
# add brand total row
df2 = (
df0.groupBy(["week_num", "parent"])
.agg(f.sum("usage").alias("usage"))
.withColumn("brand", f.lit("Total"))
.withColumn("channel", f.lit(""))
)
df2 = df1.unionByName(df2).sort(["week_num", "parent", "brand", "channel"])
# add weeknum total row
df3 = (
df0.groupBy(["week_num"])
.agg(f.sum("usage").alias("usage"))
.withColumn("parent", f.lit("Total"))
.withColumn("brand", f.lit(""))
.withColumn("channel", f.lit(""))
)
df3 = df2.unionByName(df3).sort(["week_num", "parent", "brand", "channel"])
result:
+--------+------+-----+-------+-----+
|week_num|parent|brand|channel|usage|
+--------+------+-----+-------+-----+
| 2| A| A2| A2TV| 3500|
| 2| A| A2| A2web| 2500|
| 2| A| A2| Total| 6000|
| 2| A|Total| | 6000|
| 2| Total| | | 6000|
| 4| A| A1| A2app| 5500|
| 4| A| A1| Total| 5500|
| 4| A| AD| ADapp| 2000|
| 4| A| AD| Total| 2000|
| 4| A|Total| | 7500|
| 4| B| B25| B25app| 7600|
| 4| B| B25| Total| 7600|
| 4| B| B26| B26app| 5600|
| 4| B| B26| Total| 5600|
| 4| B|Total| |13200|
| 4| Total| | |20700|
| 5| C|Total| | 4958|
| 5| C| c25| Total| 2658|
| 5| C| c25| c25app| 2658|
| 5| C| c27| Total| 1100|
+--------+------+-----+-------+-----+
First question, is there any alternative approach or more efficient way without repetition?
and second, what if I want to show total always at top per each group , regardless of parent/brand/channel alphabetical name, How can I sort this. like this:(this is dummy data but I hope it is clear enough)
+--------+------+-----+-------+-----+
|week_num|parent|brand|channel|usage|
+--------+------+-----+-------+-----+
| 2| Total| | | 6000|
| 2| A|Total| | 6000|
| 2| A| A2| Total| 6000|
| 2| A| A2| A2TV| 3500|
| 2| A| A2| A2web| 2500|
| 4| Total| | |20700|
| 4| A|Total| | 7500|
| 4| B|Total| |13200|
| 4| A| A1| Total| 5500|
| 4| A| A1| A2app| 5500|
| 4| A| AD| Total| 2000|
| 4| A| AD| ADapp| 2000|
| 4| B| B25| Total| 7600|
| 4| B| B25| B25app| 7600|
| 4| B| B26| Total| 5600|
| 4| B| B26| B26app| 5600|
I think you just need the rollup method.
agg_df = (
df.rollup(["week_num", "parent", "brand", "channel"])
.agg(F.sum("usage").alias("usage"), F.grouping_id().alias("lvl"))
.orderBy(agg_cols)
)
agg_df.show()
+--------+------+-----+-------+-----+---+
|week_num|parent|brand|channel|usage|lvl|
+--------+------+-----+-------+-----+---+
| null| null| null| null|31658| 15|
| 2| null| null| null| 6000| 7|
| 2| A| null| null| 6000| 3|
| 2| A| A2| null| 6000| 1|
| 2| A| A2| A2TV| 3500| 0|
| 2| A| A2| A2web| 2500| 0|
| 4| null| null| null|20700| 7|
| 4| A| null| null| 7500| 3|
| 4| A| A1| null| 5500| 1|
| 4| A| A1| A2app| 5500| 0|
| 4| A| AD| null| 2000| 1|
| 4| A| AD| ADapp| 2000| 0|
| 4| B| null| null|13200| 3|
| 4| B| B25| null| 7600| 1|
| 4| B| B25| B25app| 7600| 0|
| 4| B| B26| null| 5600| 1|
| 4| B| B26| B26app| 5600| 0|
| 5| null| null| null| 4958| 7|
| 5| C| null| null| 4958| 3|
| 5| C| c25| null| 2658| 1|
+--------+------+-----+-------+-----+---+
only showing top 20 rows
The rest is pure cosmetic. Probably not a good idea to do that with spark. better do that in the restition tool you will use after.
agg_df = agg_df.withColumn("lvl", F.dense_rank().over(Window.orderBy("lvl")))
TOTAL = "Total"
agg_df = (
agg_df.withColumn(
"parent", F.when(F.col("lvl") == 4, TOTAL).otherwise(F.col("parent"))
)
.withColumn(
"brand",
F.when(F.col("lvl") == 3, TOTAL).otherwise(
F.coalesce(F.col("brand"), F.lit(""))
),
)
.withColumn(
"channel",
F.when(F.col("lvl") == 2, TOTAL).otherwise(
F.coalesce(F.col("channel"), F.lit(""))
),
)
)
agg_df.where(F.col("lvl") != 5).orderBy(
"week_num", F.col("lvl").desc(), "parent", "brand", "channel"
).drop("lvl").show(500)
+--------+------+-----+-------+-----+
|week_num|parent|brand|channel|usage|
+--------+------+-----+-------+-----+
| 2| Total| | | 6000|
| 2| A|Total| | 6000|
| 2| A| A2| Total| 6000|
| 2| A| A2| A2TV| 3500|
| 2| A| A2| A2web| 2500|
| 4| Total| | |20700|
| 4| A|Total| | 7500|
| 4| B|Total| |13200|
| 4| A| A1| Total| 5500|
| 4| A| AD| Total| 2000|
| 4| B| B25| Total| 7600|
| 4| B| B26| Total| 5600|
| 4| A| A1| A2app| 5500|
| 4| A| AD| ADapp| 2000|
| 4| B| B25| B25app| 7600|
| 4| B| B26| B26app| 5600|
| 5| Total| | | 4958|
| 5| C|Total| | 4958|
| 5| C| c25| Total| 2658|
| 5| C| c27| Total| 1100|
| 5| C| c28| Total| 1200|
| 5| C| c25| c25app| 2658|
| 5| C| c27| c27app| 1100|
| 5| C| c28| c26app| 1200|
+--------+------+-----+-------+-----+

Pyspark dataframes group by

I have dataframe like below
|123 |124 |125 |
+-----+-----+-----+
| 1| 2| 3|
| 9| 9| 4|
| 4| 12| 1|
| 2| 4| 8|
| 7| 6| 3|
| 19| 11| 2|
| 21| 10| 10
i need the data to be in
1:[123,125]
2:[123,124,125]
3:[125]
Order is not required to be sorted . I am new to dataframes in pyspark any help would be appreciated
There are no melt or pivot APIs in pyspark that will accomplish this directly. Instead, flatmap from the RDD into a new dataframe and aggregate:
df.show()
+---+---+---+
|123|124|125|
+---+---+---+
| 1| 2| 3|
| 9| 9| 4|
| 4| 12| 1|
| 2| 4| 8|
| 7| 6| 3|
| 19| 11| 2|
| 21| 10| 10|
+---+---+---+
For each column or each row in the RDD, output a row with two columns: the value of the column and the column name:
cols = df.columns
(df.rdd
.flatMap(lambda row: [(row[c], c) for c in cols]).toDF(["value", "column_name"])
.show())
+-----+-----------+
|value|column_name|
+-----+-----------+
| 1| 123|
| 2| 124|
| 3| 125|
| 9| 123|
| 9| 124|
| 4| 125|
| 4| 123|
| 12| 124|
| 1| 125|
| 2| 123|
| 4| 124|
| 8| 125|
| 7| 123|
| 6| 124|
| 3| 125|
| 19| 123|
| 11| 124|
| 2| 125|
| 21| 123|
| 10| 124|
+-----+-----------+
Then, group by the value and aggregate the column names into a list:
from pyspark.sql import functions as f
(df.rdd
.flatMap(lambda row: [(row[c], c) for c in cols]).toDF(["value", "column_name"])
.groupby("value").agg(f.collect_list("column_name"))
.show())
+-----+-------------------------+
|value|collect_list(column_name)|
+-----+-------------------------+
| 19| [123]|
| 7| [123]|
| 6| [124]|
| 9| [123, 124]|
| 1| [123, 125]|
| 10| [124, 125]|
| 3| [125, 125]|
| 12| [124]|
| 8| [125]|
| 11| [124]|
| 2| [124, 123, 125]|
| 4| [125, 123, 124]|
| 21| [123]|
+-----+-------------------------+

Pyspark: Add new Column contain a value in a column counterpart another value in another column that meets a specified condition

Add new Column contain a value in a column counterpart another value in another column that meets a specified condition
For instance,
original DF as follows:
+-----+-----+-----+
|col1 |col2 |col3 |
+-----+-----+-----+
| A| 17| 1|
| A| 16| 2|
| A| 18| 2|
| A| 30| 3|
| B| 35| 1|
| B| 34| 2|
| B| 36| 2|
| C| 20| 1|
| C| 30| 1|
| C| 43| 1|
+-----+-----+-----+
I need to repeat the value in col2 that counterpart to 1 in col3 for each col1's groups. and if there are more value =1 in col3 for any group from col1 repeat the minimum value
the desired Df as follows:
+----+----+----+----------+
|col1|col2|col3|new_column|
+----+----+----+----------+
| A| 17| 1| 17|
| A| 16| 2| 17|
| A| 18| 2| 17|
| A| 30| 3| 17|
| B| 35| 1| 35|
| B| 34| 2| 35|
| B| 36| 2| 35|
| C| 20| 1| 20|
| C| 30| 1| 20|
| C| 43| 1| 20|
+----+----+----+----------+
df3=df.filter(df.col3==1)
+----+----+----+
|col1|col2|col3|
+----+----+----+
| B| 35| 1|
| C| 20| 1|
| C| 30| 1|
| C| 43| 1|
| A| 17| 1|
+----+----+----+
df3.createOrReplaceTempView("mytable")
To obtain minimum value of col2 I followed the accepted answer in this link How to find exact median for grouped data in Spark
df6=spark.sql("select col1, min(col2) as minimum from mytable group by col1 order by col1")
df6.show()
+----+-------+
|col1|minimum|
+----+-------+
| A| 17|
| B| 35|
| C| 20|
+----+-------+
df_a=df.join(df6,['col1'],'leftouter')
+----+----+----+-------+
|col1|col2|col3|minimum|
+----+----+----+-------+
| B| 35| 1| 35|
| B| 34| 2| 35|
| B| 36| 2| 35|
| C| 20| 1| 20|
| C| 30| 1| 20|
| C| 43| 1| 20|
| A| 17| 1| 17|
| A| 16| 2| 17|
| A| 18| 2| 17|
| A| 30| 3| 17|
+----+----+----+-------+
Is there way better than this solution?