PySpark reassign values of duplicate rows - dataframe

I have a dataframe like this:
id,p1
1,A
2,null
3,B
4,null
4,null
2,C
Using PySpark, I want to remove all the duplicates. However, if there is a duplicate in which the p1 column is not null I want to remove the null one. For example, I want to remove the first occurrence of id 2 and either of id 4. Right now I am splitting the dataframe into two dataframes as such:
id,p1
1,A
3,B
2,C
id,p1
2,null
4,null
4,null
Removing the duplicates from both, then adding the ones which are not in the first dataframe back. Like that I get this dataframe.
id,p1
1,A
3,B
4,null
2,C
This is what I have so far:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('test').getOrCreate()
d = spark.createDataFrame(
[(1,"A"),
(2,None),
(3,"B"),
(4,None),
(4,None),
(2,"C")],
["id", "p"]
)
d1 = d.filter(d.p.isNull())
d2 = d.filter(d.p.isNotNull())
d1 = d1.dropDuplicates()
d2 = d2.dropDuplicates()
d3 = d1.join(d2, "id", 'left_anti')
d4 = d2.unionByName(d3)
Is there a more beautiful way of doing this? It really feels redundant like this but I can't come up with a better way. I tried using groupby but couldn't achieve it. Any ideas? Thanks.

(df1.sort(col('p1').desc())#sort column descending and will put nulls low in list
.dropDuplicates(subset = ['id']).show()#Drop duplicates on column id
)
+---+----+
| id| p1|
+---+----+
| 1| A|
| 2| C|
| 3| B|
| 4|null|
+---+----+

Use window row_number() function and sort by "p" column descending.
Example:
d.show()
#+---+----+
#| id| p|
#+---+----+
#| 1| A|
#| 2|null|
#| 3| B|
#| 4|null|
#| 4|null|
#| 2| C|
#+---+----+
from pyspark.sql.functions import col, row_number
from pyspark.sql.window import Window
window_spec=row_number().over(Window.partitionBy("id").orderBy(col("p").desc()))
d.withColumn("rn",window_spec).filter(col("rn")==1).drop("rn").show()
#+---+----+
#| id| p|
#+---+----+
#| 1| A|
#| 3| B|
#| 2| C|
#| 4|null|
#+---+----+

Related

pyspark get latest non-null element of every column in one row

Let me explain my question using an example:
I have a dataframe:
pd_1 = pd.DataFrame({'day':[1,2,3,2,1,3],
'code': [10, 10, 20,20,30,30],
'A': [44, 55, 66,77,88,99],
'B':['a',None,'c',None,'d', None],
'C':[None,None,'12',None,None, None]
})
df_1 = sc.createDataFrame(pd_1)
df_1.show()
Output:
+---+----+---+----+----+
|day|code| A| B| C|
+---+----+---+----+----+
| 1| 10| 44| a|null|
| 2| 10| 55|null|null|
| 3| 20| 66| c| 12|
| 2| 20| 77|null|null|
| 1| 30| 88| d|null|
| 3| 30| 99|null|null|
+---+----+---+----+----+
What I want to achieve is a new dataframe, each row corresponds to a code, and for each column I want to have the most recent non-null value (with highest day).
In pandas, I can simply do
pd_2 = pd_1.sort_values('day', ascending=True).groupby('code').last()
pd_2.reset_index()
to get
code day A B C
0 10 2 55 a None
1 20 3 66 c 12
2 30 3 99 d None
My question is, how can I do it in pyspark (preferably version < 3)?
What I have tried so far is:
from pyspark.sql import Window
import pyspark.sql.functions as F
w = Window.partitionBy('code').orderBy(F.desc('day')).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
## Update: after applying #Steven's idea to remove for loop:
df_1 = df_1 .select([F.collect_list(x).over(w).getItem(0).alias(x) for x in df_.columns])
##for x in df_1.columns:
## df_1 = df_1.withColumn(x, F.collect_list(x).over(w).getItem(0))
df_1 = df_1.distinct()
df_1.show()
Output
+---+----+---+---+----+
|day|code| A| B| C|
+---+----+---+---+----+
| 2| 10| 55| a|null|
| 3| 30| 99| d|null|
| 3| 20| 66| c| 12|
+---+----+---+---+----+
Which I'm not very happy with, especially due to the for loop.
I think your current solution is quite nice. If you want another solution, you can try using first/last window functions :
from pyspark.sql import functions as F, Window
w = Window.partitionBy("code").orderBy(F.col("day").desc())
df2 = (
df.select(
"day",
"code",
F.row_number().over(w).alias("rwnb"),
*(
F.first(F.col(col), ignorenulls=True)
.over(w.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))
.alias(col)
for col in ("A", "B", "C")
),
)
.where("rwnb = 1")
.drop("rwnb")
)
and the result :
df2.show()
+---+----+---+---+----+
|day|code| A| B| C|
+---+----+---+---+----+
| 2| 10| 55| a|null|
| 3| 30| 99| d|null|
| 3| 20| 66| c| 12|
+---+----+---+---+----+
Here's another way of doing by using array functions and struct ordering instead of Window:
from pyspark.sql import functions as F
other_cols = ["day", "A", "B", "C"]
df_1 = df_1.groupBy("code").agg(
F.collect_list(F.struct(*other_cols)).alias("values")
).selectExpr(
"code",
*[f"array_max(filter(values, x-> x.{c} is not null))['{c}'] as {c}" for c in other_cols]
)
df_1.show()
#+----+---+---+---+----+
#|code|day| A| B| C|
#+----+---+---+---+----+
#| 10| 2| 55| a|null|
#| 30| 3| 99| d|null|
#| 20| 3| 66| c| 12|
#+----+---+---+---+----+

How to transpose a long dataframe to wide dataframe

I have a dataframe looks like:
group, rate
A,0.1
A,0.2
B,0.3
B,0.1
C,0.1
C,0.2
How can I transpose this to a wide data frame. This is what I expect to get:
group, rate_1, rate_2
A,0.1,0.2
B,0.3,0.1
C,0.1,0.2
The number of records in each group is the same and also how to create a consistent column name with prefix or suffix while transposing?
Do you know which function I can use?
Thanks,
Try with groupBy, collect_list then dynamically split the array column as new columns.
Example:
df.show()
#+-----+----+
#|group|rate|
#+-----+----+
#| A| 0.1|
#| A| 0.2|
#| B| 0.3|
#| B| 0.1|
#+-----+----+
arr_size = 2
exprs=['group']+[expr('lst[' + str(x) + ']').alias('rate_'+str(x+1)) for x in range(0, arr_size)]
df1=df.groupBy("group").agg(collect_list(col("rate")).alias("lst"))
df1.select(*exprs).show()
#+-----+------+------+
#|group|rate_1|rate_2|
#+-----+------+------+
#| B| 0.3| 0.1|
#| A| 0.1| 0.2|
#+-----+------+------+
For Preserver Order in collect_list():
df=spark.createDataFrame([('A',0.1),('A',0.2),('B',0.3),('B',0.1)],['group','rate']).withColumn("mid",monotonically_increasing_id()).repartition(100)
from pyspark.sql.functions import *
from pyspark.sql import *
w=Window.partitionBy("group").orderBy("mid")
w1=Window.partitionBy("group").orderBy(desc("mid"))
df1=df.withColumn("lst",collect_list(col("rate")).over(w)).\
withColumn("snr",row_number().over(w1)).\
filter(col("snr") == 1).\
drop(*['mid','snr','rate'])
df1.show()
#+-----+----------+
#|group| lst|
#+-----+----------+
#| B|[0.3, 0.1]|
#| A|[0.1, 0.2]|
#+-----+----------+
arr_size = 2
exprs=['group']+[expr('lst[' + str(x) + ']').alias('rate_'+str(x+1)) for x in range(0, arr_size)]
df1.select(*exprs).show()
+-----+------+------+
|group|rate_1|rate_2|
+-----+------+------+
| B| 0.3| 0.1|
| A| 0.1| 0.2|
+-----+------+------+
I would create a column to rank your "rate" column and then pivot:
First create a "rank" column and concatenate the string "rate_" to the row_number:
from pyspark.sql.functions import concat, first, lit, row_number
from pyspark.sql import Window
df = df.withColumn(
"rank",
concat(
lit("rate_"),
row_number().over(Window.partitionBy("group")\
.orderBy("rate")).cast("string")
)
)
df.show()
#+-----+----+------+
#|group|rate| rank|
#+-----+----+------+
#| B| 0.1|rate_1|
#| B| 0.3|rate_2|
#| C| 0.1|rate_1|
#| C| 0.2|rate_2|
#| A| 0.1|rate_1|
#| A| 0.2|rate_2|
#+-----+----+------+
Now group by the "group" column and pivot on the "rank" column. Since you need an aggregation, use first.
df.groupBy("group").pivot("rank").agg(first("rate")).show()
#+-----+------+------+
#|group|rate_1|rate_2|
#+-----+------+------+
#| B| 0.1| 0.3|
#| C| 0.1| 0.2|
#| A| 0.1| 0.2|
#+-----+------+------+
The above does not depend on knowing the number of records in each group ahead of time.
However if (like you said) you know the number of records in each group you can make the pivot more efficient by passing in the values
num_records = 2
values = ["rate_" + str(i+1) for i in range(num_records)]
df.groupBy("group").pivot("rank", values=values).agg(first("rate")).show()
#+-----+------+------+
#|group|rate_1|rate_2|
#+-----+------+------+
#| B| 0.1| 0.3|
#| C| 0.1| 0.2|
#| A| 0.1| 0.2|
#+-----+------+------+

Show different values in another column that has the same id pyspark dataframe

I have a pyspark dataframe that looks like this:
How can I show the counts of every unique time under every id and order by id? the ideal result is below.
Try with groupBy,count
Example:
df.show()
#+---+-------------------+
#| ID| TIME|
#+---+-------------------+
#| 1|07-24-2019,19:47:36|
#| 2|07-24-2019,20:43:39|
#| 1|07-24-2019,20:47:36|
#| 1|07-24-2019,19:47:36|
#+---+-------------------+
from pyspark.sql.functions import *
df.groupBy("ID","TIME").\
agg(count(col("ID")).alias("count")).\
orderBy("ID","TIME").\
show()
#or using time as aggregation
df.groupBy("ID","TIME").\
agg(count(col("TIME")).alias("count")).\
orderBy("ID","TIME").\
show()
#+---+-------------------+-----+
#| ID| TIME|count|
#+---+-------------------+-----+
#| 1|07-24-2019,19:47:36| 2|
#| 1|07-24-2019,20:47:36| 1|
#| 2|07-24-2019,20:43:39| 1|
#+---+-------------------+-----+

how to sort value before concatenate text columns in pyspark

I need help to convert below code in Pyspark code or Pyspark sql code.
df["full_name"] = df.apply(lambda x: "_".join(sorted((x["first"], x["last"]))), axis=1)
Its basically adding one new column name full_name which have to concatenate values of the columns first and last in a sorted way.
I have done below code but don't know how to apply to sort in a columns text value.
df= df.withColumn('full_name', f.concat(f.col('first'),f.lit('_'), f.col('last')))
From Spark-2.4+:
We can use array_join, array_sort functions for this case.
Example:
df.show()
#+-----+----+
#|first|last|
#+-----+----+
#| a| b|
#| e| c|
#| d| a|
#+-----+----+
from pyspark.sql.functions import *
#first we create array of first,last columns then apply sort and join on array
df.withColumn("full_name",array_join(array_sort(array(col("first"),col("last"))),"_")).show()
#+-----+----+---------+
#|first|last|full_name|
#+-----+----+---------+
#| a| b| a_b|
#| e| c| c_e|
#| d| a| a_d|
#+-----+----+---------+

How to get unique values for each column in HIVE/PySpark table?

I have a table in HIVE/PySpark with A, B and C columns.
I want to get unique values for each of the column like
{A: [1, 2, 3], B:[a, b], C:[10, 20]}
in any format (dataframe, table, etc.)
How to do this efficiently (in parallel for each column) in HIVE or PySpark?
Current approach that I have does this for each column separately and thus is taking a lot of time.
We can use collect_set() from the pyspark.sql.functions module,
>>> df = spark.createDataFrame([(1,'a',10),(2,'a',20),(3,'b',10)],['A','B','C'])
>>> df.show()
+---+---+---+
| A| B| C|
+---+---+---+
| 1| a| 10|
| 2| a| 20|
| 3| b| 10|
+---+---+---+
>>> from pyspark.sql import functions as F
>>> df.select([F.collect_set(x).alias(x) for x in df.columns]).show()
+---------+------+--------+
| A| B| C|
+---------+------+--------+
|[1, 2, 3]|[b, a]|[20, 10]|
+---------+------+--------+