I have a dataframe like this:
id,p1
1,A
2,null
3,B
4,null
4,null
2,C
Using PySpark, I want to remove all the duplicates. However, if there is a duplicate in which the p1 column is not null I want to remove the null one. For example, I want to remove the first occurrence of id 2 and either of id 4. Right now I am splitting the dataframe into two dataframes as such:
id,p1
1,A
3,B
2,C
id,p1
2,null
4,null
4,null
Removing the duplicates from both, then adding the ones which are not in the first dataframe back. Like that I get this dataframe.
id,p1
1,A
3,B
4,null
2,C
This is what I have so far:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('test').getOrCreate()
d = spark.createDataFrame(
[(1,"A"),
(2,None),
(3,"B"),
(4,None),
(4,None),
(2,"C")],
["id", "p"]
)
d1 = d.filter(d.p.isNull())
d2 = d.filter(d.p.isNotNull())
d1 = d1.dropDuplicates()
d2 = d2.dropDuplicates()
d3 = d1.join(d2, "id", 'left_anti')
d4 = d2.unionByName(d3)
Is there a more beautiful way of doing this? It really feels redundant like this but I can't come up with a better way. I tried using groupby but couldn't achieve it. Any ideas? Thanks.
(df1.sort(col('p1').desc())#sort column descending and will put nulls low in list
.dropDuplicates(subset = ['id']).show()#Drop duplicates on column id
)
+---+----+
| id| p1|
+---+----+
| 1| A|
| 2| C|
| 3| B|
| 4|null|
+---+----+
Use window row_number() function and sort by "p" column descending.
Example:
d.show()
#+---+----+
#| id| p|
#+---+----+
#| 1| A|
#| 2|null|
#| 3| B|
#| 4|null|
#| 4|null|
#| 2| C|
#+---+----+
from pyspark.sql.functions import col, row_number
from pyspark.sql.window import Window
window_spec=row_number().over(Window.partitionBy("id").orderBy(col("p").desc()))
d.withColumn("rn",window_spec).filter(col("rn")==1).drop("rn").show()
#+---+----+
#| id| p|
#+---+----+
#| 1| A|
#| 3| B|
#| 2| C|
#| 4|null|
#+---+----+
Related
Let me explain my question using an example:
I have a dataframe:
pd_1 = pd.DataFrame({'day':[1,2,3,2,1,3],
'code': [10, 10, 20,20,30,30],
'A': [44, 55, 66,77,88,99],
'B':['a',None,'c',None,'d', None],
'C':[None,None,'12',None,None, None]
})
df_1 = sc.createDataFrame(pd_1)
df_1.show()
Output:
+---+----+---+----+----+
|day|code| A| B| C|
+---+----+---+----+----+
| 1| 10| 44| a|null|
| 2| 10| 55|null|null|
| 3| 20| 66| c| 12|
| 2| 20| 77|null|null|
| 1| 30| 88| d|null|
| 3| 30| 99|null|null|
+---+----+---+----+----+
What I want to achieve is a new dataframe, each row corresponds to a code, and for each column I want to have the most recent non-null value (with highest day).
In pandas, I can simply do
pd_2 = pd_1.sort_values('day', ascending=True).groupby('code').last()
pd_2.reset_index()
to get
code day A B C
0 10 2 55 a None
1 20 3 66 c 12
2 30 3 99 d None
My question is, how can I do it in pyspark (preferably version < 3)?
What I have tried so far is:
from pyspark.sql import Window
import pyspark.sql.functions as F
w = Window.partitionBy('code').orderBy(F.desc('day')).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
## Update: after applying #Steven's idea to remove for loop:
df_1 = df_1 .select([F.collect_list(x).over(w).getItem(0).alias(x) for x in df_.columns])
##for x in df_1.columns:
## df_1 = df_1.withColumn(x, F.collect_list(x).over(w).getItem(0))
df_1 = df_1.distinct()
df_1.show()
Output
+---+----+---+---+----+
|day|code| A| B| C|
+---+----+---+---+----+
| 2| 10| 55| a|null|
| 3| 30| 99| d|null|
| 3| 20| 66| c| 12|
+---+----+---+---+----+
Which I'm not very happy with, especially due to the for loop.
I think your current solution is quite nice. If you want another solution, you can try using first/last window functions :
from pyspark.sql import functions as F, Window
w = Window.partitionBy("code").orderBy(F.col("day").desc())
df2 = (
df.select(
"day",
"code",
F.row_number().over(w).alias("rwnb"),
*(
F.first(F.col(col), ignorenulls=True)
.over(w.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))
.alias(col)
for col in ("A", "B", "C")
),
)
.where("rwnb = 1")
.drop("rwnb")
)
and the result :
df2.show()
+---+----+---+---+----+
|day|code| A| B| C|
+---+----+---+---+----+
| 2| 10| 55| a|null|
| 3| 30| 99| d|null|
| 3| 20| 66| c| 12|
+---+----+---+---+----+
Here's another way of doing by using array functions and struct ordering instead of Window:
from pyspark.sql import functions as F
other_cols = ["day", "A", "B", "C"]
df_1 = df_1.groupBy("code").agg(
F.collect_list(F.struct(*other_cols)).alias("values")
).selectExpr(
"code",
*[f"array_max(filter(values, x-> x.{c} is not null))['{c}'] as {c}" for c in other_cols]
)
df_1.show()
#+----+---+---+---+----+
#|code|day| A| B| C|
#+----+---+---+---+----+
#| 10| 2| 55| a|null|
#| 30| 3| 99| d|null|
#| 20| 3| 66| c| 12|
#+----+---+---+---+----+
I have a dataframe looks like:
group, rate
A,0.1
A,0.2
B,0.3
B,0.1
C,0.1
C,0.2
How can I transpose this to a wide data frame. This is what I expect to get:
group, rate_1, rate_2
A,0.1,0.2
B,0.3,0.1
C,0.1,0.2
The number of records in each group is the same and also how to create a consistent column name with prefix or suffix while transposing?
Do you know which function I can use?
Thanks,
Try with groupBy, collect_list then dynamically split the array column as new columns.
Example:
df.show()
#+-----+----+
#|group|rate|
#+-----+----+
#| A| 0.1|
#| A| 0.2|
#| B| 0.3|
#| B| 0.1|
#+-----+----+
arr_size = 2
exprs=['group']+[expr('lst[' + str(x) + ']').alias('rate_'+str(x+1)) for x in range(0, arr_size)]
df1=df.groupBy("group").agg(collect_list(col("rate")).alias("lst"))
df1.select(*exprs).show()
#+-----+------+------+
#|group|rate_1|rate_2|
#+-----+------+------+
#| B| 0.3| 0.1|
#| A| 0.1| 0.2|
#+-----+------+------+
For Preserver Order in collect_list():
df=spark.createDataFrame([('A',0.1),('A',0.2),('B',0.3),('B',0.1)],['group','rate']).withColumn("mid",monotonically_increasing_id()).repartition(100)
from pyspark.sql.functions import *
from pyspark.sql import *
w=Window.partitionBy("group").orderBy("mid")
w1=Window.partitionBy("group").orderBy(desc("mid"))
df1=df.withColumn("lst",collect_list(col("rate")).over(w)).\
withColumn("snr",row_number().over(w1)).\
filter(col("snr") == 1).\
drop(*['mid','snr','rate'])
df1.show()
#+-----+----------+
#|group| lst|
#+-----+----------+
#| B|[0.3, 0.1]|
#| A|[0.1, 0.2]|
#+-----+----------+
arr_size = 2
exprs=['group']+[expr('lst[' + str(x) + ']').alias('rate_'+str(x+1)) for x in range(0, arr_size)]
df1.select(*exprs).show()
+-----+------+------+
|group|rate_1|rate_2|
+-----+------+------+
| B| 0.3| 0.1|
| A| 0.1| 0.2|
+-----+------+------+
I would create a column to rank your "rate" column and then pivot:
First create a "rank" column and concatenate the string "rate_" to the row_number:
from pyspark.sql.functions import concat, first, lit, row_number
from pyspark.sql import Window
df = df.withColumn(
"rank",
concat(
lit("rate_"),
row_number().over(Window.partitionBy("group")\
.orderBy("rate")).cast("string")
)
)
df.show()
#+-----+----+------+
#|group|rate| rank|
#+-----+----+------+
#| B| 0.1|rate_1|
#| B| 0.3|rate_2|
#| C| 0.1|rate_1|
#| C| 0.2|rate_2|
#| A| 0.1|rate_1|
#| A| 0.2|rate_2|
#+-----+----+------+
Now group by the "group" column and pivot on the "rank" column. Since you need an aggregation, use first.
df.groupBy("group").pivot("rank").agg(first("rate")).show()
#+-----+------+------+
#|group|rate_1|rate_2|
#+-----+------+------+
#| B| 0.1| 0.3|
#| C| 0.1| 0.2|
#| A| 0.1| 0.2|
#+-----+------+------+
The above does not depend on knowing the number of records in each group ahead of time.
However if (like you said) you know the number of records in each group you can make the pivot more efficient by passing in the values
num_records = 2
values = ["rate_" + str(i+1) for i in range(num_records)]
df.groupBy("group").pivot("rank", values=values).agg(first("rate")).show()
#+-----+------+------+
#|group|rate_1|rate_2|
#+-----+------+------+
#| B| 0.1| 0.3|
#| C| 0.1| 0.2|
#| A| 0.1| 0.2|
#+-----+------+------+
I have a pyspark dataframe that looks like this:
How can I show the counts of every unique time under every id and order by id? the ideal result is below.
Try with groupBy,count
Example:
df.show()
#+---+-------------------+
#| ID| TIME|
#+---+-------------------+
#| 1|07-24-2019,19:47:36|
#| 2|07-24-2019,20:43:39|
#| 1|07-24-2019,20:47:36|
#| 1|07-24-2019,19:47:36|
#+---+-------------------+
from pyspark.sql.functions import *
df.groupBy("ID","TIME").\
agg(count(col("ID")).alias("count")).\
orderBy("ID","TIME").\
show()
#or using time as aggregation
df.groupBy("ID","TIME").\
agg(count(col("TIME")).alias("count")).\
orderBy("ID","TIME").\
show()
#+---+-------------------+-----+
#| ID| TIME|count|
#+---+-------------------+-----+
#| 1|07-24-2019,19:47:36| 2|
#| 1|07-24-2019,20:47:36| 1|
#| 2|07-24-2019,20:43:39| 1|
#+---+-------------------+-----+
I need help to convert below code in Pyspark code or Pyspark sql code.
df["full_name"] = df.apply(lambda x: "_".join(sorted((x["first"], x["last"]))), axis=1)
Its basically adding one new column name full_name which have to concatenate values of the columns first and last in a sorted way.
I have done below code but don't know how to apply to sort in a columns text value.
df= df.withColumn('full_name', f.concat(f.col('first'),f.lit('_'), f.col('last')))
From Spark-2.4+:
We can use array_join, array_sort functions for this case.
Example:
df.show()
#+-----+----+
#|first|last|
#+-----+----+
#| a| b|
#| e| c|
#| d| a|
#+-----+----+
from pyspark.sql.functions import *
#first we create array of first,last columns then apply sort and join on array
df.withColumn("full_name",array_join(array_sort(array(col("first"),col("last"))),"_")).show()
#+-----+----+---------+
#|first|last|full_name|
#+-----+----+---------+
#| a| b| a_b|
#| e| c| c_e|
#| d| a| a_d|
#+-----+----+---------+
I have a table in HIVE/PySpark with A, B and C columns.
I want to get unique values for each of the column like
{A: [1, 2, 3], B:[a, b], C:[10, 20]}
in any format (dataframe, table, etc.)
How to do this efficiently (in parallel for each column) in HIVE or PySpark?
Current approach that I have does this for each column separately and thus is taking a lot of time.
We can use collect_set() from the pyspark.sql.functions module,
>>> df = spark.createDataFrame([(1,'a',10),(2,'a',20),(3,'b',10)],['A','B','C'])
>>> df.show()
+---+---+---+
| A| B| C|
+---+---+---+
| 1| a| 10|
| 2| a| 20|
| 3| b| 10|
+---+---+---+
>>> from pyspark.sql import functions as F
>>> df.select([F.collect_set(x).alias(x) for x in df.columns]).show()
+---------+------+--------+
| A| B| C|
+---------+------+--------+
|[1, 2, 3]|[b, a]|[20, 10]|
+---------+------+--------+