how to create & sort by an ordered categorical variable in pyspark - dataframe

I'm migrating some code from pandas to pyspark. My source dataframe looks like this:
a b c
0 1 insert 1
1 2 update 1
2 3 seed 1
3 4 insert 2
4 5 update 2
5 6 delete 2
6 7 snapshot 1
and the operation (in python / pandas) that I'm applying is:
df.b = pd.Categorical(df.b, ordered=True, categories=['insert', 'seed', 'update', 'snapshot', 'delete'])
df.sort_values(['c', 'b'])
resulting in the output dataframe:
a b c
0 1 insert 1
2 3 seed 1
1 2 update 1
6 7 snapshot 1
3 4 insert 2
4 5 update 2
5 6 delete 2
I'm unsure how best to set up ordered categoricals using pyspark, and my initial approach creates a new column using case-when and attempts to use that subsequently:
df = df.withColumn(
"_precedence",
when(col("b") == "insert", 1)
.when(col("b") == "seed", 2)
.when(col("b") == "update", 3)
.when(col("b") == "snapshot", 4)
.when(col("b") == "delete", 5)
)

You can use a map:
from pyspark.sql.functions import create_map, lit, col
categories=['insert', 'seed', 'update', 'snapshot', 'delete']
# per #HaleemurAli, adjusted the below list comprehension to create map
map1 = create_map([val for (i, c) in enumerate(categories) for val in (c, lit(i))])
#Column<b'map(insert, 0, seed, 1, update, 2, snapshot, 3, delete, 4)'>
df.orderBy('c', map1[col('b')]).show()
+---+---+--------+---+
| id| a| b| c|
+---+---+--------+---+
| 0| 1| insert| 1|
| 2| 3| seed| 1|
| 1| 2| update| 1|
| 6| 7|snapshot| 1|
| 3| 4| insert| 2|
| 4| 5| update| 2|
| 5| 6| delete| 2|
+---+---+--------+---+
to reverse the order on column-b: df.orderBy('c', map1[col('b')].desc()).show()

You could also do this using coalesce with ur when statements.
from pyspark.sql import functions as F
categories=['insert', 'seed', 'update', 'snapshot', 'delete']
cols=[(F.when(F.col("b")==x,F.lit(y))) for x,y in zip(categories,[x for x in (range(1, len(categories)+1))])]
df.orderBy("c",F.coalesce(*cols)).show()
#+---+--------+---+
#| a| b| c|
#+---+--------+---+
#| 1| insert| 1|
#| 3| seed| 1|
#| 2| update| 1|
#| 7|snapshot| 1|
#| 4| insert| 2|
#| 5| update| 2|
#| 6| delete| 2|
#+---+--------+---+

Related

pyspark get latest non-null element of every column in one row

Let me explain my question using an example:
I have a dataframe:
pd_1 = pd.DataFrame({'day':[1,2,3,2,1,3],
'code': [10, 10, 20,20,30,30],
'A': [44, 55, 66,77,88,99],
'B':['a',None,'c',None,'d', None],
'C':[None,None,'12',None,None, None]
})
df_1 = sc.createDataFrame(pd_1)
df_1.show()
Output:
+---+----+---+----+----+
|day|code| A| B| C|
+---+----+---+----+----+
| 1| 10| 44| a|null|
| 2| 10| 55|null|null|
| 3| 20| 66| c| 12|
| 2| 20| 77|null|null|
| 1| 30| 88| d|null|
| 3| 30| 99|null|null|
+---+----+---+----+----+
What I want to achieve is a new dataframe, each row corresponds to a code, and for each column I want to have the most recent non-null value (with highest day).
In pandas, I can simply do
pd_2 = pd_1.sort_values('day', ascending=True).groupby('code').last()
pd_2.reset_index()
to get
code day A B C
0 10 2 55 a None
1 20 3 66 c 12
2 30 3 99 d None
My question is, how can I do it in pyspark (preferably version < 3)?
What I have tried so far is:
from pyspark.sql import Window
import pyspark.sql.functions as F
w = Window.partitionBy('code').orderBy(F.desc('day')).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
## Update: after applying #Steven's idea to remove for loop:
df_1 = df_1 .select([F.collect_list(x).over(w).getItem(0).alias(x) for x in df_.columns])
##for x in df_1.columns:
## df_1 = df_1.withColumn(x, F.collect_list(x).over(w).getItem(0))
df_1 = df_1.distinct()
df_1.show()
Output
+---+----+---+---+----+
|day|code| A| B| C|
+---+----+---+---+----+
| 2| 10| 55| a|null|
| 3| 30| 99| d|null|
| 3| 20| 66| c| 12|
+---+----+---+---+----+
Which I'm not very happy with, especially due to the for loop.
I think your current solution is quite nice. If you want another solution, you can try using first/last window functions :
from pyspark.sql import functions as F, Window
w = Window.partitionBy("code").orderBy(F.col("day").desc())
df2 = (
df.select(
"day",
"code",
F.row_number().over(w).alias("rwnb"),
*(
F.first(F.col(col), ignorenulls=True)
.over(w.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))
.alias(col)
for col in ("A", "B", "C")
),
)
.where("rwnb = 1")
.drop("rwnb")
)
and the result :
df2.show()
+---+----+---+---+----+
|day|code| A| B| C|
+---+----+---+---+----+
| 2| 10| 55| a|null|
| 3| 30| 99| d|null|
| 3| 20| 66| c| 12|
+---+----+---+---+----+
Here's another way of doing by using array functions and struct ordering instead of Window:
from pyspark.sql import functions as F
other_cols = ["day", "A", "B", "C"]
df_1 = df_1.groupBy("code").agg(
F.collect_list(F.struct(*other_cols)).alias("values")
).selectExpr(
"code",
*[f"array_max(filter(values, x-> x.{c} is not null))['{c}'] as {c}" for c in other_cols]
)
df_1.show()
#+----+---+---+---+----+
#|code|day| A| B| C|
#+----+---+---+---+----+
#| 10| 2| 55| a|null|
#| 30| 3| 99| d|null|
#| 20| 3| 66| c| 12|
#+----+---+---+---+----+

PySpark pivot as SQL query

Looking to write the full-SQL equivalent of a pivot implemented in pyspark. Code below creates a pandas DataFrame.
import pandas as pd
df = pd.DataFrame({
'id': ['a','a','a','b','b','b','b','c','c'],
'name': ['up','down','left','up','down','left','right','up','down'],
'count': [6,7,5,3,4,2,9,12,4]})
# id name count
# 0 a up 6
# 1 a down 7
# 2 a left 5
# 3 b up 3
# 4 b down 4
# 5 b left 2
# 6 b right 9
# 7 c up 12
# 8 c down 4
Code below then converts to a pyspark DataFrame and implements a pivot on the name column.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
ds = spark.createDataFrame(df)
dp = ds.groupBy('id').pivot('name').max().toPandas()
# id down left right up
# 0 c 4 NaN NaN 12
# 1 b 4 2.0 9.0 3
# 2 a 7 5.0 NaN 6
Trying to do the equivalent of ds.groupBy('id').pivot('name').max() in full-SQL, ie something like
ds.createOrReplaceTempView('ds')
dp = spark.sql(f"""
SELECT * FROM ds
PIVOT
(MAX(count)
FOR
...)""").toPandas()
Taking reference from SparkSQL Pivot -
Pivot
spark.sql(f"""
SELECT * FROM ds
PIVOT (
MAX(count)
FOR name in ('up','down','left','right')
)""").show()
+---+---+----+----+-----+
| id| up|down|left|right|
+---+---+----+----+-----+
| c| 12| 4|null| null|
| b| 3| 4| 2| 9|
| a| 6| 7| 5| null|
+---+---+----+----+-----+
Dynamic Approach
You can dynamically create the PIVOT Clause , I tried to create a general wrapper around the same
def pivot_by(inp_df,by):
distinct_by = inp_df[by].unique()
distinct_name_str = ''
for i,name in enumerate(distinct_by):
if i == 0:
distinct_name_str += f'\'{name}\''
else:
distinct_name_str += f',\'{name}\''
final_str = f'FOR {by} in ({distinct_name_str})'
return final_str
pivot_clause_str = pivot_by(df,'name')
### O/p - FOR name in ('up','down','left','right')
spark.sql(f"""
SELECT * FROM ds
PIVOT (
MAX(count)
{pivot_clause_str}
)""").show()
+---+---+----+----+-----+
| id| up|down|left|right|
+---+---+----+----+-----+
| c| 12| 4|null| null|
| b| 3| 4| 2| 9|
| a| 6| 7| 5| null|
+---+---+----+----+-----+
Dynamic Approach Usage
pivot_clause_str = pivot_by(df,'id')
##O/p - FOR id in ('a','b','c')
ds.createOrReplaceTempView('ds')
sql.sql(f"""
SELECT * FROM ds
PIVOT (
MAX(count)
{pivot_clause_str}
)""").show()
+-----+----+---+----+
| name| a| b| c|
+-----+----+---+----+
| down| 7| 4| 4|
| left| 5| 2|null|
| up| 6| 3| 12|
|right|null| 9|null|
+-----+----+---+----+

How to get value_counts for a spark row?

I have a spark dataframe with 3 columns storing 3 different predictions. I want to know the count of each output value so as to pick the value that was obtained max number of times as the final output.
I can do this in pandas easily by calling my lambda function for each row to get value_counts as shown below. I have converted my spark df to pandas df here, but I need to be able to perform similar operation on the spark df directly.
r=[Row(run_1=1, run_2=2, run_3=1, name='test run', id=1)]
df1=spark.createDataFrame(r)
df1.show()
df2=df1.toPandas()
r=df2.iloc[0]
val_counts=r[['run_1','run_2','run_3']].value_counts()
print(val_counts)
top_val=val_counts.index[0]
top_val_cnt=val_counts.values[0]
print('Majority output = %s, occured %s out of 3 times'%(top_val,top_val_cnt))
The output tells me that the value 1 occurred the most number of times- twice in this case -
+---+--------+-----+-----+-----+
| id| name|run_1|run_2|run_3|
+---+--------+-----+-----+-----+
| 1|test run| 1| 2| 1|
+---+--------+-----+-----+-----+
1 2
2 1
Name: 0, dtype: int64
Majority output = 1, occured 2 out of 3 times
I am trying to write a udf function which can take each of the df1 rows and get the top_val and top_val_cnt. Is there a way to achieve this using spark df?
python's code should be similar, maybe it will help you
val df1 = Seq((1, 1, 1, 2), (1, 2, 3, 3), (2, 2, 2, 2)).toDF()
df1.show()
df1.select(array('*)).map(s=>{
val list = s.getList(0)
(list.toString(),list.toArray.groupBy(i => i).mapValues(_.size).toList.toString())
}).show(false)
output:
+---+---+---+---+
| _1| _2| _3| _4|
+---+---+---+---+
| 1| 1| 1| 2|
| 1| 2| 3| 3|
| 2| 2| 2| 2|
+---+---+---+---+
+------------+-------------------------+
|_1 |_2 |
+------------+-------------------------+
|[1, 1, 1, 2]|List((2,1), (1,3)) |
|[1, 2, 3, 3]|List((2,1), (1,1), (3,2))|
|[2, 2, 2, 2]|List((2,4)) |
+------------+-------------------------+
Let's have a test dataframe similar like yours.
list = [(1,'test run',1,2,1),(2,'test run',3,2,3),(3,'test run',4,4,4)]
df=spark.createDataFrame(list, ['id', 'name','run_1','run_2','run_3'])
newdf = df.rdd.map(lambda x : (x[0],x[1],x[2:])) \
.map(lambda x : (x[0],x[1],x[2][0],x[2][1],x[2][2],[max(set(x[2]),key=x[2].count )])) \
.toDF(['id','test','run_1','run_2','run_3','most_frequent'])
>>> newdf.show()
+---+--------+-----+-----+-----+-------------+
| id| test|run_1|run_2|run_3|most_frequent|
+---+--------+-----+-----+-----+-------------+
| 1|test run| 1| 2| 1| [1]|
| 2|test run| 3| 2| 3| [3]|
| 3|test run| 4| 4| 4| [4]|
+---+--------+-----+-----+-----+-------------+
Or you need to handle a case when each item in list is different. i.e returning a null.
list = [(1,'test run',1,2,1),(2,'test run',3,2,3),(3,'test run',4,4,4),(4,'test run',1,2,3)]
df=spark.createDataFrame(list, ['id', 'name','run_1','run_2','run_3'])
from pyspark.sql.functions import udf
#udf
def most_frequent(*mylist):
counter = 1
num = mylist[0]
for i in mylist:
curr_frequency = mylist.count(i)
if(curr_frequency> counter):
counter = curr_frequency
num = i
return num
else:
return None
Initializing counter to '1' and returning count if its greater than '1' only.
df.withColumn('most_frequent', most_frequent('run_1', 'run_2', 'run_3')).show()
+---+--------+-----+-----+-----+-------------+
| id| name|run_1|run_2|run_3|most_frequent|
+---+--------+-----+-----+-----+-------------+
| 1|test run| 1| 2| 1| 1|
| 2|test run| 3| 2| 3| 3|
| 3|test run| 4| 4| 4| 4|
| 4|test run| 1| 2| 3| null|
+---+--------+-----+-----+-----+-------------+
+---+--------+-----+-----+-----+----+

Partition PySpark DataFrame depending on unique values in column (Custom Partitioning)

I have a PySpark data frame in which I have separate columns for names, types, days and values. An example of the dataframe can be seen below:
+------+----+---+-----+
| Name|Type|Day|Value|
+------+----+---+-----+
| name1| a| 1| 140|
| name2| a| 1| 180|
| name3| a| 1| 150|
| name4| b| 1| 145|
| name5| b| 1| 185|
| name6| c| 1| 155|
| name7| c| 1| 160|
| name8| a| 2| 120|
| name9| a| 2| 110|
|name10| b| 2| 125|
|name11| b| 2| 185|
|name12| c| 3| 195|
+------+----+---+-----+
For a selected value of Type, I want to create separate dataframes depending on the unique values of the column titled Day. Let's say, I have chosen a as my preferred Type. In the aforementioned example, I have three unique values of Day (viz. 1, 2 , 3). For each unique value of Day which has a row with the chosen Type a - (that is days 1 and 2 in the above data), I want to create a dataframe which has all rows with the chosen chosen Type and Day. In the example mentioned above, I will have two dataframe which will look as below
+------+----+---+-----+
| Name|Type|Day|Value|
+------+----+---+-----+
| name1| a| 1| 140|
| name2| a| 1| 180|
| name3| a| 1| 150|
+------+----+---+-----+
and
+------+----+---+-----+
| Name|Type|Day|Value|
+------+----+---+-----+
| name8| a| 2| 120|
| name9| a| 2| 110|
+------+----+---+-----+
How can I do this? In the actual data that I will be working with, I have millions of columns. So, I want to know about the most efficient way in which I can realize the above mentioned aim.
You can use the below mentioned code to generate the example given above.
from pyspark.sql import *
import numpy as np
Stats = Row("Name", "Type", "Day", "Value")
stat1 = Stats('name1', 'a', 1, 140)
stat2 = Stats('name2', 'a', 1, 180)
stat3 = Stats('name3', 'a', 1, 150)
stat4 = Stats('name4', 'b', 1, 145)
stat5 = Stats('name5', 'b', 1, 185)
stat6 = Stats('name6', 'c', 1, 155)
stat7 = Stats('name7', 'c', 1, 160)
stat8 = Stats('name8', 'a', 2, 120)
stat9 = Stats('name9', 'a', 2, 110)
stat10 = Stats('name10', 'b', 2, 125)
stat11 = Stats('name11', 'b', 2, 185)
stat12 = Stats('name12', 'c', 3, 195)
You can just use df.repartition("Type", "Day")
Docs for the same.
When I validate using the following function, I get the mentioned output
def validate(partition):
count = 0
for row in partition:
print(row)
count += 1
print(count)
My data
+------+--------------------+-------+-------+
|amount| trans_date|user_id|row_num|
+------+--------------------+-------+-------+
| 99.1|2019-06-04T00:00:...| 101| 1|
| 89.27|2019-06-04T00:00:...| 102| 2|
| 89.1|2019-03-04T00:00:...| 102| 3|
| 73.11|2019-09-10T00:00:...| 103| 4|
|-69.81|2019-09-11T00:00:...| 101| 5|
| 12.51|2018-12-14T00:00:...| 101| 6|
| 43.23|2018-09-11T00:00:...| 101| 7|
+------+--------------------+-------+-------+
After df.repartition("user_id") I get the following:
Output
Row(amount=73.11, trans_date='2019-09-10T00:00:00.000+05:30', user_id='103', row_num=4)
1
Row(amount=89.27, trans_date='2019-06-04T00:00:00.000+05:30', user_id='102', row_num=2)
Row(amount=89.1, trans_date='2019-03-04T00:00:00.000+05:30', user_id='102', row_num=3)
2
Row(amount=99.1, trans_date='2019-06-04T00:00:00.000+05:30', user_id='101', row_num=1)
Row(amount=-69.81, trans_date='2019-09-11T00:00:00.000+05:30', user_id='101', row_num=5)
Row(amount=12.51, trans_date='2018-12-14T00:00:00.000+05:30', user_id='101', row_num=6)
Row(amount=43.23, trans_date='2018-09-11T00:00:00.000+05:30', user_id='101', row_num=7)
4

Spark SQL: Is there a way to distinguish columns with same name?

I have a csv with a header with columns with same name.
I want to process them with spark using only SQL and be able to refer these columns unambiguously.
Ex.:
id name age height name
1 Alex 23 1.70
2 Joseph 24 1.89
I want to get only first name column using only Spark SQL
As mentioned in the comments, I think that the less error prone method would be to have the schema of the input data changed.
Yet, in case you are looking for a quick workaround, you can simply index the duplicated names of the columns.
For instance, let's create a dataframe with three id columns.
val df = spark.range(3)
.select('id * 2 as "id", 'id * 3 as "x", 'id, 'id * 4 as "y", 'id)
df.show
+---+---+---+---+---+
| id| x| id| y| id|
+---+---+---+---+---+
| 0| 0| 0| 0| 0|
| 2| 3| 1| 4| 1|
| 4| 6| 2| 8| 2|
+---+---+---+---+---+
Then I can use toDF to set new column names. Let's consider that I know that only id is duplicated. If we don't, adding the extra logic to figure out which columns are duplicated would not be very difficult.
var i = -1
val names = df.columns.map( n =>
if(n == "id") {
i+=1
s"id_$i"
} else n )
val new_df = df.toDF(names : _*)
new_df.show
+----+---+----+---+----+
|id_0| x|id_1| y|id_2|
+----+---+----+---+----+
| 0| 0| 0| 0| 0|
| 2| 3| 1| 4| 1|
| 4| 6| 2| 8| 2|
+----+---+----+---+----+