Update pyspark dataframe from a column having the target column values

Update pyspark dataframe from a column having the target column values - dataframe

I have a dataframe which has a column('target_column' in this case) and I need to update these target columns with 'val' column values.
I have tried using udfs and .withcolumn but they all expect fixed column value. In my case it can be variable. Also using rdd map transformations didn't work as rdds are immutable.
def test():
data = [("jose_1", 'mase', "firstname", "jane"), ("li_1", "ken", 'lastname', 'keno'), ("liz_1", 'durn', 'firstname', 'liz')]
source_df = spark.createDataFrame(data, ["firstname", "lastname", "target_column", "val"])
source_df.show()
if __name__ == "__main__":
spark = SparkSession.builder.appName('Name Group').getOrCreate()
test()
spark.stop()
Input:
+---------+--------+-------------+----+
|firstname|lastname|target_column| val|
+---------+--------+-------------+----+
| jose_1| mase| firstname|jane|
| li_1| ken| lastname|keno|
| liz_1| durn| firstname| liz|
+---------+--------+-------------+----+
Expected output:
+---------+--------+-------------+----+
|firstname|lastname|target_column| val|
+---------+--------+-------------+----+
| jane| mase| firstname|jane|
| li_1| keno| lastname|keno|
| liz| durn| firstname| liz|
+---------+--------+-------------+----+
For e.g. in first row in input the target_column is 'firstname' and val is 'jane'. So I need to update the firstname with 'jane' in that row.
Thanks

You can do a loop with all you columns:
from pyspark.sql import functions as F
for col in df.columns:
df = df.withColumn(
col,
F.when(
F.col("target_column")==F.lit(col),
F.col("val")
).otherwise(F.col(col))
)

Related

pyspark- elif statement and assign position to extract a value

I have a dataframe like the example
productNo
prodcuctMT
productPR
productList
2389
['xy-5', 'yz-12','zb-56','iu-30']
['pr-1', 'pr-2', 'pr-3', 'pr-4']
['67230','7839','1339','9793']
6745
['xy-4', 'yz-34','zb-8','iu-9']
['pr-6', pr-1', 'pr-3', 'pr-7']
['1111','0987','8910','0348']
I would like to use elif statement for multiple conditions where we look at productMT and if it passes the condition it looks at productPR and takes the position for which it satisfies the condition.
if productMT contains xy-5 then if productPR contains pr-1 , take its position and add new column with value from productList.
productNo
prodcuctMT
productPR
productList
productList
2389
['xy-5', 'yz-12','zb-56','iu-30']
['pr-1', 'pr-2', 'pr-3', 'pr-4']
['67230','7839','1339','9793']
67230
I tried using a filter but it only does the work for one filter and I need to run on multiple filters so it loops through all rows and conditions.
F.arrays_zip('productList', 'prodcuctMT', 'productPR'),
lambda x: (x.prodcuctMT == 'xy-5') & (x.productPR != 'pr-1')
)
df_array_pos = df_array.withColumn('output', filtered[0].productList).withColumn('flag', filtered[0].prodcuctMT)```

You just need to use multiple when functions for each elif conditions you want
Your sample data
df = spark.createDataFrame([
(2389, ['xy-5', 'yz-12','zb-56','iu-30'], ['pr-1', 'pr-2', 'pr-3', 'pr-4'], ['67230','7839','1339','9793']),
(6745, ['xy-4', 'yz-34','zb-8','iu-9'], ['pr-6', 'pr-1', 'pr-3', 'pr-7'], ['1111','0987','8910','0348']),
], ['productNo', 'productMT', 'productPR', 'productList'])
+---------+---------------------------+------------------------+-------------------------+
|productNo|productMT |productPR |productList |
+---------+---------------------------+------------------------+-------------------------+
|2389 |[xy-5, yz-12, zb-56, iu-30]|[pr-1, pr-2, pr-3, pr-4]|[67230, 7839, 1339, 9793]|
|6745 |[xy-4, yz-34, zb-8, iu-9] |[pr-6, pr-1, pr-3, pr-7]|[1111, 0987, 8910, 0348] |
+---------+---------------------------+------------------------+-------------------------+
You can add as many when as you like
from pyspark.sql import functions as F
(df
.withColumn('output', F
.when(F.array_contains('productMT', 'xy-5') & F.array_contains('productPR', 'pr-1'), F.col('productList')[F.array_position('productMT', 'xy-5') - 1])
)
.show(10, False)
)
+---------+---------------------------+------------------------+-------------------------+------+
|productNo|productMT |productPR |productList |output|
+---------+---------------------------+------------------------+-------------------------+------+
|2389 |[xy-5, yz-12, zb-56, iu-30]|[pr-1, pr-2, pr-3, pr-4]|[67230, 7839, 1339, 9793]|67230 |
|6745 |[xy-4, yz-34, zb-8, iu-9] |[pr-6, pr-1, pr-3, pr-7]|[1111, 0987, 8910, 0348] |null |
+---------+---------------------------+------------------------+-------------------------+------+

Vectorized pandas udf in pyspark with dict lookup

I'm trying to learn to use pandas_udf in pyspark (Databricks).
One of the assignments is to write a pandas_udf to sort by day of the week. I know how to do this using spark udf:
from pyspark.sql.functions import *
data = [('Sun', 282905.5), ('Mon', 238195.5), ('Thu', 264620.0), ('Sat', 278482.0), ('Wed', 227214.0)]
schema = 'day string, avg_users double'
df = spark.createDataFrame(data, schema)
print('Original')
df.show()
#udf()
def udf(day: str) -> str:
dow = {"Mon": "1", "Tue": "2", "Wed": "3", "Thu": "4",
"Fri": "5", "Sat": "6", "Sun": "7"}
return dow[day] + '-' + day
print('with spark udf')
final_df = df.select(col('avg_users'), udf(col('day')).alias('day')).sort('day')
final_df.show()
Prints:
Original
+---+-----------+
|day| avg_users|
+---+-----------+
|Sun| 282905.5|
|Mon| 238195.5|
|Thu| 264620.0|
|Sat| 278482.0|
|Wed| 227214.0|
+---+-----------+
with spark udf
+-----------+-----+
| avg_users| day|
+-----------+-----+
| 238195.5|1-Mon|
| 227214.0|3-Wed|
| 264620.0|4-Thu|
| 278482.0|6-Sat|
| 282905.5|7-Sun|
+-----------+-----+
Trying to do the same with pandas_udf
import pandas as pd
#pandas_udf('string')
def p_udf(day: pd.Series) -> pd.Series:
dow = {"Mon": "1", "Tue": "2", "Wed": "3", "Thu": "4",
"Fri": "5", "Sat": "6", "Sun": "7"}
return dow[day.str] + '-' + day.str
p_final_df = df.select(df.avg_users, p_udf(df.day))
print('with pandas udf')
p_final_df.show()
I get KeyError: <pandas.core.strings.accessor.StringMethods object at 0x7f31197cd9a0>. I think it's coming from dow[day.str], which kinda makes sense.
I also tried:
return dow[day.str.__str__()] + '-' + day.str # KeyError: .... StringMethods
return dow[str(day.str)] + '-' + day.str # KeyError: .... StringMethods
return dow[day.str.upper()] + '-' + day.str # TypeError: unhashable type: 'Series'
return f"{dow[day.str]}-{day.str}" # KeyError: .... StringMethods (but I think this is logically
# wrong, returning a string instead of a Series)
I've read:
API reference
PySpark equivalent for lambda function in Pandas UDF
How to convert Scalar Pyspark UDF to Pandas UDF?
Pandas UDF in pyspark

Using the .str method alone without any actual vectorized transformation was giving you the error. Also, you can not use the whole series as a key for your dow dict. Use a map method for pandas.Series:
from pyspark.sql.functions import *
import pandas as pd
data = [('Sun', 282905.5), ('Mon', 238195.5), ('Thu', 264620.0), ('Sat', 278482.0), ('Wed', 227214.0)]
schema = 'day string, avg_users double'
df = spark.createDataFrame(data, schema)
#pandas_udf("string")
def p_udf(day: pd.Series) -> pd.Series:
dow = {"Mon": "1", "Tue": "2", "Wed": "3", "Thu": "4",
"Fri": "5", "Sat": "6", "Sun": "7"}
return day.map(dow) + '-' + day
df.select(df.avg_users, p_udf(df.day).alias("day")).show()
+---------+-----+
|avg_users| day|
+---------+-----+
| 282905.5|7-Sun|
| 238195.5|1-Mon|
| 264620.0|4-Thu|
| 278482.0|6-Sat|
| 227214.0|3-Wed|
+---------+-----+

what about we return a dataframe using groupeddata and orderby after you do the udf. Pandas sort_values is quite problematic within udfs.
Basically, in the udf I generate the numbers using python and then concatenate them back to the day column.
from pyspark.sql.functions import pandas_udf
import pandas as pd
from pyspark.sql.types import *
import calendar
def sortdf(pdf):
day=pdf.day
pdf =pdf.assign(day=(day.map(dict(zip(calendar.day_abbr, range(7))))+1).astype(str) + '-'+day)
return pdf
df.groupby('avg_users').applyInPandas(sortdf, schema=df.schema).show()
+-----+---------+
| day|avg_users|
+-----+---------+
|3-Wed| 227214.0|
|1-Mon| 238195.5|
|4-Thu| 264620.0|
|6-Sat| 278482.0|
|7-Sun| 282905.5|
+-----+---------+

How to count hypothenuses with pandas udf, pyspark

I want to write a panda udf which will take two arguments cathetus1, and cathetus2 from other dataframe and return hypot.
# this data is list where cathetuses are.
data = [(3.0, 4.0), (6.0, 8.0), (3.3, 5.6)]
schema = StructType([StructField("cathetus1",DoubleType(),True),StructField("cathetus2",DoubleType(),True)])
df = spark.createDataFrame(data=data,schema=schema)
df.show()
#and this is creating dataframe where only cathetuses are showing.
this is function i have written so far.
def pandaUdf(cat1, cat2):
leg1 = []
leg2 = []
for i in data:
x = 0
leg1.append(i[x])
leg2.append(i[x+1])
hypoData.append(np.hypot(leg1,leg2))
return np.hypot(leg1,leg2)
#example_series = pd.Series(data)
and im trying to create a new column in df, which values will be hypos.
df.withColumn(col('Hypo'), pandaUdf(example_df.cathetus1,example_df.cathetus2)).show()
but this gives me an error --> col should be Column.
I dont understand how I can fix this error or why its even there.

You can apply np.hypot on the 2 cathetus directly without extracting individual values.
from pyspark.sql import functions as F
from pyspark.sql.types import *
data = [(3.0, 4.0), (6.0, 8.0), (3.3, 5.6)]
schema = StructType([StructField("cathetus1",DoubleType(),True),StructField("cathetus2",DoubleType(),True)])
df = spark.createDataFrame(data=data,schema=schema)
df.show()
"""
+---------+---------+
|cathetus1|cathetus2|
+---------+---------+
| 3.0| 4.0|
| 6.0| 8.0|
| 3.3| 5.6|
+---------+---------+
"""
def hypot(cat1: pd.Series, cat2: pd.Series) -> pd.Series:
return np.hypot(cat1,cat2)
hypot_pandas_df = F.pandas_udf(hypot, returnType=FloatType())
df.withColumn("Hypo", hypot_pandas_df("cathetus1", "cathetus2")).show()
"""
+---------+---------+----+
|cathetus1|cathetus2|Hypo|
+---------+---------+----+
| 3.0| 4.0| 5.0|
| 6.0| 8.0|10.0|
| 3.3| 5.6| 6.5|
+---------+---------+----+
"""

cast a date to integer pyspark

Is is possible to convert a date column to an integer column in a pyspark dataframe? I tried 2 different ways but every attempt returns a column with nulls. What am I missing?
from pyspark.sql.types import *
# DUMMY DATA
simpleData = [("James",34,"2006-01-01","true","M",3000.60),
("Michael",33,"1980-01-10","true","F",3300.80),
("Robert",37,"1992-07-01","false","M",5000.50)
]
columns = ["firstname","age","jobStartDate","isGraduated","gender","salary"]
df = spark.createDataFrame(data = simpleData, schema = columns)
df=df.withColumn("jobStartDate", df['jobStartDate'].cast(DateType()))
# ATTEMPT 1 with cast()
df=df.withColumn("jobStartDateAsInteger1", df['jobStartDate'].cast(IntegerType()))
# ATTEMPT 2 with selectExpr()
df=df.selectExpr("*","CAST(jobStartDate as int) as jobStartDateAsInteger2")
df.show()

You can try casting it to a UNIX timestamp using F.unix_timestamp():
from pyspark.sql.types import *
import pyspark.sql.functions as F
# DUMMY DATA
simpleData = [("James",34,"2006-01-01","true","M",3000.60),
("Michael",33,"1980-01-10","true","F",3300.80),
("Robert",37,"1992-07-01","false","M",5000.50)
]
columns = ["firstname","age","jobStartDate","isGraduated","gender","salary"]
df = spark.createDataFrame(data = simpleData, schema = columns)
df=df.withColumn("jobStartDate", df['jobStartDate'].cast(DateType()))
df=df.withColumn("jobStartDateAsInteger1", F.unix_timestamp(df['jobStartDate']))
df.show()
+---------+---+------------+-----------+------+------+----------------------+
|firstname|age|jobStartDate|isGraduated|gender|salary|jobStartDateAsInteger1|
+---------+---+------------+-----------+------+------+----------------------+
| James| 34| 2006-01-01| true| M|3000.6| 1136073600|
| Michael| 33| 1980-01-10| true| F|3300.8| 316310400|
| Robert| 37| 1992-07-01| false| M|5000.5| 709948800|
+---------+---+------------+-----------+------+------+----------------------+

Recommendation - Creating a new dataframe with conditions

I've been studying Spark for a while but today I got stuck, I'm working in a Recommendation model using Audioscrobbler Dataset.
I have my model based in ALS and the following definition for make the recommendations:
def makeRecommendations(model: ALSModel, userID: Int,howMany: Int): DataFrame = {
val toRecommend = model.itemFactors.select($"id".as("artist")).withColumn("user", lit(userID))
model.transform(toRecommend).
select("artist", "prediction", "user").
orderBy($"prediction".desc).
limit(howMany)
}
It's generating the expected output, but now I would like to create a new list of DataFrames using Predictions DF and User Data DF.
DataFrame Example
New list of DF consisting of the Predicted value from "Predictions DF" and "Listened" that will be 0 if the user didn't listened the artist or 1 if the user listened, something like this:
Expected DF
I tried the following solution:
val recommendationsSeq = someUsers.map { userID =>
//Gets the artists from user in testData
val artistsOfUser = testData.where($"user".===(userID)).select("artist").rdd.map(r => r(0)).collect.toList
// Recommendations for each user
val recoms = makeRecommendations(model, userID, numRecom)
//Insert a column listened with 1 if the artist in the test set for the user and 0 otherwise
val recomOutput = recoms.withColumn("listened", when($"artist".isin(artistsOfUser: _*), 1.0).otherwise(0.0)).drop("artist")
(recomOutput)
}.toSeq
But its very time consuming when the recommendation has more than 30 users. I believe there's a better way to do it,
Could someone give some idea?
Thanks,

You can try joining dataframes then goupby and count:
scala> val df1 = Seq((1205,0.9873411,1000019)).toDF("artist","prediction","user")
scala> df1.show()
+------+----------+-------+
|artist|prediction| user|
+------+----------+-------+
| 1205| 0.9873411|1000019|
+------+----------+-------+
scala> val df2 = Seq((1000019,1205,40)).toDF("user","artist","playcount")
scala> df2.show()
+-------+------+---------+
| user|artist|playcount|
+-------+------+---------+
|1000019| 1205| 40|
+-------+------+---------+
scala> df1.join(df2,Seq("artist","user")).groupBy('prediction).count().show()
+----------+-----+
|prediction|count|
+----------+-----+
| 0.9873411| 1|
+----------+-----+

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Update pyspark dataframe from a column having the target column values - dataframe

You can do a loop with all you columns: from pyspark.sql import functions as F for col in df.columns: df = df.withColumn( col, F.when( F.col("target_column")==F.lit(col), F.col("val") ).otherwise(F.col(col)) )

Related

pyspark- elif statement and assign position to extract a value

Vectorized pandas udf in pyspark with dict lookup

How to count hypothenuses with pandas udf, pyspark

cast a date to integer pyspark

Recommendation - Creating a new dataframe with conditions

Categories

Resources