pyspark: Dataframe- UDF with multiple arguments - dataframe

I have a dataframe where column has an array and each element is a dictionary.
class
product
{"deleteDate": null, "class":"AB", "validFrom": "2022-09-01", "validTo": "2009-08-31"}, {"deleteDate": null, "class":"CD", "validFrom": "2009-09-01", "validTo": "2024-08-31"}
{"deleteDate": "2021-09-01", "class":"AB", "validFrom": "2003-09-01", "validTo": "2009-03-01"}, {"deleteDate": null, "class":"CD", "validFrom": "2009-09-01", "validTo": "2024-08-31"}
I am trying to filter an element base on a few conditions.
def getelement(value,entity):
list_url = []
for i in range(len(value)):
if value[i] is not None and (value[i].deleteDate is None):
if (value[i].validFrom <= (Date of Today)) & (value[i].validFrom >= (date of today)):
list_url.append(value[i].entity)
if list_url:
return str(list_url[-1])
if not list_url:
return None
udfgeturl=F.udf(lambda z: getelement(z) if not z is None else "" , StringType() )
master = df.withColumn( 'ClassName', udfgeturl('Class'))
The function takes two elements, value and entity. where value refers to column name and entity refers to a key in dictionary for which I want to save the result.
The code works with one element getelement(value) for UDF but I do not know how the UDF can take two arguments, any suggestion, please?

To improve the performance (Spark functions vs UDF performance?), you could use only spark transformations:
I'm assuming (value[i].validFrom >= (date of today)) is supposed to actually be (value[i].validTo >= (date of today))
import pyspark.sql.functions as f
def getelement(value, entity):
df = (
df
.withColumn('output', f.expr(f'filter({value}, element -> (element.deleteDate is null) AND (element.validFrom <= current_date()) AND (element.validTo >= current_date()))')[entity][-1])
)
return df

You can use a struct to bundle the parameters into 1 object. Then access the elements of the struct with . operator.
Code example:
def getelement(object):
value = object.value
entity = object.entity
return str( entity + " " + value )
udfgeturl=f.udf(getelement , StringType() )
df.select(
udfgeturl(
f.struct(
f.col("col1").alias("value"),
f.col("col2").alias("entity"))
)
).show()

Related

Computing for the mean of a given column from a dataframe

I need to find the arithmetic mean of each columns by returning res?
def ave(df, name):
df = {
'Courses':["Spark","PySpark","Python","pandas",None],
'Fee' :[20000,25000,22000,None,30000],
'Duration':['30days','40days','35days','None','50days'],
'Discount':[1000,2300,1200,2000,None]}
#CODE HERE
res = []
for i in df.columns:
res.append(col_ave(df, i))
I tried individually creating codes for the mean but Im having trouble

Extract 1st Column data and update 2ndColumn based on 1st Column data

I have an excel file with the following data:
LogID
T-1111
P-09899
P-09189,T-0011
T-111,T-2111
P-09099,P-7897
RCT-0989,RCT-099
I need to extract the first column LogID before the delimiter "-" and then populate a second column 'LogType' based on the string extracted (T is Tank LogType, P is Pump LogType)
For the above input, the output should be
LogID
LogType
T-1111
Tank
P-09899
Pump
P-09189,T-0011
Multiple
T-111,T-2111
Tank
P-09099,P-7897
Pump
RCT-0989,RCT-099
Reactor
I have written a function to do this in python:
def log_parser(log_string):
log_dict = { "T":"Tank","P":"Pump" }
log_list = log_string.split(",")
for i in log_list:
str_extract = i.upper().split("-",1)
if len(log_list) ==1:
result = log_dict[str_extract[0]]
return result
break
else:
idx = log_list.index(i)
for j in range(len(log_list)):
if (idx == j):
continue
str_extract_j = log_list[j].upper().split("-",1)
if str_extract_j[0] != str_extract[0]:
result = "Multiple"
return result
break
else:
result = log_dict[str_extract[0]]
return result
I am not sure how to implement this function in pandas..
Can i define the function in pandas and then use the lamba apply funtion like this:
test_df['LogType'] = test_df[['LogID']].apply(lambda x:log_parser(x), axis=1)
You can use:
# mapping dictionary for types
d = {'T': 'Tank', 'P': 'Pump'}
# extract letters before -
s = df['LogID'].str.extractall('([A-Z])-')[0]
# group by index
g = s.groupby(level=0)
df['LogType'] = (g.first() # get first match
.map(d) # map type name
# mask if several types
.mask(g.nunique().gt(1),
'Multiple')
)
Output:
LogID LogType
0 T-1111 Tank
1 P-09899 Pump
2 P-09189,T-0011 Multiple

how to add a condition such that when a dataframe has some column then only add respective "when" condition

I'm using spark-sql-2.4.1v with java8.
I have a scenario where I need to add a when condition on column iff that column exist in the respective dataframe . How can it be done?
Ex :
val df = ...// may contain columns either abc, x or y or both....depend on some business logic.
val result_df = df
.withColumn("new_column", when(col("abc") === "a" , concat(col("x"),lit("_"),col("y"))))
// here the problem is some times df may not contain/fetch "x" column
then it should give "y" value in the result_df. But in the above
statement throwing error as "x" column is not present in df at that
point.
So how to check if a column (i.e. "x") presents the use in the
concat() else go with remaining columns( i.e. "y")
Here vice versa also possible i.e. only col(x) presents in the df but
not column("y"). In some cases both columns x, y available in the df
then it is working fine.
Question.
how to add the condition in when-clause as and when the column presents in the dataframe. ?
one correction in question.
If some colums is not there I should not go into that "withColumn" condition.
Ex :
If column x presents :
val result_df = df
.withColumn("new_x", when(col("abc") === "a" , concat(col("x"))))
If column x presents :
val result_df = df
.withColumn("new_y", when(col("abc") === "a" , concat(col("y"))))
If both column x and y presents :
val result_df = df
.withColumn("new_x", when(col("abc") === "a" , concat(col("x"))))
.withColumn("new_y", when(col("abc") === "a" , concat(col("y"))))
.withColumn("new_x_y", when(col("abc") === "a" , concat(col("x"),lit("_"),col("y"))))
You could achive that by creating a list of columns dynamically using the columns property and a simple if Scala/Java statement. The list should include or not the targetColumn depending while the columns was found or not in the dataframe schema (scala code):
import org.apache.spark.sql.functions.{col, concat_ws}
// the column we should check for. Change this accordingly to your requirements i.e "y"
val targetColumn = "x"
var concatItems = Seq(col("y"))
// add targetColumn if found in df.columns
if (df.columns.contains(targetColumn))
concatItems = concatItems :+ col(targetColumn)
df.withColumn("new_column", when(col("abc") === "a", concat_ws("_", concatItems:_*)))
Note that instead of contact we use concat_ws since it will check automatically while contactItems contains one or more items and apply _ seperator respectively.
Update:
Here is the new updated code using a select statement:
var selectExpr = null
if(df.columns.contains("x") && df.columns.contains("y"))
selectExpr = Seq(
when(col("abc") === "a", col("x")).as("new_x"),
when(col("abc") === "a", col("y")).as("new_y"),
when(col("abc") === "a", concat_ws("_", col("x"), col("y"))).as("new_x_y")
)
else if(df.columns.contains("y"))
selectExpr = when(col("abc") === "a", col("y")).as("new_y")
else
selectExpr = when(col("abc") === "a", col("x")).as("new_x")
df.select(selectExpr:_*)
Note that we don't need to use withColumn, select is exactly what you need for your case.
You need to do this with your languages native flow control, e.g. in python/PySpark with if, else statements.
The reason being that Spark df functions work on columns, thus you cannot apply a .when() condition checking col names, it only looks at values within the cols and applies the logic/condition row-wise.
E.g. for F.when(col(x) == col(y)), spark will translate it to Java where it'll apply that logic rowise in the two cols.
This also makes sense if you think that Spark dfs are made up of row objects, so what it does is send the condition to the drive to apply this condition to every object (row), which looks like this [Row(x=2), Row(y=5)].
def check_columns(df, col_x, col_y, concat_name):
'''
df: spark dataframe
col_x & col_y: the two cols to concat if both present
concat_name: name for new concated col
-----------------------------------------------------
returns: df with new concated col if oth x & y cols present
otherwise if returns df with x or y col if only on present
'''
cols = list(col_x) + list(col_y)
if all(item in df.columns for item in cols)
df = df.withColumn(concat_name, concat(col(col_x),lit("_"),col(col_y)))
return df
Only need to apply action if both x & y present as if only one is it'll return the df with the existing x or y col anyways.
I would apply something like the above, save as function for re-usability.
What you could do with .when() is only concat values where a condition is met row-wise, this will give you a col with values concated where condition is met.
df.when('concat_col', F.when( F.col('age') < F.lit('18'),
concat(F.col('name'), F.lit('_underAge'))
.otherwise(F.col('name'),F.lit('_notUnderAge')))
Hope this helps!

Dataframe column filter from a list of tuples

I'm trying to create a function to filter a dataframe from a list of tuples. I've created the below function but it doesn't seem to be working.
The list of tuples would be have dataframe column name, and a min value and a max value to filter.
eg:
eg_tuple = [('colname1', 10, 20), ('colname2', 30, 40), ('colname3', 50, 60)]
My attempted function is below:
def col_cut(df, cutoffs):
for c in cutoffs:
df_filter = df[ (df[c[0]] >= c[1]) & (df[c[0]] <= c[2])]
return df_filter
Note that the function should not filter on rows where the value is equal to max or min. Appreciate the help.
The problem is that you each time take df as the source to filter. You should filter with:
def col_cut(df, cutoffs):
df_filter = df
for col, mn, mx in cutoffs:
dfcol = df_filter[col]
df_filter = df_filter[(dfcol >= mn) & (dfcol <= mx)]
return df_filter
Note that you can use .between(..) [pandas-doc] here:
def col_cut(df, cutoffs):
df_filter = df
for col, mn, mx in cutoffs:
df_filter = df_filter[df_filter[col].between(mn, mx)]
return df_filter
Use np.logical_and + reduce of all masks created by list comprehension with Series.between:
def col_cut(df, cutoffs):
mask = np.logical_and.reduce([df[col].between(min1,max1) for col,min1,max1 in cutoffs])
return df[mask]

How to map different indices in Pyomo?

I am a new Pyomo/Python user. Now I need to formulate one set of constraints with index 'n', where all of the 3 components are with different indices but correlate with index 'n'. I am just curious that how I can map the relationship between these sets.
In my case, I read csv files in which their indices are related to 'n' to generate my set. For example: a1.n1, a2.n3, a3.n5 /// b1.n2, b2.n4, b3.n6, b4.n7 /// c1.n1, c2.n2, c3.n4, c4.n6 ///. The constraint expression of index n1 and n2 is the follows for example:
for n1: P(a1.n1) + L(c1.n1) == D(n1)
for n2: - F(b1.n2) + L(c2.n2) == D(n2)
Now let's go the coding. The set creating codes are as follow, they are within a class:
import pyomo
import pandas
import pyomo.opt
import pyomo.environ as pe
class MyModel:
def __init__(self, Afile, Bfile, Cfile):
self.A_data = pandas.read_csv(Afile)
self.A_data.set_index(['a'], inplace = True)
self.A_data.sort_index(inplace = True)
self.A_set = self.A_data.index.unique()
... ...
Then I tried to map the relationship in the constraint construction like follows:
def createModel(self):
self.m = pe.ConcreteModel()
self.m.A_set = pe.Set( initialize = self.A_set )
def obj_rule(m):
return ...
self.m.OBJ = pe.Objective(rule = obj_rule, sense = pe.minimize)
def constr(m, n)
As = self.A_data.reset_index()
Amap = As[ As['n'] == n ]['a']
Bs = self.B_data.reset_index()
Bmap = Bs[ Bs['n'] == n ]['b']
Cs = self.C_data.reset_index()
Cmap = Cs[ Cs['n'] == n ]['c']
return sum(m.P[(p,n)] for p in Amap) - sum(m.F[(s,n)] for s in Bmap) + sum(m.L[(r,n)] for r in Cmap) == self.D_data.ix[n, 'D']
self.m.cons = pe.Constraint(self.m.D_set, rule = constr)
def solve(self):
... ...
Finally, the error raises when I run this:
KeyError: "Index '(1, 1)' is not valid for indexed component 'P'"
I know it is the wrong way, so I am wondering if there is a good way to map their relationships. Thanks in advance!
Gabriel
I just forgot to post my answer to my own question when I solved this one week ago. The key thing towards this problem is setting up a map index.
Let me just modify the code in the question. Firstly, we need to modify the dataframe to include the information of the mapped indices. Then, the set for the mapped index can be constructed, taking 2 mapped indices as example:
self.m.A_set = pe.Set( initialize = self.A_set, dimen = 2 )
The names of the two mapped indices are 'alpha' and 'beta' respectively. Then the constraint can be formulated, based on the variables declared at the beginning:
def constr(m, n)
Amap = self.A_data[ self.A_data['alpha'] == n ]['beta']
Bmap = self.B_data[ self.B_data['alpha'] == n ]['beta']
return sum(m.P[(i,n)] for i in Amap) + sum(m.L[(r,n)] for r in Bmap) == D.loc[n, 'D']
m.TravelingBal = pe.Constraint(m.A_set, rule = constr)
The summation groups all associated B to A with a mapped index set.