I have defined a custom function to correct outliers of one of my DF column. The function is working as expected, but i am not getting idea how to call this function in DF. Could you please help me in solving this?
Below is my custom function:
def corr_sft_outlier(in_bhk, in_sft):
bhk_band = np.quantile(outlierdf2[outlierdf2.bhk_size==in_bhk]['avg_sft'], (.20,.90))
lower_band = round(bhk_band[0])
upper_band = round(bhk_band[1])
if (in_sft>=lower_band)&(in_sft<=upper_band):
return in_sft
elif (in_sft<lower_band):
return lower_band
elif (in_sft>upper_band):
return upper_band
else:
return None
And i am calling this function in below ways, but both are not working.
outlierdf2[['bhk_size','avg_sft']].apply(corr_sft_outlier)
outlierdf2.apply(corr_sft_outlier(outlierdf2['bhk_size'],outlierdf2['avg_sft']))
Here you go:
outlierdf2['adj_avg_sft'] = df.apply(lambda x: corr_sft_outlier(x['bhk_size'],x['avg_sft']), axis=1)
Related
Called a Function inside agg() of Series, from below snippet of code, in first call it's printing int number for variable "a", and in second call it's coming as Series. I am not able to figure it out the reason for this behaviour.
import pandas as pd
ser = pd.Series([1,2,3,4,5])
def find_second_last(a):
print(a)
return a.iloc[-2]
ser.agg(find_second_last)
.iloc with single position without [] will return the int by default
a.iloc[[-2]]# return pd.Series
a.iloc[-2] # return int
a.iloc[1:] # return pd.Series
I am looking for a function that can check if the input object is a dataframe or a goubby object.
def fun(input_object):
if is_goupby_object():
// Doing something on goupby object
else:
// Otherwise do something on the dataframe
You can try this
def is_goupby_object(obj):
try:
if(obj.ngroups > 0):
return True
except:
return False
if(is_goupby_object(df_groupby)):
// Doing something on gouupby object
elif isinstance(dfObj, pd.DataFrame):
// Otherwise do something on the dataframe
else:
// not a groupby or df object
I have more of a general question. I've written a couple of functions that transform data successively:
def func1(df):
pass
...
def main():
df = pd.read_csv()
df1 = func1(df)
df2 = func2(df1)
df3 = func3(df2)
df4 = func4(df3)
df4.to_csv()
if __name__ == "__main__":
main()
Is there a better way of organizing the logic of my script?
Should I use classes for cases like this when everything is tied to one dataset?
It depends of your usecase. For what I understand, I would use dictionary of your functions that process a df.
For instance:
function_returning_a_df = { "f1": func1, "f2": func2, "f3": func3}
df = pd.read_csv(csv)
if this df needs 3 functions to be applied
df_processing = ["f1","f2","f3"] #function will be applied in this order
# If you need to keep df at every step you can make a list
dfs_processed = []
for func in df_processing:
dfs_processed.append(df) # if you want to save all steps
df = function_returning_a_df[func](df)
I have created 2 functions that returns two dataframe.I want to create another function and merge dataframe from function1, function2 and manipulate the data there. How can i call the function and merge it together.The way i called doesn't work for me
def func1():
return df1
def func2():
return df2
def fucn3():
func1()
func2()
Your question is not entirely clear but what I think you mean is:
Use merge:
def func3():
df = func1().merge(func2())
#do something with df
return df
I have a dataframe (maple) that, amongst others, has the columns 'THM', which is filled with float64 and 'Season_index', which is filled with int64. The 'THM' column has some missing values, and I want to fill them using the following function:
def fill_thm(cols):
THM = cols[0]
Season_index = cols[1]
if pd.isnull[THM]:
if Season_index == 1:
return 10
elif Season_index == 2:
return 20
elif Season_index == 3:
return 30
else:
return 40
else:
return THM
Then, to apply the function I used
maple['THM']= maple[['THM','Season_index']].apply(fill_thm,axis=1)
But I am getting the ("'function' object is not subscriptable", 'occurred at index 0') error. Anyone has any idea why? Thanks!
Try this:
def fill_thm(THM, S_i):
if pd.isnull[THM]:
if S_i == 1:
return 10
elif S_i == 2:
return 20
elif S_i == 3:
return 30
else:
return 40
else:
return THM
And apply with:
maple.loc[:,'THM'] = maple[['THM','Season_index']].apply(lambda row: pd.Series((fill_thm(row['THM'], row['Season_index']))), axis=1)
Try this code:
def fill(cols):
Age = cols[0]
Pclass=cols[1]
if pd.isnull['Age']:
if Pclass==1:
return 37
elif Pclass==2:
return 30
else:
return 28
else:
return Age
train[:,'Age'] = train[['Age','Pclass']].apply(fill,axis=1)
first of all, when you use apply on a specific column, you need not to specify axis=1.
second, if you are using pandas 0.22, just upgrade to 0.24. It solves all the issues with apply on Dataframes.