function' object is not subscriptable", 'occurred at index 0' - typeerror

I have a dataframe (maple) that, amongst others, has the columns 'THM', which is filled with float64 and 'Season_index', which is filled with int64. The 'THM' column has some missing values, and I want to fill them using the following function:
def fill_thm(cols):
THM = cols[0]
Season_index = cols[1]
if pd.isnull[THM]:
if Season_index == 1:
return 10
elif Season_index == 2:
return 20
elif Season_index == 3:
return 30
else:
return 40
else:
return THM
Then, to apply the function I used
maple['THM']= maple[['THM','Season_index']].apply(fill_thm,axis=1)
But I am getting the ("'function' object is not subscriptable", 'occurred at index 0') error. Anyone has any idea why? Thanks!

Try this:
def fill_thm(THM, S_i):
if pd.isnull[THM]:
if S_i == 1:
return 10
elif S_i == 2:
return 20
elif S_i == 3:
return 30
else:
return 40
else:
return THM
And apply with:
maple.loc[:,'THM'] = maple[['THM','Season_index']].apply(lambda row: pd.Series((fill_thm(row['THM'], row['Season_index']))), axis=1)

Try this code:
def fill(cols):
Age = cols[0]
Pclass=cols[1]
if pd.isnull['Age']:
if Pclass==1:
return 37
elif Pclass==2:
return 30
else:
return 28
else:
return Age
train[:,'Age'] = train[['Age','Pclass']].apply(fill,axis=1)

first of all, when you use apply on a specific column, you need not to specify axis=1.
second, if you are using pandas 0.22, just upgrade to 0.24. It solves all the issues with apply on Dataframes.

Related

return pandas dataframe from function

I want to return a dataframe from this function, which can be used elsewhere (for plotly graph to be exact).
My idea is to use the dataframe I can create with points_sum(), save it as the team name, and then use that dataframe in my px.line(dataframe = team_name).
In essence, I want to use the men_points_df variable after I created it.
def points_sum(team):
points = 0
men_points = []
for index, row in menscore_df.iterrows():
if row['hometeam'] == team:
if row['homegoals'] > row['awaygoals']:
points += 2
elif row['homegoals'] == row['awaygoals']:
points += 1
elif row['homegoals'] < row['awaygoals']:
points == points
date = str(row['date'])
men_points.append([date, points])
if row['awayteam'] == team:
if row['homegoals'] < row['awaygoals']:
points += 2
elif row['homegoals'] == row['awaygoals']:
points += 1
elif row['homegoals'] > row['awaygoals']:
points == points
date = str(row['date'])
men_points.append([date, points])
men_points_df = pd.DataFrame(men_points, columns = ["Date", 'Points'])
return men_points_df
In plotly, I am trying to use my new dataframe (men_points_df), like below, but I get the error undefined name, even though I can print it (for example: test = points_sum("FIF") (FIF is one of the team names) and it shows the correct dataframe in the console (when I type test):
elif pathname == "/page-3":
return [html.H1('Seasonal performance',
style={'textAlign':'center'}),
html.Div(
children=[
html.H2('Select team',style={'textAlign':'center'}),
html.Br(),
html.Br(),
dcc.Dropdown(
id='team_dd',
options=[{'label': v, 'value': k} for k,v in teams_all.items()],
)]),
dcc.Graph(id="performance_graph")
]
Output(component_id="performance_graph", component_property="figure"),
Input(component_id="team_dd", component_property="value")
def update_graph(option_selected):
title = "none selected"
if option_selected:
title = option_selected
line_fig = px.line(
test, # <------------ THIS IS THE ISSUE
title = f"{title}",
x = "Date", y = "Points")
return line_fig
Just call points_sum in the update_graph function, before you use test:
def update_graph(option_selected):
title = "none selected"
if option_selected:
title = option_selected
# vvv Here vvv
test = points_sum("FIF")
line_fig = px.line(
test, #THIS IS THE ISSUE
title = f"{title}",
x = "Date", y = "Points")
return line_fig

Pandas Data frame column condition check based on length of the value

I have pandas data frame which gets created by reading an excel file. The excel file has a column called serial number. Then I pass a serial number to another function which connect to API and fetch me the result set for those serial number.
My Code -:
def create_excel(filename):
try:
data = pd.read_excel(filename, usecols=[4,18,19,20,26,27,28],converters={'Serial Number': '{:0>32}'.format})
except Exception as e:
sys.exit("Error reading %s: %s" % (filename, e))
data["Subject Organization"].fillna("N/A",inplace= True)
df = data[data['Subject Organization'].str.contains("Fannie",case = False)]
#df['Serial Number'].apply(lamda x: '000'+x if len(x) == 29 else '00'+x if len(x) == 30 else '0'+x if len(x) == 31 else x)
print(df)
df.to_excel(r'Data.xlsx',index= False)
output = df['Serial Number'].apply(lambda x: fetch_by_ser_no(x))
df2 = pd.DataFrame(output)
df2.columns = ['Output']
df5 = pd.concat([df,df2],axis = 1)
The problem I am facing is I want to check if df5 returned by fetch_by_ser_no() is blank then make the serial number as 34 characters by adding two more leading 00 and then check the function again.
How can I do it by not creating multiple dataframe
Any help!!
Thanks
You can try to use if ... else ...:
output = df['Serial Number'].apply(lambda x: 'ok' if fetch_by_ser_no(x) else 'badly')

Remove the first or last char so the values from a column should start with numbers

I'm new to Pandas and I'd like to ask your advice.
Let's take this dataframe:
df_test = pd.DataFrame({'Dimensions': ['22.67x23.5', '22x24.6', '45x56', 'x23x56.22','46x23x','34x45'],
'Other': [59, 29, 73, 56,48,22]})
I want to detect the lines that starts with "x" (line 4) or ends with "x" (line 5) and then remove them so my dataframe should look like this
Dimensions Other
22.67x23.5 59
22x24.6 29
45x56 73
23x56.22 56
46x23 48
34x45 22
I wanted to create a function and apply it to a column
def remove_x(x):
if (x.str.match('^[a-zA-Z]') == True):
x = x[1:]
return x
if (x.str.match('.*[a-zA-Z]$') == True):
x = x[:-1]
return x
If I apply this function to the column
df_test['Dimensions'] = df_test['Dimensions'].apply(remove_x)
I got an error 'str' object has no attribute 'str'
I delete 'str' from the function and re-run all but no success.
What should I do?
Thank you for any suggestions or if there is another way to do it I'm interested in.
Just use str.strip:
df_test['Dimensions'] = df_test['Dimensions'].str.strip('x')
For general patterns, you can try str.replace:
df_test['Dimensions'].str.replace('(^x)|(x$)','')
Output:
Dimensions Other
0 22.67x23.5 59
1 22x24.6 29
2 45x56 73
3 23x56.22 56
4 46x23 48
5 34x45 22
#QuangHoang's answer is better (for simplicity and efficiency), but here's what went wrong in your approach. In your apply function, you are making calls to accessing the str methods of a Series or DataFrame. But when you call df_test['Dimensions'].apply(remove_x), the values passed to remove_x are the elements of df_test['Dimensions'], aka the str values themselves. So you should construct the function as if x is an incoming str.
Here's how you could implement that (avoiding any regex):
def remove_x(x):
if x[0] == 'x':
return x[1:]
elif x[-1] == 'x':
return x[:-1]
else:
return x
More idiomatically:
def remove_x(x):
x.strip('x')
Or even:
df_test['Dimensions'] = df_test['Dimensions'].apply(lambda x : x.strip('x'))
All that said, better to not use apply and follow the built-ins shown by Quang.

How to call custom function in Pandas

I have defined a custom function to correct outliers of one of my DF column. The function is working as expected, but i am not getting idea how to call this function in DF. Could you please help me in solving this?
Below is my custom function:
def corr_sft_outlier(in_bhk, in_sft):
bhk_band = np.quantile(outlierdf2[outlierdf2.bhk_size==in_bhk]['avg_sft'], (.20,.90))
lower_band = round(bhk_band[0])
upper_band = round(bhk_band[1])
if (in_sft>=lower_band)&(in_sft<=upper_band):
return in_sft
elif (in_sft<lower_band):
return lower_band
elif (in_sft>upper_band):
return upper_band
else:
return None
And i am calling this function in below ways, but both are not working.
outlierdf2[['bhk_size','avg_sft']].apply(corr_sft_outlier)
outlierdf2.apply(corr_sft_outlier(outlierdf2['bhk_size'],outlierdf2['avg_sft']))
Here you go:
outlierdf2['adj_avg_sft'] = df.apply(lambda x: corr_sft_outlier(x['bhk_size'],x['avg_sft']), axis=1)

Binary-search without an explicit array

I want to perform a binary-search using e.g. np.searchsorted, however, I do not want to create an explicit array containing values. Instead, I want to define a function giving the value to be expected at the desired position of the array, e.g. p(i) = i, where i denotes the position within the array.
Generating an array of values regarding the function would, in my case, be neither efficient nor elegant. Is there any way to achieve this?
What about something like:
import collections
class GeneratorSequence(collections.Sequence):
def __init__(self, func, size):
self._func = func
self._len = size
def __len__(self):
return self._len
def __getitem__(self, i):
if 0 <= i < self._len:
return self._func(i)
else:
raise IndexError
def __iter__(self):
for i in range(self._len):
yield self[i]
This would work with np.searchsorted(), e.g.:
import numpy as np
gen_seq = GeneratorSequence(lambda x: x ** 2, 100)
np.searchsorted(gen_seq, 9)
# 3
You could also write your own binary search function, you do not really need NumPy in this case, and it can actually be beneficial:
def bin_search(seq, item):
first = 0
last = len(seq) - 1
found = False
while first <= last and not found:
midpoint = (first + last) // 2
if seq[midpoint] == item:
first = midpoint
found = True
else:
if item < seq[midpoint]:
last = midpoint - 1
else:
first = midpoint + 1
return first
Which gives identical results:
all(bin_search(gen_seq, i) == np.searchsorted(gen_seq, i) for i in range(100))
# True
Incidentally, this is also WAY faster:
gen_seq = GeneratorSequence(lambda x: x ** 2, 1000000)
%timeit np.searchsorted(gen_seq, 10000)
# 1 loop, best of 3: 1.23 s per loop
%timeit bin_search(gen_seq, 10000)
# 100000 loops, best of 3: 16.1 µs per loop
Inspired by #norok2 comment, I think you can use something like this:
def f(i):
return i*2 # Just an example
class MySeq(Sequence):
def __init__(self, f, maxi):
self.maxi = maxi
self.f = f
def __getitem__(self, x):
if x < 0 or x > self.maxi:
raise IndexError()
return self.f(x)
def __len__(self):
return self.maxi + 1
In this case f is your function while maxi is the maximum index. This of course only works if the function f return values in sorted order.
At this point you can use an object of type MySeq inside np.searchsorted.