How to apply defined function on other column than defined - pandas

I have a function that returns the BMI from a given dataframe with columns 'Weight' and 'Height'
Here is the function:
def BMI(dataframe):
return dataframe['Weight'] / (dataframe['Height']**2)
I added new column 'Height In Meters' to the dataframe 'data' with:
data['Height In Meters'] = data['Height']/100
What i would like to do next, is to apply the original function on the dataframe 'data',
but instead of using the column 'Height', the calculation would be by using the new column 'Height In Meters'.
the result should be a new column called 'BMI' in the dataframe 'data', that shows for each row the calculation using 'Height In Meters'.
I tried:
data['BMI'] = data[['Weight','Height In Meters']].apply(BMI,axis=1)
But that doesn't seem to work.

You could pass the column names as arguments:
def BMI(dataframe, col1, col2):
return dataframe[col1] / (dataframe[col2]**2)
a = data.apply(BMI, args=('Weight', 'Height'), axis=1)
data['Height In Meters'] = data['Height']/100
b = data.apply(BMI, args=('Weight', 'Height In Meters'), axis=1)

Related

Computing for the mean of a given column from a dataframe

I need to find the arithmetic mean of each columns by returning res?
def ave(df, name):
df = {
'Courses':["Spark","PySpark","Python","pandas",None],
'Fee' :[20000,25000,22000,None,30000],
'Duration':['30days','40days','35days','None','50days'],
'Discount':[1000,2300,1200,2000,None]}
#CODE HERE
res = []
for i in df.columns:
res.append(col_ave(df, i))
I tried individually creating codes for the mean but Im having trouble

groupby with transform minmax

for every city , I want to create a new column which is minmax scalar of another columns (age).
I tried this an get Input contains infinity or a value too large for dtype('float64').
cols=['age']
def f(x):
scaler1=preprocessing.MinMaxScaler()
x[['age_minmax']] = scaler1.fit_transform(x[cols])
return x
df = df.groupby(['city']).apply(f)
From the comments:
df['age'].replace([np.inf, -np.inf], np.nan, inplace=True)
Or
df['age'] = df['age'].replace([np.inf, -np.inf], np.nan)

Dataframe column filter from a list of tuples

I'm trying to create a function to filter a dataframe from a list of tuples. I've created the below function but it doesn't seem to be working.
The list of tuples would be have dataframe column name, and a min value and a max value to filter.
eg:
eg_tuple = [('colname1', 10, 20), ('colname2', 30, 40), ('colname3', 50, 60)]
My attempted function is below:
def col_cut(df, cutoffs):
for c in cutoffs:
df_filter = df[ (df[c[0]] >= c[1]) & (df[c[0]] <= c[2])]
return df_filter
Note that the function should not filter on rows where the value is equal to max or min. Appreciate the help.
The problem is that you each time take df as the source to filter. You should filter with:
def col_cut(df, cutoffs):
df_filter = df
for col, mn, mx in cutoffs:
dfcol = df_filter[col]
df_filter = df_filter[(dfcol >= mn) & (dfcol <= mx)]
return df_filter
Note that you can use .between(..) [pandas-doc] here:
def col_cut(df, cutoffs):
df_filter = df
for col, mn, mx in cutoffs:
df_filter = df_filter[df_filter[col].between(mn, mx)]
return df_filter
Use np.logical_and + reduce of all masks created by list comprehension with Series.between:
def col_cut(df, cutoffs):
mask = np.logical_and.reduce([df[col].between(min1,max1) for col,min1,max1 in cutoffs])
return df[mask]

Assigning values to dataframe columns

In the below code, the dataframe df5 is not getting populated. I am just assigning the values to dataframe's columns and I have specified the column beforehand. When I print the dataframe, it returns an empty dataframe. Not sure whether I am missing something.
Any help would be appreciated.
import math
import pandas as pd
columns = ['ClosestLat','ClosestLong']
df5 = pd.DataFrame(columns=columns)
def distance(pt1, pt2):
return math.sqrt((pt1[0] - pt2[0])**2 + (pt1[1] - pt2[1])**2)
for pt1 in df1:
closestPoints = [pt1, df2[0]]
for pt2 in df2:
if distance(pt1, pt2) < distance(closestPoints[0], closestPoints[1]):
closestPoints = [pt1, pt2]
df5['ClosestLat'] = closestPoints[1][0]
df5['ClosestLat'] = closestPoints[1][0]
df5['ClosestLong'] = closestPoints[1][1]
print ("Point: " + str(closestPoints[0]) + " is closest to " + str(closestPoints[1]))
From the look of your code, you're trying to populate df5 with a list of latitudes and longitudes. However, you're making a couple mistakes.
The columns of pandas dataframes are Series, and hold some type of sequential data. So df5['ClosestLat'] = closestPoints[1][0] attempts to assign the entire column a single numerical value, and results in an empty column.
Even if the dataframe wasn't ignoring your attempts to assign a real number to the column, you would lose data because you are overwriting the column with each loop.
The Solution: Build a list of lats and longs, then insert into the dataframe.
import math
import pandas as pd
columns = ['ClosestLat','ClosestLong']
df5 = pd.DataFrame(columns=columns)
def distance(pt1, pt2):
return math.sqrt((pt1[0] - pt2[0])**2 + (pt1[1] - pt2[1])**2)
lats, lngs = [], []
for pt1 in df1:
closestPoints = [pt1, df2[0]]
for pt2 in df2:
if distance(pt1, pt2) < distance(closestPoints[0], closestPoints[1]):
closestPoints = [pt1, pt2]
lats.append(closestPoints[1][0])
lngs.append(closestPoints[1][1])
df['ClosestLat'] = pd.Series(lats)
df['ClosestLong'] = pd.Series(lngs)

Applying different functions to different columns of grouped dataframe

I am new to Pandas. I have grouped a dataframe by date and applied a function to different columns of the dataframe as shown below
def func(x):
questionID = x['questionID'].size()
is_true = x['is_bounty'].sum()
is_closed = x['is_closed'].sum()
flag = True
return pd.Series([questionID, is_true, is_closed, flag], index=['questionID', 'is_true', 'is_closed', 'flag'])
df_grouped = df1.groupby(['date'], as_index = False)
df_grouped = df_grouped.apply(func)
But when I run this I get an error saying
questionID = x['questionID'].size()
TypeError: 'int' object is not callable.
When I do the same thing this way it doesn't give any error.
df_grouped1 = df_grouped['questionID'].size()
I don't understand where am I going wrong.
'int' object is not callable. means you have to use size without ()
x['questionID'].size
For some objects size is only value, for others it can be function.
The same can be with other values/functions.