Drop rows that hasn't float value in a column - pandas

I have this df:
My task is to find results with this conditions:
[(df.neighbourhood_group == 'Manhattan') & (df.room_type == 'Entire home/apt') & (df.price.between(150.0, 175.0))]`
But this is not working. The error message says:
TypeError: '>=' not supported between instances of 'str' and 'float'
Because in the price column I have the value Private room wrote somewhere.
How can I write a piece of code that tells to keep only float values and drop all the others?
NOTE
These are not working:
df = df[df['price'].apply(lambda x: type(x) in [float])
clean['price']=df['price'].str.replace('Private room', '0.0')
clean.price = clean.price.astype(float)
df.select_dtypes(exclude=['str'])
This is the CSV data.

One way to achive it:
df['price'] = df.apply(lambda r: r['price'] if type(x['price'])==float else np.nan, axis=1)
df.dropna(inplace=True)
In this way you will replace any non-float row with np.nan, and later remove such row.

Related

Dataframe index with isclose function

I have a dataframe with numerical values between 0 and 1. I am trying to create simple summary statistics (manually). I when using boolean I can get the index but when I try to use math.isclose the function does not work and gives an error.
For example:
import pandas as pd
df1 = pd.DataFrame({'col1':[0,.05,0.74,0.76,1], 'col2': [0,
0.05,0.5, 0.75,1], 'x1': [1,2,3,4,5], 'x2':
[5,6,7,8,9]})
result75 = df1.index[round(df1['col2'],2) == 0.75].tolist()
value75 = df1['x2'][result75]
print(value75.mean())
This will give the correct result but occasionally the value result is NAN so I tried:
result75 = df1.index[math.isclose(round(df1['col2'],2), 0.75, abs_tol = 0.011)].tolist()
value75 = df1['x2'][result75]
print(value75.mean())
This results in the following error message:
TypeError: cannot convert the series to <class 'float'>
Both are type "bool" so not sure what is going wrong here...
This works:
rows_meeting_condition = df1[(df1['col2'] > 0.74) & (df1['col2'] < 0.76)]
print(rows_meeting_condition['x2'].mean())

groupby with transform minmax

for every city , I want to create a new column which is minmax scalar of another columns (age).
I tried this an get Input contains infinity or a value too large for dtype('float64').
cols=['age']
def f(x):
scaler1=preprocessing.MinMaxScaler()
x[['age_minmax']] = scaler1.fit_transform(x[cols])
return x
df = df.groupby(['city']).apply(f)
From the comments:
df['age'].replace([np.inf, -np.inf], np.nan, inplace=True)
Or
df['age'] = df['age'].replace([np.inf, -np.inf], np.nan)

Creating new column in pandas using existing column values as filter using pandas - .isin() fails as Attribute Error

Error: AttributeError: 'int' object has no attribute 'isin'
Question: There are no null values, works in individual code block. Tried to modify the data type of series R to object, error goes : 'str' object has no attribute 'isin'
What am I missing?
Code:
X = [1, 2, 3, 4]
if dg['RFM_Segment'] == '111':
return 'Core'
elif (dg['R'].isin(X) & dg['F'].isin([1]) & dg['M'].isin(X) & (dg['RFM_Segment'] != '111')).any():
return 'Loyal'
elif (dg['R'].isin(X) & dg['F'].isin(X) & dg['M'].isin([1]) & (dg['RFM_Segment'] != '111')).any():
return 'Whales'
elif (dg['R'].isin(X) & dg['F'].isin([1]) & dg['M'].isin([3,4])).any():
return 'Promising'
elif (dg['R'].isin([1]) & dg['F'].isin([4]) & dg['M'].isin(X)).any():
return 'Rookies'
elif (dg['R'].isin([4]) & dg['F'].isin([4]) & dg['M'].isin(X)).any():
return 'Slipping'
else:
return 'NA'
dg['user_segment']= dg.apply(user_segment, axis= 1)
I will assume that you accidentally cut off the top of your code snipet, in which you define user_segment.
The issue lies in the way you tried to use apply. Note that apply will operate on Series, rather than DataFrame. So, by indexing into any element of a series, you will not receive a Series object (as you would when indexing into DataFrame), but rather a object of a given columns' type (like int, str etc.). An example:
import pandas as pd
X = ['a', 'c']
df = pd.DataFrame([['a', 'b'], ['c', 'd'], ['e', 'f']], columns=['col1', 'col2'])
df['col1'].isin(X) # this works, because I'm applying `isin` on the entire column.
def test_apply(x):
print(x['col1'].isin(X))
return x
df.apply(test_apply, axis=1) # this doesn't work,
# because I'm applying `isin` on a non-pandas object, in
# this example `str`

Pandas Data frame column condition check based on length of the value

I have pandas data frame which gets created by reading an excel file. The excel file has a column called serial number. Then I pass a serial number to another function which connect to API and fetch me the result set for those serial number.
My Code -:
def create_excel(filename):
try:
data = pd.read_excel(filename, usecols=[4,18,19,20,26,27,28],converters={'Serial Number': '{:0>32}'.format})
except Exception as e:
sys.exit("Error reading %s: %s" % (filename, e))
data["Subject Organization"].fillna("N/A",inplace= True)
df = data[data['Subject Organization'].str.contains("Fannie",case = False)]
#df['Serial Number'].apply(lamda x: '000'+x if len(x) == 29 else '00'+x if len(x) == 30 else '0'+x if len(x) == 31 else x)
print(df)
df.to_excel(r'Data.xlsx',index= False)
output = df['Serial Number'].apply(lambda x: fetch_by_ser_no(x))
df2 = pd.DataFrame(output)
df2.columns = ['Output']
df5 = pd.concat([df,df2],axis = 1)
The problem I am facing is I want to check if df5 returned by fetch_by_ser_no() is blank then make the serial number as 34 characters by adding two more leading 00 and then check the function again.
How can I do it by not creating multiple dataframe
Any help!!
Thanks
You can try to use if ... else ...:
output = df['Serial Number'].apply(lambda x: 'ok' if fetch_by_ser_no(x) else 'badly')

Dataframe loc with multiple string value conditions

Hi, given this dataframe is it possible to fetch the Number value associated with certain conditions using df.loc? This is what i came up with so far.
if df.loc[(df["Tags"]=="Brunei") & (df["Type"]=="Host"),"Number"]:
I want the output to be 1. Is this the correct way to do it?
You're in the right way, but you have to pass ".values[0]" in the end of the .loc statement to extract the only value that you got in the pandas Series.
df = pd.DataFrame({
'Tags': ['Brunei', 'China'],
'Type': ['Host', 'Address'],
'Number': [1, 1192]
}
)
display(df)
series = df.loc[(df["Tags"]=="Brunei") & (df["Type"]=="Host"),"Number"]
print(type(series))
value = df.loc[(df["Tags"]=="Brunei") & (df["Type"]=="Host"),"Number"].values[0]
print(type(value))