boolean indexing with loc returns NaN - pandas

import pandas as pd
numbers = [
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
]
df = pd.DataFrame(numbers)
condition = df.loc[:, 1:2] < 4
df[condition]
0 1 2
0 NaN 2.0 3.0
1 NaN NaN NaN
2 NaN NaN NaN
Why am I getting these wrong results, and what can I do to get the correct results?

Boolean condition has to be Series, but here your selected columns return DataFrame:
print (condition)
1 2
0 True True
1 False False
2 False False
So for convert boolean Dataframe to mask use DataFrame.all for test if all Trues per rows or
DataFrame.any if at least one True per rows:
print (condition.any(axis=1))
print (condition.all(axis=1))
0 True
1 False
2 False
dtype: bool
Or select only one column for condition:
print (df.loc[:, 1] < 4)
0 True
1 False
2 False
Name: 1, dtype: bool
print (df[condition.any(axis=1)])
0 1 2
0 1 2 3

Related

Pandas - drop n rows by column value

I need to remove last n rows where Status equals 1
v = df[df['Status'] == 1].count()
f = df[df['Status'] == 0].count()
diff = v - f
diff
df2 = df[~df['Status'] == 1].tail(diff).all() #ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
df2
Check whether Status is equal to 1 and get only those places where it is (.loc[lambda s: s] is doing that using boolean indexing). The index of n such rows from tail will be dropped:
df.drop(df.Status.eq(1).loc[lambda s: s].tail(n).index)
sample run:
In [343]: df
Out[343]:
Status
0 1
1 2
2 3
3 2
4 1
5 1
6 1
7 2
In [344]: n
Out[344]: 2
In [345]: df.Status.eq(1)
Out[345]:
0 True
1 False
2 False
3 False
4 True
5 True
6 True
7 False
Name: Status, dtype: bool
In [346]: df.Status.eq(1).loc[lambda s: s]
Out[346]:
0 True
4 True
5 True
6 True
Name: Status, dtype: bool
In [347]: df.Status.eq(1).loc[lambda s: s].tail(n)
Out[347]:
5 True
6 True
Name: Status, dtype: bool
In [348]: df.Status.eq(1).loc[lambda s: s].tail(n).index
Out[348]: Int64Index([5, 6], dtype='int64')
In [349]: df.drop(df.Status.eq(1).loc[lambda s: s].tail(n).index)
Out[349]:
Status
0 1
1 2
2 3
3 2
4 1
7 2
Using groupBy() and transform() to mark columns to keep:
df = pd.DataFrame({"Status": [1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1]})
n = 3
df["Keep"] = df.groupby("Status")["Status"].transform(
lambda x: x.reset_index().index < len(x) - n if x.name == 1 else True
)
df.loc[df["Keep"]].drop(columns="Keep")

Performing calculation based off multiple rows in Pandas dataframe

Set up an example dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame([[10, False, np.nan, np.nan], [5, False, np.nan, np.nan], [np.nan, True, 'a', 'b'],[np.nan,True,'b','a']],
columns=['value', 'IsRatio', 'numerator','denominator'],index=['a','b','c','d'])
index
value
IsRatio
numerator
denominator
a
10
False
nan
nan
b
5
False
nan
nan
c
nan
True
a
b
d
nan
True
b
a
For rows where IsRatio is True, I would like to lookup the values for the numerator and denominator, and calculate a ratio.
For a single row I can use .loc
numerator_name = df.loc['c','numerator']
denominator_name = df.loc['c','denominator']
df.loc['c','value'] = int(df.loc[numerator_name]['value'])/int(df.loc[denominator_name]['value'])
This will calculate the ratio for a single row
index
value
IsRatio
numerator
denominator
a
10
False
nan
nan
b
5
False
nan
nan
c
2
True
a
b
d
nan
True
b
a
How can I generalise this to all rows? I think I might need an apply function but I can't figure it out.
You can use apply to apply your computation to each row (mind the axis=1 input argument):
df['value'] = df.apply(
lambda x: int(df.loc[x.numerator]['value']) / int(df.loc[x.denominator]['value'])
if x.IsRatio else x.value,
axis=1
)
The result is the following:
value IsRatio numerator denominator
a 10 False nan nan
b 5 False nan nan
c 2 True a b
d 0.5 True b a
Note: you should remove np.array from the creation of the example DataFrame, otherwise the IsRatio column has type str. So df should be defined as follow:
import pandas as pd
import numpy as np
df = pd.DataFrame([[10, False, np.nan, np.nan], [5, False, np.nan, np.nan], [np.nan, True, 'a', 'b'],[np.nan,True,'b','a']],
columns=['value', 'IsRatio', 'numerator','denominator'],index=['a','b','c','d'])
Otherwise, if IsRatio column is actually of type str, you should modify the previous code as following:
df['value'] = df.apply(
lambda x: int(df.loc[x.numerator]['value']) / int(df.loc[x.denominator]['value'])
if x.IsRatio == 'True' else x.value,
axis=1
)
value IsRatio numerator denominator
a 10 False nan nan
b 5 False nan nan
c 2 True a b
d 0.5 True b a
To do as a vectorised solution numpy where() is a good solution.
df = pd.DataFrame(np.array([[10, False, np.nan, np.nan], [5, False, np.nan, np.nan], [np.nan, True, 10, 5],[np.nan,True,30,10]]),
columns=['value', 'IsRatio', 'numerator','denominator'],index=['a','b','c','d'])
# df.assign(v=np.where)
df.IsRatio = df.IsRatio.astype(bool)
df.assign(v=np.where(df.IsRatio, df.numerator/df.denominator, df.value))
value
IsRatio
numerator
denominator
v
a
10
False
nan
nan
10
b
5
False
nan
nan
5
c
nan
True
10
5
2
d
nan
True
30
10
3

Pandas subtract columns with groupby and mask

For groups under one "SN", I would like to subtract three performance indicators for each group. One group boundaries are the serial number SN and sequential Boolean True values in mask. (So multiple True sequances can exist under one SN).
The first indicator I want is, Csub that subtracts between the first and last values of each group in column 'C'. Second, Bmean, is the mean of each group in column 'B'.
For example:
In:
df = pd.DataFrame({"SN" : ["66", "66", "66", "77", "77", "77", "77", "77"], "B" : [-2, -1, -2, 3, 1, -1, 1, 1], "C" : [1, 2, 3, 15, 11, 2, 1, 2],
"mask" : [False, False, False, True, True, False, True, True] })
SN B C mask
0 66 -2 1 False
1 66 -1 2 False
2 66 -2 3 False
3 77 3 15 True
4 77 1 11 True
5 77 -1 2 False
6 77 1 1 True
7 77 1 2 True
Out:
SN B C mask Csub Bmean CdivB
0 66 -2 1 False Nan Nan Nan
1 66 -1 2 False Nan Nan Nan
2 66 -2 3 False Nan Nan Nan
3 77 3 15 True -4 13 -0.3
4 77 1 11 True -4 13 -0.3
5 77 -1 2 False Nan Nan Nan
6 77 1 1 True 1 1 1
7 77 1 2 True 1 1 1
I cooked up something like this, but it groups by the mask T/F values. It should group by SN and sequential True values, not ALL True values. Further, I cannot figure out how to get a subtraction sqeezed in to this.
# Extracting performance values
perf = (df.assign(
Bmean = df['B'], CdivB = df['C']/df['B']
).groupby(['SN','mask'])
.agg(dict(Bmean ='mean', CdivB = 'mean'))
.reset_index(drop=False)
)
It's not pretty, but you can try the following.
First, prepare a 'group_key' column in order to group by consecutive True values in 'mask':
# Select the rows where 'mask' is True preceded by False.
first_true = df.loc[
(df['mask'] == True)
& (df['mask'].shift(fill_value=False) == False)
]
# Add the column.
df['group_key'] = pd.Series()
# Each row in first_true gets assigned a different 'group_key' value.
df.loc[first_true.index, 'group_key'] = range(len(first_true))
# Forward fill 'group_key' on mask.
df.loc[df['mask'], 'group_key'] = df.loc[df['mask'], 'group_key'].ffill()
Then we can group by 'SN' and 'group_key' and compute and assign the indicator values.
# Group by 'SN' and 'group_key'.
gdf = df.groupby(by=['SN', 'group_key'], as_index=False)
# Compute indicator values
indicators = pd.DataFrame(gdf.nth(0)) # pd.DataFrame used here to avoid a SettingwithCopyWarning.
indicators['Csub'] = gdf.nth(0)['C'].array - gdf.nth(-1)['C'].array
indicators['Bmean'] = gdf.mean()['B'].array
# Write values to original dataframe
df = df.join(indicators.reindex(columns=['Csub', 'Bmean']))
# Forward fill the indicator values
df.loc[df['mask'], ['Csub', 'Bmean']] = df.loc[df['mask'], ['Csub', 'Bmean']].ffill()
# Drop 'group_key' column
df = df.drop(columns=['group_key'])
I excluded 'CdivB' since I couldn't understand what it's value should be.

Replace a string value with NaN in pandas data frame - Python

Do I have to replace the value? with NaN so you can invoke the .isnull () method. I have found several solutions but some errors are always returned. Suppose:
data = pd.DataFrame([[1,?,5],[?,?,4],[?,32.1,1]])
and if I try:
pd.data.replace('?', np.nan)
I have:
0 1 2
0 1.0 NaN 5
1 NaN NaN 4
2 NaN 32.1 1
but data.isnull() returns:
0 1 2
0 False False False
1 False False False
2 False False False
Why?
I think you forget assign back:
data = pd.DataFrame([[1,'?',5],['?','?',4],['?',32.1,1]])
data = data.replace('?', np.nan)
#alternative
#data.replace('?', np.nan, inplace=True)
print (data)
0 1 2
0 1.0 NaN 5
1 NaN NaN 4
2 NaN 32.1 1
print (data.isnull())
0 1 2
0 False True False
1 True True False
2 True False False
# a dataframe with string values
dat = pd.DataFrame({'a':[1,'FG', 2, 4], 'b':[2, 5, 'NA', 7]})
Removing non numerical elements from the dataframe:
"Method 1 - with regex"
dat2 = dat.replace(r'^([A-Za-z]|[0-9]|_)+$', np.NaN, regex=True)
dat2
"Method 2 - with pd.to_numeric"
dat3 = pd.DataFrame()
for col in dat.columns:
dat3[col] = pd.to_numeric(dat[col], errors='coerce')
dat3
? is a not null. So you will expect to get a False under the isnull test
>>> data = pandas.DataFrame([[1,'?',5],['?','?',4],['?',32.1,1]])
>>> data
0 1 2
0 False False False
1 False False False
2 False False False
After you replace ? with NaN the test will look much different
>>> data = data.replace('?', np.nan)
>>> data
0 1 2
0 False True False
1 True True False
2 True False False
I believe when you are doing pd.data.replace('?', np.nan) this action is not done in place, so you must try -
data = data.replace('?', np.nan)

Unable to remove NaN from panda Series

I know this question has been asked many times before, but all the solutions I have found don't seem to be working for me. I am unable to remove the NaN values from my pandas Series or DataFrame.
First, I tried removing directly from the DataFrame like in I/O 7 and 8 in the documentation (http://pandas.pydata.org/pandas-docs/stable/missing_data.html)
In[1]:
df['salary'][:5]
Out[1]:
0 365788
1 267102
2 170941
3 NaN
4 243293
In [2]:
pd.isnull(df['salary'][:5])
Out[2]:
0 False
1 False
2 False
3 False
4 False
I was expecting line 3 to show up as True, but it didn't. I removed the Series from the DataFrame to try it again.
sal = df['salary'][:5]
In [100]:
type(sals)
Out[100]:
pandas.core.series.Series
In [101]:
sal.isnull()
Out[101]:
0 False
1 False
2 False
3 False
4 False
Name: salary, dtype: bool
In [102]:
sal.dropna()
Out[102]:
0 365788
1 267102
2 170941
3 NaN
4 243293
Name: salary, dtype: object
Can someone tell me what I'm doing wrong? I am using IPython Notebook 2.2.0.
The datatype of your column is object, which tells me it probably contains strings rather than numerical values. Try converting to float:
>>> sa1 = pd.Series(["365788", "267102", "170941", "NaN", "243293"])
>>> sa1
0 365788
1 267102
2 170941
3 NaN
4 243293
dtype: object
>>> sa1.isnull()
0 False
1 False
2 False
3 False
4 False
dtype: bool
>>> sa1 = sa1.astype(float)
>>> sa1.isnull()
0 False
1 False
2 False
3 True
4 False
dtype: bool