Pandas subtract columns with groupby and mask - pandas

For groups under one "SN", I would like to subtract three performance indicators for each group. One group boundaries are the serial number SN and sequential Boolean True values in mask. (So multiple True sequances can exist under one SN).
The first indicator I want is, Csub that subtracts between the first and last values of each group in column 'C'. Second, Bmean, is the mean of each group in column 'B'.
For example:
In:
df = pd.DataFrame({"SN" : ["66", "66", "66", "77", "77", "77", "77", "77"], "B" : [-2, -1, -2, 3, 1, -1, 1, 1], "C" : [1, 2, 3, 15, 11, 2, 1, 2],
"mask" : [False, False, False, True, True, False, True, True] })
SN B C mask
0 66 -2 1 False
1 66 -1 2 False
2 66 -2 3 False
3 77 3 15 True
4 77 1 11 True
5 77 -1 2 False
6 77 1 1 True
7 77 1 2 True
Out:
SN B C mask Csub Bmean CdivB
0 66 -2 1 False Nan Nan Nan
1 66 -1 2 False Nan Nan Nan
2 66 -2 3 False Nan Nan Nan
3 77 3 15 True -4 13 -0.3
4 77 1 11 True -4 13 -0.3
5 77 -1 2 False Nan Nan Nan
6 77 1 1 True 1 1 1
7 77 1 2 True 1 1 1
I cooked up something like this, but it groups by the mask T/F values. It should group by SN and sequential True values, not ALL True values. Further, I cannot figure out how to get a subtraction sqeezed in to this.
# Extracting performance values
perf = (df.assign(
Bmean = df['B'], CdivB = df['C']/df['B']
).groupby(['SN','mask'])
.agg(dict(Bmean ='mean', CdivB = 'mean'))
.reset_index(drop=False)
)

It's not pretty, but you can try the following.
First, prepare a 'group_key' column in order to group by consecutive True values in 'mask':
# Select the rows where 'mask' is True preceded by False.
first_true = df.loc[
(df['mask'] == True)
& (df['mask'].shift(fill_value=False) == False)
]
# Add the column.
df['group_key'] = pd.Series()
# Each row in first_true gets assigned a different 'group_key' value.
df.loc[first_true.index, 'group_key'] = range(len(first_true))
# Forward fill 'group_key' on mask.
df.loc[df['mask'], 'group_key'] = df.loc[df['mask'], 'group_key'].ffill()
Then we can group by 'SN' and 'group_key' and compute and assign the indicator values.
# Group by 'SN' and 'group_key'.
gdf = df.groupby(by=['SN', 'group_key'], as_index=False)
# Compute indicator values
indicators = pd.DataFrame(gdf.nth(0)) # pd.DataFrame used here to avoid a SettingwithCopyWarning.
indicators['Csub'] = gdf.nth(0)['C'].array - gdf.nth(-1)['C'].array
indicators['Bmean'] = gdf.mean()['B'].array
# Write values to original dataframe
df = df.join(indicators.reindex(columns=['Csub', 'Bmean']))
# Forward fill the indicator values
df.loc[df['mask'], ['Csub', 'Bmean']] = df.loc[df['mask'], ['Csub', 'Bmean']].ffill()
# Drop 'group_key' column
df = df.drop(columns=['group_key'])
I excluded 'CdivB' since I couldn't understand what it's value should be.

Related

Pandas - drop n rows by column value

I need to remove last n rows where Status equals 1
v = df[df['Status'] == 1].count()
f = df[df['Status'] == 0].count()
diff = v - f
diff
df2 = df[~df['Status'] == 1].tail(diff).all() #ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
df2
Check whether Status is equal to 1 and get only those places where it is (.loc[lambda s: s] is doing that using boolean indexing). The index of n such rows from tail will be dropped:
df.drop(df.Status.eq(1).loc[lambda s: s].tail(n).index)
sample run:
In [343]: df
Out[343]:
Status
0 1
1 2
2 3
3 2
4 1
5 1
6 1
7 2
In [344]: n
Out[344]: 2
In [345]: df.Status.eq(1)
Out[345]:
0 True
1 False
2 False
3 False
4 True
5 True
6 True
7 False
Name: Status, dtype: bool
In [346]: df.Status.eq(1).loc[lambda s: s]
Out[346]:
0 True
4 True
5 True
6 True
Name: Status, dtype: bool
In [347]: df.Status.eq(1).loc[lambda s: s].tail(n)
Out[347]:
5 True
6 True
Name: Status, dtype: bool
In [348]: df.Status.eq(1).loc[lambda s: s].tail(n).index
Out[348]: Int64Index([5, 6], dtype='int64')
In [349]: df.drop(df.Status.eq(1).loc[lambda s: s].tail(n).index)
Out[349]:
Status
0 1
1 2
2 3
3 2
4 1
7 2
Using groupBy() and transform() to mark columns to keep:
df = pd.DataFrame({"Status": [1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1]})
n = 3
df["Keep"] = df.groupby("Status")["Status"].transform(
lambda x: x.reset_index().index < len(x) - n if x.name == 1 else True
)
df.loc[df["Keep"]].drop(columns="Keep")

boolean indexing with loc returns NaN

import pandas as pd
numbers = [
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
]
df = pd.DataFrame(numbers)
condition = df.loc[:, 1:2] < 4
df[condition]
0 1 2
0 NaN 2.0 3.0
1 NaN NaN NaN
2 NaN NaN NaN
Why am I getting these wrong results, and what can I do to get the correct results?
Boolean condition has to be Series, but here your selected columns return DataFrame:
print (condition)
1 2
0 True True
1 False False
2 False False
So for convert boolean Dataframe to mask use DataFrame.all for test if all Trues per rows or
DataFrame.any if at least one True per rows:
print (condition.any(axis=1))
print (condition.all(axis=1))
0 True
1 False
2 False
dtype: bool
Or select only one column for condition:
print (df.loc[:, 1] < 4)
0 True
1 False
2 False
Name: 1, dtype: bool
print (df[condition.any(axis=1)])
0 1 2
0 1 2 3

Python Pandas: How to delete row with certain value of 'Object' datatype?

I have a dataframe name train_data.
This is the datatype of each column.
The columns workclass, occupation, and native-country are of "Object" datatype and some of the rows contain values of "?".
In this example, you can see row index 5 has some values with "?".
I want to delete all rows with any cell that has any "?".
I tried the following code, but it didn't work.
train_data = train_data[~(train_data.values == '?').any(1)]
train_data
use .loc for index slicing.
import pandas as pd
df1 = pd.DataFrame({'A' : [0,1,2,3,'?'],
'B' : [2,4,5,'?',9],
'C' : [0,'?',2,3,4]})
print(df1)
A B C
0 0 2 0
1 1 4 ?
2 2 5 2
3 3 ? 3
4 ? 9 4
print(df1.loc[~df1.eq('?').any(1)])
A B C
0 0 2 0
2 2 5 2
if you only want to check object columns use
pd.select_dtypes
df1.select_dtypes('object').eq('?').any(1)
0 False
1 True
2 False
3 True
4 True
dtype: bool
Edit.
One method for handling leading or trailing white spaces.
df1 = pd.DataFrame({'A' : [0,1,2,3,'?'],
'B' : [2,4,5,' ?',9],
'C' : [0,'? ',2,3,4]})
df1.eq('?').any(1)
0 False
1 False
2 False
3 False
4 True
dtype: bool
df1.replace('(\s+\?)|(\?\s+)',r'?',regex=True).eq('?').any(1)
0 False
1 True
2 False
3 True
4 True
dtype: bool
str.strip() with a lambda
str_cols = df1.select_dtypes('object').columns
df1[str_cols] = df1[str_cols].apply(lambda x : x.str.strip())
df1.eq('?').any(1)
0 False
1 True
2 False
3 True
4 True
dtype: bool

Replace a string value with NaN in pandas data frame - Python

Do I have to replace the value? with NaN so you can invoke the .isnull () method. I have found several solutions but some errors are always returned. Suppose:
data = pd.DataFrame([[1,?,5],[?,?,4],[?,32.1,1]])
and if I try:
pd.data.replace('?', np.nan)
I have:
0 1 2
0 1.0 NaN 5
1 NaN NaN 4
2 NaN 32.1 1
but data.isnull() returns:
0 1 2
0 False False False
1 False False False
2 False False False
Why?
I think you forget assign back:
data = pd.DataFrame([[1,'?',5],['?','?',4],['?',32.1,1]])
data = data.replace('?', np.nan)
#alternative
#data.replace('?', np.nan, inplace=True)
print (data)
0 1 2
0 1.0 NaN 5
1 NaN NaN 4
2 NaN 32.1 1
print (data.isnull())
0 1 2
0 False True False
1 True True False
2 True False False
# a dataframe with string values
dat = pd.DataFrame({'a':[1,'FG', 2, 4], 'b':[2, 5, 'NA', 7]})
Removing non numerical elements from the dataframe:
"Method 1 - with regex"
dat2 = dat.replace(r'^([A-Za-z]|[0-9]|_)+$', np.NaN, regex=True)
dat2
"Method 2 - with pd.to_numeric"
dat3 = pd.DataFrame()
for col in dat.columns:
dat3[col] = pd.to_numeric(dat[col], errors='coerce')
dat3
? is a not null. So you will expect to get a False under the isnull test
>>> data = pandas.DataFrame([[1,'?',5],['?','?',4],['?',32.1,1]])
>>> data
0 1 2
0 False False False
1 False False False
2 False False False
After you replace ? with NaN the test will look much different
>>> data = data.replace('?', np.nan)
>>> data
0 1 2
0 False True False
1 True True False
2 True False False
I believe when you are doing pd.data.replace('?', np.nan) this action is not done in place, so you must try -
data = data.replace('?', np.nan)

Unable to remove NaN from panda Series

I know this question has been asked many times before, but all the solutions I have found don't seem to be working for me. I am unable to remove the NaN values from my pandas Series or DataFrame.
First, I tried removing directly from the DataFrame like in I/O 7 and 8 in the documentation (http://pandas.pydata.org/pandas-docs/stable/missing_data.html)
In[1]:
df['salary'][:5]
Out[1]:
0 365788
1 267102
2 170941
3 NaN
4 243293
In [2]:
pd.isnull(df['salary'][:5])
Out[2]:
0 False
1 False
2 False
3 False
4 False
I was expecting line 3 to show up as True, but it didn't. I removed the Series from the DataFrame to try it again.
sal = df['salary'][:5]
In [100]:
type(sals)
Out[100]:
pandas.core.series.Series
In [101]:
sal.isnull()
Out[101]:
0 False
1 False
2 False
3 False
4 False
Name: salary, dtype: bool
In [102]:
sal.dropna()
Out[102]:
0 365788
1 267102
2 170941
3 NaN
4 243293
Name: salary, dtype: object
Can someone tell me what I'm doing wrong? I am using IPython Notebook 2.2.0.
The datatype of your column is object, which tells me it probably contains strings rather than numerical values. Try converting to float:
>>> sa1 = pd.Series(["365788", "267102", "170941", "NaN", "243293"])
>>> sa1
0 365788
1 267102
2 170941
3 NaN
4 243293
dtype: object
>>> sa1.isnull()
0 False
1 False
2 False
3 False
4 False
dtype: bool
>>> sa1 = sa1.astype(float)
>>> sa1.isnull()
0 False
1 False
2 False
3 True
4 False
dtype: bool