(pandas) Why does .bfill().ffill() act differently than ffill().bfill() on groups? - pandas

I think I'm missing something basic conceptually, but I'm not able to find the answer in the docs.
>>> df=pd.DataFrame({'a':[1,1,2,2,3,3], 'b':[5,np.nan, 6, np.nan, np.nan, np.nan]})
>>> df
a b
0 1 5.0
1 1 NaN
2 2 6.0
3 2 NaN
4 3 NaN
5 3 NaN
Using ffill() and then bfill():
>>> df.groupby('a')['b'].ffill().bfill()
0 5.0
1 5.0
2 6.0
3 6.0
4 NaN
5 NaN
Using bfill() and then ffill():
>>> df.groupby('a')['b'].bfill().ffill()
0 5.0
1 5.0
2 6.0
3 6.0
4 6.0
5 6.0
Doesn't the second way break the groupings? Will the first way always make sure that the values are filled in only with other values in that group?

I think you need:
print (df.groupby('a')['b'].apply(lambda x: x.ffill().bfill()))
0 5.0
1 5.0
2 6.0
3 6.0
4 NaN
5 NaN
Name: b, dtype: float64
print (df.groupby('a')['b'].apply(lambda x: x.bfill().ffill()))
0 5.0
1 5.0
2 6.0
3 6.0
4 NaN
5 NaN
Name: b, dtype: float64
because in your sample only first ffill or bfill is DataFrameGroupBy.ffill or DataFrameGroupBy.bfill, second is working with output Series. So it break groups, because Series has no groups.
print (df.groupby('a')['b'].ffill())
0 5.0
1 5.0
2 6.0
3 6.0
4 NaN
5 NaN
Name: b, dtype: float64
print (df.groupby('a')['b'].bfill())
0 5.0
1 NaN
2 6.0
3 NaN
4 NaN
5 NaN
Name: b, dtype: float64

Related

How do I append an uneven column to an existing one?

I am having trouble appending later values from column C to column A within the same df using pandas. I have tried .append and .concat with ignore_index=True, still not working.
import pandas as pd
d = {'a':[1,2,3,None, None], 'b':[7,8,9, None, None], 'c':[None, None, None, 5, 6]}
df = pd.DataFrame(d)
df['a'] = df['a'].append(df['c'], ignore_index=True)
print(df)
a b c
0 1.0 7.0 NaN
1 2.0 8.0 NaN
2 3.0 9.0 NaN
3 NaN NaN 5.0
4 NaN NaN 6.0
Desired:
a b c
0 1.0 7.0 NaN
1 2.0 8.0 NaN
2 3.0 9.0 NaN
3 5.0 NaN 5.0
4 6.0 NaN 6.0
Thank you for updating that, this is what I would do:
df['a'] = df['a'].fillna(df['c'])
print(df)
Output:
a b c
0 1.0 7.0 NaN
1 2.0 8.0 NaN
2 3.0 9.0 NaN
3 5.0 NaN 5.0
4 6.0 NaN 6.0

How to represent the column with max Nan values in pandas df?

i can show it by: df.isnull().sum() and get the max value with: df.isnull().sum().max() ,
but someone can tell me how to represent the column name with max Nan's ?
Thank you all!
Use Series.idxmax with DataFrame.loc for filter column with most missing values:
df.loc[:, df.isnull().sum().idxmax()]
If need select multiple columns with more maximes compare Series with max value:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,np.nan,5,np.nan,4],
'C':[7,8,9,np.nan,2,np.nan],
'D':[1,np.nan,5,7,1,0]
})
print (df)
A B C D
0 a 4.0 7.0 1.0
1 b 5.0 8.0 NaN
2 c NaN 9.0 5.0
3 d 5.0 NaN 7.0
4 e NaN 2.0 1.0
5 f 4.0 NaN 0.0
s = df.isnull().sum()
df = df.loc[:, s.eq(s.max())]
print (df)
B C
0 4.0 7.0
1 5.0 8.0
2 NaN 9.0
3 5.0 NaN
4 NaN 2.0
5 4.0 NaN

How to keep True and None Value using pandas?

I've one DataFrame
import pandas as pd
data = {'a': [1,2,3,None,4,None,2,4,5,None],'b':[6,6,6,'NaN',4,'NaN',11,11,11,'NaN']}
df = pd.DataFrame(data)
condition = (df['a']>2) | (df['a'] == None)
print(df[condition])
a b
0 1.0 6
1 2.0 6
2 3.0 6
3 NaN NaN
4 4.0 4
5 NaN NaN
6 2.0 11
7 4.0 11
8 5.0 11
9 NaN NaN
Here, i've to keep where condition is coming True and Where None is there i want to keep those rows as well.
Expected output is :
a b
2 3.0 6
3 NaN NaN
4 4.0 4
5 NaN NaN
7 4.0 11
8 5.0 11
9 NaN NaN
Thanks in Advance
You can use another | or condition (Note: See #ALlolz's comment, you shouldnt compare a series with np.nan)
condition = (df['a']>2) | (df['a'].isna())
df[condition]
a b
2 3.0 6
3 NaN NaN
4 4.0 4
5 NaN NaN
7 4.0 11
8 5.0 11
9 NaN NaN

How to perform a rolling window on a pandas DataFrame, whereby each row consists nan values that should not be replaced?

I have the following dataframe:
df = pd.DataFrame([[0, 1, 2, 4, np.nan, np.nan, np.nan,1],
[0, 1, 2 ,np.nan, np.nan, np.nan,np.nan,1],
[0, 2, 2 ,np.nan, 2, np.nan,1,1]])
With output:
0 1 2 3 4 5 6 7
0 0 1 2 4 NaN NaN NaN 1
1 0 1 2 NaN NaN NaN NaN 1
2 0 2 2 NaN 2 NaN 1 1
with dtypes:
df.dtypes
0 int64
1 int64
2 int64
3 float64
4 float64
5 float64
6 float64
7 int64
Then the underneath rolling summation is applied:
df.rolling(window = 7, min_periods =1, axis = 'columns').sum()
And the output is as follows:
0 1 2 3 4 5 6 7
0 0.0 1.0 3.0 4.0 4.0 4.0 4.0 4.0
1 0.0 1.0 3.0 NaN NaN NaN NaN 4.0
2 0.0 2.0 4.0 NaN 2.0 2.0 3.0 5.0
I notice that the rolling window stops and starts again whenever the dtype of the next column is different.
I however have a dataframe whereby all columns are of the same object type.
df = df.astype('object')
which has output:
0 1 2 3 4 5 6 7
0 0.0 1.0 3.0 7.0 7.0 7.0 7.0 8.0
1 0.0 1.0 3.0 3.0 3.0 3.0 3.0 4.0
2 0.0 2.0 4.0 4.0 6.0 6.0 7.0 8.0
My desired output however, stops and starts again after a nan value appears. This would look like:
0 1 2 3 4 5 6 7
0 0.0 1.0 3.0 7.0 NaN NaN NaN 8.0
1 0.0 1.0 3.0 NaN NaN NaN Nan 4.0
2 0.0 2.0 4.0 NaN 6.0 NaN 7.0 8.0
I figured there must be a way that NaN values are not considered but also not filled in with values obtained from the rolling window.
Anything would help!
Workaround is:
Where are the nan-values located:
nan = df.isnull()
Apply the rolling window.
df = df.rolling(window = 7, min_periods =1, axis = 'columns').sum()
Only show values labeled as false.
df[~nan]

How to select NaN values in pandas in specific range

I have a dataframe like this:
df = pd.DataFrame({'col1': [5,6,np.nan, np.nan,np.nan, 4, np.nan, np.nan,np.nan, np.nan,7,8,8, np.nan, 5 , np.nan]})
df:
col1
0 5.0
1 6.0
2 NaN
3 NaN
4 NaN
5 4.0
6 NaN
7 NaN
8 NaN
9 NaN
10 7.0
11 8.0
12 8.0
13 NaN
14 5.0
15 NaN
These NaN values should be replaced in the following way. The first selection should look like this.
2 NaN
3 NaN
4 NaN
5 4.0
6 NaN
7 NaN
8 NaN
9 NaN
And then these Nan values should be replace with the only value in that selection, 4.
The second selection is:
13 NaN
14 5.0
15 NaN
and these NaN values should be replaced with 5.
With isnull() you can select the NaN values in a dataframe but how are able to filter/select these specific ranges in pandas?
Solution if missing values are around one non missing val - solution create unique groups and replace in groups by forward and back filling:
#test missing values
s = df['col1'].isna()
#create unique groups
v = s.ne(s.shift()).cumsum()
#count groups and get only 1 value around, filter only misising values groups
mask = v.map(v.value_counts()).eq(1) | s
#groups for replacement per groups
g = mask.ne(mask.shift()).cumsum()
df['col2'] = df.groupby(g)['col1'].apply(lambda x: x.ffill().bfill())
print (df)
col1 col2
0 5.0 5.0
1 6.0 6.0
2 NaN 4.0
3 NaN 4.0
4 NaN 4.0
5 4.0 4.0
6 NaN 4.0
7 NaN 4.0
8 NaN 4.0
9 NaN 4.0
10 7.0 7.0
11 8.0 8.0
12 8.0 8.0
13 NaN 5.0
14 5.0 5.0
15 NaN 5.0