Fill only last among of consecutive NaN in Pandas by mean of previous and next valid values - pandas

Fill only last among of consecutive NaN in Pandas by mean of previous and next valid values. If one NaN, then fill with mean of next and previous. If two consecutive NaN, impute second one with mean of next and previous valid values.
Series:
expected output:

Idea is remove consecutive missing values without last, then use interpolate and assign back last missing value by condition:
m = df['header'].isna()
mask = m & ~m.shift(-1, fill_value=False)
df.loc[mask, 'header'] = df.loc[mask | ~m, 'header'].interpolate()
print (df)
header
0 10.0
1 20.0
2 20.0
3 20.0
4 30.0
5 NaN
6 35.0
7 40.0
8 10.0
9 NaN
10 NaN
11 30.0
12 50.0
Details:
print (df.assign(m=m, mask=mask))
header m mask
0 10.0 False False
1 20.0 False False
2 20.0 True True
3 20.0 False False
4 30.0 False False
5 NaN True False
6 35.0 True True
7 40.0 False False
8 10.0 False False
9 NaN True False
10 NaN True False
11 30.0 True True
12 50.0 False False
print (df.loc[mask | ~m, 'header'])
0 10.0
1 20.0
2 NaN
3 20.0
4 30.0
6 NaN
7 40.0
8 10.0
11 NaN
12 50.0
Name: header, dtype: float64
Solution for interpolate per groups is:
df.loc[mask, 'header'] = df.loc[mask | ~m, 'header'].groupby(df['groups'])
.transform(lambda x: x.interpolate())

You can try:
s = df['header']
m = s.isna()
df['header'] = s.ffill().add(s.bfill()).div(2).mask(m&m.shift(-1, fill_value=False))
output and intermediates:
header output ffill bfill m m&m.shift(-1)
0 10.0 10.0 10.0 10.0 False False
1 20.0 20.0 20.0 20.0 False False
2 NaN 20.0 20.0 20.0 True False
3 20.0 20.0 20.0 20.0 False False
4 30.0 30.0 30.0 30.0 False False
5 NaN NaN 30.0 40.0 True True
6 NaN 35.0 30.0 40.0 True False
7 40.0 40.0 40.0 40.0 False False
8 10.0 10.0 10.0 10.0 False False
9 NaN NaN 10.0 50.0 True True
10 NaN NaN 10.0 50.0 True True
11 NaN 30.0 10.0 50.0 True False
12 50.0 50.0 50.0 50.0 False False

Related

Pandas easy API to find out all inf or nan cells?

I had search stackoverflow about this, all are so complex.
I want to output the row and column info about all cells that is inf or NaN.
You can replace np.inf to missing values and test them by DataFrame.isna and last test at least one True by DataFrame.any passed to DataFrame.loc for SubDataFrame:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,np.inf],
'C':[7,np.nan,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[np.inf,3,6,9,2,np.nan],
'F':list('aaabbb')
})
print (df)
A B C D E F
0 a 4.0 7.0 1 inf a
1 b 5.0 NaN 3 3.0 a
2 c 4.0 9.0 5 6.0 a
3 d 5.0 4.0 7 9.0 b
4 e 5.0 2.0 1 2.0 b
5 f inf 3.0 0 NaN b
m = df.replace(np.inf, np.nan).isna()
print (m)
A B C D E F
0 False False False False True False
1 False False True False False False
2 False False False False False False
3 False False False False False False
4 False False False False False False
5 False True False False True False
df = df.loc[m.any(axis=1), m.any()]
print (df)
B C E
0 4.0 7.0 inf
1 5.0 NaN 3.0
5 inf 3.0 NaN
Or if need index and columns names in DataFrame use DataFrame.stack with Index.to_frame:
s = df.replace(np.inf, np.nan).stack(dropna=False)
df1 = s[s.isna()].index.to_frame(index=False)
print (df1)
0 1
0 0 E
1 1 C
2 5 B
3 5 E

How to determine the end of a non-NaN series in pandas

For a data frame
df = pd.DataFrame([[np.nan, 3.0, 7.0], [0.0, 5.0, 8.0], [0.0, 0.0, 0.0], [1.0, 3.0, np.nan], [1.0, np.nan, np.nan]],
columns=[1, 2, 3], index=pd.date_range('20180101', periods=5))
which is
1 2 3
2018-01-01 NaN 3.0 7.0
2018-01-02 0.0 5.0 8.0
2018-01-03 0.0 0.0 0.0
2018-01-04 1.0 3.0 NaN
2018-01-05 1.0 NaN NaN
I would like know when a non-NaN series (column) is over. The resulting data frame should look
1 2 3
2018-01-01 False False False
2018-01-02 False False False
2018-01-03 False False False
2018-01-04 False False True
2018-01-05 False True True
I tried to work with
df.apply(lambda x: x.last_valid_index())
which results in
1 2018-01-05
2 2018-01-04
3 2018-01-03
So far so good. But how to continue? All solutions (also those not containing last_valid_index()) are welcome!
Use back filling missing values with test missing values:
df1 = df.bfill().isna()
print (df1)
1 2 3
2018-01-01 False False False
2018-01-02 False False False
2018-01-03 False False False
2018-01-04 False False True
2018-01-05 False True True
Detail:
print (df.bfill())
1 2 3
2018-01-01 0.0 3.0 7.0
2018-01-02 0.0 5.0 8.0
2018-01-03 0.0 0.0 0.0
2018-01-04 1.0 3.0 NaN
2018-01-05 1.0 NaN NaN

How to forward fill row values with function in pandas MultiIndex dataframe?

I have the following MultiIndex dataframe:
Close ATR condition
Date Symbol
1990-01-01 A 24 1 True
B 72 1 False
C 40 3 False
D 21 5 True
1990-01-02 A 65 4 True
B 19 2 True
C 43 3 True
D 72 1 False
1990-01-03 A 92 5 False
B 32 3 True
C 52 2 False
D 33 1 False
I perform the following calculation on this dataframe:
data.loc[data.index.levels[0][0], 'Shares'] = 0
data.loc[data.index.levels[0][0], 'Closed_P/L'] = 0
data = data.reset_index()
Equity = 10000
def calcs(x):
global Equity
# Skip first date
if x.index[0]==0: return x
# calculate Shares where condition is True
x.loc[x['condition'] == True, 'Shares'] = np.floor((Equity * 0.02 / x['ATR']).astype(float))
# other calulations
x['Closed_P/L'] = x['Shares'] * x['Close']
Equity += x['Closed_P/L'].sum()
return x
data = data.groupby('Date').apply(calcs)
data['Equity'] = data.groupby('Date')['Closed_P/L'].transform('sum')
data['Equity'] = data.groupby('Symbol')['Equity'].cumsum() + Equity
data = data.set_index(['Date','Symbol'])
The output is:
Close ATR condition Shares Closed_P/L Equity
Date Symbol
1990-01-01 A 24 1.2 True 0.0 0.0 10000.0
B 72 1.4 False 0.0 0.0 10000.0
C 40 3 False 0.0 0.0 10000.0
D 21 5 True 0.0 0.0 10000.0
1990-01-02 A 65 4 True 50.0 3250.0 17988.0
B 19 2 True 100.0 1900.0 17988.0
C 43 3 True 66.0 2838.0 17988.0
D 72 1 False NaN NaN 17988.0
1990-01-03 A 92 5 False NaN NaN 21796.0
B 32 3 True 119.0 3808.0 21796.0
C 52 2 False NaN NaN 21796.0
D 33 1 False NaN NaN 21796.0
I want to forward fill Shares values - grouped by Symbol - in case condition evaluates to False (except for first date). So the Shares value on 1990-01-02 for D should be 0 (because on 1990-01-01 the Shares value for D was 0 and the condition on 1990-01-02 is False). Also values for Shares on 1990-01-03 for A, C and D should be 50, 66 and 0 respectively based on the logic described above. How can I do that?

How i can find non Nan occurrencies patterns in a multi index dataframe?

I am dealing with a multi-indexed dataframe that looks like this:
(sorry for writing null instead of NaN)
What could be the most efficient way to find occurrences of the patterns i highlighted?
I expect to reach a result like this one:
Thanks in advance for any insight!
For who wants to play with it:
from io import StringIO
import pandas as pd
df1_text = """ A B C
STAND1 CH1 NaN NaN NaN
STAND1 CH2 NaN 11.2 NaN
STAND1 CH3 12.4 7.0 NaN
STAND1 CH4 10.2 2.0 NaN
STAND2 CH1 NaN 2.5 NaN
STAND2 CH2 NaN 11.2 NaN
STAND2 CH3 NaN NaN 6.3
STAND2 CH4 NaN NaN 23.5
STAND3 CH1 NaN NaN NaN
STAND3 CH2 12.3 NaN NaN
STAND3 CH3 5.3 4.5 NaN
STAND3 CH4 7.2 25.6 NaN"""
df1 = pd.read_csv(StringIO(df1_text), delim_whitespace=True)
Here is one approach. In short, you can use
df2 = df.swaplevel(0,1).unstack().notnull()
print(pd.Series(np.dot(df2.index, df2)).value_counts())
The first line creates df2 that lines up the channel column with 9 columns of boolean indicators of cells that are not null, e.g.
# A B C
# STAND1 STAND2 STAND3 STAND1 STAND2 STAND3 STAND1 STAND2 STAND3
# CH1 False False False False True False False False False
# CH2 False False True True True False False False False
# CH3 True False True True False True False True False
# CH4 True False True True False True False True False
The goal of the second step is to replace each column in df2 with a string representing an event. Using the fact that Python strings can be multiplied by integers, we get
np.dot([CH1, CH2, CH3, CH4], [True, True, False, False]) <==>
'CH1' * True + 'CH2' * True + 'CH3' * False + 'CH4' * False <==>
'CH1' * 1 + 'CH2' * 1 + 'CH3' * 0 + 'CH4' * 0 <==>
'CH1' + 'CH2' <==>
'CH1CH2'
This has a cosmetic defect of omitting commas and including an "empty" event.
Full example:
from io import StringIO
import pandas as pd
df1_text = """ A B C
STAND1 CH1 NaN NaN NaN
STAND1 CH2 NaN 11.2 NaN
STAND1 CH3 12.4 7.0 NaN
STAND1 CH4 10.2 2.0 NaN
STAND2 CH1 NaN 2.5 NaN
STAND2 CH2 NaN 11.2 NaN
STAND2 CH3 NaN NaN 6.3
STAND2 CH4 NaN NaN 23.5
STAND3 CH1 NaN NaN NaN
STAND3 CH2 12.3 NaN NaN
STAND3 CH3 5.3 4.5 NaN
STAND3 CH4 7.2 25.6 NaN"""
df1 = pd.read_csv(StringIO(df1_text), delim_whitespace=True)
# solution
df2 = df.swaplevel(0,1).unstack().notnull()
print(pd.Series(np.dot(df2.index, df2)).value_counts())
# In [559]: df.swaplevel(0,1).unstack().notnull()
# Out[559]:
# A B C
# STAND1 STAND2 STAND3 STAND1 STAND2 STAND3 STAND1 STAND2 STAND3
# CH1 False False False False True False False False False
# CH2 False False True True True False False False False
# CH3 True False True True False True False True False
# CH4 True False True True False True False True False
# In [560]: np.dot(df2.index, df2)
# Out[560]:
# array(['CH3CH4', '', 'CH2CH3CH4', 'CH2CH3CH4', 'CH1CH2', 'CH3CH4', '',
# 'CH3CH4', ''], dtype=object)
# In [561]: pd.Series(np.dot(df2.index, df2)).value_counts()
# Out[561]:
# CH3CH4 3
# 3
# CH2CH3CH4 2
# CH1CH2 1
# dtype: int64

Count How Many Columns in Dataframe before NaN

I want to count how many column data (pd.Dataframe) before Nan data. My data:
df
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Id
A 1 1 2 3 3 NaN NaN NaN NaN NaN NaN NaN NaN NaN
B 6 6 7 7 8 9 10 NaN NaN NaN NaN NaN NaN NaN
C 1 2 3 3 4 5 6 6 7 7 8 9 10 NaN
my desire output:
df_result
count
Id
A 5
B 7
C 13
thank you in advance for the answer.
Use:
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 13
A 1 1 2 3 3 NaN NaN NaN NaN NaN NaN NaN NaN 54.0
B 6 6 7 7 8 9.0 10.0 NaN NaN NaN NaN NaN 5.0 NaN
C 1 2 3 3 4 5.0 6.0 6.0 7.0 7.0 8.0 9.0 10.0 NaN
df = df.isnull().cumsum(axis=1).eq(0).sum(axis=1)
print (df)
A 5
B 7
C 13
dtype: int64
Detail:
First check NaNs:
print (df.isnull())
0 1 2 3 4 5 6 7 8 9 \
A False False False False False True True True True True
B False False False False False False False True True True
C False False False False False False False False False False
10 11 12 13
A True True True False
B True True False True
C False False False True
Get cumsum - Trues are processes like 1, False like 0
print (df.isnull().cumsum(axis=1))
0 1 2 3 4 5 6 7 8 9 10 11 12 13
A 0 0 0 0 0 1 2 3 4 5 6 7 8 8
B 0 0 0 0 0 0 0 1 2 3 4 5 5 6
C 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Compare by 0:
print (df.isnull().cumsum(axis=1).eq(0))
0 1 2 3 4 5 6 7 8 9 10 \
A True True True True True False False False False False False
B True True True True True True True False False False False
C True True True True True True True True True True True
11 12 13
A False False False
B False False False
C True True False
Sum boolean mask - Trues like 1s:
print (df.isnull().cumsum(axis=1).eq(0).sum(axis=1))
A 5
B 7
C 13
dtype: int64