How to determine the end of a non-NaN series in pandas - pandas

For a data frame
df = pd.DataFrame([[np.nan, 3.0, 7.0], [0.0, 5.0, 8.0], [0.0, 0.0, 0.0], [1.0, 3.0, np.nan], [1.0, np.nan, np.nan]],
columns=[1, 2, 3], index=pd.date_range('20180101', periods=5))
which is
1 2 3
2018-01-01 NaN 3.0 7.0
2018-01-02 0.0 5.0 8.0
2018-01-03 0.0 0.0 0.0
2018-01-04 1.0 3.0 NaN
2018-01-05 1.0 NaN NaN
I would like know when a non-NaN series (column) is over. The resulting data frame should look
1 2 3
2018-01-01 False False False
2018-01-02 False False False
2018-01-03 False False False
2018-01-04 False False True
2018-01-05 False True True
I tried to work with
df.apply(lambda x: x.last_valid_index())
which results in
1 2018-01-05
2 2018-01-04
3 2018-01-03
So far so good. But how to continue? All solutions (also those not containing last_valid_index()) are welcome!

Use back filling missing values with test missing values:
df1 = df.bfill().isna()
print (df1)
1 2 3
2018-01-01 False False False
2018-01-02 False False False
2018-01-03 False False False
2018-01-04 False False True
2018-01-05 False True True
Detail:
print (df.bfill())
1 2 3
2018-01-01 0.0 3.0 7.0
2018-01-02 0.0 5.0 8.0
2018-01-03 0.0 0.0 0.0
2018-01-04 1.0 3.0 NaN
2018-01-05 1.0 NaN NaN

Related

Fill only last among of consecutive NaN in Pandas by mean of previous and next valid values

Fill only last among of consecutive NaN in Pandas by mean of previous and next valid values. If one NaN, then fill with mean of next and previous. If two consecutive NaN, impute second one with mean of next and previous valid values.
Series:
expected output:
Idea is remove consecutive missing values without last, then use interpolate and assign back last missing value by condition:
m = df['header'].isna()
mask = m & ~m.shift(-1, fill_value=False)
df.loc[mask, 'header'] = df.loc[mask | ~m, 'header'].interpolate()
print (df)
header
0 10.0
1 20.0
2 20.0
3 20.0
4 30.0
5 NaN
6 35.0
7 40.0
8 10.0
9 NaN
10 NaN
11 30.0
12 50.0
Details:
print (df.assign(m=m, mask=mask))
header m mask
0 10.0 False False
1 20.0 False False
2 20.0 True True
3 20.0 False False
4 30.0 False False
5 NaN True False
6 35.0 True True
7 40.0 False False
8 10.0 False False
9 NaN True False
10 NaN True False
11 30.0 True True
12 50.0 False False
print (df.loc[mask | ~m, 'header'])
0 10.0
1 20.0
2 NaN
3 20.0
4 30.0
6 NaN
7 40.0
8 10.0
11 NaN
12 50.0
Name: header, dtype: float64
Solution for interpolate per groups is:
df.loc[mask, 'header'] = df.loc[mask | ~m, 'header'].groupby(df['groups'])
.transform(lambda x: x.interpolate())
You can try:
s = df['header']
m = s.isna()
df['header'] = s.ffill().add(s.bfill()).div(2).mask(m&m.shift(-1, fill_value=False))
output and intermediates:
header output ffill bfill m m&m.shift(-1)
0 10.0 10.0 10.0 10.0 False False
1 20.0 20.0 20.0 20.0 False False
2 NaN 20.0 20.0 20.0 True False
3 20.0 20.0 20.0 20.0 False False
4 30.0 30.0 30.0 30.0 False False
5 NaN NaN 30.0 40.0 True True
6 NaN 35.0 30.0 40.0 True False
7 40.0 40.0 40.0 40.0 False False
8 10.0 10.0 10.0 10.0 False False
9 NaN NaN 10.0 50.0 True True
10 NaN NaN 10.0 50.0 True True
11 NaN 30.0 10.0 50.0 True False
12 50.0 50.0 50.0 50.0 False False

Pandas easy API to find out all inf or nan cells?

I had search stackoverflow about this, all are so complex.
I want to output the row and column info about all cells that is inf or NaN.
You can replace np.inf to missing values and test them by DataFrame.isna and last test at least one True by DataFrame.any passed to DataFrame.loc for SubDataFrame:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,np.inf],
'C':[7,np.nan,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[np.inf,3,6,9,2,np.nan],
'F':list('aaabbb')
})
print (df)
A B C D E F
0 a 4.0 7.0 1 inf a
1 b 5.0 NaN 3 3.0 a
2 c 4.0 9.0 5 6.0 a
3 d 5.0 4.0 7 9.0 b
4 e 5.0 2.0 1 2.0 b
5 f inf 3.0 0 NaN b
m = df.replace(np.inf, np.nan).isna()
print (m)
A B C D E F
0 False False False False True False
1 False False True False False False
2 False False False False False False
3 False False False False False False
4 False False False False False False
5 False True False False True False
df = df.loc[m.any(axis=1), m.any()]
print (df)
B C E
0 4.0 7.0 inf
1 5.0 NaN 3.0
5 inf 3.0 NaN
Or if need index and columns names in DataFrame use DataFrame.stack with Index.to_frame:
s = df.replace(np.inf, np.nan).stack(dropna=False)
df1 = s[s.isna()].index.to_frame(index=False)
print (df1)
0 1
0 0 E
1 1 C
2 5 B
3 5 E

Pandas Equivalent of Excel COUNTIFS

I've read through some previous questions and am having trouble implementing. Here is my table.
Value Bool
abc TRUE
abc TRUE
bca TRUE
bca FALSE
asd FALSE
asd FALSE
I want this:
Value Bool Count
abc TRUE 2
abc TRUE 2
bca TRUE 1
bca FALSE 1
asd FALSE 0
asd FALSE 0
For each group of terms in Value, count the number of occurrences of TRUE, which is a boolean in my df.
In Excel you can do COUNTIFS to do this. Can someone please show me the way in Pandas?
Try with groupby transform:
df['Count']=df.groupby('Value')['Bool'].transform('sum')
print(df)
Value Bool Count
0 abc True 2.0
1 abc True 2.0
2 bca True 1.0
3 bca False 1.0
4 asd False 0.0
5 asd False 0.0
Or:
df['Count']=df.groupby('Value')['Bool'].transform(lambda x: x.sum())
print(df)
Value Bool Count
0 abc True 2
1 abc True 2
2 bca True 1
3 bca False 1
4 asd False 0
5 asd False 0

How to forward fill row values with function in pandas MultiIndex dataframe?

I have the following MultiIndex dataframe:
Close ATR condition
Date Symbol
1990-01-01 A 24 1 True
B 72 1 False
C 40 3 False
D 21 5 True
1990-01-02 A 65 4 True
B 19 2 True
C 43 3 True
D 72 1 False
1990-01-03 A 92 5 False
B 32 3 True
C 52 2 False
D 33 1 False
I perform the following calculation on this dataframe:
data.loc[data.index.levels[0][0], 'Shares'] = 0
data.loc[data.index.levels[0][0], 'Closed_P/L'] = 0
data = data.reset_index()
Equity = 10000
def calcs(x):
global Equity
# Skip first date
if x.index[0]==0: return x
# calculate Shares where condition is True
x.loc[x['condition'] == True, 'Shares'] = np.floor((Equity * 0.02 / x['ATR']).astype(float))
# other calulations
x['Closed_P/L'] = x['Shares'] * x['Close']
Equity += x['Closed_P/L'].sum()
return x
data = data.groupby('Date').apply(calcs)
data['Equity'] = data.groupby('Date')['Closed_P/L'].transform('sum')
data['Equity'] = data.groupby('Symbol')['Equity'].cumsum() + Equity
data = data.set_index(['Date','Symbol'])
The output is:
Close ATR condition Shares Closed_P/L Equity
Date Symbol
1990-01-01 A 24 1.2 True 0.0 0.0 10000.0
B 72 1.4 False 0.0 0.0 10000.0
C 40 3 False 0.0 0.0 10000.0
D 21 5 True 0.0 0.0 10000.0
1990-01-02 A 65 4 True 50.0 3250.0 17988.0
B 19 2 True 100.0 1900.0 17988.0
C 43 3 True 66.0 2838.0 17988.0
D 72 1 False NaN NaN 17988.0
1990-01-03 A 92 5 False NaN NaN 21796.0
B 32 3 True 119.0 3808.0 21796.0
C 52 2 False NaN NaN 21796.0
D 33 1 False NaN NaN 21796.0
I want to forward fill Shares values - grouped by Symbol - in case condition evaluates to False (except for first date). So the Shares value on 1990-01-02 for D should be 0 (because on 1990-01-01 the Shares value for D was 0 and the condition on 1990-01-02 is False). Also values for Shares on 1990-01-03 for A, C and D should be 50, 66 and 0 respectively based on the logic described above. How can I do that?

returning same initial input when filtering dataframes' values

I have the following dataframe I have obtained from the read_html pandas' property.
A        1.48        2.64    1.02         2.46   2.73
B       658.4        14.33    7.41        15.35   8.59
C        3.76         2.07    4.61         2.26   2.05
D   513854.86         5.70    0.00         5.35  30.16
I would like to remove the rows that are over 150 so I did adf1= df[df > 150], however it returns the same table.
Then I thought to include in the decimals in the routeroute = pd.read_html(https//route , decimal='.') and continues returning the same initial dataframe with no filters.
This would be my desired output:
A        1.48        2.64    1.02         2.46   2.73
C        3.76         2.07    4.61         2.26   2.05
Need:
print (df)
0 1 2 3 4 5
0 A 1.48 2.64 1.02 2.46 2.73
1 B 658.40 14.33 7.41 15.35 8.59
2 C 3.76 2.07 4.61 2.26 2.05
3 D 513854.86 5.70 0.00 5.35 30.16
df1 = df[~(df.iloc[:, 1:] > 150).any(1)]
print (df1)
0 1 2 3 4 5
0 A 1.48 2.64 1.02 2.46 2.73
2 C 3.76 2.07 4.61 2.26 2.05
Or:
df1 = df[(df.iloc[:, 1:] <= 150).all(1)]
print (df1)
0 1 2 3 4 5
0 A 1.48 2.64 1.02 2.46 2.73
2 C 3.76 2.07 4.61 2.26 2.05
Explanation:
First select all columns without first by iloc:
print (df.iloc[:, 1:])
1 2 3 4 5
0 1.48 2.64 1.02 2.46 2.73
1 658.40 14.33 7.41 15.35 8.59
2 3.76 2.07 4.61 2.26 2.05
3 513854.86 5.70 0.00 5.35 30.16
Then compare - get boolean DataFrame:
print (df.iloc[:, 1:] > 150)
1 2 3 4 5
0 False False False False False
1 True False False False False
2 False False False False False
3 True False False False False
print (df.iloc[:, 1:] <= 150)
1 2 3 4 5
0 True True True True True
1 False True True True True
2 True True True True True
3 False True True True True
Then use all for check if all values in row has Trues
or any for check if at least one value is True:
print ((df.iloc[:, 1:] > 150).any(1))
0 False
1 True
2 False
3 True
dtype: bool
print ((df.iloc[:, 1:] <= 150).all(1))
0 True
1 False
2 True
3 False
dtype: bool
Last first Series invert with ~ and filter by boolean indexing.