How to forward fill row values with function in pandas MultiIndex dataframe? - pandas

I have the following MultiIndex dataframe:
Close ATR condition
Date Symbol
1990-01-01 A 24 1 True
B 72 1 False
C 40 3 False
D 21 5 True
1990-01-02 A 65 4 True
B 19 2 True
C 43 3 True
D 72 1 False
1990-01-03 A 92 5 False
B 32 3 True
C 52 2 False
D 33 1 False
I perform the following calculation on this dataframe:
data.loc[data.index.levels[0][0], 'Shares'] = 0
data.loc[data.index.levels[0][0], 'Closed_P/L'] = 0
data = data.reset_index()
Equity = 10000
def calcs(x):
global Equity
# Skip first date
if x.index[0]==0: return x
# calculate Shares where condition is True
x.loc[x['condition'] == True, 'Shares'] = np.floor((Equity * 0.02 / x['ATR']).astype(float))
# other calulations
x['Closed_P/L'] = x['Shares'] * x['Close']
Equity += x['Closed_P/L'].sum()
return x
data = data.groupby('Date').apply(calcs)
data['Equity'] = data.groupby('Date')['Closed_P/L'].transform('sum')
data['Equity'] = data.groupby('Symbol')['Equity'].cumsum() + Equity
data = data.set_index(['Date','Symbol'])
The output is:
Close ATR condition Shares Closed_P/L Equity
Date Symbol
1990-01-01 A 24 1.2 True 0.0 0.0 10000.0
B 72 1.4 False 0.0 0.0 10000.0
C 40 3 False 0.0 0.0 10000.0
D 21 5 True 0.0 0.0 10000.0
1990-01-02 A 65 4 True 50.0 3250.0 17988.0
B 19 2 True 100.0 1900.0 17988.0
C 43 3 True 66.0 2838.0 17988.0
D 72 1 False NaN NaN 17988.0
1990-01-03 A 92 5 False NaN NaN 21796.0
B 32 3 True 119.0 3808.0 21796.0
C 52 2 False NaN NaN 21796.0
D 33 1 False NaN NaN 21796.0
I want to forward fill Shares values - grouped by Symbol - in case condition evaluates to False (except for first date). So the Shares value on 1990-01-02 for D should be 0 (because on 1990-01-01 the Shares value for D was 0 and the condition on 1990-01-02 is False). Also values for Shares on 1990-01-03 for A, C and D should be 50, 66 and 0 respectively based on the logic described above. How can I do that?

Related

On use of any method

a code from Kaggle, which is said to remove outliners:
outliers_mask = (ft.abs() > ft.abs().quantile(outl_thresh)).any(axis=1)
Would not Any return a boolean item? either a an item being in a list or not?
So what the code says is, save in the mask all absolute values in Ft which are above the quantile (introduced by another variable)? What does the Any stand for? what for? thank you.
I think first part return DataFrame filled by boolean True or/and False:
(ft.abs() > ft.abs().quantile(outl_thresh))
so is added DataFrame.any for test if at least one True per rows to boolean Series.
df = pd.DataFrame({'a':[False, False, True],
'b':[False, True, True],
'c':[False, False, True]})
print (df)
a b c
0 False False False
1 False True False
2 True True True
print (df.any(axis=1))
0 False <- no True per rows
1 True <- one True per rows
2 True <- three Trues per rows
dtype: bool
Similar method for test if all values are Trues is DataFrame.all:
print (df.all(axis=1))
0 False
1 False
2 True
dtype: bool
Reason is for filtering by boolean indexing is necessary boolean Series, not boolean DataFrame.
Another sample data:
np.random.seed(2021)
ft = pd.DataFrame(np.random.randint(100, size=(10, 5))).sub(20)
print (ft)
0 1 2 3 4
0 65 37 -20 74 66
1 24 42 71 9 1
2 73 4 -8 50 50
3 13 -13 -19 77 6
4 46 28 79 43 29
5 -4 30 34 32 73
6 -15 29 18 -6 51
7 65 50 21 1 5
8 -10 16 -1 37 62
9 70 -5 20 56 33
outl_thresh = 0.95
print (ft.abs().quantile(outl_thresh))
0 71.65
1 46.40
2 75.40
3 75.65
4 69.85
Name: 0.95, dtype: float64
print((ft.abs() > ft.abs().quantile(outl_thresh)))
0 1 2 3 4
0 False False False False False
1 False False False False False
2 True False False False False
3 False False False True False
4 False False True False False
5 False False False False True
6 False False False False False
7 False True False False False
8 False False False False False
9 False False False False False
outliers_mask = (ft.abs() > ft.abs().quantile(outl_thresh)).any(axis=1)
print (outliers_mask)
0 False
1 False
2 True
3 True
4 True
5 True
6 False
7 True
8 False
9 False
dtype: bool
df1 = ft[outliers_mask]
print (df1)
0 1 2 3 4
2 73 4 -8 50 50
3 13 -13 -19 77 6
4 46 28 79 43 29
5 -4 30 34 32 73
7 65 50 21 1 5
0 1 2 3 4
df2 = ft[~outliers_mask]
print (df2)
0 1 2 3 4
0 65 37 -20 74 66
1 24 42 71 9 1
6 -15 29 18 -6 51
8 -10 16 -1 37 62
9 70 -5 20 56 33

Pandas easy API to find out all inf or nan cells?

I had search stackoverflow about this, all are so complex.
I want to output the row and column info about all cells that is inf or NaN.
You can replace np.inf to missing values and test them by DataFrame.isna and last test at least one True by DataFrame.any passed to DataFrame.loc for SubDataFrame:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,np.inf],
'C':[7,np.nan,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[np.inf,3,6,9,2,np.nan],
'F':list('aaabbb')
})
print (df)
A B C D E F
0 a 4.0 7.0 1 inf a
1 b 5.0 NaN 3 3.0 a
2 c 4.0 9.0 5 6.0 a
3 d 5.0 4.0 7 9.0 b
4 e 5.0 2.0 1 2.0 b
5 f inf 3.0 0 NaN b
m = df.replace(np.inf, np.nan).isna()
print (m)
A B C D E F
0 False False False False True False
1 False False True False False False
2 False False False False False False
3 False False False False False False
4 False False False False False False
5 False True False False True False
df = df.loc[m.any(axis=1), m.any()]
print (df)
B C E
0 4.0 7.0 inf
1 5.0 NaN 3.0
5 inf 3.0 NaN
Or if need index and columns names in DataFrame use DataFrame.stack with Index.to_frame:
s = df.replace(np.inf, np.nan).stack(dropna=False)
df1 = s[s.isna()].index.to_frame(index=False)
print (df1)
0 1
0 0 E
1 1 C
2 5 B
3 5 E

Percentage of False in a column, groupby

I'm fairly new to this. I'm trying to figure out how to calculate the percentage of elementName that are True/False after a droupby command. Instead of count, I need percent.
I'd appreciate all kind of help)
He're how my data looks:
comp isB element FY
1750 . false 62 62
true 305 305
1800 false 52 52
true 356 356
# Print original DataFrame
>>> df
comp isB element FY
0 1750 False 62 62
1 1750 True 305 305
2 1800 False 52 52
3 1800 True 356 356
# Sum number of elements
>>> df['total_count'] = df.groupby('comp').transform(sum)['element']
>>> df
comp isB element FY total_count
0 1750 False 62 62 367
1 1750 True 305 305 367
2 1800 False 52 52 408
3 1800 True 356 356 408
# Calculate fraction or percent according to preference
>>> df['fraction'] = df['element'] / df['total_count']
>>> df['percent'] = df['fraction'] * 100
>>> df
comp isB element FY total_count fraction percent
0 1750 False 62 62 367 0.168937 16.893733
1 1750 True 305 305 367 0.831063 83.106267
2 1800 False 52 52 408 0.127451 12.745098
3 1800 True 356 356 408 0.872549 87.254902
# Get series using group-by
>>> df.groupby(['comp', 'isB'])['percent'].max()
comp isB
1750 False 16.893733
True 83.106267
1800 False 12.745098
True 87.254902
Name: percent, dtype: float64
You could just use .mean(), since numpy casts booleans to integers during that operation.
In [17]: import pandas as pd
In [18]: import numpy as np
In [19]: df = pd.DataFrame({'a': np.random.choice([True, False], size=10),
'b': np.random.choice(['x', 'y'], size=10)})
In [20]: df
Out[20]:
a b
0 False x
1 True y
2 False y
3 True x
4 True y
5 False y
6 False x
7 False y
8 True x
9 True y
In [21]: df.groupby(['b']).mean()
Out[21]:
a
b
x 0.5
y 0.5

Count How Many Columns in Dataframe before NaN

I want to count how many column data (pd.Dataframe) before Nan data. My data:
df
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Id
A 1 1 2 3 3 NaN NaN NaN NaN NaN NaN NaN NaN NaN
B 6 6 7 7 8 9 10 NaN NaN NaN NaN NaN NaN NaN
C 1 2 3 3 4 5 6 6 7 7 8 9 10 NaN
my desire output:
df_result
count
Id
A 5
B 7
C 13
thank you in advance for the answer.
Use:
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 13
A 1 1 2 3 3 NaN NaN NaN NaN NaN NaN NaN NaN 54.0
B 6 6 7 7 8 9.0 10.0 NaN NaN NaN NaN NaN 5.0 NaN
C 1 2 3 3 4 5.0 6.0 6.0 7.0 7.0 8.0 9.0 10.0 NaN
df = df.isnull().cumsum(axis=1).eq(0).sum(axis=1)
print (df)
A 5
B 7
C 13
dtype: int64
Detail:
First check NaNs:
print (df.isnull())
0 1 2 3 4 5 6 7 8 9 \
A False False False False False True True True True True
B False False False False False False False True True True
C False False False False False False False False False False
10 11 12 13
A True True True False
B True True False True
C False False False True
Get cumsum - Trues are processes like 1, False like 0
print (df.isnull().cumsum(axis=1))
0 1 2 3 4 5 6 7 8 9 10 11 12 13
A 0 0 0 0 0 1 2 3 4 5 6 7 8 8
B 0 0 0 0 0 0 0 1 2 3 4 5 5 6
C 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Compare by 0:
print (df.isnull().cumsum(axis=1).eq(0))
0 1 2 3 4 5 6 7 8 9 10 \
A True True True True True False False False False False False
B True True True True True True True False False False False
C True True True True True True True True True True True
11 12 13
A False False False
B False False False
C True True False
Sum boolean mask - Trues like 1s:
print (df.isnull().cumsum(axis=1).eq(0).sum(axis=1))
A 5
B 7
C 13
dtype: int64

returning same initial input when filtering dataframes' values

I have the following dataframe I have obtained from the read_html pandas' property.
A        1.48        2.64    1.02         2.46   2.73
B       658.4        14.33    7.41        15.35   8.59
C        3.76         2.07    4.61         2.26   2.05
D   513854.86         5.70    0.00         5.35  30.16
I would like to remove the rows that are over 150 so I did adf1= df[df > 150], however it returns the same table.
Then I thought to include in the decimals in the routeroute = pd.read_html(https//route , decimal='.') and continues returning the same initial dataframe with no filters.
This would be my desired output:
A        1.48        2.64    1.02         2.46   2.73
C        3.76         2.07    4.61         2.26   2.05
Need:
print (df)
0 1 2 3 4 5
0 A 1.48 2.64 1.02 2.46 2.73
1 B 658.40 14.33 7.41 15.35 8.59
2 C 3.76 2.07 4.61 2.26 2.05
3 D 513854.86 5.70 0.00 5.35 30.16
df1 = df[~(df.iloc[:, 1:] > 150).any(1)]
print (df1)
0 1 2 3 4 5
0 A 1.48 2.64 1.02 2.46 2.73
2 C 3.76 2.07 4.61 2.26 2.05
Or:
df1 = df[(df.iloc[:, 1:] <= 150).all(1)]
print (df1)
0 1 2 3 4 5
0 A 1.48 2.64 1.02 2.46 2.73
2 C 3.76 2.07 4.61 2.26 2.05
Explanation:
First select all columns without first by iloc:
print (df.iloc[:, 1:])
1 2 3 4 5
0 1.48 2.64 1.02 2.46 2.73
1 658.40 14.33 7.41 15.35 8.59
2 3.76 2.07 4.61 2.26 2.05
3 513854.86 5.70 0.00 5.35 30.16
Then compare - get boolean DataFrame:
print (df.iloc[:, 1:] > 150)
1 2 3 4 5
0 False False False False False
1 True False False False False
2 False False False False False
3 True False False False False
print (df.iloc[:, 1:] <= 150)
1 2 3 4 5
0 True True True True True
1 False True True True True
2 True True True True True
3 False True True True True
Then use all for check if all values in row has Trues
or any for check if at least one value is True:
print ((df.iloc[:, 1:] > 150).any(1))
0 False
1 True
2 False
3 True
dtype: bool
print ((df.iloc[:, 1:] <= 150).all(1))
0 True
1 False
2 True
3 False
dtype: bool
Last first Series invert with ~ and filter by boolean indexing.