returning same initial input when filtering dataframes' values - pandas

I have the following dataframe I have obtained from the read_html pandas' property.
A        1.48        2.64    1.02         2.46   2.73
B       658.4        14.33    7.41        15.35   8.59
C        3.76         2.07    4.61         2.26   2.05
D   513854.86         5.70    0.00         5.35  30.16
I would like to remove the rows that are over 150 so I did adf1= df[df > 150], however it returns the same table.
Then I thought to include in the decimals in the routeroute = pd.read_html(https//route , decimal='.') and continues returning the same initial dataframe with no filters.
This would be my desired output:
A        1.48        2.64    1.02         2.46   2.73
C        3.76         2.07    4.61         2.26   2.05

Need:
print (df)
0 1 2 3 4 5
0 A 1.48 2.64 1.02 2.46 2.73
1 B 658.40 14.33 7.41 15.35 8.59
2 C 3.76 2.07 4.61 2.26 2.05
3 D 513854.86 5.70 0.00 5.35 30.16
df1 = df[~(df.iloc[:, 1:] > 150).any(1)]
print (df1)
0 1 2 3 4 5
0 A 1.48 2.64 1.02 2.46 2.73
2 C 3.76 2.07 4.61 2.26 2.05
Or:
df1 = df[(df.iloc[:, 1:] <= 150).all(1)]
print (df1)
0 1 2 3 4 5
0 A 1.48 2.64 1.02 2.46 2.73
2 C 3.76 2.07 4.61 2.26 2.05
Explanation:
First select all columns without first by iloc:
print (df.iloc[:, 1:])
1 2 3 4 5
0 1.48 2.64 1.02 2.46 2.73
1 658.40 14.33 7.41 15.35 8.59
2 3.76 2.07 4.61 2.26 2.05
3 513854.86 5.70 0.00 5.35 30.16
Then compare - get boolean DataFrame:
print (df.iloc[:, 1:] > 150)
1 2 3 4 5
0 False False False False False
1 True False False False False
2 False False False False False
3 True False False False False
print (df.iloc[:, 1:] <= 150)
1 2 3 4 5
0 True True True True True
1 False True True True True
2 True True True True True
3 False True True True True
Then use all for check if all values in row has Trues
or any for check if at least one value is True:
print ((df.iloc[:, 1:] > 150).any(1))
0 False
1 True
2 False
3 True
dtype: bool
print ((df.iloc[:, 1:] <= 150).all(1))
0 True
1 False
2 True
3 False
dtype: bool
Last first Series invert with ~ and filter by boolean indexing.

Related

Working with a multiindex dataframe, to get summation results over a boolean column, based on a condition from another column

We have a multiindex dataframe that looks like:
date condition_1 condition_2
item1 0 2021-06-10 06:30:00+00:00 True False
1 2021-06-10 07:00:00+00:00 False True
2 2021-06-10 07:30:00+00:00 True True
item2 3 2021-06-10 06:30:00+00:00 True False
4 2021-06-10 07:00:00+00:00 True True
5 2021-06-10 07:30:00+00:00 True True
item3 6 2021-06-10 06:30:00+00:00 True True
7 2021-06-10 07:00:00+00:00 False True
8 2021-06-10 07:30:00+00:00 True True
The value of date repeats between items (because the df is a result of a default concat on a dictionary of dataframes).
The logic we basically want to vectorize is "for every date where condition_1 is true for all items: sum the occurrences where condition_2 is true in a new results column for all of them".
The result would basically look like this based on the above example (comments on how it's derived: next to the results column):
date condition_1 condition_2 result
item1 0 2021-06-10 06:30:00+00:00 True False 1 [because condition_1 is True for all items and condition_2 is True once]
1 2021-06-10 07:00:00+00:00 False True 0 [condition_1 is not True for all items so condition_2 is irrelevant]
2 2021-06-10 07:30:00+00:00 True True 3 [both conditions are True for all 3 items]
item2 3 2021-06-10 06:30:00+00:00 True False 1 [a repeat for the same reasons]
4 2021-06-10 07:00:00+00:00 True True 0 [a repeat for the same reasons]
5 2021-06-10 07:30:00+00:00 True True 3 [a repeat for the same reasons]
item3 6 2021-06-10 06:30:00+00:00 True True 1 [a repeat for the same reasons]
7 2021-06-10 07:00:00+00:00 False True 0 [a repeat for the same reasons]
8 2021-06-10 07:30:00+00:00 True True 3 [a repeat for the same reasons]
Here is what I came up with.
def cond_sum(s):
return s.cond1.all() * s.cond2.sum()
df.reset_index(level=0, inplace=True)
df['result'] = df.groupby('date').apply(cond_sum)
df.set_index('item', append=True)
Then if you want the original index, you can add it back.
df.set_index('item', append=True).swaplevel()
Note, you mentioned vectorized, so you could swap that out for:
dfg = df.groupby(level=0).agg({'cond1': 'all', 'cond2': 'sum'})
df['result'] = dfg.cond1 * dfg.cond2

On use of any method

a code from Kaggle, which is said to remove outliners:
outliers_mask = (ft.abs() > ft.abs().quantile(outl_thresh)).any(axis=1)
Would not Any return a boolean item? either a an item being in a list or not?
So what the code says is, save in the mask all absolute values in Ft which are above the quantile (introduced by another variable)? What does the Any stand for? what for? thank you.
I think first part return DataFrame filled by boolean True or/and False:
(ft.abs() > ft.abs().quantile(outl_thresh))
so is added DataFrame.any for test if at least one True per rows to boolean Series.
df = pd.DataFrame({'a':[False, False, True],
'b':[False, True, True],
'c':[False, False, True]})
print (df)
a b c
0 False False False
1 False True False
2 True True True
print (df.any(axis=1))
0 False <- no True per rows
1 True <- one True per rows
2 True <- three Trues per rows
dtype: bool
Similar method for test if all values are Trues is DataFrame.all:
print (df.all(axis=1))
0 False
1 False
2 True
dtype: bool
Reason is for filtering by boolean indexing is necessary boolean Series, not boolean DataFrame.
Another sample data:
np.random.seed(2021)
ft = pd.DataFrame(np.random.randint(100, size=(10, 5))).sub(20)
print (ft)
0 1 2 3 4
0 65 37 -20 74 66
1 24 42 71 9 1
2 73 4 -8 50 50
3 13 -13 -19 77 6
4 46 28 79 43 29
5 -4 30 34 32 73
6 -15 29 18 -6 51
7 65 50 21 1 5
8 -10 16 -1 37 62
9 70 -5 20 56 33
outl_thresh = 0.95
print (ft.abs().quantile(outl_thresh))
0 71.65
1 46.40
2 75.40
3 75.65
4 69.85
Name: 0.95, dtype: float64
print((ft.abs() > ft.abs().quantile(outl_thresh)))
0 1 2 3 4
0 False False False False False
1 False False False False False
2 True False False False False
3 False False False True False
4 False False True False False
5 False False False False True
6 False False False False False
7 False True False False False
8 False False False False False
9 False False False False False
outliers_mask = (ft.abs() > ft.abs().quantile(outl_thresh)).any(axis=1)
print (outliers_mask)
0 False
1 False
2 True
3 True
4 True
5 True
6 False
7 True
8 False
9 False
dtype: bool
df1 = ft[outliers_mask]
print (df1)
0 1 2 3 4
2 73 4 -8 50 50
3 13 -13 -19 77 6
4 46 28 79 43 29
5 -4 30 34 32 73
7 65 50 21 1 5
0 1 2 3 4
df2 = ft[~outliers_mask]
print (df2)
0 1 2 3 4
0 65 37 -20 74 66
1 24 42 71 9 1
6 -15 29 18 -6 51
8 -10 16 -1 37 62
9 70 -5 20 56 33

How to forward fill row values with function in pandas MultiIndex dataframe?

I have the following MultiIndex dataframe:
Close ATR condition
Date Symbol
1990-01-01 A 24 1 True
B 72 1 False
C 40 3 False
D 21 5 True
1990-01-02 A 65 4 True
B 19 2 True
C 43 3 True
D 72 1 False
1990-01-03 A 92 5 False
B 32 3 True
C 52 2 False
D 33 1 False
I perform the following calculation on this dataframe:
data.loc[data.index.levels[0][0], 'Shares'] = 0
data.loc[data.index.levels[0][0], 'Closed_P/L'] = 0
data = data.reset_index()
Equity = 10000
def calcs(x):
global Equity
# Skip first date
if x.index[0]==0: return x
# calculate Shares where condition is True
x.loc[x['condition'] == True, 'Shares'] = np.floor((Equity * 0.02 / x['ATR']).astype(float))
# other calulations
x['Closed_P/L'] = x['Shares'] * x['Close']
Equity += x['Closed_P/L'].sum()
return x
data = data.groupby('Date').apply(calcs)
data['Equity'] = data.groupby('Date')['Closed_P/L'].transform('sum')
data['Equity'] = data.groupby('Symbol')['Equity'].cumsum() + Equity
data = data.set_index(['Date','Symbol'])
The output is:
Close ATR condition Shares Closed_P/L Equity
Date Symbol
1990-01-01 A 24 1.2 True 0.0 0.0 10000.0
B 72 1.4 False 0.0 0.0 10000.0
C 40 3 False 0.0 0.0 10000.0
D 21 5 True 0.0 0.0 10000.0
1990-01-02 A 65 4 True 50.0 3250.0 17988.0
B 19 2 True 100.0 1900.0 17988.0
C 43 3 True 66.0 2838.0 17988.0
D 72 1 False NaN NaN 17988.0
1990-01-03 A 92 5 False NaN NaN 21796.0
B 32 3 True 119.0 3808.0 21796.0
C 52 2 False NaN NaN 21796.0
D 33 1 False NaN NaN 21796.0
I want to forward fill Shares values - grouped by Symbol - in case condition evaluates to False (except for first date). So the Shares value on 1990-01-02 for D should be 0 (because on 1990-01-01 the Shares value for D was 0 and the condition on 1990-01-02 is False). Also values for Shares on 1990-01-03 for A, C and D should be 50, 66 and 0 respectively based on the logic described above. How can I do that?

Converting boolean to zero-or-one, for all elements in an array

I have the following datasets of boolean columns
date hr energy
0 5-Feb-18 False False
1 29-Jan-18 False False
2 6-Dec-17 True False
3 16-Nov-17 False False
4 14-Nov-17 True True
5 25-Oct-17 False False
6 24-Oct-17 False False
7 5-Oct-17 False False
8 3-Oct-17 False False
9 26-Sep-17 False False
10 13-Sep-17 True False
11 7-Sep-17 False False
12 31-Aug-17 False False
I want to multiply each boolean column by 1 to turn it into a dummy
I tried:
df = df.iloc[:, 1:]
for col in df:
col = col*1
but the columns remain boolean, why?
Just using
df.iloc[:,1:]=df.iloc[:,1:].astype(int)
df
Out[477]:
date hr energy
0 5-Feb-18 0 0
1 29-Jan-18 0 0
2 6-Dec-17 1 0
3 16-Nov-17 0 0
4 14-Nov-17 1 1
5 25-Oct-17 0 0
6 24-Oct-17 0 0
7 5-Oct-17 0 0
8 3-Oct-17 0 0
9 26-Sep-17 0 0
10 13-Sep-17 1 0
11 7-Sep-17 0 0
12 31-Aug-17 0 0
For future cases other than True or False, If you want to convert categorical into numerical you could always use the replace function.
df.iloc[:,1:]=df.iloc[:,1:].replace({True:1,False:0})

Count How Many Columns in Dataframe before NaN

I want to count how many column data (pd.Dataframe) before Nan data. My data:
df
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Id
A 1 1 2 3 3 NaN NaN NaN NaN NaN NaN NaN NaN NaN
B 6 6 7 7 8 9 10 NaN NaN NaN NaN NaN NaN NaN
C 1 2 3 3 4 5 6 6 7 7 8 9 10 NaN
my desire output:
df_result
count
Id
A 5
B 7
C 13
thank you in advance for the answer.
Use:
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 13
A 1 1 2 3 3 NaN NaN NaN NaN NaN NaN NaN NaN 54.0
B 6 6 7 7 8 9.0 10.0 NaN NaN NaN NaN NaN 5.0 NaN
C 1 2 3 3 4 5.0 6.0 6.0 7.0 7.0 8.0 9.0 10.0 NaN
df = df.isnull().cumsum(axis=1).eq(0).sum(axis=1)
print (df)
A 5
B 7
C 13
dtype: int64
Detail:
First check NaNs:
print (df.isnull())
0 1 2 3 4 5 6 7 8 9 \
A False False False False False True True True True True
B False False False False False False False True True True
C False False False False False False False False False False
10 11 12 13
A True True True False
B True True False True
C False False False True
Get cumsum - Trues are processes like 1, False like 0
print (df.isnull().cumsum(axis=1))
0 1 2 3 4 5 6 7 8 9 10 11 12 13
A 0 0 0 0 0 1 2 3 4 5 6 7 8 8
B 0 0 0 0 0 0 0 1 2 3 4 5 5 6
C 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Compare by 0:
print (df.isnull().cumsum(axis=1).eq(0))
0 1 2 3 4 5 6 7 8 9 10 \
A True True True True True False False False False False False
B True True True True True True True False False False False
C True True True True True True True True True True True
11 12 13
A False False False
B False False False
C True True False
Sum boolean mask - Trues like 1s:
print (df.isnull().cumsum(axis=1).eq(0).sum(axis=1))
A 5
B 7
C 13
dtype: int64