Pandas ffill based on condition - pandas

I have a data frame of empty daily prices. I have then written a function which give week commencing monday dates.
Day WC Monday Price 1 Price 2
1/1/12 1/1/12 44 34
2/1/13 1/1/12 55 34
3/1/12 1/1/12 44 34
4/1/13 1/1/12 NA NA
5/1/12 1/1/12 NA NA
6/1/13 1/1/12 34 NA
7/1/12 1/1/12 33 NA
8/1/13 8/1/12 12 NA
9/1/12 8/1/12 34 NA
10/1/13 8/1/12 23 NA
I want to say if the price only has NAs left until the end of the column, then fill with the last value only to the end of the incomplete week
So the expected output is:
Day WC Monday Price 1 Price 2
1/1/12 1/1/12 44 34
2/1/13 1/1/12 55 34
3/1/12 1/1/12 44 34
4/1/13 1/1/12 NA 34
5/1/12 1/1/12 NA 34
6/1/13 1/1/12 34 34
7/1/12 1/1/12 33 34
8/1/13 8/1/12 12 NA
9/1/12 8/1/12 34 NA
10/1/13 8/1/12 23 NA

Idea is test, if last row per group is missing values by GroupBy.transform with GroupBy.last and then replace missing values with DataFrame.mask and GroupBy.ffill:
c = ['Price 1','Price 2']
m = df.isna().groupby('WC Monday')[c].transform('last')
df[c] = df[c].mask(m, df.groupby('WC Monday')[c].ffill())
print (df)
Day WC Monday Price 1 Price 2
0 1/1/12 1/1/12 44.0 34.0
1 2/1/13 1/1/12 55.0 34.0
2 3/1/12 1/1/12 44.0 34.0
3 4/1/13 1/1/12 NaN 34.0
4 5/1/12 1/1/12 NaN 34.0
5 6/1/13 1/1/12 34.0 34.0
6 7/1/12 1/1/12 33.0 34.0
7 8/1/13 8/1/12 12.0 NaN
8 9/1/12 8/1/12 34.0 NaN
9 10/1/13 8/1/12 23.0 NaN

Related

Shift row left by leading NaN's without removing all NaN's

Shift row left by leading NaN's without removing all NaN's
How can I remove leading NaN's in pandas when reading in a csv file?
Example code:
df = pd.DataFrame({
'c1': [ 20, 30, np.nan, np.nan, np.nan, 17, np.nan],
'c2': [np.nan, 74, 65, np.nan, np.nan, 74, 82],
'c3': [ 250, 290, 340, 325, 345, 315, 248],
'c4': [ 250, np.nan, 340, 325, 345, 315, 248],
'c5': [np.nan, np.nan, 340, np.nan, 345, np.nan, 248],
'c6': [np.nan, np.nan, np.nan, 325, 345, np.nan, np.nan]})
The code displays this
| | c1 | c2 | c3 | c4 | c5 | c6 |
|:-|-----:|-----:|----:|------:|------:|------:|
|0 | 20.0 | NaN | 250 | 250.0 | NaN | NaN |
|1 | 30.0 | 74.0 | 290 | NaN | NaN | NaN |
|2 | NaN | 65.0 | 340 | 340.0 | 340.0 | NaN |
|3 | NaN | NaN | 325 | 325.0 | NaN | 325.0 |
|4 | NaN | NaN | 345 | 345.0 | 345.0 | 345.0 |
|5 | 17.0 | 74.0 | 315 | 315.0 | NaN | NaN |
|6 | NaN | 82.0 | 248 | 248.0 | 248.0 | NaN |
I'd like to only reomve the leading NaN's so the result would look like this
| | c1 | c2 | c3 | c4 | c5 | c6 |
|:-|-----:|-----:|----:|------:|------:|------:|
|0 | 20 | NaN | 250.0 | 250.0 | NaN | NaN |
|1 | 30 | 74.0 | 290.0 | NaN | NaN | NaN |
|2 | 65 | 340.0 | 340.0 | 340.0 | NaN | NaN |
|3 | 325 | 325.0 | NaN | 325.0 | NaN | NaN |
|4 | 345 | 345.0 | 345.0 | 345.0 | NaN | NaN |
|5 | 17 | 74.0 | 315.0 | 315.0 | NaN | NaN |
|6 | 82 | 248.0 | 248.0 | 248.0 | NaN | NaN |
I have tried the following but that didn't work
response = pd.read_csv (r'MonthlyPermitReport.csv')
df = pd.DataFrame(response)
df.loc[df.first_valid_index():]
Help please.
You can try this:
s = df.isna().cumprod(axis=1).sum(axis=1)
df.apply(lambda x: x.shift(-s[x.name]), axis=1)
Output:
c1 c2 c3 c4 c5 c6
0 20.0 NaN 250.0 250.0 NaN NaN
1 30.0 74.0 290.0 NaN NaN NaN
2 65.0 340.0 340.0 340.0 NaN NaN
3 325.0 325.0 NaN 325.0 NaN NaN
4 345.0 345.0 345.0 345.0 NaN NaN
5 17.0 74.0 315.0 315.0 NaN NaN
6 82.0 248.0 248.0 248.0 NaN NaN
Details:
s, is a series that counts number of leading NaN in a row. isna finds all the NaN the dataframe, then using cumprod along the row axis we are eliminating NaN after a non-NaN value by multiplying by zero. Lastly, we use sum along row to calculate the number of place to shift each row.
Using dataframe apply with axis=1 (rowwise) the name of the pd.Series called in df.apply(axis=1) is the row index of the dataframe. Therefore we can fetch the number of periods to shift using, s defined above.
Let us try apply create the list then recreate the dataframe
out = pd.DataFrame(df.apply(lambda x : [x[x.notna().cumsum()>0].tolist()],1).str[0].tolist(),
index=df.index,
columns=df.columns)
Out[102]:
c1 c2 c3 c4 c5 c6
0 20.0 NaN 250.0 250.0 NaN NaN
1 30.0 74.0 290.0 NaN NaN NaN
2 65.0 340.0 340.0 340.0 NaN NaN
3 325.0 325.0 NaN 325.0 NaN NaN
4 345.0 345.0 345.0 345.0 NaN NaN
5 17.0 74.0 315.0 315.0 NaN NaN
6 82.0 248.0 248.0 248.0 NaN NaN

7 days hourly mean with pandas

I need some help calculating a 7 days mean for every hour.
The timeseries has a hourly resolution and I need the 7 days mean for each hour e.g. for 13 o'clock
date, x
2020-07-01 13:00 , 4
2020-07-01 14:00 , 3
.
.
.
2020-07-02 13:00 , 3
2020-07-02 14:00 , 7
.
.
.
I tried it with pandas and a rolling mean, but rolling includes last 7 days.
Thanks for any hints!
Add a new hour column, grouping by hour column, and then add
The average was calculated over 7 days. This is consistent with the intent of the question.
df['hour'] = df.index.hour
df = df.groupby(df.hour)['x'].rolling(7).mean().reset_index()
df.head(35)
hour level_1 x
0 0 2020-07-01 00:00:00 NaN
1 0 2020-07-02 00:00:00 NaN
2 0 2020-07-03 00:00:00 NaN
3 0 2020-07-04 00:00:00 NaN
4 0 2020-07-05 00:00:00 NaN
5 0 2020-07-06 00:00:00 NaN
6 0 2020-07-07 00:00:00 48.142857
7 0 2020-07-08 00:00:00 50.285714
8 0 2020-07-09 00:00:00 60.000000
9 0 2020-07-10 00:00:00 63.142857
10 1 2020-07-01 01:00:00 NaN
11 1 2020-07-02 01:00:00 NaN
12 1 2020-07-03 01:00:00 NaN
13 1 2020-07-04 01:00:00 NaN
14 1 2020-07-05 01:00:00 NaN
15 1 2020-07-06 01:00:00 NaN
16 1 2020-07-07 01:00:00 52.571429
17 1 2020-07-08 01:00:00 48.428571
18 1 2020-07-09 01:00:00 38.000000
19 2 2020-07-01 02:00:00 NaN
20 2 2020-07-02 02:00:00 NaN
21 2 2020-07-03 02:00:00 NaN
22 2 2020-07-04 02:00:00 NaN
23 2 2020-07-05 02:00:00 NaN
24 2 2020-07-06 02:00:00 NaN
25 2 2020-07-07 02:00:00 46.571429
26 2 2020-07-08 02:00:00 47.714286
27 2 2020-07-09 02:00:00 42.714286
28 3 2020-07-01 03:00:00 NaN
29 3 2020-07-02 03:00:00 NaN
30 3 2020-07-03 03:00:00 NaN
31 3 2020-07-04 03:00:00 NaN
32 3 2020-07-05 03:00:00 NaN
33 3 2020-07-06 03:00:00 NaN
34 3 2020-07-07 03:00:00 72.571429

For each CohortGroup assign the Proper CohortPeriod Count

I am trying to assign the proper cohort period count to the 'Cohort Period' column for each cohort group. I believe showing what I am trying to achieve makes more sense.
For loops seem like the way to go, was wondering if the same can be achieved using some nifty pandas function.
Out[7]:
OrderPeriod CohortGroup Cohort Period
0 1/1/2017 1/1/2017 NaN
1 1/1/2017 1/1/2017 NaN
2 1/1/2017 1/1/2017 NaN
3 1/1/2017 1/1/2017 NaN
4 1/1/2017 1/1/2017 NaN
5 1/1/2017 1/1/2017 NaN
6 1/1/2017 1/1/2017 NaN
7 1/1/2017 1/1/2017 NaN
8 4/1/2017 1/1/2017 NaN
9 6/1/2017 1/1/2017 NaN
10 8/1/2017 1/1/2017 NaN
11 9/1/2017 1/1/2017 NaN
12 9/1/2017 1/1/2017 NaN
13 11/1/2017 1/1/2017 NaN
14 4/1/2018 1/1/2017 NaN
15 6/1/2018 1/1/2017 NaN
16 12/1/2018 1/1/2017 NaN
17 1/1/2019 1/1/2017 NaN
18 5/1/2019 1/1/2017 NaN
19 2/1/2017 2/1/2017 NaN
20 3/1/2017 3/1/2017 NaN
21 3/1/2017 3/1/2017 NaN
22 3/1/2017 3/1/2017 NaN
23 3/1/2017 3/1/2017 NaN
24 3/1/2017 3/1/2017 NaN
25 4/1/2017 3/1/2017 NaN
If Cohort Group and OrderPeriod are the same it's assigned a 1, then counts for each new OrderPeriod and assigns that number to Cohort Period. Once a new CohortGroup starts the process begins again.
Out[7]:
OrderPeriod CohortGroup Cohort Period
0 1/1/2017 1/1/2017 1.0
1 1/1/2017 1/1/2017 1.0
2 1/1/2017 1/1/2017 1.0
3 1/1/2017 1/1/2017 1.0
4 1/1/2017 1/1/2017 1.0
5 1/1/2017 1/1/2017 1.0
6 1/1/2017 1/1/2017 1.0
7 1/1/2017 1/1/2017 1.0
8 4/1/2017 1/1/2017 2.0
9 6/1/2017 1/1/2017 3.0
10 8/1/2017 1/1/2017 4.0
11 9/1/2017 1/1/2017 5.0
12 9/1/2017 1/1/2017 5.0
13 11/1/2017 1/1/2017 6.0
14 4/1/2018 1/1/2017 7.0
15 6/1/2018 1/1/2017 8.0
16 12/1/2018 1/1/2017 9.0
17 1/1/2019 1/1/2017 10.0
18 5/1/2019 1/1/2017 11.0
19 2/1/2017 2/1/2017 1.0
20 3/1/2017 3/1/2017 1.0
21 3/1/2017 3/1/2017 1.0
22 3/1/2017 3/1/2017 1.0
23 3/1/2017 3/1/2017 1.0
24 3/1/2017 3/1/2017 1.0
25 4/1/2017 3/1/2017 2.0
I will do rank
df=df.apply(pd.to_datetime)
df['Cohort Period']=df.groupby('CohortGroup')['OrderPeriod'].rank('dense')
df
OrderPeriod CohortGroup Cohort Period
0 2017-01-01 2017-01-01 1.0
1 2017-01-01 2017-01-01 1.0
2 2017-01-01 2017-01-01 1.0
3 2017-01-01 2017-01-01 1.0
4 2017-01-01 2017-01-01 1.0
5 2017-01-01 2017-01-01 1.0
6 2017-01-01 2017-01-01 1.0
7 2017-01-01 2017-01-01 1.0
8 2017-04-01 2017-01-01 2.0
9 2017-06-01 2017-01-01 3.0
10 2017-08-01 2017-01-01 4.0
11 2017-09-01 2017-01-01 5.0
12 2017-09-01 2017-01-01 5.0
13 2017-11-01 2017-01-01 6.0
14 2018-04-01 2017-01-01 7.0
15 2018-06-01 2017-01-01 8.0
16 2018-12-01 2017-01-01 9.0
17 2019-01-01 2017-01-01 10.0
18 2019-05-01 2017-01-01 11.0
19 2017-02-01 2017-02-01 1.0
20 2017-03-01 2017-03-01 1.0
21 2017-03-01 2017-03-01 1.0
22 2017-03-01 2017-03-01 1.0
23 2017-03-01 2017-03-01 1.0
24 2017-03-01 2017-03-01 1.0
25 2017-04-01 2017-03-01 2.0
First we make your CohortGroup groups be checking where it changes with shift
Then we use groupby.apply to check where OrderPeriod is not the same as CohortGroup:
groups = df['CohortGroup'].ne(df['CohortGroup'].shift()).cumsum()
cohort_period = df.groupby(groups)\
.apply(lambda x: (x['OrderPeriod'].ne(x['CohortGroup'])\
& x['OrderPeriod'].ne(x['OrderPeriod'].shift(-1)))\
.cumsum().add(1)).values
df['Cohort Period'] = cohort_period
output
OrderPeriod CohortGroup Cohort Period
0 1/1/2017 1/1/2017 1
1 1/1/2017 1/1/2017 1
2 1/1/2017 1/1/2017 1
3 1/1/2017 1/1/2017 1
4 1/1/2017 1/1/2017 1
5 1/1/2017 1/1/2017 1
6 1/1/2017 1/1/2017 1
7 1/1/2017 1/1/2017 1
8 4/1/2017 1/1/2017 2
9 6/1/2017 1/1/2017 3
10 8/1/2017 1/1/2017 4
11 9/1/2017 1/1/2017 4
12 9/1/2017 1/1/2017 5
13 11/1/2017 1/1/2017 6
14 4/1/2018 1/1/2017 7
15 6/1/2018 1/1/2017 8
16 12/1/2018 1/1/2017 9
17 1/1/2019 1/1/2017 10
18 5/1/2019 1/1/2017 11
19 2/1/2017 2/1/2017 1
20 3/1/2017 3/1/2017 1
21 3/1/2017 3/1/2017 1
22 3/1/2017 3/1/2017 1
23 3/1/2017 3/1/2017 1
24 3/1/2017 3/1/2017 1
25 4/1/2017 3/1/2017 2

New Column With Repeated Value from a different column

I have the following code. I need to add a column deaths_last_tuesday which shows the deaths from last Tuesday, for each day.
import pandas as pd
data = {'date': ['2014-05-01', '2014-05-02', '2014-05-03', '2014-05-04', '2014-05-05', '2014-05-06', '2014-05-07',
'2014-05-08', '2014-05-09', '2014-05-10', '2014-05-11', '2014-05-12', '2014-05-13', '2014-05-14',
'2014-05-15', '2014-05-16', '2014-05-17', '2014-05-18', '2014-05-19', '2014-05-20'],
'battle_deaths': [34, 25, 26, 15, 15, 14, 26, 25, 62, 41, 23, 56, 23, 34, 23, 67, 54, 34, 45, 12]}
df = pd.DataFrame(data, columns=['date', 'battle_deaths'])
df['date'] = pd.to_datetime(df['date'])
df['day_of_week'] = df['date'].dt.dayofweek
df = df.set_index('date')
df.sort_index()
battle_deaths day_of_week deaths_last_tuesday
date
2014-05-01 34 3
2014-05-02 25 4 24
2014-05-03 26 5 24
2014-05-04 15 6 24
2014-05-05 15 0 24
2014-05-06 14 1 24
2014-05-07 26 2 24
2014-05-08 25 3 24
2014-05-09 62 4 25
2014-05-10 41 5 25
2014-05-11 23 6 25
2014-05-12 56 0 25
I want to do this so that I want to compare the deaths of each day with the deaths of the previous Tuesday.
Use:
df['deaths_last_tuesday'] = df['battle_deaths'].where(df['day_of_week'].eq(3)).ffill().shift()
print (df)
battle_deaths day_of_week deaths_last_tuesday
date
2014-05-01 34 3 NaN
2014-05-02 25 4 34.0
2014-05-03 26 5 34.0
2014-05-04 15 6 34.0
2014-05-05 15 0 34.0
2014-05-06 14 1 34.0
2014-05-07 26 2 34.0
2014-05-08 25 3 34.0
2014-05-09 62 4 25.0
2014-05-10 41 5 25.0
2014-05-11 23 6 25.0
2014-05-12 56 0 25.0
2014-05-13 23 1 25.0
2014-05-14 34 2 25.0
2014-05-15 23 3 25.0
2014-05-16 67 4 23.0
2014-05-17 54 5 23.0
2014-05-18 34 6 23.0
2014-05-19 45 0 23.0
2014-05-20 12 1 23.0
Explanation:
First compare by eq (==):
print (df['day_of_week'].eq(3))
date
2014-05-01 True
2014-05-02 False
2014-05-03 False
2014-05-04 False
2014-05-05 False
2014-05-06 False
2014-05-07 False
2014-05-08 True
2014-05-09 False
2014-05-10 False
2014-05-11 False
2014-05-12 False
2014-05-13 False
2014-05-14 False
2014-05-15 True
2014-05-16 False
2014-05-17 False
2014-05-18 False
2014-05-19 False
2014-05-20 False
Name: day_of_week, dtype: bool
Then create missing values for not matched values by where:
print (df['battle_deaths'].where(df['day_of_week'].eq(3)))
date
2014-05-01 34.0
2014-05-02 NaN
2014-05-03 NaN
2014-05-04 NaN
2014-05-05 NaN
2014-05-06 NaN
2014-05-07 NaN
2014-05-08 25.0
2014-05-09 NaN
2014-05-10 NaN
2014-05-11 NaN
2014-05-12 NaN
2014-05-13 NaN
2014-05-14 NaN
2014-05-15 23.0
2014-05-16 NaN
2014-05-17 NaN
2014-05-18 NaN
2014-05-19 NaN
2014-05-20 NaN
Name: battle_deaths, dtype: float64
Forwrd fill missing values:
print (df['battle_deaths'].where(df['day_of_week'].eq(3)).ffill())
date
2014-05-01 34.0
2014-05-02 34.0
2014-05-03 34.0
2014-05-04 34.0
2014-05-05 34.0
2014-05-06 34.0
2014-05-07 34.0
2014-05-08 25.0
2014-05-09 25.0
2014-05-10 25.0
2014-05-11 25.0
2014-05-12 25.0
2014-05-13 25.0
2014-05-14 25.0
2014-05-15 23.0
2014-05-16 23.0
2014-05-17 23.0
2014-05-18 23.0
2014-05-19 23.0
2014-05-20 23.0
Name: battle_deaths, dtype: float64
And last shift:
print (df['battle_deaths'].where(df['day_of_week'].eq(3)).ffill().shift())
date
2014-05-01 NaN
2014-05-02 34.0
2014-05-03 34.0
2014-05-04 34.0
2014-05-05 34.0
2014-05-06 34.0
2014-05-07 34.0
2014-05-08 34.0
2014-05-09 25.0
2014-05-10 25.0
2014-05-11 25.0
2014-05-12 25.0
2014-05-13 25.0
2014-05-14 25.0
2014-05-15 25.0
2014-05-16 23.0
2014-05-17 23.0
2014-05-18 23.0
2014-05-19 23.0
2014-05-20 23.0
Name: battle_deaths, dtype: float64

Capping values after a trigger level in a different variable _after GroupBy

There was an elegant answer to a question almost like this provided by EdChum. The difference between that question and this is that now the capping needs to be applied to data that had had "GroupBy" performed.
Original Data:
Symbol DTE Spot Strike Vol
AAPL 30.00 100.00 80.00 14.58
AAPL 30.00 100.00 85.00 16.20
AAPL 30.00 100.00 90.00 18.00
AAPL 30.00 100.00 95.00 20.00
AAPL 30.00 100.00 100.00 22.00
AAPL 30.00 100.00 105.00 25.30
AAPL 30.00 100.00 110.00 29.10
AAPL 30.00 100.00 115.00 33.46
AAPL 30.00 100.00 120.00 38.48
AAPL 50.00 102.00 80.00 13.08
AAPL 50.00 102.00 85.00 14.70
AAPL 50.00 102.00 90.00 16.50
AAPL 50.00 102.00 95.00 18.50
AAPL 50.00 102.00 100.00 20.50
AAPL 50.00 102.00 105.00 23.80
AAPL 50.00 102.00 110.00 27.60
AAPL 50.00 102.00 115.00 31.96
AAPL 50.00 102.00 120.00 36.98
IBM 30.00 170.00 150.00 7.29
IBM 30.00 170.00 155.00 8.10
IBM 30.00 170.00 160.00 9.00
IBM 30.00 170.00 165.00 10.00
IBM 30.00 170.00 170.00 11.00
IBM 30.00 170.00 175.00 12.65
IBM 30.00 170.00 180.00 14.55
IBM 30.00 170.00 185.00 16.73
IBM 30.00 170.00 190.00 19.24
IBM 60.00 171.00 150.00 5.79
IBM 60.00 171.00 155.00 6.60
IBM 60.00 171.00 160.00 7.50
IBM 60.00 171.00 165.00 8.50
IBM 60.00 171.00 170.00 9.50
IBM 60.00 171.00 175.00 11.15
IBM 60.00 171.00 180.00 13.05
IBM 60.00 171.00 185.00 15.23
IBM 60.00 171.00 190.00 17.74
I then create a few new variables:
df['ATM_dist'] =abs(df['Spot']-df['Strike'])
imin = df.groupby(['DTE','Symbol'])['ATM_dist'].transform('idxmin')
df['NormStrike']=np.log(df['Strike']/df['Spot'])/(((df['DTE']/365)**.5)*df['ATMvol']/100)
df['ATMvol'] = df.loc[imin,'Vol'].values
The results are below:
Symbol DTE Spot Strike Vol ATM_dist ATMvol NormStrike
0 AAPL 30 100 80 14.58 20 22.0 -3.537916
1 AAPL 30 100 85 16.20 15 22.0 -2.576719
2 AAPL 30 100 90 18.00 10 22.0 -1.670479
3 AAPL 30 100 95 20.00 5 22.0 -0.813249
4 AAPL 30 100 100 22.00 0 22.0 0.000000
5 AAPL 30 100 105 25.30 5 22.0 0.773562
6 AAPL 30 100 110 29.10 10 22.0 1.511132
7 AAPL 30 100 115 33.46 15 22.0 2.215910
8 AAPL 30 100 120 38.48 20 22.0 2.890688
9 AAPL 50 102 80 13.08 22 20.5 -3.201973
10 AAPL 50 102 85 14.70 17 20.5 -2.402955
11 AAPL 50 102 90 16.50 12 20.5 -1.649620
12 AAPL 50 102 95 18.50 7 20.5 -0.937027
13 AAPL 50 102 100 20.50 2 20.5 -0.260994
14 AAPL 50 102 105 23.80 3 20.5 0.382049
15 AAPL 50 102 110 27.60 8 20.5 0.995172
16 AAPL 50 102 115 31.96 13 20.5 1.581035
17 AAPL 50 102 120 36.98 18 20.5 2.141961
18 IBM 30 170 150 7.29 20 11.0 -3.968895
19 IBM 30 170 155 8.10 15 11.0 -2.929137
20 IBM 30 170 160 9.00 10 11.0 -1.922393
21 IBM 30 170 165 10.00 5 11.0 -0.946631
22 IBM 30 170 170 11.00 0 11.0 0.000000
23 IBM 30 170 175 12.65 5 11.0 0.919188
24 IBM 30 170 180 14.55 10 11.0 1.812480
25 IBM 30 170 185 16.73 15 11.0 2.681295
26 IBM 30 170 190 19.24 20 11.0 3.526940
27 IBM 60 171 150 5.79 21 9.5 -3.401827
28 IBM 60 171 155 6.60 16 9.5 -2.550520
29 IBM 60 171 160 7.50 11 9.5 -1.726243
30 IBM 60 171 165 8.50 6 9.5 -0.927332
31 IBM 60 171 170 9.50 1 9.5 -0.152273
32 IBM 60 171 175 11.15 4 9.5 0.600317
33 IBM 60 171 180 13.05 9 9.5 1.331704
34 IBM 60 171 185 15.23 14 9.5 2.043051
35 IBM 60 171 190 17.74 19 9.5 2.735427
I wish to have the values of 'Vol' cap to the level where another column 'NormStrike' hits a trigger (in this case abs(NormStrike) >= 2 ). This new column, 'Desired_Level', created while leaving the 'Vol' column unchanged. The first cap should cause the Vol value at index location 0 to be 16.2 because the cap was triggered at index location 1 when NormStrike hit -2.576719.
Added clarification:
I am looking for a generic solution, that works away from the lowest abs(NormStrike) level in both directions to hit both the -2 and the +2 trigger. If it is not hit (which it might not be) then desired level is just original_level
An additional note, it will always be true that the abs(NormStrike) continues to grow in size from the min(abs(NormStrike)) level as it is a function of abs(distance from spot to strike)
the code that EdChum provided (prior to me bringing GroupBy into the mix) is below:
clip = 4
lower = df.loc[df['NS'] <= -clip, 'Vol'].idxmax()
upper = df.loc[df['NS'] >= clip, 'Vol'].idxmin()
df['Original_level'] = df['Original_level'].clip(df.loc[lower,'Original_level'], df.loc[upper, 'Original_level'])
There are 2 issues, first, it did not work after groupby and second, if a particular group of data does not have a NS value that exceeds the "clip" value then it generates an error. The ideal outcome would be, in this case, nothing is done to the Vol level for the particular Symbol/DTE group in question.
Ed suggested implementing a reset_index() but I am not sure how to use that to solve the issue.
I hope this was not to convoluted of a question
thank you for any assistance
You can try this to see whether it works out. I assume if the clip has been triggered, then NaN will be put. You can replace it by your customized choice.
import pandas as pd
import numpy as np
# use np.where(criterion, x, y) to do a vectorized statement like if criterion is True, then set it to x, else set it to y
def func(group):
group['Triggered'] = np.where((group['NormStrike'] >= 2) | (group['NormStrike'] <= -4), 'Yes', 'No')
group['Desired_Level'] = np.where((group['NormStrike'] >= 2) | (group['NormStrike'] <= -4), np.nan, group['Vol'])
group = group.fillna(method='ffill').fillna(method='bfill')
return group
df = df.groupby(['Symbol', 'DTE']).apply(func)
Out[410]:
Symbol DTE Spot Strike Vol ATM_dist ATMvol NormStrike Triggered Desired_Level
0 AAPL 30 100 80 14.58 20 22 -3.5379 No 14.58
1 AAPL 30 100 85 16.20 15 22 -2.5767 No 16.20
2 AAPL 30 100 90 18.00 10 22 -1.6705 No 18.00
3 AAPL 30 100 95 20.00 5 22 -0.8132 No 20.00
4 AAPL 30 100 100 22.00 0 22 0.0000 No 22.00
5 AAPL 30 100 105 25.30 5 22 0.7736 No 25.30
6 AAPL 30 100 110 29.10 10 22 1.5111 No 29.10
7 AAPL 30 100 115 33.46 15 22 2.2159 Yes 29.10
8 AAPL 30 100 120 38.48 20 22 2.8907 Yes 29.10
9 AAPL 50 102 80 14.58 22 22 -3.5379 No 14.58
10 AAPL 50 102 85 16.20 17 22 -2.5767 No 16.20
11 AAPL 50 102 90 18.00 12 22 -1.6705 No 18.00
12 AAPL 50 102 95 20.00 7 22 -0.8132 No 20.00
13 AAPL 50 102 100 22.00 2 22 0.0000 No 22.00
14 AAPL 50 102 105 25.30 3 22 0.7736 No 25.30
15 AAPL 50 102 110 29.10 8 22 1.5111 No 29.10
16 AAPL 50 102 115 33.46 13 22 2.2159 Yes 29.10
17 AAPL 50 102 120 38.48 18 22 2.8907 Yes 29.10
18 AAPL 30 170 150 14.58 20 22 -3.5379 No 14.58
19 AAPL 30 170 155 16.20 15 22 -2.5767 No 16.20
20 AAPL 30 170 160 18.00 10 22 -1.6705 No 18.00
21 AAPL 30 170 165 20.00 5 22 -0.8132 No 20.00
22 AAPL 30 170 170 22.00 0 22 0.0000 No 22.00
23 AAPL 30 170 175 25.30 5 22 0.7736 No 25.30
24 AAPL 30 170 180 29.10 10 22 1.5111 No 29.10
25 AAPL 30 170 185 33.46 15 22 2.2159 Yes 29.10
26 AAPL 30 170 190 38.48 20 22 2.8907 Yes 29.10
27 AAPL 60 171 150 14.58 21 22 -3.5379 No 14.58
28 AAPL 60 171 155 16.20 16 22 -2.5767 No 16.20
29 AAPL 60 171 160 18.00 11 22 -1.6705 No 18.00
30 AAPL 60 171 165 20.00 6 22 -0.8132 No 20.00
31 AAPL 60 171 170 22.00 1 22 0.0000 No 22.00
32 AAPL 60 171 175 25.30 4 22 0.7736 No 25.30
33 AAPL 60 171 180 29.10 9 22 1.5111 No 29.10
34 AAPL 60 171 185 33.46 14 22 2.2159 Yes 29.10
35 AAPL 60 171 190 38.48 19 22 2.8907 Yes 29.10