For each CohortGroup assign the Proper CohortPeriod Count - pandas

I am trying to assign the proper cohort period count to the 'Cohort Period' column for each cohort group. I believe showing what I am trying to achieve makes more sense.
For loops seem like the way to go, was wondering if the same can be achieved using some nifty pandas function.
Out[7]:
OrderPeriod CohortGroup Cohort Period
0 1/1/2017 1/1/2017 NaN
1 1/1/2017 1/1/2017 NaN
2 1/1/2017 1/1/2017 NaN
3 1/1/2017 1/1/2017 NaN
4 1/1/2017 1/1/2017 NaN
5 1/1/2017 1/1/2017 NaN
6 1/1/2017 1/1/2017 NaN
7 1/1/2017 1/1/2017 NaN
8 4/1/2017 1/1/2017 NaN
9 6/1/2017 1/1/2017 NaN
10 8/1/2017 1/1/2017 NaN
11 9/1/2017 1/1/2017 NaN
12 9/1/2017 1/1/2017 NaN
13 11/1/2017 1/1/2017 NaN
14 4/1/2018 1/1/2017 NaN
15 6/1/2018 1/1/2017 NaN
16 12/1/2018 1/1/2017 NaN
17 1/1/2019 1/1/2017 NaN
18 5/1/2019 1/1/2017 NaN
19 2/1/2017 2/1/2017 NaN
20 3/1/2017 3/1/2017 NaN
21 3/1/2017 3/1/2017 NaN
22 3/1/2017 3/1/2017 NaN
23 3/1/2017 3/1/2017 NaN
24 3/1/2017 3/1/2017 NaN
25 4/1/2017 3/1/2017 NaN
If Cohort Group and OrderPeriod are the same it's assigned a 1, then counts for each new OrderPeriod and assigns that number to Cohort Period. Once a new CohortGroup starts the process begins again.
Out[7]:
OrderPeriod CohortGroup Cohort Period
0 1/1/2017 1/1/2017 1.0
1 1/1/2017 1/1/2017 1.0
2 1/1/2017 1/1/2017 1.0
3 1/1/2017 1/1/2017 1.0
4 1/1/2017 1/1/2017 1.0
5 1/1/2017 1/1/2017 1.0
6 1/1/2017 1/1/2017 1.0
7 1/1/2017 1/1/2017 1.0
8 4/1/2017 1/1/2017 2.0
9 6/1/2017 1/1/2017 3.0
10 8/1/2017 1/1/2017 4.0
11 9/1/2017 1/1/2017 5.0
12 9/1/2017 1/1/2017 5.0
13 11/1/2017 1/1/2017 6.0
14 4/1/2018 1/1/2017 7.0
15 6/1/2018 1/1/2017 8.0
16 12/1/2018 1/1/2017 9.0
17 1/1/2019 1/1/2017 10.0
18 5/1/2019 1/1/2017 11.0
19 2/1/2017 2/1/2017 1.0
20 3/1/2017 3/1/2017 1.0
21 3/1/2017 3/1/2017 1.0
22 3/1/2017 3/1/2017 1.0
23 3/1/2017 3/1/2017 1.0
24 3/1/2017 3/1/2017 1.0
25 4/1/2017 3/1/2017 2.0

I will do rank
df=df.apply(pd.to_datetime)
df['Cohort Period']=df.groupby('CohortGroup')['OrderPeriod'].rank('dense')
df
OrderPeriod CohortGroup Cohort Period
0 2017-01-01 2017-01-01 1.0
1 2017-01-01 2017-01-01 1.0
2 2017-01-01 2017-01-01 1.0
3 2017-01-01 2017-01-01 1.0
4 2017-01-01 2017-01-01 1.0
5 2017-01-01 2017-01-01 1.0
6 2017-01-01 2017-01-01 1.0
7 2017-01-01 2017-01-01 1.0
8 2017-04-01 2017-01-01 2.0
9 2017-06-01 2017-01-01 3.0
10 2017-08-01 2017-01-01 4.0
11 2017-09-01 2017-01-01 5.0
12 2017-09-01 2017-01-01 5.0
13 2017-11-01 2017-01-01 6.0
14 2018-04-01 2017-01-01 7.0
15 2018-06-01 2017-01-01 8.0
16 2018-12-01 2017-01-01 9.0
17 2019-01-01 2017-01-01 10.0
18 2019-05-01 2017-01-01 11.0
19 2017-02-01 2017-02-01 1.0
20 2017-03-01 2017-03-01 1.0
21 2017-03-01 2017-03-01 1.0
22 2017-03-01 2017-03-01 1.0
23 2017-03-01 2017-03-01 1.0
24 2017-03-01 2017-03-01 1.0
25 2017-04-01 2017-03-01 2.0

First we make your CohortGroup groups be checking where it changes with shift
Then we use groupby.apply to check where OrderPeriod is not the same as CohortGroup:
groups = df['CohortGroup'].ne(df['CohortGroup'].shift()).cumsum()
cohort_period = df.groupby(groups)\
.apply(lambda x: (x['OrderPeriod'].ne(x['CohortGroup'])\
& x['OrderPeriod'].ne(x['OrderPeriod'].shift(-1)))\
.cumsum().add(1)).values
df['Cohort Period'] = cohort_period
output
OrderPeriod CohortGroup Cohort Period
0 1/1/2017 1/1/2017 1
1 1/1/2017 1/1/2017 1
2 1/1/2017 1/1/2017 1
3 1/1/2017 1/1/2017 1
4 1/1/2017 1/1/2017 1
5 1/1/2017 1/1/2017 1
6 1/1/2017 1/1/2017 1
7 1/1/2017 1/1/2017 1
8 4/1/2017 1/1/2017 2
9 6/1/2017 1/1/2017 3
10 8/1/2017 1/1/2017 4
11 9/1/2017 1/1/2017 4
12 9/1/2017 1/1/2017 5
13 11/1/2017 1/1/2017 6
14 4/1/2018 1/1/2017 7
15 6/1/2018 1/1/2017 8
16 12/1/2018 1/1/2017 9
17 1/1/2019 1/1/2017 10
18 5/1/2019 1/1/2017 11
19 2/1/2017 2/1/2017 1
20 3/1/2017 3/1/2017 1
21 3/1/2017 3/1/2017 1
22 3/1/2017 3/1/2017 1
23 3/1/2017 3/1/2017 1
24 3/1/2017 3/1/2017 1
25 4/1/2017 3/1/2017 2

Related

7 days hourly mean with pandas

I need some help calculating a 7 days mean for every hour.
The timeseries has a hourly resolution and I need the 7 days mean for each hour e.g. for 13 o'clock
date, x
2020-07-01 13:00 , 4
2020-07-01 14:00 , 3
.
.
.
2020-07-02 13:00 , 3
2020-07-02 14:00 , 7
.
.
.
I tried it with pandas and a rolling mean, but rolling includes last 7 days.
Thanks for any hints!
Add a new hour column, grouping by hour column, and then add
The average was calculated over 7 days. This is consistent with the intent of the question.
df['hour'] = df.index.hour
df = df.groupby(df.hour)['x'].rolling(7).mean().reset_index()
df.head(35)
hour level_1 x
0 0 2020-07-01 00:00:00 NaN
1 0 2020-07-02 00:00:00 NaN
2 0 2020-07-03 00:00:00 NaN
3 0 2020-07-04 00:00:00 NaN
4 0 2020-07-05 00:00:00 NaN
5 0 2020-07-06 00:00:00 NaN
6 0 2020-07-07 00:00:00 48.142857
7 0 2020-07-08 00:00:00 50.285714
8 0 2020-07-09 00:00:00 60.000000
9 0 2020-07-10 00:00:00 63.142857
10 1 2020-07-01 01:00:00 NaN
11 1 2020-07-02 01:00:00 NaN
12 1 2020-07-03 01:00:00 NaN
13 1 2020-07-04 01:00:00 NaN
14 1 2020-07-05 01:00:00 NaN
15 1 2020-07-06 01:00:00 NaN
16 1 2020-07-07 01:00:00 52.571429
17 1 2020-07-08 01:00:00 48.428571
18 1 2020-07-09 01:00:00 38.000000
19 2 2020-07-01 02:00:00 NaN
20 2 2020-07-02 02:00:00 NaN
21 2 2020-07-03 02:00:00 NaN
22 2 2020-07-04 02:00:00 NaN
23 2 2020-07-05 02:00:00 NaN
24 2 2020-07-06 02:00:00 NaN
25 2 2020-07-07 02:00:00 46.571429
26 2 2020-07-08 02:00:00 47.714286
27 2 2020-07-09 02:00:00 42.714286
28 3 2020-07-01 03:00:00 NaN
29 3 2020-07-02 03:00:00 NaN
30 3 2020-07-03 03:00:00 NaN
31 3 2020-07-04 03:00:00 NaN
32 3 2020-07-05 03:00:00 NaN
33 3 2020-07-06 03:00:00 NaN
34 3 2020-07-07 03:00:00 72.571429

Pandas ffill based on condition

I have a data frame of empty daily prices. I have then written a function which give week commencing monday dates.
Day WC Monday Price 1 Price 2
1/1/12 1/1/12 44 34
2/1/13 1/1/12 55 34
3/1/12 1/1/12 44 34
4/1/13 1/1/12 NA NA
5/1/12 1/1/12 NA NA
6/1/13 1/1/12 34 NA
7/1/12 1/1/12 33 NA
8/1/13 8/1/12 12 NA
9/1/12 8/1/12 34 NA
10/1/13 8/1/12 23 NA
I want to say if the price only has NAs left until the end of the column, then fill with the last value only to the end of the incomplete week
So the expected output is:
Day WC Monday Price 1 Price 2
1/1/12 1/1/12 44 34
2/1/13 1/1/12 55 34
3/1/12 1/1/12 44 34
4/1/13 1/1/12 NA 34
5/1/12 1/1/12 NA 34
6/1/13 1/1/12 34 34
7/1/12 1/1/12 33 34
8/1/13 8/1/12 12 NA
9/1/12 8/1/12 34 NA
10/1/13 8/1/12 23 NA
Idea is test, if last row per group is missing values by GroupBy.transform with GroupBy.last and then replace missing values with DataFrame.mask and GroupBy.ffill:
c = ['Price 1','Price 2']
m = df.isna().groupby('WC Monday')[c].transform('last')
df[c] = df[c].mask(m, df.groupby('WC Monday')[c].ffill())
print (df)
Day WC Monday Price 1 Price 2
0 1/1/12 1/1/12 44.0 34.0
1 2/1/13 1/1/12 55.0 34.0
2 3/1/12 1/1/12 44.0 34.0
3 4/1/13 1/1/12 NaN 34.0
4 5/1/12 1/1/12 NaN 34.0
5 6/1/13 1/1/12 34.0 34.0
6 7/1/12 1/1/12 33.0 34.0
7 8/1/13 8/1/12 12.0 NaN
8 9/1/12 8/1/12 34.0 NaN
9 10/1/13 8/1/12 23.0 NaN

New Column With Repeated Value from a different column

I have the following code. I need to add a column deaths_last_tuesday which shows the deaths from last Tuesday, for each day.
import pandas as pd
data = {'date': ['2014-05-01', '2014-05-02', '2014-05-03', '2014-05-04', '2014-05-05', '2014-05-06', '2014-05-07',
'2014-05-08', '2014-05-09', '2014-05-10', '2014-05-11', '2014-05-12', '2014-05-13', '2014-05-14',
'2014-05-15', '2014-05-16', '2014-05-17', '2014-05-18', '2014-05-19', '2014-05-20'],
'battle_deaths': [34, 25, 26, 15, 15, 14, 26, 25, 62, 41, 23, 56, 23, 34, 23, 67, 54, 34, 45, 12]}
df = pd.DataFrame(data, columns=['date', 'battle_deaths'])
df['date'] = pd.to_datetime(df['date'])
df['day_of_week'] = df['date'].dt.dayofweek
df = df.set_index('date')
df.sort_index()
battle_deaths day_of_week deaths_last_tuesday
date
2014-05-01 34 3
2014-05-02 25 4 24
2014-05-03 26 5 24
2014-05-04 15 6 24
2014-05-05 15 0 24
2014-05-06 14 1 24
2014-05-07 26 2 24
2014-05-08 25 3 24
2014-05-09 62 4 25
2014-05-10 41 5 25
2014-05-11 23 6 25
2014-05-12 56 0 25
I want to do this so that I want to compare the deaths of each day with the deaths of the previous Tuesday.
Use:
df['deaths_last_tuesday'] = df['battle_deaths'].where(df['day_of_week'].eq(3)).ffill().shift()
print (df)
battle_deaths day_of_week deaths_last_tuesday
date
2014-05-01 34 3 NaN
2014-05-02 25 4 34.0
2014-05-03 26 5 34.0
2014-05-04 15 6 34.0
2014-05-05 15 0 34.0
2014-05-06 14 1 34.0
2014-05-07 26 2 34.0
2014-05-08 25 3 34.0
2014-05-09 62 4 25.0
2014-05-10 41 5 25.0
2014-05-11 23 6 25.0
2014-05-12 56 0 25.0
2014-05-13 23 1 25.0
2014-05-14 34 2 25.0
2014-05-15 23 3 25.0
2014-05-16 67 4 23.0
2014-05-17 54 5 23.0
2014-05-18 34 6 23.0
2014-05-19 45 0 23.0
2014-05-20 12 1 23.0
Explanation:
First compare by eq (==):
print (df['day_of_week'].eq(3))
date
2014-05-01 True
2014-05-02 False
2014-05-03 False
2014-05-04 False
2014-05-05 False
2014-05-06 False
2014-05-07 False
2014-05-08 True
2014-05-09 False
2014-05-10 False
2014-05-11 False
2014-05-12 False
2014-05-13 False
2014-05-14 False
2014-05-15 True
2014-05-16 False
2014-05-17 False
2014-05-18 False
2014-05-19 False
2014-05-20 False
Name: day_of_week, dtype: bool
Then create missing values for not matched values by where:
print (df['battle_deaths'].where(df['day_of_week'].eq(3)))
date
2014-05-01 34.0
2014-05-02 NaN
2014-05-03 NaN
2014-05-04 NaN
2014-05-05 NaN
2014-05-06 NaN
2014-05-07 NaN
2014-05-08 25.0
2014-05-09 NaN
2014-05-10 NaN
2014-05-11 NaN
2014-05-12 NaN
2014-05-13 NaN
2014-05-14 NaN
2014-05-15 23.0
2014-05-16 NaN
2014-05-17 NaN
2014-05-18 NaN
2014-05-19 NaN
2014-05-20 NaN
Name: battle_deaths, dtype: float64
Forwrd fill missing values:
print (df['battle_deaths'].where(df['day_of_week'].eq(3)).ffill())
date
2014-05-01 34.0
2014-05-02 34.0
2014-05-03 34.0
2014-05-04 34.0
2014-05-05 34.0
2014-05-06 34.0
2014-05-07 34.0
2014-05-08 25.0
2014-05-09 25.0
2014-05-10 25.0
2014-05-11 25.0
2014-05-12 25.0
2014-05-13 25.0
2014-05-14 25.0
2014-05-15 23.0
2014-05-16 23.0
2014-05-17 23.0
2014-05-18 23.0
2014-05-19 23.0
2014-05-20 23.0
Name: battle_deaths, dtype: float64
And last shift:
print (df['battle_deaths'].where(df['day_of_week'].eq(3)).ffill().shift())
date
2014-05-01 NaN
2014-05-02 34.0
2014-05-03 34.0
2014-05-04 34.0
2014-05-05 34.0
2014-05-06 34.0
2014-05-07 34.0
2014-05-08 34.0
2014-05-09 25.0
2014-05-10 25.0
2014-05-11 25.0
2014-05-12 25.0
2014-05-13 25.0
2014-05-14 25.0
2014-05-15 25.0
2014-05-16 23.0
2014-05-17 23.0
2014-05-18 23.0
2014-05-19 23.0
2014-05-20 23.0
Name: battle_deaths, dtype: float64

How to get count incremental by date

I am trying to get a count of rows with incremental dates.
My table looks like this:
ID name status create_date
1 John AC 2016-01-01 00:00:26.513
2 Jane AC 2016-01-02 00:00:26.513
3 Kane AC 2016-01-02 00:00:26.513
4 Carl AC 2016-01-03 00:00:26.513
5 Dave AC 2016-01-04 00:00:26.513
6 Gina AC 2016-01-04 00:00:26.513
Now what I want to return from the SQL is something like this:
Date Count
2016-01-01 1
2016-01-02 3
2016-01-03 4
2016-01-04 6
You can make use of COUNT() OVER () without PARTITION BY,by using ORDER BY. It will give you the cumulative sum.Use DISTINCT to filter out the duplicate values.
SELECT DISTINCT CAST(create_date AS DATE) [Date],
COUNT(create_date) OVER (ORDER BY CAST(create_date AS DATE)) as [COUNT]
FROM [YourTable]
SELECT create_date, COUNT(create_date) as [COUNT]
FROM (
SELECT CAST(create_date AS DATE) create_date
FROM [YourTable]
) T
GROUP BY create_date
Per your description, you need a continuous dates list, Does it make sense?
This sample only generating one-month data.
CREATE TABLE #tt(ID INT, name VARCHAR(10), status VARCHAR(10), create_date DATETIME)
INSERT INTO #tt
SELECT 1,'John','AC','2016-01-01 00:00:26.513' UNION
SELECT 2,'Jane','AC','2016-01-02 00:00:26.513' UNION
SELECT 3,'Kane','AC','2016-01-02 00:00:26.513' UNION
SELECT 4,'Carl','AC','2016-01-03 00:00:26.513' UNION
SELECT 5,'Dave','AC','2016-01-04 00:00:26.513' UNION
SELECT 6,'Gina','AC','2016-01-04 00:00:26.513' UNION
SELECT 7,'Tina','AC','2016-01-08 00:00:26.513'
SELECT * FROM #tt
SELECT CONVERT(DATE,DATEADD(d,sv.number,n.FirstDate)) AS [Date],COUNT(n.num) AS [Count]
FROM master.dbo.spt_values AS sv
LEFT JOIN (
SELECT MIN(t.create_date)OVER() AS FirstDate,DATEDIFF(d,MIN(t.create_date)OVER(),t.create_date) AS num FROM #tt AS t
) AS n ON n.num<=sv.number
WHERE sv.type='P' AND sv.number>=0 AND MONTH(DATEADD(d,sv.number,n.FirstDate))=MONTH(n.FirstDate)
GROUP BY CONVERT(DATE,DATEADD(d,sv.number,n.FirstDate))
Date Count
---------- -----------
2016-01-01 1
2016-01-02 3
2016-01-03 4
2016-01-04 6
2016-01-05 6
2016-01-06 6
2016-01-07 6
2016-01-08 7
2016-01-09 7
2016-01-10 7
2016-01-11 7
2016-01-12 7
2016-01-13 7
2016-01-14 7
2016-01-15 7
2016-01-16 7
2016-01-17 7
2016-01-18 7
2016-01-19 7
2016-01-20 7
2016-01-21 7
2016-01-22 7
2016-01-23 7
2016-01-24 7
2016-01-25 7
2016-01-26 7
2016-01-27 7
2016-01-28 7
2016-01-29 7
2016-01-30 7
2016-01-31 7
2017-01-01 7
2017-01-02 7
2017-01-03 7
2017-01-04 7
2017-01-05 7
2017-01-06 7
2017-01-07 7
2017-01-08 7
2017-01-09 7
2017-01-10 7
2017-01-11 7
2017-01-12 7
2017-01-13 7
2017-01-14 7
2017-01-15 7
2017-01-16 7
2017-01-17 7
2017-01-18 7
2017-01-19 7
2017-01-20 7
2017-01-21 7
2017-01-22 7
2017-01-23 7
2017-01-24 7
2017-01-25 7
2017-01-26 7
2017-01-27 7
2017-01-28 7
2017-01-29 7
2017-01-30 7
2017-01-31 7
2018-01-01 7
2018-01-02 7
2018-01-03 7
2018-01-04 7
2018-01-05 7
2018-01-06 7
2018-01-07 7
2018-01-08 7
2018-01-09 7
2018-01-10 7
2018-01-11 7
2018-01-12 7
2018-01-13 7
2018-01-14 7
2018-01-15 7
2018-01-16 7
2018-01-17 7
2018-01-18 7
2018-01-19 7
2018-01-20 7
2018-01-21 7
2018-01-22 7
2018-01-23 7
2018-01-24 7
2018-01-25 7
2018-01-26 7
2018-01-27 7
2018-01-28 7
2018-01-29 7
2018-01-30 7
2018-01-31 7
2019-01-01 7
2019-01-02 7
2019-01-03 7
2019-01-04 7
2019-01-05 7
2019-01-06 7
2019-01-07 7
2019-01-08 7
2019-01-09 7
2019-01-10 7
2019-01-11 7
2019-01-12 7
2019-01-13 7
2019-01-14 7
2019-01-15 7
2019-01-16 7
2019-01-17 7
2019-01-18 7
2019-01-19 7
2019-01-20 7
2019-01-21 7
2019-01-22 7
2019-01-23 7
2019-01-24 7
2019-01-25 7
2019-01-26 7
2019-01-27 7
2019-01-28 7
2019-01-29 7
2019-01-30 7
2019-01-31 7
2020-01-01 7
2020-01-02 7
2020-01-03 7
2020-01-04 7
2020-01-05 7
2020-01-06 7
2020-01-07 7
2020-01-08 7
2020-01-09 7
2020-01-10 7
2020-01-11 7
2020-01-12 7
2020-01-13 7
2020-01-14 7
2020-01-15 7
2020-01-16 7
2020-01-17 7
2020-01-18 7
2020-01-19 7
2020-01-20 7
2020-01-21 7
2020-01-22 7
2020-01-23 7
2020-01-24 7
2020-01-25 7
2020-01-26 7
2020-01-27 7
2020-01-28 7
2020-01-29 7
2020-01-30 7
2020-01-31 7
2021-01-01 7
2021-01-02 7
2021-01-03 7
2021-01-04 7
2021-01-05 7
2021-01-06 7
2021-01-07 7
2021-01-08 7
2021-01-09 7
2021-01-10 7
2021-01-11 7
2021-01-12 7
2021-01-13 7
2021-01-14 7
2021-01-15 7
2021-01-16 7
2021-01-17 7
2021-01-18 7
2021-01-19 7
2021-01-20 7
2021-01-21 7
2021-01-22 7
2021-01-23 7
2021-01-24 7
2021-01-25 7
2021-01-26 7
2021-01-27 7
2021-01-28 7
2021-01-29 7
2021-01-30 7
2021-01-31 7
select r.date,count(r.date) count
from
(
select id,name,substring(convert(nvarchar(50),create_date),1,10) date
from tblName
) r
group by r.date
In this code, in the subquery part,
I select the first 10 letter of date which is converted from dateTime to nvarchar so I make like '2016-01-01'. (which is not also necessary but for make code more readable I prefer to do it in this way).
Then with a simple group by I have date and date's count.

Capping values after a trigger level in a different variable _after GroupBy

There was an elegant answer to a question almost like this provided by EdChum. The difference between that question and this is that now the capping needs to be applied to data that had had "GroupBy" performed.
Original Data:
Symbol DTE Spot Strike Vol
AAPL 30.00 100.00 80.00 14.58
AAPL 30.00 100.00 85.00 16.20
AAPL 30.00 100.00 90.00 18.00
AAPL 30.00 100.00 95.00 20.00
AAPL 30.00 100.00 100.00 22.00
AAPL 30.00 100.00 105.00 25.30
AAPL 30.00 100.00 110.00 29.10
AAPL 30.00 100.00 115.00 33.46
AAPL 30.00 100.00 120.00 38.48
AAPL 50.00 102.00 80.00 13.08
AAPL 50.00 102.00 85.00 14.70
AAPL 50.00 102.00 90.00 16.50
AAPL 50.00 102.00 95.00 18.50
AAPL 50.00 102.00 100.00 20.50
AAPL 50.00 102.00 105.00 23.80
AAPL 50.00 102.00 110.00 27.60
AAPL 50.00 102.00 115.00 31.96
AAPL 50.00 102.00 120.00 36.98
IBM 30.00 170.00 150.00 7.29
IBM 30.00 170.00 155.00 8.10
IBM 30.00 170.00 160.00 9.00
IBM 30.00 170.00 165.00 10.00
IBM 30.00 170.00 170.00 11.00
IBM 30.00 170.00 175.00 12.65
IBM 30.00 170.00 180.00 14.55
IBM 30.00 170.00 185.00 16.73
IBM 30.00 170.00 190.00 19.24
IBM 60.00 171.00 150.00 5.79
IBM 60.00 171.00 155.00 6.60
IBM 60.00 171.00 160.00 7.50
IBM 60.00 171.00 165.00 8.50
IBM 60.00 171.00 170.00 9.50
IBM 60.00 171.00 175.00 11.15
IBM 60.00 171.00 180.00 13.05
IBM 60.00 171.00 185.00 15.23
IBM 60.00 171.00 190.00 17.74
I then create a few new variables:
df['ATM_dist'] =abs(df['Spot']-df['Strike'])
imin = df.groupby(['DTE','Symbol'])['ATM_dist'].transform('idxmin')
df['NormStrike']=np.log(df['Strike']/df['Spot'])/(((df['DTE']/365)**.5)*df['ATMvol']/100)
df['ATMvol'] = df.loc[imin,'Vol'].values
The results are below:
Symbol DTE Spot Strike Vol ATM_dist ATMvol NormStrike
0 AAPL 30 100 80 14.58 20 22.0 -3.537916
1 AAPL 30 100 85 16.20 15 22.0 -2.576719
2 AAPL 30 100 90 18.00 10 22.0 -1.670479
3 AAPL 30 100 95 20.00 5 22.0 -0.813249
4 AAPL 30 100 100 22.00 0 22.0 0.000000
5 AAPL 30 100 105 25.30 5 22.0 0.773562
6 AAPL 30 100 110 29.10 10 22.0 1.511132
7 AAPL 30 100 115 33.46 15 22.0 2.215910
8 AAPL 30 100 120 38.48 20 22.0 2.890688
9 AAPL 50 102 80 13.08 22 20.5 -3.201973
10 AAPL 50 102 85 14.70 17 20.5 -2.402955
11 AAPL 50 102 90 16.50 12 20.5 -1.649620
12 AAPL 50 102 95 18.50 7 20.5 -0.937027
13 AAPL 50 102 100 20.50 2 20.5 -0.260994
14 AAPL 50 102 105 23.80 3 20.5 0.382049
15 AAPL 50 102 110 27.60 8 20.5 0.995172
16 AAPL 50 102 115 31.96 13 20.5 1.581035
17 AAPL 50 102 120 36.98 18 20.5 2.141961
18 IBM 30 170 150 7.29 20 11.0 -3.968895
19 IBM 30 170 155 8.10 15 11.0 -2.929137
20 IBM 30 170 160 9.00 10 11.0 -1.922393
21 IBM 30 170 165 10.00 5 11.0 -0.946631
22 IBM 30 170 170 11.00 0 11.0 0.000000
23 IBM 30 170 175 12.65 5 11.0 0.919188
24 IBM 30 170 180 14.55 10 11.0 1.812480
25 IBM 30 170 185 16.73 15 11.0 2.681295
26 IBM 30 170 190 19.24 20 11.0 3.526940
27 IBM 60 171 150 5.79 21 9.5 -3.401827
28 IBM 60 171 155 6.60 16 9.5 -2.550520
29 IBM 60 171 160 7.50 11 9.5 -1.726243
30 IBM 60 171 165 8.50 6 9.5 -0.927332
31 IBM 60 171 170 9.50 1 9.5 -0.152273
32 IBM 60 171 175 11.15 4 9.5 0.600317
33 IBM 60 171 180 13.05 9 9.5 1.331704
34 IBM 60 171 185 15.23 14 9.5 2.043051
35 IBM 60 171 190 17.74 19 9.5 2.735427
I wish to have the values of 'Vol' cap to the level where another column 'NormStrike' hits a trigger (in this case abs(NormStrike) >= 2 ). This new column, 'Desired_Level', created while leaving the 'Vol' column unchanged. The first cap should cause the Vol value at index location 0 to be 16.2 because the cap was triggered at index location 1 when NormStrike hit -2.576719.
Added clarification:
I am looking for a generic solution, that works away from the lowest abs(NormStrike) level in both directions to hit both the -2 and the +2 trigger. If it is not hit (which it might not be) then desired level is just original_level
An additional note, it will always be true that the abs(NormStrike) continues to grow in size from the min(abs(NormStrike)) level as it is a function of abs(distance from spot to strike)
the code that EdChum provided (prior to me bringing GroupBy into the mix) is below:
clip = 4
lower = df.loc[df['NS'] <= -clip, 'Vol'].idxmax()
upper = df.loc[df['NS'] >= clip, 'Vol'].idxmin()
df['Original_level'] = df['Original_level'].clip(df.loc[lower,'Original_level'], df.loc[upper, 'Original_level'])
There are 2 issues, first, it did not work after groupby and second, if a particular group of data does not have a NS value that exceeds the "clip" value then it generates an error. The ideal outcome would be, in this case, nothing is done to the Vol level for the particular Symbol/DTE group in question.
Ed suggested implementing a reset_index() but I am not sure how to use that to solve the issue.
I hope this was not to convoluted of a question
thank you for any assistance
You can try this to see whether it works out. I assume if the clip has been triggered, then NaN will be put. You can replace it by your customized choice.
import pandas as pd
import numpy as np
# use np.where(criterion, x, y) to do a vectorized statement like if criterion is True, then set it to x, else set it to y
def func(group):
group['Triggered'] = np.where((group['NormStrike'] >= 2) | (group['NormStrike'] <= -4), 'Yes', 'No')
group['Desired_Level'] = np.where((group['NormStrike'] >= 2) | (group['NormStrike'] <= -4), np.nan, group['Vol'])
group = group.fillna(method='ffill').fillna(method='bfill')
return group
df = df.groupby(['Symbol', 'DTE']).apply(func)
Out[410]:
Symbol DTE Spot Strike Vol ATM_dist ATMvol NormStrike Triggered Desired_Level
0 AAPL 30 100 80 14.58 20 22 -3.5379 No 14.58
1 AAPL 30 100 85 16.20 15 22 -2.5767 No 16.20
2 AAPL 30 100 90 18.00 10 22 -1.6705 No 18.00
3 AAPL 30 100 95 20.00 5 22 -0.8132 No 20.00
4 AAPL 30 100 100 22.00 0 22 0.0000 No 22.00
5 AAPL 30 100 105 25.30 5 22 0.7736 No 25.30
6 AAPL 30 100 110 29.10 10 22 1.5111 No 29.10
7 AAPL 30 100 115 33.46 15 22 2.2159 Yes 29.10
8 AAPL 30 100 120 38.48 20 22 2.8907 Yes 29.10
9 AAPL 50 102 80 14.58 22 22 -3.5379 No 14.58
10 AAPL 50 102 85 16.20 17 22 -2.5767 No 16.20
11 AAPL 50 102 90 18.00 12 22 -1.6705 No 18.00
12 AAPL 50 102 95 20.00 7 22 -0.8132 No 20.00
13 AAPL 50 102 100 22.00 2 22 0.0000 No 22.00
14 AAPL 50 102 105 25.30 3 22 0.7736 No 25.30
15 AAPL 50 102 110 29.10 8 22 1.5111 No 29.10
16 AAPL 50 102 115 33.46 13 22 2.2159 Yes 29.10
17 AAPL 50 102 120 38.48 18 22 2.8907 Yes 29.10
18 AAPL 30 170 150 14.58 20 22 -3.5379 No 14.58
19 AAPL 30 170 155 16.20 15 22 -2.5767 No 16.20
20 AAPL 30 170 160 18.00 10 22 -1.6705 No 18.00
21 AAPL 30 170 165 20.00 5 22 -0.8132 No 20.00
22 AAPL 30 170 170 22.00 0 22 0.0000 No 22.00
23 AAPL 30 170 175 25.30 5 22 0.7736 No 25.30
24 AAPL 30 170 180 29.10 10 22 1.5111 No 29.10
25 AAPL 30 170 185 33.46 15 22 2.2159 Yes 29.10
26 AAPL 30 170 190 38.48 20 22 2.8907 Yes 29.10
27 AAPL 60 171 150 14.58 21 22 -3.5379 No 14.58
28 AAPL 60 171 155 16.20 16 22 -2.5767 No 16.20
29 AAPL 60 171 160 18.00 11 22 -1.6705 No 18.00
30 AAPL 60 171 165 20.00 6 22 -0.8132 No 20.00
31 AAPL 60 171 170 22.00 1 22 0.0000 No 22.00
32 AAPL 60 171 175 25.30 4 22 0.7736 No 25.30
33 AAPL 60 171 180 29.10 9 22 1.5111 No 29.10
34 AAPL 60 171 185 33.46 14 22 2.2159 Yes 29.10
35 AAPL 60 171 190 38.48 19 22 2.8907 Yes 29.10