Vectorize for loop and return x day high and low - pandas

Overview
For each row of a dataframe I want to calculate the x day high and low.
An x day high is higher than previous x days.
An x day low is lower than previous x days.
The for loop is explained in further detail in this post
Update:
Answer by #mozway below completes in around 20 seconds with dataset containing 18k rows. Can this be improved with numpy with broadcasting etc?
Example
2020-03-20 has an x_day_low value of 1 as it is lower than the previous day.
2020-03-27 has an x_day_high value of 8 as it is higher than the previous 8 days.
See desired output and test code below which is calculated with a for loop in the findHighLow function. How would I vectorize findHighLow as the actual dataframe is somewhat larger.
Test data
def genMockDataFrame(days,startPrice,colName,startDate,seed=None):
periods = days*24
np.random.seed(seed)
steps = np.random.normal(loc=0, scale=0.0018, size=periods)
steps[0]=0
P = startPrice+np.cumsum(steps)
P = [round(i,4) for i in P]
fxDF = pd.DataFrame({
'ticker':np.repeat( [colName], periods ),
'date':np.tile( pd.date_range(startDate, periods=periods, freq='H'), 1 ),
'price':(P)})
fxDF.index = pd.to_datetime(fxDF.date)
fxDF = fxDF.price.resample('D').ohlc()
fxDF.columns = [i.title() for i in fxDF.columns]
return fxDF
#rows set to 15 for minimal example but actual dataframe contains around 18000 rows.
number_of_rows = 15
df = genMockDataFrame(number_of_rows,1.1904,'tttmmm','19/3/2020',seed=157)
def findHighLow (df):
df['x_day_high'] = 0
df['x_day_low'] = 0
for n in reversed(range(len(df['High']))):
for i in reversed(range(n)):
if df['High'][n] > df['High'][i]:
df['x_day_high'][n] = n - i
else: break
for n in reversed(range(len(df['Low']))):
for i in reversed(range(n)):
if df['Low'][n] < df['Low'][i]:
df['x_day_low'][n] = n - i
else: break
return df
df = findHighLow (df)
Desired output should match this:
df[["High","Low","x_day_high","x_day_low"]]
High Low x_day_high x_day_low
date
2020-03-19 1.1937 1.1832 0 0
2020-03-20 1.1879 1.1769 0 1
2020-03-21 1.1767 1.1662 0 2
2020-03-22 1.1721 1.1611 0 3
2020-03-23 1.1819 1.1690 2 0
2020-03-24 1.1928 1.1807 4 0
2020-03-25 1.1939 1.1864 6 0
2020-03-26 1.2141 1.1964 7 0
2020-03-27 1.2144 1.2039 8 0
2020-03-28 1.2099 1.2018 0 1
2020-03-29 1.2033 1.1853 0 4
2020-03-30 1.1887 1.1806 0 6
2020-03-31 1.1972 1.1873 1 0
2020-04-01 1.1997 1.1914 2 0
2020-04-02 1.1924 1.1781 0 9

Here are two so solutions. Both produce the desired output, as posted in the question.
The first solution uses Numba and completes in 0.5 seconds on my machine for 20k rows. If you can use Numba, this is the way to go. The second solution uses only Pandas/Numpy and completes in 1.5 seconds for 20k rows.
Numba
#numba.njit
def count_smaller(arr):
current = arr[-1]
count = 0
for i in range(arr.shape[0]-2, -1, -1):
if arr[i] > current:
break
count += 1
return count
#numba.njit
def count_greater(arr):
current = arr[-1]
count = 0
for i in range(arr.shape[0]-2, -1, -1):
if arr[i] < current:
break
count += 1
return count
df["x_day_high"] = df.High.expanding().apply(count_smaller, engine='numba', raw=True)
df["x_day_low"] = df.Low.expanding().apply(count_greater, engine='numba', raw=True)
Pandas/Numpy
def count_consecutive_true(bool_arr):
return bool_arr[::-1].cumprod().sum()
def count_smaller(arr):
return count_consecutive_true(arr <= arr[-1]) - 1
def count_greater(arr):
return count_consecutive_true(arr >= arr[-1]) - 1
df["x_day_high"] = df.High.expanding().apply(count_smaller, raw=True)
df["x_day_low"] = df.Low.expanding().apply(count_greater, raw=True)
This last solution is similar to mozway's. However it runs faster because it doesn't need to perform a join and uses numpy as much as possible. It also looks arbitrarily far back.

You can use rolling to get the last N days, a comparison + cumprod on the reversed boolean array to keep only the last consecutive valid values, and sum to count them. Apply on each column using agg and join the output after adding a prefix.
# number of days
N = 8
df.join(df.rolling(f'{N+1}d', min_periods=1)
.agg({'High': lambda s: s.le(s.iloc[-1])[::-1].cumprod().sum()-1,
'Low': lambda s: s.ge(s.iloc[-1])[::-1].cumprod().sum()-1,
})
.add_prefix(f'{N}_days_')
)
Output:
Open High Low Close 8_days_High 8_days_Low
date
2020-03-19 1.1904 1.1937 1.1832 1.1832 0.0 0.0
2020-03-20 1.1843 1.1879 1.1769 1.1772 0.0 1.0
2020-03-21 1.1755 1.1767 1.1662 1.1672 0.0 2.0
2020-03-22 1.1686 1.1721 1.1611 1.1721 0.0 3.0
2020-03-23 1.1732 1.1819 1.1690 1.1819 2.0 0.0
2020-03-24 1.1836 1.1928 1.1807 1.1922 4.0 0.0
2020-03-25 1.1939 1.1939 1.1864 1.1936 6.0 0.0
2020-03-26 1.1967 1.2141 1.1964 1.2114 7.0 0.0
2020-03-27 1.2118 1.2144 1.2039 1.2089 7.0 0.0
2020-03-28 1.2080 1.2099 1.2018 1.2041 0.0 1.0
2020-03-29 1.2033 1.2033 1.1853 1.1880 0.0 4.0
2020-03-30 1.1876 1.1887 1.1806 1.1879 0.0 6.0
2020-03-31 1.1921 1.1972 1.1873 1.1939 1.0 0.0
2020-04-01 1.1932 1.1997 1.1914 1.1914 2.0 0.0
2020-04-02 1.1902 1.1924 1.1781 1.1862 0.0 7.0

Related

How to add Multilevel Columns and create new column?

I am trying to create a "total" column in my dataframe
idx = pd.MultiIndex.from_product([['Room 1','Room 2', 'Room 3'],['on','off']])
df = pd.DataFrame([[1,4,3,6,5,15], [3,2,1,5,1,7]], columns=idx)
My dataframe
Room 1 Room 2 Room 3
on off on off on off
0 1 4 3 6 5 15
1 3 2 1 5 1 7
For each room, I want to create a total column and then a on% column.
I have tried the following, however, it does not work.
df.loc[:, slice(None), "total" ] = df.xs('on', axis=1,level=1) + df.xs('off', axis=1,level=1)
Let us try something fancy ~
df.stack(0).eval('total=on + off \n on_pct=on / total').stack().unstack([1, 2])
Room 1 Room 2 Room 3
off on total on_pct off on total on_pct off on total on_pct
0 4.0 1.0 5.0 0.2 6.0 3.0 9.0 0.333333 15.0 5.0 20.0 0.250
1 2.0 3.0 5.0 0.6 5.0 1.0 6.0 0.166667 7.0 1.0 8.0 0.125
Oof this was a roughie, but you can do it like this if you want to avoid loops. Worth noting it redefines your df twice because i need the total columns. Sorry about that, but is the best i could do. Also if you have any questions just comment.
df = pd.concat([y.assign(**{'Total {0}'.format(x+1): y.iloc[:,0] + y.iloc[:,1]})for x , y in df.groupby(np.arange(df.shape[1])//2,axis=1)],axis=1)
df = pd.concat([y.assign(**{'Percentage_Total{0}'.format(x+1): (y.iloc[:,0] / y.iloc[:,2])*100})for x , y in df.groupby(np.arange(df.shape[1])//3,axis=1)],axis=1)
print(df)
This groups by the column's first index (rooms) and then loops through each group to add the total and percent on. The final step is to reindex using the unique rooms:
import pandas as pd
idx = pd.MultiIndex.from_product([['Room 1','Room 2', 'Room 3'],['on','off']])
df = pd.DataFrame([[1,4,3,6,5,15], [3,2,1,5,1,7]], columns=idx)
for room, group in df.groupby(level=0, axis=1):
df[(room, 'total')] = group.sum(axis=1)
df[(room, 'pct_on')] = group[(room, 'on')] / df[(room, 'total')]
result = df.reindex(columns=df.columns.get_level_values(0).unique(), level=0)
Output:
Room 1 Room 2 Room 3
on off total pct_on on off total pct_on on off total pct_on
0 1 4 5 0.2 3 6 9 0.333333 5 15 20 0.250
1 3 2 5 0.6 1 5 6 0.166667 1 7 8 0.125

pandas dataframe how to replace extreme outliers for all columns

I have a pandas dataframe with some very extreme value - more than 5 std.
I want to replace, per column, each value that is more than 5 std with the max other value.
For example,
df = A B
1 2
1 6
2 8
1 115
191 1
Will become:
df = A B
1 2
1 6
2 8
1 8
2 1
What is the best way to do it without a for loop over the columns?
s=df.mask((df-df.apply(lambda x: x.std() )).gt(5))#mask where condition applies
s=s.assign(A=s.A.fillna(s.A.max()),B=s.B.fillna(s.B.max())).sort_index(axis = 0)#fill with max per column and resort frame
A B
0 1.0 2.0
1 1.0 6.0
2 2.0 8.0
3 1.0 8.0
4 2.0 1.0
Per the discussion in the comments you need to decide what your threshold is. say it is q=100, then you can do
q = 100
df.loc[df['A'] > q,'A'] = max(df.loc[df['A'] < q,'A'] )
df
this fixes column A:
A B
0 1 2
1 1 6
2 2 8
3 1 115
4 2 1
do the same for B
Calculate a column-wise z-score (if you deem something an outlier if it lies outside a given number of standard deviations of the column) and then calculate a boolean mask of values outside your desired range
def calc_zscore(col):
return (col - col.mean()) / col.std()
zscores = df.apply(calc_zscore, axis=0)
outlier_mask = zscores > 5
After that it's up to you to fill the values marked with the boolean mask.
df[outlier_mask] = something

How to shift every value of selected rows by one place in pandas

In the pandas dataframe I need to shift all the values in every seventh row (every Saturday) by one place, so that all the 10s line up vertically.
So 2020-11-07
should go from
10 1 10 3 10 9 10 7 10 3 10 5 10 5 10 10
to
10 10 1 10 3 10 9 10 7 10 3 10 5 10 5 10
And likewise for 2020-11-14, 2020-11-21 etc.
For every 7th row in the dataframe, use shift method to move everything right by one place and then concatenate the last value and everything from the shifted row except the first one (which will be null)
for i in range(len(df)):
if i % 6 == 0:
df.iloc[i, :] = [df.iloc[i, -1]] + df.iloc[i, :].shift(1).tolist()[1:]
If needed, this solution can be generalized to shift every kth row by r places
for i in range(len(df)):
if i % (k-1) == 0:
df.iloc[i, :] = df.iloc[i, -r:].tolist() + df.iloc[i, :].shift(r).tolist()[r:]
EDIT:
You can achieve this also without using the shift method
For every 7th row 1 place:
for i in range(len(df)):
if i % 6 == 0:
df.iloc[i, :] = [df.iloc[i, -1]] + df.iloc[i, :-1].tolist()
For the general case of kth row r places:
for i in range(len(df)):
if i % (k-1) == 0:
df.iloc[i, :] = df.iloc[i, -r:].tolist() + df.iloc[i, :-r].tolist()
Do you want something like this ?
import pandas as pd
import numpy as np
# Build a little exemple (with only 2 columns)
rng = pd.date_range('2021-11-01', periods=15, freq='D')
df = pd.DataFrame({ 'Date': rng, '1': range(15), '2':range(15) })
df["2"] = df["2"] * 100
# Algo :
# creat a new column with the max hour+1
df["3"] = np.NaN
for i in [2,1]:
# Increment hour every sunday (day 6, because starting at 0)
df[str(i+1)] = df[str(i)].where(df["Date"].dt.weekday == 6, df[str(i+1)])
# Last hour become first hour every sunday
df[str(1)] = df[str(3)].where(df["Date"].dt.weekday == 6, df[str(1)])
#EndFor
# Keep only first columns
df = df[["Date", "1", "2"]]
Result :
Date 1 2
0 2021-11-01 0.0 0
1 2021-11-02 1.0 100
2 2021-11-03 2.0 200
3 2021-11-04 3.0 300
4 2021-11-05 4.0 400
5 2021-11-06 5.0 500
6 2021-11-07 600.0 6
7 2021-11-08 7.0 700
8 2021-11-09 8.0 800
9 2021-11-10 9.0 900
10 2021-11-11 10.0 1000
11 2021-11-12 11.0 1100
12 2021-11-13 12.0 1200
13 2021-11-14 1300.0 13
14 2021-11-15 14.0 1400
This is not the best way to get the result, but this is one.

Pandas - calculate rolling average of group excluding current row

For an example:
data = {'Platoon': ['A','A','A','A','A','A','B','B','B','B','B','C','C','C','C','C'],
'Date' : [1,2,3,4,5,6,1,2,3,4,5,1,2,3,4,5],
'Casualties': [1,4,5,7,5,5,6,1,4,5,6,7,4,6,4,6]}
df = pd.DataFrame(data)
This works to calculate the rolling average, inclusive of the current row:
df['avg'] = df.groupby(['Platoon'])['Casualties'].transform(lambda x: x.rolling(2, 1).mean())
Which gives:
Platoon Date Casualties Avg
A 1 1 1.0
A 2 4 2.5
A 3 5 4.5
A 4 7 6.0
......
What I want to get is:
Platoon Date Casualties Avg
A 1 1 1.0
A 2 4 1.0
A 3 5 2.5
A 4 7 4.5
......
I suspect I can use shift here but I can't figure it out!
You need shift with bfill
df.groupby(['Platoon'])['Casualties'].apply(lambda x: x.rolling(2, 1).mean().shift().bfill())

Sorting Pandas data frame with groupby and conditions

I'm trying to sort a data frame based on groups meeting conditions.
The I'm getting a syntax error for the way I'm sorting the groups.
And I'm losing the initial order of the data frame before attempting the above.
This is the order of sorting that I'm trying to achieve:
1) Sort on First and Test columns.
2) Test==1 groups, sort on Secondary then by Final column.
---Test==0 groups, sort on Final column only.
import pandas as pd
df=pd.DataFrame({"First":[100,100,100,100,100,100,200,200,200,200,200],"Test":[1,1,1,0,0,0,0,1,1,1,0],"Secondary":[.1,.1,.1,.2,.2,.3,.3,.3,.3,.4,.4],"Final":[1.1,2.2,3.3,4.4,5.5,6.6,7.7,8.8,9.9,10.10,11.11]})
def sorter(x):
if x["Test"]==1:
x.sort_values(['Secondary','Final'], inplace=True)
else:
x=x.sort_values('Final', inplace=True)
df=df.sort_values(["First","Test"],ascending=[False, False]).reset_index(drop=True)
df.groupby(['First','Test']).apply(lambda x: sorter(x))
df
Expected result:
First Test Secondary Final
200 1 0.4 10.1
200 1 0.3* 9.9*
200 1 0.3* 8.8*
200 0 0.4 11.11*
200 0 0.3 7.7*
100 1 0.5 2.2
100 1 0.1* 3.3*
100 1 0.1* 1.1*
100 0 0.3 6.6*
100 0 0.2 5.5*
100 0 0.2 4.4*
You can try of sorting in descending order without groupby,
w.r.t sequence you gave, the order of sorting will change.will it work for you
df=pd.DataFrame({"First":[100,100,100,100,100,100,200,200,200,200,200],"Test":[1,1,1,0,0,0,0,1,1,1,0],"Secondary":[.1,.5,.1,.9,.4,.1,.3,.3,.3,.4,.4],"Final":[1.1,2.2,3.3,4.4,5.5,6.6,7.7,8.8,9.9,10.10,11.11]})
df = df.groupby(['First','Test']).apply(lambda x: x.sort_values(['First','Test','Secondary','Final'],ascending=False) if x.iloc[0]['Test']==1 else x.sort_values(['First','Test','Final'],ascending=False)).reset_index(drop=True)
df.sort_values(['First','Test'],ascending=[True,False])
Out:
Final First Secondary Test
3 2.20 100 0.5 1
4 3.30 100 0.1 1
5 1.10 100 0.1 1
0 6.60 100 0.1 0
1 5.50 100 0.4 0
2 4.40 100 0.9 0
8 10.10 200 0.4 1
9 9.90 200 0.3 1
10 8.80 200 0.3 1
6 11.11 200 0.4 0
7 7.70 200 0.3 0
The trick was to sort subsets separately and replace the values in the original df.
This came up in other solutions to pandas sorting problems.
import pandas as pd
df=pd.DataFrame({"First":[100,100,100,100,100,100,200,200,200,200,200],"Test":[1,1,1,0,0,0,0,1,1,1,0],"Secondary":[.1,.5,.1,.9,.4,.1,.3,.3,.3,.4,.4],"Final":[1.1,2.2,3.3,4.4,5.5,6.6,7.7,8.8,9.9,10.10,11.11]})
df.sort_values(['First','Test','Secondary','Final'],ascending=False, inplace=True)
index_subset=df[df["Test"]==0].index
sorted_subset=df[df["Test"]==0].sort_values(['First','Final'],ascending=False)
df.loc[index_subset,:]=sorted_subset.values
print(df)