Pandas: Get rolling metric with adaptive window size [duplicate] - pandas

I am not sure I understand the parameter min_periods in Pandas rolling functions : why does it have to be smaller than the window parameter?
I would like to compute (for instance) the rolling max minus rolling min with a window of ten values BUT I want to wait maybe 20 values before starting computations:
In[1]: import pandas as pd
In[2]: import numpy as np
In[3]: df = pd.DataFrame(columns=['A','B'], data=np.random.randint(low=0,high=100,size=(100,2)))
In[4]: roll = df['A'].rolling(window=10, min_periods=20)
In[5]: df['C'] = roll.max() - roll.min()
In[6]: roll
Out[6]: Rolling [window=10,min_periods=20,center=False,axis=0]
In[7]: df['C'] = roll.max()-roll.min()
I get the following error:
ValueError: Invalid min_periods size 20 greater than window 10
I thought that min_periods was there to tell how many values the function had to wait before starting computations. The documentation says:
min_periods : int, default None
Minimum number of observations in window required to have a value
(otherwise result is NA)
I had not been carefull to the "in window" detail here...
Then what would be the most efficient way to achieve what I am trying to achieve? Should I do something like:
roll = df.loc[20:,'A'].rolling(window=10)
df['C'] = roll.max() - roll.min()
Is there a more efficient way?

the min_period = n option simply means that you require at least n valid observations to compute your rolling stats.
Example, suppose min_period = 5 and you have a rolling mean over the last 10 observations. Now, what happens if 6 of the last 10 observations are actually missing values? Then, given that 4<5 (indeed, there are only 4 non-missing values here and you require at least 5 non-missing observations), the rolling mean will be missing as well.
It's a very, very important option.
From the documentation
min_periods : int, default None Minimum number of observations in
window required to have a value (otherwise result is NA).

The min period argument is just a way to apply the function to a smaller sample than the rolling window. So let say you want the rolling minimum of window of 10, passing the min period argument of 5 would allow to calculate the min of the first 5 data, then the first 6, then 7,8,9 and finally 10. Now that pandas can start rolling his 10 data point windows, because it has more than 10 data point, it will keep period window of 10.

Related

How to calculate rolling.agg('max') utilising a dataframe column as input to my function

I'm working with a kline dataframe. I'm adding a Swing_High and Swing_Low column to my df.
I've picked up an error where during low volatile periods my Close == Swing_Low price. This gives me a inf error in another function I have where close / Swing_Low.
To fix this I need to calculate the max/min value based on whether Close == Swing_Low or not. Default is for the rolling period to be 10 but if the above is true then increase the rolling period to 15.
Below is how I calculated the Swing_High and Swing_Low up to encountering Inf error.
import pandas as pd
df = pd.read_csv('Data/bybit_BTCUSD_15m.csv')
df["Date"] = df["Date"].astype('datetime64[ns]')
# Calculate the swing high and low for a given length
df['Swing_High'] = df['High'].rolling(10).agg('max')
df['Swing_Low'] = df['Low'].rolling(10).agg('min')
I tried the below function but it gives me a ValueError: The truth value of a Series is ambiguous
def swing_high(close, high, period1, period2):
a = high.rolling(period1).agg('max')
b = high.rolling(period2).agg('max')
if a != close:
return a
else:
return b
df['Swing_High'] = swing_high(df['Close'], df['High'], 10, 15)
How do I fix this or is there a better way to achieve my desired outcome?
A simple solution for what you're trying to achieve :
using the where function:
here’s the basic syntax using the pandas where() function:
df['col'] = (value_if_false).where(condition, value_if_true)
df['Swing_High_10']=df['High'].rolling(10).agg('max')
df['Swing_High_15']=df['High'].rolling(15).agg('max')
df['Swing_High']=(df['Swing_High_15']).where(df['Swing_High_10']!=df['Close'], df['Swing_High_15'])

Sample random row from df.groupby("column1")["column2].max() and not first one if multiple candidates

What would be the correct way to return n random max values from a groupby?
I have a dataframe containing audio events, with the following columns:
audio
start_time
end_time
duration
labelling confidence (1 to 5)
label ("Ambulance", "Engine", ...)
I have multiple events/rows for each label and I have 26 labels in total.
What I would like to achieve is to get one event per label with max confidence.
Let's say we have 7 events that have label "Ambulance" and they have the following labelling confidence: 2, 5, 5, 4, 4, 3, 5.
The max confidence is 5 in this case, which gives us 3 selectable events.
I would like to get one of the three at random.
Doing the following with pandas: df.groupby("label").max() will return the first row with max labelling confidence. I would like it to be a random selection.
Many thanks in advance
Cheers
Antoine
Edit: following a comment from the OP, the simplest solution is to shuffle the data frame before picking the max rows:
# Some random data
labels = list('ABCDE')
repeats = np.random.randint(1, 6, len(labels))
df = pd.DataFrame({
'label': np.repeat(labels, repeats),
'confidence': np.random.randint(1, 6, repeats.sum())
})
# Shuffle the data frame. For each `label` get the first row,
# which we can be sure has the max `confidence` because we
# sorted it
(
df.sample(frac=1)
.sort_values(['label', 'confidence'], ascending=[True, False])
.groupby('label')
.head(1)
)
If you are running this in IPython / Jupyter Notebook, watch the index of the resulting data frame to see the randomness of the result.
Here is how I finally managed to do it:
shuffled_df = df.sample(frac=1)
filtered_df = shuffled_df.loc[shuffled_df.groupby("label")["confidence"].idxmax()]

Pandas aggregate by unique occurrence per group

In pandas, I'd like to analyze groups if there is a single occurrence of a conditional value. I've included a sample dataframe with a first step attempt at identifying such groups below. So, let's say, in the data frame below, I want to filter the original data frame only for species of iris that ever had a sepal length greater than 6. In the last command, I'm counting the number of unique species groups that had a sepal length greater than 6 (so, at least I can count them).
But, what I really want is the original dataframe where I analyze rows only if the species had a sepal length greater than 6 (so, it would be a dataframe without the species "setosa" since they never have one).
The longer explanation is that I have a real dataset of users. Each user will have values in certain columns that may exceed a threshold value of interest. I haven't figured out how to analyze users who have these threshold values.
Perhaps a loop would be better. I might loop through each unique user name and look if any row with that user ever exceeds a certain value and gets some kind of new column (though I know loops are frowned upon in pandas so I'm posting here to see if there's some kind of well-known method of identifying groups by occurrence).
Thanks and let me know if I can make this question any more clear!
import pandas as pd
import seaborn as sns
import numpy as np
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
iris = sns.load_dataset('iris')
iris['longsepal'] = iris['sepal_length'] > 7
iris['longpetal'] = iris['petal_length'] > 5
iris.groupby(['longsepal'])['species'].nunique()
Consider groupby().transform() to calculate inline max aggregates to be later filtered on its value by species. Technically, the > 7 only returns one species as veriscolor max reaches 7.0. Below shows the operator and functional form of inequality logic.
iris['longsepal'] = iris.groupby(['species'])['sepal_length'].transform('max')
iris['longpetal'] = iris.groupby(['species'])['petal_length'].transform('max')
# DATA FILTERS
longsepal_iris = iris.loc[iris['longsepal'] > 7] # GREATER THAN OPERATOR FORM: >
longsepal_iris = iris.loc[iris['longsepal'].gt(7)] # GREATER THAN FUNCTIONAL FORM: gt()
longpetal_iris = iris.loc[iris['longpetal'] > 5] # GREATER THAN OPERATOR FORM: >
longpetal_iris = iris.loc[iris['longpetal'].gt(5)] # GREATER THAN FUNCTIONAL FORM: gt()
# SPECIES
longsepal_iris['species'].unique()
# ['virginica']
longpetal_iris['species'].unique()
# ['versicolor' 'virginica']

pandas resample when cumulative function returns data frame

I would like to use resampling function from pandas but applying my own custom function. The problem I'm facing is that the custom function returns a pandas Data Frame instead of a single array.
The following example illustrate my problem:
>>> import pandas as pd
>>> import numpy as np
>>> def f(data):
... return ((1+data).cumprod(axis=0)-1)
...
>>> data = np.random.randn(1000,3)
>>> index = pd.date_range("20170101", periods = 1000, freq="B")
>>> df = pd.DataFrame(data= data, index =index)
Now suppose I want to resample the business days to business end month frequency:
>>> resampler = df.resample("BM")
If I apply now the my function f I don't get the desired result. I would like to get the last row of my output from f.
>>> resampler.apply(f)
this is becaumes the cumprod in my function f returns a pandas data frame. I could write my f such that it returns just the last row. However, I would like to use this function in other places as well to return the whole Data Frame. This could be solved via introducing a flag like "last_row" in the function f which steers to return the complete or just the last row. But this solutions seem rather nasty.
Just define your function f with a last_row parameter. You can default it to False so that it returns the entire dataframe. When True it returns the last row
def f(data, last_row=False):
df = ((1+data).cumprod(axis=0)-1)
if last_row:
return df.iloc[-1]
return df
Get the last row
df.resample('BM').apply(f, last_row=True)
0 1 2
2017-01-31 0.185662 -0.580058 -1.004879
2017-02-28 -1.004035 -0.999878 17.059846
2017-03-31 -0.995280 -1.000001 -1.000507
2017-04-28 -1.000656 -240.369487 -1.002645
2017-05-31 47.646827 -72.042190 -1.000016
....
Return all the rows as you already did.
df.resample('BM').apply(f)
I think you could refactor in the following way, which will be much faster for larger dataframes:
(1+df).resample('BM').prod() - 1
0 1 2
2017-01-31 -0.999436 -1.259078 -1.000215
2017-02-28 -1.221404 0.342863 9.841939
2017-03-31 -0.820196 -1.002598 -0.450662
2017-04-28 -1.000299 2.739184 -1.035557
2017-05-31 -0.999986 -0.920445 -2.103289
That gives the same answer as #TedPetrou although you can't tell because we used different random seeds, but you can easily test this yourself. Though actually, I'm still sorting out why this gives the same answer via prod() rather than cumprod(). Anyway, as you can see this is a mix of intuition and reverse engineering I'm using here and will update as I double check things...
For this relatively small dataframe with 1,000 rows, this way is only around twice as fast, but if you increase the rows you'll find this way scales much better (about 250x faster at 10,000 rows).
Alternative approaches: These give different answers from the above (and from each other) but I wonder if they might be closer to what you are looking for?
(1+df).resample('BM').mean().expanding().apply( lambda x: x.prod() - 1)
(1+df).expanding().apply( lambda x: x.prod() - 1).resample('BM').mean()

Realign Pandas Rolling Average series in a dataframe

Very new to Pandas. Importing stock data in a DF and I want to calculate 10D rolling average. That I can figure out. Issue is it gives 9 NaN because of the 10D moving average period.
I want to re-align the data so the 10th piece of data is a new rolling average column at the top of the data frame. I tried moving the data by writing the following code:
small = pd.rolling_mean(df['Close'],10)
and then trying to add that to the df with the following code
df['MA10D2'] = small[9:]
but it still provides 9 NaN at the top. Anyone can help me out?
Assignment will be done based on index. small[9:] will start the index at position 9, thus the assignement will keep their position starting at index 9.
The function you are searching for is called shift:
df['MA10D2'] = small.shift(-9)