Computing Rolling autocorrelation using Pandas.rolling - pandas

I am attempting calculate the rolling auto-correlation for a Series object using Pandas (0.23.3)
Setting up the example:
dt_index = pd.date_range('2018-01-01','2018-02-01', freq = 'B')
data = np.random.rand(len(dt_index))
s = pd.Series(data, index = dt_index)
Creating a Rolling object with window size = 5:
r = s.rolling(5)
Getting:
Rolling [window=5,center=False,axis=0]
Now when I try to calculate the correlation (Pretty sure this is the wrong approach):
r.corr(other=r)
I get only NaNs
I tried another approach based on the documentation::
df = pd.DataFrame()
df['a'] = s
df['b'] = s.shift(-1)
df.rolling(window=5).corr()
Getting something like:
...
2018-03-01 a NaN NaN
b NaN NaN
Really not sure where I'm going wrong with this. Any help would be immensely appreciated! The docs use float64 as well. Thinking it's because the correlation is very close to zero and so it's showing NaN? Somebody had raised a bug report here, but jreback solved the problem in a previous bug fix I think.
This is another relevant answer, but it's using pd.rolling_apply, which does not seem to be supported in Pandas version 0.23.3?

IIUC,
>>> s.rolling(5).apply(lambda x: x.autocorr(), raw=False)
2018-01-01 NaN
2018-01-02 NaN
2018-01-03 NaN
2018-01-04 NaN
2018-01-05 -0.502455
2018-01-08 -0.072132
2018-01-09 -0.216756
2018-01-10 -0.090358
2018-01-11 -0.928272
2018-01-12 -0.754725
2018-01-15 -0.822256
2018-01-16 -0.941788
2018-01-17 -0.765803
2018-01-18 -0.680472
2018-01-19 -0.902443
2018-01-22 -0.796185
2018-01-23 -0.691141
2018-01-24 -0.427208
2018-01-25 0.176668
2018-01-26 0.016166
2018-01-29 -0.876047
2018-01-30 -0.905765
2018-01-31 -0.859755
2018-02-01 -0.795077

This is a lot faster than Pandas' autocorr but the results are different. In my dataset, there is a 0.87 Pearson correlation between the results of those two methods. There is a discussion about why the results are different here.
from statsmodels.tsa.stattools import acf
s.rolling(5).apply(lambda x: acf(x, unbiased=True, fft=False)[1], raw=True)
Note that the input cannot have null values, otherwise it will return all nulls.

Related

Pandas resample().size() returns different object types depending on sample frequency

I have Pandas DataFrame, which I am summarizing using the groupby(), resample() and size() functions:
freq = upgrade.groupby("UPGRADESTRATEGY")
type(freq)
<class 'pandas.core.groupby.groupby.DataFrameGroupBy'>
print(freq.resample('Q'))
DatetimeIndexResamplerGroupby [freq=<QuarterEnd: startingMonth=12>, axis=0, closed=right, label=right, convention=e, base=0]
print(freq.resample('Y'))
DatetimeIndexResamplerGroupby [freq=<YearEnd: month=12>, axis=0, closed=right, label=right, convention=e, base=0]
To my surprise the size() function returns different types depending on sampling frequency:
if I sample by quarter, I receive Series,
if I sample by year, I receive DataFrame.
Code:
print(type(freq.resample('Q').size()))
<class 'pandas.core.series.Series'>
print(type(freq.resample('Y').size()))
<class 'pandas.core.frame.DataFrame'>
The outputs are also semantically different, so I cannot re-use the same pipeline to process results. I expected that the output object type and the structure of the output data is not dependent on the sampling frequency I used.
Questions:
Why I am getting different object type as a result of size() function,
when I use different sample frequencies?
How can I change code to get the same object type and the same object structure regardless of the specified sampling frequency?
I am using 0.23.4 version of Pandas.
UPDATE:
The problem is apparently data-dependent. I boiled it down to 10 records, which causing the change of the output type. I cannot explain why I am getting the different output type here. Sorting values by date does not change the result.
u2.dtypes
UPGRADESTRATEGY object
SCENARIO_STARTDATE datetime64[ns]
dtype: object
This generates DataFrame.
print(u2.iloc[-105:-96])
UPGRADESTRATEGY SCENARIO_STARTDATE
18645 b 2016-12-20 14:48:57
18646 a 2017-01-07 16:58:44
18647 b 2017-01-11 14:39:58
18648 a 2017-01-10 15:42:22
18649 a 2017-01-10 10:07:34
18650 a 2017-01-12 15:31:14
18651 a 2017-01-13 12:44:02
18652 a 2017-01-13 14:51:59
18653 a 2016-12-12 22:30:01
type(u2.iloc[-105:-96].groupby(["UPGRADESTRATEGY"]).resample('Q', on='SCENARIO_STARTDATE').size())
pandas.core.frame.DataFrame
This generates Series.
print(u2.iloc[-104:-96])
UPGRADESTRATEGY SCENARIO_STARTDATE
18646 a 2017-01-07 16:58:44
18647 b 2017-01-11 14:39:58
18648 a 2017-01-10 15:42:22
18649 a 2017-01-10 10:07:34
18650 a 2017-01-12 15:31:14
18651 a 2017-01-13 12:44:02
18652 a 2017-01-13 14:51:59
18653 a 2016-12-12 22:30:01
type(u2.iloc[-104:-96].groupby(["UPGRADESTRATEGY"]).resample('Q', on='SCENARIO_STARTDATE').size())
pandas.core.series.Series
DataFrame output:
u2.iloc[-105:-96].groupby(["UPGRADESTRATEGY"]).resample('Q', on='SCENARIO_STARTDATE').size()
SCENARIO_STARTDATE 2016-12-31 2017-03-31
UPGRADESTRATEGY
a 1 6
b 1 1
Series output:
u2.iloc[-104:-96].groupby(["UPGRADESTRATEGY"]).resample('Q', on='SCENARIO_STARTDATE').size()
UPGRADESTRATEGY SCENARIO_STARTDATE
a 2016-12-31 1
2017-03-31 6
b 2017-03-31 1
dtype: int64
I am out of ideas!

Re-sampling and interpolating data using pandas from a given date column to a different date column

I can mostly find conversions and down/upsampling from e.g. daily date range to monthly date ranges or from monthly/yearly date ranges to daily date ranges using pandas.
Is there a way that given data for some arbitrary days one can map them to different days using interpolation/extrapolation?
Index.union, reindex, and interpolate
MCVE
Create toy data. Three rows every other day.
tidx = pd.date_range('2018-01-01', periods=3, freq='2D')
df = pd.DataFrame(dict(A=[1, 3, 5]), tidx)
df
A
2018-01-01 1
2018-01-03 3
2018-01-05 5
New index for those days in between
other_tidx = pd.date_range(tidx.min(), tidx.max()).difference(tidx)
Solution
Create a new index that is the union of the old index and the new index
union_idx = other_tidx.union(df.index)
When we reindex with this we get
df.reindex(union_idx)
A
2018-01-01 1.0
2018-01-02 NaN
2018-01-03 3.0
2018-01-04 NaN
2018-01-05 5.0
We see the gaps we expected. Now we can use interpolate. But we need to use the argument method='index' to ensure we interpolate relative to the size of the gaps in the index.
df.reindex(union_idx).interpolate('index')
A
2018-01-01 1.0
2018-01-02 2.0
2018-01-03 3.0
2018-01-04 4.0
2018-01-05 5.0
And now those gaps are filled.
We can reindex again to reduce to just the other index values
df.reindex(union_idx).interpolate('index').reindex(other_tidx)
A
2018-01-02 2.0
2018-01-04 4.0

Error when changing date format in dataframe index

I have the following df :
A B
2018-01-02 100.000000 100.000000
2018-01-03 100.808036 100.325886
2018-01-04 101.616560 102.307700
I am looking forward to change the time format of the index, so I tried (using #jezrael s response in the link Format pandas dataframe index date):
df.index = rdo.index.strftime('%d-%m-%Y')
But it outputs :
AttributeError: 'Index' object has no attribute 'strftime'
My desired output would be:
A B
02-01-2018 100.000000 100.000000
03-01-2018 100.808036 100.325886
04-01-2018 101.616560 102.307700
I find quite similar the question asked in the link above with my issue. I do not really understand why the attrError arises.
Your index seems to be of a string (object) dtype, but it must be a DatetimeIndex, which can be checked by using df.info():
In [19]: df.index = pd.to_datetime(df.index).strftime('%d-%m-%Y')
In [20]: df
Out[20]:
A B
02-01-2018 100.000000 100.000000
03-01-2018 100.808036 100.325886
04-01-2018 101.616560 102.307700

Pandas add column based on grouped by rolling average

I have successfully added a new summed Volume column using Transform when grouping by Date like so:
df
Name Date Volume
--------------------------
APL 12-01-2017 1102
BSC 12-01-2017 4500
CDF 12-02-2017 5455
df['vol_all_daily'] = df['Volume'].groupby([df['Date']]).transform('sum')
Name Date Volume vol_all_daily
------------------------------------------
APL 12-01-2017 1102 5602
BSC 12-01-2017 4500 5602
CDF 12-02-2017 5455 5455
However when I want to take the rolling average it doesn't work!
df['vol_all_ma_2']=df['vol_all_daily'].
groupby([df['Date']]).rolling(window=2).mean()
Returns a DataGroupBy that gives error *and becomes too hard to put back into a df column anyways.
df['vol_all_ma_2'] =
df['vol_all_daily'].groupby([df['Date']]).transform('mean').
rolling(window=2).mean()
This just produces near identical result of vol_all_daily column
Update:
I wasn't taking the just one column per date..The above code will still take multiple dates...Instead I add the .first() to the groupby..Not sure why groupby isnt taking one row per date.
The behavior of what you have written seems correct (Part 1 below), but perhaps you want to be calling something different (Part 2 below).
Part 1: Why what you have written is behaving correctly:
d = {'Name':['APL', 'BSC', 'CDF'],'Date':pd.DatetimeIndex(['2017-12-01', '2017-12-01', '2017-12-02']),'Volume':[1102,4500,5455]}
df = pd.DataFrame(d)
df['vol_all_daily'] = df['Volume'].groupby([df['Date']]).transform('sum')
print(df)
rolling_vol = df['vol_all_daily'].groupby([df['Date']]).rolling(window=2).mean()
print('')
print(rolling_vol)
I get as output:
Date Name Volume vol_all_daily
0 2017-12-01 APL 1102 5602
1 2017-12-01 BSC 4500 5602
2 2017-12-02 CDF 5455 5455
Date
2017-12-01 0 NaN
1 5602.0
2017-12-02 2 NaN
Name: vol_all_daily, dtype: float64
To understand why this result rolling_vol is correct, notice that you have first called the groupby, and only after that you have called rolling. That should not produce something that fits with df.
Part 2: What I think you wanted to call (just a rolling average):
If you instead run:
# same as above but without groupby
rolling_vol2 = df['vol_all_daily'].rolling(window=2).mean()
print('')
print(rolling_vol2)
You should get:
0 NaN
1 5602.0
2 5528.5
Name: vol_all_daily, dtype: float64
which looks more like the rolling average you seem to want. To explain that, I suggest reading the details of pandas resampling vs rolling.

Pandastic way of growing a dataframe

So, I have a year-indexed dataframe that I would like to increment by some logic beyond the end year (2013), say, grow the last value by n percent for 10 years, but the logic could also be to just add a constant, or slightly growing number. I will leave that to a function and just stuff the logic there.
I can't think of a neat vectorized way to do that with arbitrary length of time and logic, leaving a longer dataframe with the extra increments added, and would prefer not to loop it.
The particular calculation matters. In general you would have to compute the values in a loop. Some NumPy ufuncs (such as np.add, np.multiply, np.minimum, np.maximum) have an accumulate method, however, which may be useful depending on the calculation.
For example, to calculate values given a constant growth rate, you could use np.multiply.accumulate (or cumprod):
import numpy as np
import pandas as pd
N = 10
index = pd.date_range(end='2013-12-31', periods=N, freq='D')
df = pd.DataFrame({'val':np.arange(N)}, index=index)
last = df['val'][-1]
# val
# 2013-12-22 0
# 2013-12-23 1
# 2013-12-24 2
# 2013-12-25 3
# 2013-12-26 4
# 2013-12-27 5
# 2013-12-28 6
# 2013-12-29 7
# 2013-12-30 8
# 2013-12-31 9
# expand df
index = pd.date_range(start='2014-1-1', periods=N, freq='D')
df = df.reindex(df.index.union(index))
# compute new values
rate = 1.1
df['val'][-N:] = last*np.multiply.accumulate(np.full(N, fill_value=rate))
yields
val
2013-12-22 0.000000
2013-12-23 1.000000
2013-12-24 2.000000
2013-12-25 3.000000
2013-12-26 4.000000
2013-12-27 5.000000
2013-12-28 6.000000
2013-12-29 7.000000
2013-12-30 8.000000
2013-12-31 9.000000
2014-01-01 9.900000
2014-01-02 10.890000
2014-01-03 11.979000
2014-01-04 13.176900
2014-01-05 14.494590
2014-01-06 15.944049
2014-01-07 17.538454
2014-01-08 19.292299
2014-01-09 21.221529
2014-01-10 23.343682
To increment by a constant value you could simply use np.arange:
step=2
df['val'][-N:] = np.arange(last+step, last+(N+1)*step, step)
or cumsum:
step=2
df['val'][-N:] = last + np.full(N, fill_value=step).cumsum()
Some linear recurrence relations can be expressed using scipy.signal.lfilter. See for example,
Trying to vectorize iterative calculation with numpy and Recursive definitions in Pandas