Pandas: vectorize sliding time window aggregation - pandas

I have a big dataframe from which I need sliding time windows averages for a given set of query points. I tried with df.rolling but this wouldn't allow me for querying arbitary points. The following works, but seems inefficient and does not allow for vectorized usage:
import pandas as pd
df = pd.DataFrame({'B': range(5)},
index = [pd.Timestamp('20130101 09:00:00'),
pd.Timestamp('20130101 09:00:02'),
pd.Timestamp('20130101 09:00:03'),
pd.Timestamp('20130101 09:00:05'),
pd.Timestamp('20130101 09:00:06')])
query = pd.date_range(df.index[0], df.index[-1], freq='s')
time_window = pd.Timedelta(seconds=2)
f = lambda t: df[(t - time_window < df.index) & (df.index <= t)]["B"].mean()
[f(t) for t in query] # works but is slow
f(query) # throws ValueError length must match
Probably this can be done better ...
Edit: The real application has measures which appear randomly between 30 and 90 seconds. Sometimes there are periods with several days or weeks without data. The time_window is typically 15 minutes. The overall time horizon is 10 years.

You're skipping just a small step.
Your "query" is really a time series resampling operation. That is, in addition to calculating a rolling mean, you are also trying to smoothly resample the time series at a frequency of one second. You can do that using the asfreq method, applying it prior to the rolling operation:
resample_rolling = df.asfreq('1s').rolling(pd.Timedelta(seconds=2)).mean()
print(np.array([f(t) for t in query]))
print(resample_rolling.to_numpy()[:, 0])
Output:
[0. 0. 1. 1.5 2. 3. 3.5]
[0. 0. 1. 1.5 2. 3. 3.5]
Note that by default, the asfreq method fills missing values in with nan values.
>>> df.asfreq(pd.Timedelta(seconds=1))
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:01 NaN
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 2.0
2013-01-01 09:00:04 NaN
2013-01-01 09:00:05 3.0
2013-01-01 09:00:06 4.0
The rolling operation then ignores those values. If instead you want to fill the values with something other than nans, you have two options. You can supply a fill_value:
>>> df.asfreq('1s', fill_value=0.0)
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:01 0.0
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 2.0
2013-01-01 09:00:04 0.0
2013-01-01 09:00:05 3.0
2013-01-01 09:00:06 4.0
Or you can specify a method, such as backfill, which uses the next value in the series:
>>> df.asfreq('1s', method='backfill')
B
2013-01-01 09:00:00 0
2013-01-01 09:00:01 1
2013-01-01 09:00:02 1
2013-01-01 09:00:03 2
2013-01-01 09:00:04 3
2013-01-01 09:00:05 3
2013-01-01 09:00:06 4
The resulting rolling mean is then different, of course:
>>> df.asfreq('1s', method='backfill').rolling('1s').mean()
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:01 1.0
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 2.0
2013-01-01 09:00:04 3.0
2013-01-01 09:00:05 3.0
2013-01-01 09:00:06 4.0

After some research I came up with the following solution with two rolling windows, one for entering the window and one for leaving:
import pandas as pd, numpy as np
df = pd.DataFrame({'B': range(5)},
index = [pd.Timestamp('20130101 09:00:00'),
pd.Timestamp('20130101 09:00:02'),
pd.Timestamp('20130101 09:00:03'),
pd.Timestamp('20130101 09:00:05'),
pd.Timestamp('20130101 09:00:06')])
query = pd.date_range(df.index[0], df.index[-1], freq='s')
time_window = pd.Timedelta(seconds=2)
aggregates = ['mean']
### Preparation
# one data point for each point entering the window
df1 = df.rolling(window=time_window, closed='right').agg(aggregates)
# one data point for each point leaving the window - use reverted df
df2 = df[::-1].rolling(window=time_window, closed='left').agg(aggregates)
df2.index += time_window
# Caution: for my real data in the reverted rolling method, I had
# to add a small Timedelta to window to function properly
# merge both together and remove duplicates
df_windowed = pd.concat([df1, df2])
df_windowed.sort_index(inplace=True)
df_windowed = df_windowed[~df_windowed.index.duplicated(keep='first')]
### the vectorized function
# Caution: get_indexer returns -1 for not found values (below df.index.min()),
# which is interpreted as last value. But last value of df_windows is always NaN
f = lambda t: df_windowed.iloc[
df_windowed.index.get_indexer(t, method='ffill') if isinstance(t, (pd.Index, pd.Series, np.ndarray,)) else
df_windowed.index.get_loc(t, method='ffill')
]["B"]["mean"].to_numpy()
f(query)

Related

Vectorize for loop and return x day high and low

Overview
For each row of a dataframe I want to calculate the x day high and low.
An x day high is higher than previous x days.
An x day low is lower than previous x days.
The for loop is explained in further detail in this post
Update:
Answer by #mozway below completes in around 20 seconds with dataset containing 18k rows. Can this be improved with numpy with broadcasting etc?
Example
2020-03-20 has an x_day_low value of 1 as it is lower than the previous day.
2020-03-27 has an x_day_high value of 8 as it is higher than the previous 8 days.
See desired output and test code below which is calculated with a for loop in the findHighLow function. How would I vectorize findHighLow as the actual dataframe is somewhat larger.
Test data
def genMockDataFrame(days,startPrice,colName,startDate,seed=None):
periods = days*24
np.random.seed(seed)
steps = np.random.normal(loc=0, scale=0.0018, size=periods)
steps[0]=0
P = startPrice+np.cumsum(steps)
P = [round(i,4) for i in P]
fxDF = pd.DataFrame({
'ticker':np.repeat( [colName], periods ),
'date':np.tile( pd.date_range(startDate, periods=periods, freq='H'), 1 ),
'price':(P)})
fxDF.index = pd.to_datetime(fxDF.date)
fxDF = fxDF.price.resample('D').ohlc()
fxDF.columns = [i.title() for i in fxDF.columns]
return fxDF
#rows set to 15 for minimal example but actual dataframe contains around 18000 rows.
number_of_rows = 15
df = genMockDataFrame(number_of_rows,1.1904,'tttmmm','19/3/2020',seed=157)
def findHighLow (df):
df['x_day_high'] = 0
df['x_day_low'] = 0
for n in reversed(range(len(df['High']))):
for i in reversed(range(n)):
if df['High'][n] > df['High'][i]:
df['x_day_high'][n] = n - i
else: break
for n in reversed(range(len(df['Low']))):
for i in reversed(range(n)):
if df['Low'][n] < df['Low'][i]:
df['x_day_low'][n] = n - i
else: break
return df
df = findHighLow (df)
Desired output should match this:
df[["High","Low","x_day_high","x_day_low"]]
High Low x_day_high x_day_low
date
2020-03-19 1.1937 1.1832 0 0
2020-03-20 1.1879 1.1769 0 1
2020-03-21 1.1767 1.1662 0 2
2020-03-22 1.1721 1.1611 0 3
2020-03-23 1.1819 1.1690 2 0
2020-03-24 1.1928 1.1807 4 0
2020-03-25 1.1939 1.1864 6 0
2020-03-26 1.2141 1.1964 7 0
2020-03-27 1.2144 1.2039 8 0
2020-03-28 1.2099 1.2018 0 1
2020-03-29 1.2033 1.1853 0 4
2020-03-30 1.1887 1.1806 0 6
2020-03-31 1.1972 1.1873 1 0
2020-04-01 1.1997 1.1914 2 0
2020-04-02 1.1924 1.1781 0 9
Here are two so solutions. Both produce the desired output, as posted in the question.
The first solution uses Numba and completes in 0.5 seconds on my machine for 20k rows. If you can use Numba, this is the way to go. The second solution uses only Pandas/Numpy and completes in 1.5 seconds for 20k rows.
Numba
#numba.njit
def count_smaller(arr):
current = arr[-1]
count = 0
for i in range(arr.shape[0]-2, -1, -1):
if arr[i] > current:
break
count += 1
return count
#numba.njit
def count_greater(arr):
current = arr[-1]
count = 0
for i in range(arr.shape[0]-2, -1, -1):
if arr[i] < current:
break
count += 1
return count
df["x_day_high"] = df.High.expanding().apply(count_smaller, engine='numba', raw=True)
df["x_day_low"] = df.Low.expanding().apply(count_greater, engine='numba', raw=True)
Pandas/Numpy
def count_consecutive_true(bool_arr):
return bool_arr[::-1].cumprod().sum()
def count_smaller(arr):
return count_consecutive_true(arr <= arr[-1]) - 1
def count_greater(arr):
return count_consecutive_true(arr >= arr[-1]) - 1
df["x_day_high"] = df.High.expanding().apply(count_smaller, raw=True)
df["x_day_low"] = df.Low.expanding().apply(count_greater, raw=True)
This last solution is similar to mozway's. However it runs faster because it doesn't need to perform a join and uses numpy as much as possible. It also looks arbitrarily far back.
You can use rolling to get the last N days, a comparison + cumprod on the reversed boolean array to keep only the last consecutive valid values, and sum to count them. Apply on each column using agg and join the output after adding a prefix.
# number of days
N = 8
df.join(df.rolling(f'{N+1}d', min_periods=1)
.agg({'High': lambda s: s.le(s.iloc[-1])[::-1].cumprod().sum()-1,
'Low': lambda s: s.ge(s.iloc[-1])[::-1].cumprod().sum()-1,
})
.add_prefix(f'{N}_days_')
)
Output:
Open High Low Close 8_days_High 8_days_Low
date
2020-03-19 1.1904 1.1937 1.1832 1.1832 0.0 0.0
2020-03-20 1.1843 1.1879 1.1769 1.1772 0.0 1.0
2020-03-21 1.1755 1.1767 1.1662 1.1672 0.0 2.0
2020-03-22 1.1686 1.1721 1.1611 1.1721 0.0 3.0
2020-03-23 1.1732 1.1819 1.1690 1.1819 2.0 0.0
2020-03-24 1.1836 1.1928 1.1807 1.1922 4.0 0.0
2020-03-25 1.1939 1.1939 1.1864 1.1936 6.0 0.0
2020-03-26 1.1967 1.2141 1.1964 1.2114 7.0 0.0
2020-03-27 1.2118 1.2144 1.2039 1.2089 7.0 0.0
2020-03-28 1.2080 1.2099 1.2018 1.2041 0.0 1.0
2020-03-29 1.2033 1.2033 1.1853 1.1880 0.0 4.0
2020-03-30 1.1876 1.1887 1.1806 1.1879 0.0 6.0
2020-03-31 1.1921 1.1972 1.1873 1.1939 1.0 0.0
2020-04-01 1.1932 1.1997 1.1914 1.1914 2.0 0.0
2020-04-02 1.1902 1.1924 1.1781 1.1862 0.0 7.0

Rolling median of date-indexed data with duplicate dates

My date-indexed data can have multiple observations for a given date.
I want to get the rolling median of a value but am not getting the result that I am looking for:
df = pd.DataFrame({
'date': ['2020-06-22', '2020-06-23','2020-06-24','2020-06-24', '2020-06-25', '2020-06-26'],
'value': [2,8,5,1,3,7]
})
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
# Attempt to get the 3-day rolling median of 'value':
df['value'].rolling('3D').median()
# This yields the following, i.e. one median value
# per **observation**
# (two values for 6/24 in this example):
date
2020-06-22 2.0
2020-06-23 5.0
2020-06-24 5.0
2020-06-24 3.5
2020-06-25 4.0
2020-06-26 4.0
Name: value, dtype: float64
# I was hoping to get one median value
# per **distinct date** in the index
# The median for 6/24, for example, would be computed
# from **all** observations on 6/22, 6/23 and 6/24(2 observations)
date
2020-06-22 NaN
2020-06-23 NaN
2020-06-24 3.5
2020-06-25 4.0
2020-06-26 4.0
Name: value, dtype: float64
How do I need to change my code?
As far as I can tell, your code produces the right answer for the second occurrence of 2020-06-24, as 3.5 is the median of 4 numbers 2,8,5,1. The first occurrence of 2020-06-24only uses its own value and the ones from the two prior days. Presumably, and I am speculating here, it is looking at the '3D' window in the elements in the rows preceding it in the timeseries, not following.
So I think your code only needs a small modification to satisfy your requirement and that is if there are multiple rows with the same date we should just pick the last one. We will do this below with groupby. Also you want the first two values to be NaN rather than medians of shorter time series -- this can be achieved by passing min_periods = 3 in the rolling function. Here is all the code, I put the median into its own column
df['median'] = df['value'].rolling('3D', min_periods = 3).median()
df.groupby(level = 0, axis = 0).last()
prints
value median
date
2020-06-22 2 NaN
2020-06-23 8 NaN
2020-06-24 1 3.5
2020-06-25 3 4.0
2020-06-26 7 4.0

Merging 2 or more data frames and transposing the result

I have several DFs derived from a Panda binning process using the below code;
df2 = df.resample(rule=timedelta(milliseconds=250))[('diffA')].mean().dropna()
df3 = df.resample(rule=timedelta(milliseconds=250))[('diffB')].mean().dropna()
.. etc
Every DF will have column containing 'time' in Datetime format( example:2019-11-22 13:18:00.000 ) and second column containing a number (i.e. 0.06 ). Different DFs will have different 'time' bins. I am trying to concatenate all DFs into one , where certain elements of the resulting DF may contain 'NaN'.
The Datetime format of the DFs give an error when using;
method 1) df4=pd.merge(df2,df3,left_on='time',right_on='time')
method 2) pd.pivot_table(df2, values = 'diffA', index=['time'], columns = 'time').reset_index()
When DFs have been combined , I also want to transpose the resulting DF, where:
Rows: are 'DiffA','DiffB'..etc
Columns: are time bins accordingly.
Have tried the transpose() method with individual DFs, just to try, but getting an error as my time /index is in 'Datetime' format..
Once that is in place, I am looking for a method to extract rows from the resulting transposed DF as individual data series.
Please advise how I can achieve the above with some guidance, appreciate any feedback ! thank you so much for your help.
Data frames ( 2 - for example )
time DiffA
2019-11-25 08:18:01.250 0.06
2019-11-25 08:18:01.500 0.05
2019-11-25 08:18:01.750 0.04
2019-11-25 08:18:02.000 0
2019-11-25 08:18:02.250 0.22
2019-11-25 08:18:02.500 0.06
time DiffB
2019-11-26 08:18:01.250 0.2
2019-11-27 08:18:01.500 0.05
2019-11-25 08:18:01.000 0.6
2019-11-25 08:18:02.000 0.01
2019-11-25 08:18:02.250 0.8
2019-11-25 08:18:02.500 0.5
resulting merged DF should be as follows ( text only);
time ( first row )
2019-11-25 08:18:01.000,
2019-11-25 08:18:01.250,
2019-11-25 08:18:01.500,
2019-11-25 08:18:01.750,
2019-11-25 08:18:02.000,
2019-11-25 08:18:02.250,
2019-11-25 08:18:02.500,
2019-11-26 08:18:01.250,
2019-11-27 08:18:01.500
(second row)
diffA nan 0.06 0.05 0.04 0 0.22 0.06 nan nan
(third row)
diffB 0.6 nan nan nan 0.01 0.8 0.5 0.2 0.05
Solution
The core logic: You need to use outer-join on the column 'time' to merge each of the sampled-dataframes together to achieve your objective. Finally resetting the index to the column time completes the solution.
I will use the dummy data I created below to create a reproducible solution.
Note: I have used df as the final dataframe and df0 as the original dataframe. My df0 is your df.
df = pd.DataFrame()
for i, column_name in zip(range(5), column_names):
if i==0:
df = df0.sample(n=10, random_state=i).rename(columns={'data': f'df{column_name}'})
else:
df_other = df0.sample(n=10, random_state=i).rename(columns={'data': f'df{column_name}'})
df = pd.merge(df, df_other, on='time', how='outer')
print(df.set_index('time').T)
Output:
Dummy Data
import pandas as pd
# dummy data:
df0 = pd.DataFrame()
df0['time'] = pd.date_range(start='2020-02-01', periods=15, freq='D')
df0['data'] = np.random.randint(0, high=9, size=15)
print(df0)
Output:
time data
0 2020-02-01 6
1 2020-02-02 1
2 2020-02-03 7
3 2020-02-04 0
4 2020-02-05 8
5 2020-02-06 8
6 2020-02-07 1
7 2020-02-08 6
8 2020-02-09 2
9 2020-02-10 6
10 2020-02-11 8
11 2020-02-12 3
12 2020-02-13 0
13 2020-02-14 1
14 2020-02-15 0

Pandastic way of growing a dataframe

So, I have a year-indexed dataframe that I would like to increment by some logic beyond the end year (2013), say, grow the last value by n percent for 10 years, but the logic could also be to just add a constant, or slightly growing number. I will leave that to a function and just stuff the logic there.
I can't think of a neat vectorized way to do that with arbitrary length of time and logic, leaving a longer dataframe with the extra increments added, and would prefer not to loop it.
The particular calculation matters. In general you would have to compute the values in a loop. Some NumPy ufuncs (such as np.add, np.multiply, np.minimum, np.maximum) have an accumulate method, however, which may be useful depending on the calculation.
For example, to calculate values given a constant growth rate, you could use np.multiply.accumulate (or cumprod):
import numpy as np
import pandas as pd
N = 10
index = pd.date_range(end='2013-12-31', periods=N, freq='D')
df = pd.DataFrame({'val':np.arange(N)}, index=index)
last = df['val'][-1]
# val
# 2013-12-22 0
# 2013-12-23 1
# 2013-12-24 2
# 2013-12-25 3
# 2013-12-26 4
# 2013-12-27 5
# 2013-12-28 6
# 2013-12-29 7
# 2013-12-30 8
# 2013-12-31 9
# expand df
index = pd.date_range(start='2014-1-1', periods=N, freq='D')
df = df.reindex(df.index.union(index))
# compute new values
rate = 1.1
df['val'][-N:] = last*np.multiply.accumulate(np.full(N, fill_value=rate))
yields
val
2013-12-22 0.000000
2013-12-23 1.000000
2013-12-24 2.000000
2013-12-25 3.000000
2013-12-26 4.000000
2013-12-27 5.000000
2013-12-28 6.000000
2013-12-29 7.000000
2013-12-30 8.000000
2013-12-31 9.000000
2014-01-01 9.900000
2014-01-02 10.890000
2014-01-03 11.979000
2014-01-04 13.176900
2014-01-05 14.494590
2014-01-06 15.944049
2014-01-07 17.538454
2014-01-08 19.292299
2014-01-09 21.221529
2014-01-10 23.343682
To increment by a constant value you could simply use np.arange:
step=2
df['val'][-N:] = np.arange(last+step, last+(N+1)*step, step)
or cumsum:
step=2
df['val'][-N:] = last + np.full(N, fill_value=step).cumsum()
Some linear recurrence relations can be expressed using scipy.signal.lfilter. See for example,
Trying to vectorize iterative calculation with numpy and Recursive definitions in Pandas

pandas HDFStore select rows by datetime index

I'm sure this is probably very simple but I can't figure out how to slice a pandas HDFStore table by its datetime index to get a specific range of rows.
I have a table that looks like this:
mdstore = pd.HDFStore(store.h5)
histTable = '/ES_USD20120615_MIDPOINT30s'
print(mdstore[histTable])
open high low close volume WAP \
date
2011-12-04 23:00:00 1266.000 1266.000 1266.000 1266.000 -1 -1
2011-12-04 23:00:30 1266.000 1272.375 1240.625 1240.875 -1 -1
2011-12-04 23:01:00 1240.875 1242.250 1240.500 1242.125 -1 -1
...
[488000 rows x 7 columns]
For example I'd like to get the range from 2012-01-11 23:00:00 to 2012-01-12 22:30:00. If it were in a df I would just use datetimes to slice on the index, but I can't figure out how to do that directly from the store table so I don't have to load the whole thing into memory.
I tried mdstore.select(histTable, where='index>20120111') and that worked in as much as I got everything on the 11th and 12th, but I couldn't see how to add a time in.
Example is here
needs pandas >= 0.13.0
In [2]: df = DataFrame(np.random.randn(5),index=date_range('20130101 09:00:00',periods=5,freq='s'))
In [3]: df
Out[3]:
0
2013-01-01 09:00:00 -0.110577
2013-01-01 09:00:01 -0.420989
2013-01-01 09:00:02 0.656626
2013-01-01 09:00:03 -0.350615
2013-01-01 09:00:04 -0.830469
[5 rows x 1 columns]
In [4]: df.to_hdf('test.h5','data',mode='w',format='table')
Specify it as a quoted string
In [8]: pd.read_hdf('test.h5','data',where='index>"20130101 09:00:01" & index<"20130101 09:00:04"')
Out[8]:
0
2013-01-01 09:00:02 0.656626
2013-01-01 09:00:03 -0.350615
[2 rows x 1 columns]
You can also specify it directly as a Timestamp
In [10]: pd.read_hdf('test.h5','data',where='index>Timestamp("20130101 09:00:01") & index<Timestamp("20130101 09:00:04")')
Out[10]:
0
2013-01-01 09:00:02 0.656626
2013-01-01 09:00:03 -0.350615
[2 rows x 1 columns]