Pandas rolling apply multiplication - pandas

I would have thought this would be a basic application of pd.DataFrame().rolling() or pd.Series().rolling(), but it appears that the pandas rolling function cannot handle scalar multiplication being applied to a rolling window; I am hoping I am wrong and someone can spot the error.
I am trying to take a rolling window of a series (or dataframe) and multiply each row of that series/dataframe by a series/dataframe of weights (these weights have been precomputed). The code I thought should work is:
data.rolling(5).apply( lambda x: x*weights )
with
data = pd.Series( np.random.randint(1,101,2000) )
weights = pd.Series([ 0.10650, 0.1405310, 0.1854318, 0.2446788, 0.3228556 ])
I thought, data.rolling(5).apply( lambda x: x*weights ) would produce a new rolling series, but the following error is returned everytime "TypeError: cannot convert the series to <class 'float'>".
I should note that the only reason that I am trying to multiply the weights is to apply a corr/cov/mean statistic on the new rolling seres/dataframe afterward...something like
rolling_weighted_corr = data.rolling(5).apply( lambda x: x*weights ).corr()
Does anyone know how to multiply (scalar) a series with a rolling series to produce a new rolling series?

Scipy can do this with signal.convolve. For example, using the mode="same" flag will return an array of the same size as data where a centered window around each row is multiplied by weights and summed (thus, np.sum(weights) is used to normalize). Example:
from scipy import signal
data_smoothed = signal.convolve(data, weights, mode="same")/np.sum(weights)
np.var(data) # 836.0666897499997
np.var(data_smoothed) # 185.52213418878418

Related

Is there a Rolling implementation of PCA in python?

For time-series analysis, it's useful to have rolling PCA functions to analyse how the dynamics of the time-series changes over time to avoid look-ahead bias.
We may want to answer the question: 'how many principle components are needed to keep 90% of the variance?'. The number of principle components to explain 90% variance may change over time, depending on the dynamics of the time-series.
In addition, we may want to reduce the number of components p in a given dataset to k < p on a rolling basis to more easily visualise the data.
While scikit has a PCA module, it does not support rolling calculations. Similarly with numpy SVD. We could use these packages in a manual for loop, but for large arrays (>10,000 rows) it would become very slow.
Is there a fast rolling implementation of PCA in python to address some of the questions above?
While I didn't manage to find a rolling implementation of PCA, it is a relatively straightforward matter to use the packages and tools mentioned in the question to code a manual rolling PCA function. In addition, we will use numba to gain a small speed-up, as it supports numpy.linalg.svd and numpy.linalg.eig.
The code in this answer is inspired by the excellent explanations of PCA here and here
import numpy as np
from numpy.linalg import eig
from numba import njit
import numpy.typing as npt
#njit
def rolling_pca(
arr: npt.NDArray[np.float64],
n_components: int,
window: int,
min_periods: int
) -> npt.NDArray[np.float64]:
"""Perform PCA on the covariance matrix of arr.
Return the lower dimensional array.
Data is assumed to have non-zero mean, so will be demeaned
in the process.
Args:
arr: Input data. Shape (n_samples, n_variables).
n_components: Number of components to reduce data matrix to.
Must be less than arr.shape[1].
window: Sliding window size.
min_periods: Minimum number of observations required to perform calculation.
Returns:
Reduced data matrix. Shape (n_samples, n_components)
"""
# create a copy to ensure we don't change data in place
arr_copy = arr.copy()
n = arr_copy.shape[0]
# create an empty array which will be populated with the output
reduced_out = np.full((n, n_components), np.nan)
# iterate over each row (timestamp in a timeseries)
for i in range(min_periods, n + 1):
if i < window:
lookback = i
else:
lookback = window
start_idx = i - lookback
curr_arr = arr_copy[start_idx: i, :]
# demean returns
curr_arr = curr_arr - (np.sum(curr_arr, axis=0) / lookback)
# calculate the covariance matrix
cov = (curr_arr.T # curr_arr) / (lookback - 1)
# get the eigenvectors
# sort eigvals to get top largest corresponding eigenvectors
evals, evecs = eig(cov)
idx = np.argsort(evals)[:n_components]
evecs = evecs[:, idx]
# multiply the top eigenvectors by the current array to get a reduced matrix
reduced = (evecs.T # curr_arr.T).T
reduced_out[start_idx: i, :] = reduced
return reduced_out
After profiling the code, the two slowest parts are, as expected, the calls to eig() and the matrix multiplication curr_arr.T # curr_arr. As the array curr_arr is limited by the window size, a pure numpy (no numba) implementation of matrix multiplcation is faster than using numba. This is because the arrays used in the matrix multiplication are small, and are not contiguous (see this post for more details). I didn't get around to resolving this issue, but if anyone has any suggestions, it would speed up this function quite a bit more.
I've compared the average timings between 3 implementations to see the effect of the speedup that numba offers. The 3 implementations are:
Manual for loop using numba, exactly as above
Manual for loop without numba, but otherwise same as the code above
Manual for loop using Sklearn instead of numpy eig, no numba (as numba does not support Sklearn)
Note that the following parameters are fixed so we get as fair a comparison as possible between implementations
Window size = 120
Minimum number of periods = 22
Input data number of variables = 20
Number of components to reduce to via PCA = 10
Number of iterations to time function over so as to get an average timing per implementation = 10
Only the row size (number of samples) is allowed to vary so we can visualise how execution time varies with array length.
We can see for large arrays (100k rows), we can decrease the time from about 14.19s using Sklearn, to 5.36s using numba, about a 2.6X speedup.
PCA Reconstruction
We implement some code similar to what we used above to reconstruct the original data matrix, using only the top principle components. Namely, we use SVD to decompose the matrix X into 3 matrices, U, S, and V^T. With these matrices, we can calculate how much variance is kept cumulatively by the components, and only keep the top k components that explain a desired amount of variance.
import numpy as np
from numpy.linalg import svd
from numba import njit
import numpy.typing as npt
#njit
def __get_num_top_components(
singular_values: npt.NDArray[np.float64],
threshold: float,
n: int,
) -> int:
"""Get the number of top eigen-components required by threshold.
Args:
singular_values: Singular values from SVD.
threshold: Minimum amount of explained variance to be kept.
n: Number of samples in data matrix.
Returns:
Required number of components to keep.
"""
evals = singular_values ** 2 / (n - 1)
evals = evals / np.sum(evals)
cumsum_evals = np.cumsum(evals)
top_k = np.argwhere(cumsum_evals > threshold).min()
return top_k
#njit
def rolling_pca_reconstruction(
arr: npt.NDArray[np.float64],
threshold: float,
window: int,
min_periods: int
) -> npt.NDArray[np.float64]:
"""Perform PCA on arr and return reconstructed matrix.
This method follows the logic succinctly outlined here:
https://stats.stackexchange.com/a/134283/178320
Args:
arr: Input data. Shape (n_samples, n_variables).
threshold: Minimum amount of explained variance to be kept.
Must be a number in (0., 1.).
window: Sliding window size.
min_periods: Minimum number of observations required to perform calculation.
Returns:
Reconstructed data matrix. Shape (n_samples, n_variables)
"""
arr_copy = arr.copy()
n = arr_copy.shape[0]
p = arr_copy.shape[1]
recon_out = np.full((n, p), np.nan)
for i in range(min_periods, n + 1):
if i < window:
lookback = i
else:
lookback = window
start_idx = i - lookback
curr_arr = arr_copy[start_idx: i, :]
# demean data
curr_arr = curr_arr - (np.sum(curr_arr, axis=0) / lookback)
# perform SVD on data, no need for full matrices, this is faster
u, s, vh = svd(curr_arr, full_matrices=False)
# calculate the number of components that explains threshold variance
top_k = __get_num_top_components(
singular_values=s,
threshold=threshold,
n=lookback,
)
# reconstruct the data matrix using the top_k components
tmp_recon = u[:, :top_k] # np.diag(s)[:top_k, :top_k] # vh[:, :top_k].T
recon_out[start_idx: i, :] = tmp_recon
return recon_out
The output of rolling_pca_reconstruction() is the reconstructed data, of same dimension as the input data arr. One useful modification that could be made to this code is to record top_k at each iteration, to understand how many components are needed to explain treshold variance over time.

How to average data for a variable over a number of timesteps

I was wondering if anyone could shed some light into how I can average this data:
I have a .nc file with data (dimensions: 2029,64,32) which relates to time, latitude and longitude. Using these commands I can plot individual timesteps:
timestep = data.variables['precip'][0]
plt.imshow(timestep)
plt.colorbar()
plt.show()
Giving a graph in this format for the 0th timestep:
I was wondering if there was any way to average this first dimension (the snapshots in time).
If you are looking to take a mean over all times, try using np.mean where you use the axis keyword to say which axis you want to average.
time_avaraged = np.mean(data.variables['precip'], axis = 0)
If you have NaN values then np.mean will give NaN for that lon/lat point. If you'd rather ignore them then use np.nanmean.
If you want to do specific times only, e.g. the first 1000 time steps, then you could do
time_avaraged = np.mean(data.variables['precip'][:1000,:,:], axis = 0)
I think if you're using pandas and numpy this may help you.Look for more details
import pandas as pd
import numpy as np
data = np.array([10,5,8,9,15,22,26,11,15,16,18,7])
d = pd.Series(data)
print(d.rolling(4).mean())

Can I extract or construct as a Pandas dataframe the table with coefficient values etc. provided by the summary() method in statsmodels?

I have run an OLS model in statsmodels and I would like to have the table in the summary as a Pandas dataframe.
This is what I mean:
I would like the table within the red frame to be constructed / extracted and become a Pandas DataFrame.
My code up to that point was straightforward:
from statsmodels.regression.linear_model import OLS
mod = OLS(endog = coded_design_poly_select.response.values, exog = coded_design_poly_select.iloc[:, :-1].values)
fitted_model = mod.fit()
fitted_model.summary()
What would you suggest?
The fitted_model is in fact a RegressionResults object that stores all the regression results and you can access them via the corresponding methods/attributes.
For what you asked for, I believe the following code would work
data = {'coef': fitted_model.params,
'std err': fitted_model.bse,
't': fitted_model.tvalues,
'P>|t|': fitted_model.pvalues,
'[0.025': fitted_model.conf_int()[0],
'0.975]': fitted_model.conf_int()[1]}
pd.DataFrame(data).round(3)

Huge sparse dataframe to scipy sparse matrix without dense transform

Have data with more then 1 million rows and 30 columns, one of the columns is user_id (more then 1500 different users).
I want one-hot-encode this column and to use data in ML algorithms (xgboost, FFM, scikit). But due to huge row numbers and unique user values matrix will be ~ 1 million X 1500, so need do this in sparse format (otherwise data kill all RAM).
For me convenient way to work with data through pandas DataFrame, which also now it support sparse format:
df = pd.get_dummies(df, columns=['user_id', 'type'], sparse=True)
Work pretty fast and have small size in RAM. But for working with scikit algos and xgboost it's necessary transform dataframe to sparse matrix.
Is there any way to do this rather than iterate through columns and hstack them in one scipy sparse matrix?
I tried df.as_matrix() and df.values, but all of first transform data to dense what arise MemoryError :(
P.S.
Same to get DMatrix for xgboost
UPDATE:
So i release next solution (will be thankful for optimisation suggestions):
def sparse_df_to_saprse_matrix (sparse_df):
index_list = sparse_df.index.values.tolist()
matrix_columns = []
sparse_matrix = None
for column in sparse_df.columns:
sps_series = sparse_df[column]
sps_series.index = pd.MultiIndex.from_product([index_list, [column]])
curr_sps_column, rows, cols = sps_series.to_coo()
if sparse_matrix != None:
sparse_matrix = sparse.hstack([sparse_matrix, curr_sps_column])
else:
sparse_matrix = curr_sps_column
matrix_columns.extend(cols)
return sparse_matrix, index_list, matrix_columns
And the following code allows to get sparse dataframe:
one_hot_df = pd.get_dummies(df, columns=['user_id', 'type'], sparse=True)
full_sparse_df = one_hot_df.to_sparse(fill_value=0)
I have created sparse matrix 1,1 million rows x 1150 columns. But during creating it's still uses significant amount of RAM (~10Gb on edge with my 12Gb).
Don't know why, because resulting sparse matrix uses only 300 Mb (after loading from HDD). Any ideas?
You should be able to use the experimental .to_coo() method in pandas [1] in the following way:
one_hot_df = pd.get_dummies(df, columns=['user_id', 'type'], sparse=True)
one_hot_df, idx_rows, idx_cols = one_hot_df.stack().to_sparse().to_coo()
This method, instead of taking a DataFrame (rows / columns) it takes a Series with rows and columns in a MultiIndex (this is why you need the .stack() method). This Series with the MultiIndex needs to be a SparseSeries, and even if your input is a SparseDataFrame, .stack() returns a regular Series. So, you need to use the .to_sparse() method before calling .to_coo().
The Series returned by .stack(), even if it's not a SparseSeries only contains the elements that are not null, so it shouldn't take more memory than the sparse version (at least with np.nan when the type is np.float).
http://pandas.pydata.org/pandas-docs/stable/sparse.html#interaction-with-scipy-sparse
Does my answer from a few months back help?
Pandas sparse dataFrame to sparse matrix, without generating a dense matrix in memory
It was accepted but I didn't get any further feedback.
I'm familiar with the scipy sparse formats and their inputs, but don't know much about pandas sparse.

Exponential decay curve fitting in numpy and scipy

I'm having a bit of trouble with fitting a curve to some data, but can't work out where I am going wrong.
In the past I have done this with numpy.linalg.lstsq for exponential functions and scipy.optimize.curve_fit for sigmoid functions. This time I wished to create a script that would let me specify various functions, determine parameters and test their fit against the data. While doing this I noticed that Scipy leastsq and Numpy lstsq seem to provide different answers for the same set of data and the same function. The function is simply y = e^(l*x) and is constrained such that y=1 at x=0.
Excel trend line agrees with the Numpy lstsq result, but as Scipy leastsq is able to take any function, it would be good to work out what the problem is.
import scipy.optimize as optimize
import numpy as np
import matplotlib.pyplot as plt
## Sampled data
x = np.array([0, 14, 37, 975, 2013, 2095, 2147])
y = np.array([1.0, 0.764317544, 0.647136491, 0.070803763, 0.003630962, 0.001485394, 0.000495131])
# function
fp = lambda p, x: np.exp(p*x)
# error function
e = lambda p, x, y: (fp(p, x) - y)
# using scipy least squares
l1, s = optimize.leastsq(e, -0.004, args=(x,y))
print l1
# [-0.0132281]
# using numpy least squares
l2 = np.linalg.lstsq(np.vstack([x, np.zeros(len(x))]).T,np.log(y))[0][0]
print l2
# -0.00313461628963 (same answer as Excel trend line)
# smooth x for plotting
x_ = np.arange(0, x[-1], 0.2)
plt.figure()
plt.plot(x, y, 'rx', x_, fp(l1, x_), 'b-', x_, fp(l2, x_), 'g-')
plt.show()
Edit - additional information
The MWE above includes a small sample of the dataset. When fitting the actual data the scipy.optimize.curve_fit curve presents an R^2 of 0.82, while the numpy.linalg.lstsq curve, which is the same as that calculated by Excel, has an R^2 of 0.41.
You are minimizing different error functions.
When you use numpy.linalg.lstsq, the error function being minimized is
np.sum((np.log(y) - p * x)**2)
while scipy.optimize.leastsq minimizes the function
np.sum((y - np.exp(p * x))**2)
The first case requires a linear dependency between the dependent and independent variables, but the solution is known analitically, while the second can handle any dependency, but relies on an iterative method.
On a separate note, I cannot test it right now, but when using numpy.linalg.lstsq, I you don't need to vstack a row of zeros, the following works as well:
l2 = np.linalg.lstsq(x[:, None], np.log(y))[0][0]
To expound a bit on Jaime's point, any non-linear transformation of the data will lead to a different error function and hence to different solutions. These will lead to different confidence intervals for the fitting parameters. So you have three possible criteria to use to make a decision: which error you want to minimize, which parameters you want more confidence in, and finally, if you are using the fitting to predict some value, which method yields less error in the interesting predicted value. Playing around a bit analytically and in Excel suggests that different kinds of noise in the data (e.g. if the noise function scales the amplitude, affects the time-constant or is additive) leads to different choices of solution.
I'll also add that while this trick "works" for exponential decay to 0, it can't be used in the more general (and common) case of damped exponentials (rising or falling) to values that cannot be assumed to be 0.