calculating the covariance matrix fast in python with some minor customizing - pandas

I have a pandas data frame and I'm trying to find the covariance of the percentage change of each column. For each pair, I want rows with missing values to be dropped, and the percentage be calculated afterwards. That is, I want something like this:
import pandas as pd
import numpy as np
# create dataframe example
N_ROWS, N_COLS = 249, 3535
df = pd.DataFrame(np.random.random((N_ROWS, N_COLS)))
df.iloc[np.random.choice(N_ROWS, N_COLS), np.random.choice(10, 50)] = np.nan
cov_df = pd.DataFrame(index=df.columns, columns=df.columns)
for col_i in df:
for col_j in df:
cov = df[[col_i, col_j]].dropna(how='any', axis=0).pct_change().cov()
cov_df.loc[col_i, col_j] = cov.iloc[0, 1]
The thing is this is super slow. The code below gives me results that is similar (but not exactly) what I want, but it runs quite fast
df.dropna(how='any', axis=0).pct_change().cov()
I am not sure why the second one runs so much faster. I want to speed up my code in the first, but I can't figure out how.
I have tried using combinations from itertools to avoid repeating the calculation for (col_i, col_j) and (col_j, col_i), and using map from multiprocessing to do the computations in parallel, but it still hasn't finished running after 90+ mintues.

somehow this works fast enough, although I am not sure why
from scipy.stats import pearsonr
corr = np.zeros((x.shape[1], x.shape[1]))
for i in range(x.shape[1]):
for j in range (i + 1, x.shape[1]):
y = x[:, [i, j]]
y = y[~np.isnan(y).any(axis=1)]
y = np.diff(y, axis=0) / y[:-1, :]
if len(y) < 2:
corr[i, j] = np.nan
continue
y = pearsonr(y[:, 0], y[:, 1])[0]
corr[i, j] = y
corr = corr + corr.T
np.fill_diagonal(corr, 1)
This takes within 8 minutes, which is fast enough for my use case.
On the other hand, this has been running for 30 minutes but still isn't done.
corr = pd.DataFrame(index=nav.columns, columns=nav.columns)
for col_i in df:
for col_j in df:
corr_ij = df[[col_i, col_j]].dropna(how='any', axis=0).pct_change().corr().iloc[0, 1]
corr.loc[col_i, col_j] = corr_ij
t1 = time.time()
Don't know why this is but anyways the first one is a good enough solution for me now.

Related

How do you speed up a score calculation based on two rows in a Pandas Dataframe?

TLDR: How can one adjust the for-loop for a faster execution time:
import numpy as np
import pandas as pd
import time
np.random.seed(0)
# Given a DataFrame df and a row_index
df = pd.DataFrame(np.random.randint(0, 3, size=(30000, 50)))
target_row_index = 5
start = time.time()
target_row = df.loc[target_row_index]
result = []
# Method 1: Optimize this for-loop
for row in df.iterrows():
"""
Logic of calculating the variables check and score:
if the values for a specific column are 2 for both rows (row/target_row), it should add 1 to the score
if for one of the rows the value is 1 and for the other 2 for a specific column, it should subtract 1 from the score.
"""
check = row[1]+target_row # row[1] takes 30 microseconds per call
score = np.sum(check == 4) - np.sum(check == 3) # np.sum takes 47 microseconds per call
result.append(score)
print(time.time()-start)
# Goal: Calculate the list result as efficient as possible
# Method 2: Optimize Apply
def add(a, b):
check = a + b
return np.sum(check == 4) - np.sum(check == 3)
start = time.time()
q = df.apply(lambda row : add(row, target_row), axis = 1)
print(time.time()-start)
So I have a dataframe of size 30'000 and a target row in this dataframe with a given row index. Now I want to compare this row to all the other rows in the dataset by calculating a score. The score is calculated as follows:
if the values for a specific column are 2 for both rows, it should add 1 to the score
if for one of the rows the value is 1 and for the other 2 for a specific column, it should subtract 1 from the score.
The result is then the list of all the scores we just calculated.
As I need to execute this code quite often I would like to optimize it for performance.
Any help is very much appreciated.
I already read Optimization when using Pandas are there further resources you can recommend? Thanks
If you're willing to convert your df to a NumPy array, NumPy has some really good vectorisation that helps. My code using NumPy is as below:
df = pd.DataFrame(np.random.randint(0, 3, size=(30000, 50)))
target_row_index = 5
start_time = time.time()
# Converting stuff to NumPy arrays
target_row = df.loc[target_row_index].to_numpy()
np_arr = df.to_numpy()
# Calculations
np_arr += target_row
check = np.sum(np_arr == 4, axis=1) - np.sum(np_arr == 3, axis=1)
result = list(check)
end_time = time.time()
print(end_time - start_time)
Your complete code (on Google Colab for me) outputs a time of 14.875332832336426 s, while the NumPy code above outputs a time of 0.018691539764404297 s, and of course, the result list is the same in both cases.
Note that in general, if your calculations are purely numerical, NumPy will virtually always be better than Pandas and a for loop. Pandas really shines through with strings and when you need the column and row names, but for pure numbers, NumPy is the way to go due to vectorisation.

What is fastest way to compute quantile over grouped dataframe?

I am creating monthly diurnal plots from pandas dataframe. I need to plot mean, median or any quantile. I am able to achieve it correctly, but with large data, quantile computation is way slower than mean or median computation. Is there any faster way to achieve this?
import pandas as pd
import numpy as np
import datetime as dt
date_range = pd.date_range(start=dt.datetime(2018,1,1,00,00), end=dt.datetime(2018,12,31,23,59), freq='1min')
N = len(date_range)
df = pd.DataFrame({'Test': np.random.rand(N)}, index=date_range)
df['Time'] = df.index.time
df['Month'] = df.index.month
time_mean_median = dt.datetime(2019,1,1,0,0,0)
time_qunatiles = dt.datetime(2019,1,1,0,0,0)
for i in range(12):
df_month = df[['Test', 'Time']].loc[df['Month'] == i + 1]
start_time = dt.datetime.now()
df1_group = df[['Test', 'Time']].groupby('Time').agg([np.mean, np.median])
time_mean_median += dt.datetime.now()-start_time
quantiles = [0.23, 0.72]
start_time = dt.datetime.now()
df2_group = df[['Test', 'Time']].groupby('Time').quantile(q=quantiles).unstack()
time_qunatiles += dt.datetime.now() - start_time
print('Mean/median computation time {}'.format(time_mean_median.time()))
print('Quantile computation time {}'.format(time_qunatiles.time()))
In this example I get mean/median total computation time around 0.7 seconds, compare to almost 12 second with quantile computation.

get less correlated variable names

I have a dataset (50 columns, 100 rows).
Also have 50 variable names, 0,1,2...49 for 50 columns.
I have to find less correlated variables, say correlation < 0.7.
I tried as follows:
import os, glob, time, numpy as np, pandas as pd
data = np.random.randint(1,99,size=(100, 50))
dataframe = pd.DataFrame(data)
print (dataframe.shape)
codes = np.arange(50).astype(str)
dataframe.columns = codes
corr = dataframe.corr()
corr = corr.unstack().sort_values()
print (corr)
corr = corr.values
indices = np.where(corr < 0.7)
print (indices)
res = codes[indices[0]].tolist() + codes[indices[1]].tolist()
print (len(res))
res = list(set(res))
print (len(res))
The result is, 50(all variables!), which is unexpected.
How to solve this problem, guys?
As mentioned in the comments, your question is somewhat ambiguous. First, there is the possibility, that no column pair is correlated. Second, the unstacking doesn't make sense, because you create an index array that you can't directly use on your 2D array. Third, which should be first, but I was blind to this - as #AmiTavory mentioned there is no point in "correlating names".
The correlation procedure per se works, as you can see in the following example:
import numpy as np
import pandas as pd
A = np.arange(100).reshape(25, 4)
#random order in column 2, i.e. a low correlation to the first columns
np.random.shuffle(A[:,2])
#flip column 3 to create a negative correlation with the first columns
A[:,3] = np.flipud(A[:,3])
#column 1 is unchanged, therefore positively correlated to column 0
df = pd.DataFrame(A)
print(df)
#establish a correlation matrix
corr = df.corr()
#retrieve index of pairs below a certain value
#use only the upper triangle with np.triu to filter for symmetric solutions
#use np.abs to take also negative correlation into account
res = np.argwhere(np.triu(np.abs(corr.values) <0.7))
print(res)
Output:
[[0 2]
[1 2]
[2 3]]
As expected, column 2 is the only one that is not correlated to any other, meaning, that all other columns are correlated with each other.

Calculate "v^T A v" for a matrix of vectors v

I have a k*n matrix X, and an k*k matrix A. For each column of X, I'd like to calculate the scalar
X[:, i].T.dot(A).dot(X[:, i])
(or, mathematically, Xi' * A * Xi).
Currently, I have a for loop:
out = np.empty((n,))
for i in xrange(n):
out[i] = X[:, i].T.dot(A).dot(X[:, i])
but since n is large, I'd like to do this faster if possible (i.e. using some NumPy functions instead of a loop).
This seems to do it nicely:
(X.T.dot(A)*X.T).sum(axis=1)
Edit: This is a little faster. np.einsum('...i,...i->...', X.T.dot(A), X.T). Both work better if X and A are Fortran contiguous.
You can use the numpy.einsum:
np.einsum('ji,jk,ki->i',x,a,x)
This will get the same result. Let's see if it is much faster:
Looks like dot is still the fastest option, particularly because it uses threaded BLAS, as opposed to einsum which runs on one core.
import perfplot
import numpy as np
def setup(n):
k = n
x = np.random.random((k, n))
A = np.random.random((k, k))
return x, A
def loop(data):
x, A = data
n = x.shape[1]
out = np.empty(n)
for i in range(n):
out[i] = x[:, i].T.dot(A).dot(x[:, i])
return out
def einsum(data):
x, A = data
return np.einsum('ji,jk,ki->i', x, A, x)
def dot(data):
x, A = data
return (x.T.dot(A)*x.T).sum(axis=1)
perfplot.show(
setup=setup,
kernels=[loop, einsum, dot],
n_range=[2**k for k in range(10)],
logx=True,
logy=True,
xlabel='n, k'
)
You can't do it faster unless you parallelize the whole thing: One thread per column. You'll still use loops - you can't get away from that.
Map reduce is a nice way to look at this problem: map step multiples, reduce step sums.

inverse of FFT not the same as original function

I don't understand why the ifft(fft(myFunction)) is not the same as my function. It seems to be the same shape but a factor of 2 out (ignoring the constant y-offset). All the documentation I can see says there is some normalisation that fft doesn't do, but that ifft should take care of that. Here's some example code below - you can see where I've bodged the factor of 2 to give me the right answer. Thanks for any help - its driving me nuts.
import numpy as np
import scipy.fftpack as fftp
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
def fourier_series(x, y, wn, n=None):
# get FFT
myfft = fftp.fft(y, n)
# kill higher freqs above wavenumber wn
myfft[wn:] = 0
# make new series
y2 = fftp.ifft(myfft).real
# find constant y offset
myfft[1:]=0
c = fftp.ifft(myfft)[0]
# remove c, apply factor of 2 and re apply c
y2 = (y2-c)*2 + c
plt.figure(num=None)
plt.plot(x, y, x, y2)
plt.show()
if __name__=='__main__':
x = np.array([float(i) for i in range(0,360)])
y = np.sin(2*np.pi/360*x) + np.sin(2*2*np.pi/360*x) + 5
fourier_series(x, y, 3, 360)
You're removing half the spectrum when you do myfft[wn:] = 0. The negative frequencies are those in the top half of the array and are required.
You have a second fudge to get your results which is taking the real part to find y2: y2 = fftp.ifft(myfft).real (fftp.ifft(myfft) has a non-negligible imaginary part due to the asymmetry in the spectrum).
Fix it with myfft[wn:-wn] = 0 instead of myfft[wn:] = 0, and remove the fudges. So the fixed code looks something like:
import numpy as np
import scipy.fftpack as fftp
import matplotlib.pyplot as plt
def fourier_series(x, y, wn, n=None):
# get FFT
myfft = fftp.fft(y, n)
# kill higher freqs above wavenumber wn
myfft[wn:-wn] = 0
# make new series
y2 = fftp.ifft(myfft)
plt.figure(num=None)
plt.plot(x, y, x, y2)
plt.show()
if __name__=='__main__':
x = np.array([float(i) for i in range(0,360)])
y = np.sin(2*np.pi/360*x) + np.sin(2*2*np.pi/360*x) + 5
fourier_series(x, y, 3, 360)
It's really worth paying attention to the interim arrays that you are creating when trying to do signal processing. Invariably, there are clues as to what is going wrong that should direct you to the problem. In this case, you taking the real part masked the problem and made your task more difficult.
Just to add another quick point: Sometimes taking the real part of the resultant array is exactly the correct thing to do. It's often the case that you end up with an imaginary part to the signal output which is just down to numerical errors in the input to the inverse FFT. Typically this manifests itself as very small imaginary values, so taking the real part is basically the same array.
You are killing the negative frequencies between 0 and -wn.
I think what you mean to do is to set myfft to 0 for all frequencies outside [-wn, wn].
Change the following line:
myfft[wn:] = 0
to:
myfft[wn:-wn] = 0