Calculating auto covariance in pandas - pandas

Following on the answer provided by #pltrdy, in this threat:
https://stackoverflow.com/a/27164416/14744492
How do you convert the pandas.Series.autocorr(), which calculates lag-N (default=1) autocorrelation on Series, into autocovariances?
Sadly the command pandas.Series.autocov()is not implemented in pandas.

What .autocorr(k) calculates is the (Pearson) correlation coefficient for lag k. But we know that, for a series x, that coefficient for lag k is:
\rho_k = \frac{Cov(x_{t}, x_{t-k})}{Var(x)}
Then, to get autocovariance, you multiply autocorrelation by the variance:
def autocov_series(x, lag=1):
return x.autocorr(x, lag=lag) * x.var()
Note that Series.var uses ddof of 1 by default so N - 1 divides the sample variance where N == s.size (and you'd get an unbiased estimate for the population variance).

Related

Weird numpy matrix values

When i want to calculate the determinant of matrix using <<np.linalg.det(mat1)>> or calculate the inverse it gives weird value output . For example it gives 1.11022302e-16 instead of 0.
I tried to round the number for determinant but i couldn't do the same for matrix elements.
Maybe the computation is a not fix numbers so multiplication or division very close to zero but not equals.
You can define a delta that can determine if its close enough, and then compute the the absolute distance between the result and the expected value.
Maybe like this:
res = np.linalg.det(mat)
delta = 0.0001
if abs(math.floor(res)-res)<delta:
return math.floor(res)
if abs(math.ceil(res)-res)<delta:
return math.ceil(res)
return res

How to calculate approximate fourier coefficients using np.trapz

I have a dataset which looks roughly as follows (and is sinusoidal in nature):
TW-240-run1.txt
Point Number Temperature
0 51.504781
1 51.487722
2 51.487722
3 51.828893
4 51.828893
5 51.436547
6 51.368312
7 51.726542
8 51.368312
9 51.317137
10 51.317137
11 51.283020
12 51.590073
.
.
.
9599 51.675366
I am tasked with finding the fundamental/first fourier coefficients, a_n and b_n for this dataset, by means of a numerical integration technique. In this case, I am simply using numpy.trapz from numpy, which aims to implement the trapezium rule. The fourier coefficients, a_n and b_n can be calculated with the following formulae:
where tau (𝛕) is the time period of the sine function. For my case, 𝛕 = 240 seconds (referring to the point number 240 on the data sheet), and thus the bounds of integration are 0 to 240. T(t) from the above formulae is the data set and n = 1.
My current code for trying to calculate the fourier coefficients is as follows:
# Packages
import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
#input data from datasheet, the loadtxt below takes in the data from t = 0s to t = 240s
x1, y1 = np.loadtxt(r'C:\Users\Sidharth\Documents\y2python\y2python\thermal_4min_a.txt', unpack=True, skiprows=3)
tau_4min = 240.0
def cosine(period, t, n):
return np.cos((2*np.pi*n*t)/(period)) #defines the cos term for the a_n formula
def sine(period, t, n): #defines the sin term for the a_n formula
return np.sin((2*np.pi*n*t)/(period))
a_1_4min = (2/tau_4min)*np.trapz((y1*cos_term_4min), x1) #implement a_n formula (trapezium rule for T(t)*cos)
print('a_1 is', a_1_4min)
b_1_4min = (2/tau_4min)*np.trapz((y1*sin_term_4min), x1) #implement b_n formula (trapezium rule for T(t)*cos)
print('b_1 is', b_1_4min)
Essentially what this is doing is, it takes in the data, but only up to the row index 241 (point number 240), and then multiplies it by the sine/cosine term from each of the above formulae. However, I realise this isn't calculating the fourier coefficients properly.
My question(s) are as follows:
Will my code work if I can find a way to set limits of integration for np.trapz and then importing the entire data set, instead of only importing the data points from 0 to 240 and multiplying it by the cos or sine term, then using np trapz on that product, as I am currently doing (0 and 240 are supposed to be my limits of integration)

Fast algorithm for computing cofactor matrix

I wonder if there is a fast algorithm, say (O(n^3)) for computing the cofactor matrix (or conjugate matrix) of a N*N square matrix. And yes one could first compute its determinant and inverse separately and then multiply them together. But how about this square matrix is non-invertible?
I am curious about the accepted answer here:Speed up python code for computing matrix cofactors
What would it mean by "This probably means that also for non-invertible matrixes, there is some clever way to calculate the cofactor (i.e., not use the mathematical formula that you use above, but some other equivalent definition)."?
Factorize M = L x D x U, whereL is lower triangular with ones on the main diagonal,U is upper triangular on the main diagonal, andD is diagonal.
You can use back-substitution as with Cholesky factorization, which is similar. Then,
M^{ -1 } = U^{ -1 } x D^{ -1 } x L^{ -1 }
and then transpose the cofactor matrix as :
Cof( M )^T = Det( U ) x Det( D ) x Det( L ) x M^{ -1 }.
If M is singular or nearly so, one element (or more) of D will be zero or nearly zero. Replace those elements with zero in the matrix product and 1 in the determinant, and use the above equation for the transpose cofactor matrix.

How do I bandpass-filter a signal using a Gaussian function in Python (Numpy/Scipy)

I have a time series (more specifically a correlation function). I want to bandpass-filter this signal using a Gaussian function H:
H(w) = e^(-alpha((w-wn)/wn)^2),
where wn is the central frequency in my bandpass filter and alpha is a certain constant value that I know.
I apply a (inverse) FFT to my H function:
H = np.e ** (-alfa * ((w - wn) / wn) ** 2)
H = np.fft.ifft(H)
HH = np.asarray([i1 for i1 in itertools.chain(H[len(H)/2:len(H)], H[0:len(H)/2])])
And what I do then is to use fftconvolve:
filtered = fftconvolve(data, HH.real, mode='same'),
but the "filtered signal" that I see seems to be filtering frequencies centered in 2 times wn.
What is the correct way of doing this? Is there a restriction in the length of my filter with respect to the length of my time series?
Perhaps what you are looking for is the Gaussian filter from Scipy,
from scipy.ndimage import gaussian_filter
output = gaussian_filter(input, sigma )
where sigma is the standard deviation of the Gaussian kernel. See the Scipy documentation for more details. https://docs.scipy.org/doc/scipy/reference/generated/scipy.ndimage.gaussian_filter.html

Is it possible to optimize this Matlab code for doing vector quantization with centroids from k-means?

I've created a codebook using k-means of size 4000x300 (4000 centroids, each with 300 features). Using the codebook, I then want to label an input vector (for purposes of binning later on). The input vector is of size Nx300, where N is the total number of input instances I receive.
To compute the labels, I calculate the closest centroid for each of the input vectors. To do so, I compare each input vector against all centroids and pick the centroid with the minimum distance. The label is then just the index of that centroid.
My current Matlab code looks like:
function labels = assign_labels(centroids, X)
labels = zeros(size(X, 1), 1);
% for each X, calculate the distance from each centroid
for i = 1:size(X, 1)
% distance of X_i from all j centroids is: sum((X_i - centroid_j)^2)
% note: we leave off the sqrt as an optimization
distances = sum(bsxfun(#minus, centroids, X(i, :)) .^ 2, 2);
[value, label] = min(distances);
labels(i) = label;
end
However, this code is still fairly slow (for my purposes), and I was hoping there might be a way to optimize the code further.
One obvious issue is that there is a for-loop, which is the bane of good performance on Matlab. I've been trying to come up with a way to get rid of it, but with no luck (I looked into using arrayfun in conjunction with bsxfun, but haven't gotten that to work). Alternatively, if someone know of any other way to speed this up, I would be greatly appreciate it.
Update
After doing some searching, I couldn't find a great solution using Matlab, so I decided to look at what is used in Python's scikits.learn package for 'euclidean_distance' (shortened):
XX = sum(X * X, axis=1)[:, newaxis]
YY = Y.copy()
YY **= 2
YY = sum(YY, axis=1)[newaxis, :]
distances = XX + YY
distances -= 2 * dot(X, Y.T)
distances = maximum(distances, 0)
which uses the binomial form of the euclidean distance ((x-y)^2 -> x^2 + y^2 - 2xy), which from what I've read usually runs faster. My completely untested Matlab translation is:
XX = sum(data .* data, 2);
YY = sum(center .^ 2, 2);
[val, ~] = max(XX + YY - 2*data*center');
Use the following function to calculate your distances. You should see an order of magnitude speed up
The two matrices A and B have the columns as the dimenions and the rows as each point.
A is your matrix of centroids. B is your matrix of datapoints.
function D=getSim(A,B)
Qa=repmat(dot(A,A,2),1,size(B,1));
Qb=repmat(dot(B,B,2),1,size(A,1));
D=Qa+Qb'-2*A*B';
You can vectorize it by converting to cells and using cellfun:
[nRows,nCols]=size(X);
XCell=num2cell(X,2);
dist=reshape(cell2mat(cellfun(#(x)(sum(bsxfun(#minus,centroids,x).^2,2)),XCell,'UniformOutput',false)),nRows,nRows);
[~,labels]=min(dist);
Explanation:
We assign each row of X to its own cell in the second line
This piece #(x)(sum(bsxfun(#minus,centroids,x).^2,2)) is an anonymous function which is the same as your distances=... line, and using cell2mat, we apply it to each row of X.
The labels are then the indices of the minimum row along each column.
For a true matrix implementation, you may consider trying something along the lines of:
P2 = kron(centroids, ones(size(X,1),1));
Q2 = kron(ones(size(centroids,1),1), X);
distances = reshape(sum((Q2-P2).^2,2), size(X,1), size(centroids,1));
Note
This assumes the data is organized as [x1 y1 ...; x2 y2 ...;...]
You can use a more efficient algorithm for nearest neighbor search than brute force.
The most popular approach are Kd-Tree. O(log(n)) average query time instead of the O(n) brute force complexity.
Regarding a Maltab implementation of Kd-Trees, you can have a look here