Computation complexity of the following matrix multiplications - time-complexity

I would like to know the computational complexities of the following operations including matrices:
a) $H = \beta I + P^{T}P$ where, $\beta$ is a known constant, $I$ is a $N \times N$ identity matrix, $P$ is a $M \times N$ sparse matrix sampled from a $N \times N$ identity matrix. Thus, P contains $M$ ones and rest all elements are zeros.
b) $a = F^{-1}Z + P^{T}B$, where $Z$ and $B$ are $N \times N$ and $M \times N$ respectively and $F{-1}$ is the inverse Fast Fourier transform (FFT).
Summary - I am not able to see how the special structure of $P$ will reduce the time complexity.
My attempt so far:
a) Since $\beta$ is a constant and we know all the places of ones in $I$, $\beta I$ will have $O(1)$ complexity.
I know multiplying two matrices of size $N \times M$ and $M \times N$ has a complexity of $O(N^{2}M) but I am not able to leverage the sparsity and special structure of P to get the actual complexity of this matrix multiplication.
b) For this one, I think that the first one will have $O(N\log N)$ complexity and the second one should have $O(N^{2}M)$ complexity, so the overall complexity will be dominated by $O(N^{2}M)$.
def get_I_matrix(N):
return np.diag(np.ones(N))
def get_P_matrix(picks, N):
return get_I_matrix(N)[np.ix_(picks),:]
def get_H_hat_diag(N, beta, picks):
H_hat_diag = beta*np.ones(N*N)
P_diag = np.zeros(N)
P_diag[picks] += 1
H_hat_diag = H_hat_diag + np.kron(P_diag, P_diag)
return H_hat_diag
def get_a_hat_vec(a_hat, z, beta):
return (a_hat + ifftw2d(beta*z)).flatten('F') # column first
a_hat = np.zeros([N, N])
a_hat = a_hat.astype(complex)
a_hat[np.ix_(picks, picks)] = B
H_hat_diag = get_H_hat_diag(N, beta, picks)
a_hat_vec = get_a_hat_vec(y_hat, z, beta)
Since, latex isn't supported, I didn't want to write $\hat$ everywhere, hence $H$ and $a$ in the problem.

Related

How to add magnitude or value to a vector in Python?

I am using this function to calculate distance between 2 vectors a,b, of size 300, word2vec, I get the distance between 'hot' and 'cold' to be equal 1.
How to add this value (1) to a vector, becz i thought simply new_vec=model['hot']+1, but when I do the calc dist(new_vec,model['hot'])=17?
import numpy
def dist(a,b):
return numpy.linalg.norm(a-b)
a=model['hot']
c=a+1
dist(a,c)
17
I expected dist(a,c) will give me back 1!
You should review what the norm is. In the case of numpy, the default is to use the L-2 norm (a.k.a the Euclidean norm). When you add 1 to a vector, the call is to add 1 to all of the elements in the vector.
>> vec1 = np.random.normal(0,1,size=300)
>> print(vec1[:5])
... [ 1.18469795 0.04074346 -1.77579852 0.23806222 0.81620881]
>> vec2 = vec1 + 1
>> print(vec2[:5])
... [ 2.18469795 1.04074346 -0.77579852 1.23806222 1.81620881]
Now, your call to norm is saying sqrt( (a1-b1)**2 + (a2-b2)**2 + ... + (aN-bN)**2 ) where N is the length of the vector and a is the first vector and b is the second vector (and ai being the ith element in a). Since (a1-b1)**2 == (a2-b2)**2 == ... == (aN-bN)**2 == 1 we expect this sum to produce N which in your case is 300. So sqrt(300) = 17.3 is the expected answer.
>> print(np.linalg.norm(vec1-vec2))
... 17.320508075688775
To answer the question, "How to add a value to a vector": you have done this correctly. If you'd like to add a value to a specific element then you can do vec2[ix] += value where ix indexes the element that you wish to add. If you want to add a value uniformly across all elements in the vector that will change the norm by 1, then add np.sqrt(1/300).
Also possibly relevant is a more commonly used distance metric for word2vec vectors: the cosine distance which measures the angle between two vectors.

How to find time-varying coefficients for a VAR model by using the Kalman Filter

I'm trying to write some code in R to reproduce the model i found in this article.
The idea is to model the signal as a VAR model, but fit the coefficients by a Kalman-filter model. This would essentially enable me to create a robust time-varying VAR(p) model and analyze non-stationary data to a degree.
The model to track the coefficients is:
X(t) = F(t) X(t− 1) +W(t)
Y(t) = H(t) X(t) + E(t),
where H(t) is the Kronecker product between lagged measurements in my time-series Y and a unit vector, and X(t) fills the role of regression-coefficients. F(t) is taken to be an identity matrix, as that should mean we assume coefficients to evolve as a random walk.
In the article, from W(T), the state noise covariance matrix Q(t) is chosen at 10^-3 at first and then fitted based on some iteration scheme. From E(t) the state noise covariance matrix is R(t) substituted by the covariance of the noise term unexplained by the model: Y(t) - H(t)Xhat(t)
I have the a priori covariance matrix of estimation error (denoted Σ in the article) written as P (based on other sources) and the a posteriori as Pmin, since it will be used in the next recursion as a priori, if that makes sense.
So far i've written the following, based on the articles Appendix A 1.2
Y <- *my timeseries, for test purposes two channels of 3000 points*
F <- diag(8) # F is (m^2*p by m^2 *p) where m=2 dimensions and p =2 lags
H <- diag(2) %x% t(vec(Y[,1:2])) #the Kronecker product of vectorized lags Y-1 and Y-2
Xhatminus <- matrix(1,8,1) # an arbitrary *a priori* coefficient matrix
Q <- diag(8)%x%(10**-7) #just a number a really low number diagonal matrix, found it used in some examples
R<- 1 #Didnt know what else to put here just yet
Pmin = diag(8) #*a priori* error estimate, just some 1-s...
Now should start the reccursion. To test i just took the first 3000 points of one trial of my data.
Xhatstorage <- matrix(0,8,3000)
for(j in 3:3000){
H <- diag(2) %x% t(vec(Y[,(j-2):(j-1)]))
K <- (Pmin %*% t(H)) %*% solve((H%*%Pmin%*%t(H) + R)) ##Solve gives inverse matrix ()^-1
P <- Pmin - K%*% H %*% Pmin
Xhatplus <- F%*%( Xhatminus + K%*%(Y[,j]-H%*%Xhatminus) )
Pplus <- (F%*% P %*% F) + Q
Xhatminus <- Xhatplus
Xhatstorage[,j] <- Xhatplus
Pmin <- Pplus
}
I extracted Xhatplus values into a storage matrix and used them to write this primitive VAR model with them:
Yhat<-array(0,3000)
for(t in 3:3000){
Yhat[t]<- t(vec(Y[,(t-2)])) %*% Xhatstorage[c(1,3),t] + t(vec(Y[,(t-1)])) %*% Xhatstorage[c(2,4),t]
}
The looks like this .
The blue line is VAR with Kalman filter found coefficients, Black is original data..
I'm having issue understanding how i can better evaluate my coefficients? Why is it so off?
How should i better choose the first a priori and a posteriori estimates to start the recursion? Currently, adding more lags to the VAR is not the issue i'm sure, it's that i don't know how to choose the initial values for Pmin and Xhatmin. Most places i pieced this together from start from arbitrary 0 assumptions in toy models, but in this case, choosing any of the said matrixes as 0 will just collapse the entire algorithm.
Lastly, is this recursion even a correct implementation of Oya et al describe in the article? I know im still missing the R evaluation based on previously unexplained errors (V(t) in Appendix A 1.2), but in general?

Stable sampling of large Gaussian distributions

I'm trying to sample a Gaussian distribution of covariance matrix P that is N by N, with N very large (around 4000 ).
Usually one would proceed like so:
Compute the Cholesky decomposition of P : L, such that L * L.T = P
Sample a normal Gaussian distribution : X ~N(0,I_N), where I_N is the identity and N = 4000
Obtain the desired sample Y from Y = L * X
The snag here is in the computation of L. The algorithm does not seem to be stable for such a large matrix, as the computed Cholesky decomposition does not satisfy L * L.T != P.
I've tried to normalize P before computing its Cholesky decomposition (dividing it by its largest value), to no avail. I'm using the C++ library Eigen, and I've noticed this problem with numpy as well.
Any advice?
Cholesky decomposition should be quite stable, if the input matrix is actually positive definite. It can have issues if the matrix is (near) semi- or in-definite.
In that case you can use the LDLT decomposition instead. For an input A it computes a permutation P, a unit-diagonal triangular L and a diagonal D, such that
A = P.T*L*D*L.T*P
Then instead of multiplying Y = L * X you need of course Y = sqrt(D) * L * X, where sqrt(D) is an element-wise sqrt (I don't know the python syntax for that).
Note that you can ignore the permutation, since permuting a vector of identically independent distributed random numbers, is still a vector of i.i.d. numbers.
If that still does not work, try using the SelfAdjointEigenSolver-decomposition.
This computes a diagonal matrix of Eigenvalues D and a unitarian matrix V of Eigenvectors, such that
A = V * D * V^{-1}
And you can do essentially the same as above. (Note that for unitarian matrices, V^{-1} is just the adjoint of V, i.e., V^{-1} = V^T in the real-valued case).

Fast way to set diagonals of an (M x N x N) matrix? Einsum / n-dimensional fill_diagonal?

I'm trying to write fast, optimized code based on matrices, and have recently discovered einsum as a tool for achieving significant speed-up.
Is it possible to use this to set the diagonals of a multidimensional array efficiently, or can it only return data?
In my problem, I'm trying to set the diagonals for an array of square matrices (shape: M x N x N) by summing the columns in each square (N x N) matrix.
My current (slow, loop-based) solution is:
# Build dummy array
dimx = 2 # Dimension x (likely to be < 100)
dimy = 3 # Dimension y (likely to be between 2 and 10)
M = np.random.randint(low=1, high=9, size=[dimx, dimy, dimy])
# Blank the diagonals so we can see the intended effect
np.fill_diagonal(M[0], 0)
np.fill_diagonal(M[1], 0)
# Compute diagonals based on summing columns
diags = np.einsum('ijk->ik', M)
# Set the diagonal for each matrix
# THIS IS LOW. CAN IT BE IMPROVED?
for i in range(len(M)):
np.fill_diagonal(M[i], diags[i])
# Print result
M
Can this be improved at all please? It seems np.fill_diagonal doesn't accepted non-square matrices (hence forcing my loop based solution). Perhaps einsum can help here too?
One approach would be to reshape to 2D, set the columns at steps of ncols+1 with the diagonal values. Reshaping creates a view and as such allows us to directly access those diagonal positions. Thus, the implementation would be -
s0,s1,s2 = M.shape
M.reshape(s0,-1)[:,::s2+1] = diags
If you do np.source(np.fill_diagonal) you'll see that in the 2d case it uses a 'strided' approach
if a.ndim == 2:
step = a.shape[1] + 1
end = a.shape[1] * a.shape[1]
a.flat[:end:step] = val
#Divakar's solution applies this to your 3d case by 'flattening' on 2 dimensions.
You could sum the columns with M.sum(axis=1). Though I vaguely recall some timings that found that einsum was actually a bit faster. sum is a little more conventional.
Someone has has asked for an ability to expand dimensions in einsum, but I don't think that will happen.

Coordinate Descent Algorithm in Julia for Least Squares not converging

As a warm-up to writing my own elastic net solver, I'm trying to get a fast enough version of ordinary least squares implemented using coordinate descent.
I believe I've implemented the coordinate descent algorithm correctly, but when I use the "fast" version (see below), the algorithm is insanely unstable, outputting regression coefficients that routinely overflow a 64-bit float when the number of features is of moderate size compared to the number of samples.
Linear Regression and OLS
If b = A*x, where A is a matrix, x a vector of the unknown regression coefficients, and y is the output, I want to find x that minimizes
||b - Ax||^2
If A[j] is the jth column of A and A[-j] is A without column j, and the columns of A are normalized so that ||A[j]||^2 = 1 for all j, the coordinate-wise update is then
Coordinate Descent:
x[j] <-- A[j]^T * (b - A[-j] * x[-j])
I'm following along with these notes (page 9-10) but the derivation is simple calculus.
It's pointed out that instead of recomputing A[j]^T(b - A[-j] * x[-j]) all the time, a faster way to do it is with
Fast Coordinate Descent:
x[j] <-- A[j]^T*r + x[j]
where the total residual r = b - Ax is computed outside the loop over coordinates. The equivalence of these update rules follows from noting that Ax = A[j]*x[j] + A[-j]*x[-j] and rearranging terms.
My problem is that while the second method is indeed faster, it's wildly numerically unstable for me whenever the number of features isn't small compared to the number of samples. I was wondering if anyone might have some insight as to why that's the case. I should note that the first method, which is more stable, still starts disagreeing with more standard methods as the number of features approaches the number of samples.
Julia code
Below is some Julia code for the two update rules:
function OLS_builtin(A,b)
x = A\b
return(x)
end
function OLS_coord_descent(A,b)
N,P = size(A)
x = zeros(P)
for cycle in 1:1000
for j = 1:P
x[j] = dot(A[:,j], b - A[:,1:P .!= j]*x[1:P .!= j])
end
end
return(x)
end
function OLS_coord_descent_fast(A,b)
N,P = size(A)
x = zeros(P)
for cycle in 1:1000
r = b - A*x
for j = 1:P
x[j] += dot(A[:,j],r)
end
end
return(x)
end
Example of the problem
I generate data with the following:
n = 100
p = 50
σ = 0.1
β_nz = float([i*(-1)^i for i in 1:10])
β = append!(β_nz,zeros(Float64,p-length(β_nz)))
X = randn(n,p); X .-= mean(X,1); X ./= sqrt(sum(abs2(X),1))
y = X*β + σ*randn(n); y .-= mean(y);
Here I use p=50, and I get good agreement between OLS_coord_descent(X,y) and OLS_builtin(X,y), whereas OLS_coord_descent_fast(X,y)returns exponentially large values for the regression coefficients.
When p is less than about 20, OLS_coord_descent_fast(X,y) agrees with the other two.
Conjecture
Since things agrees for the regime of p << n, I think the algorithm is formally correct, but numerically unstable. Does anyone have any thoughts on whether this guess is correct, and if so how to correct for the instability while retaining (most) of the performance gains of the fast version of the algorithm?
The quick answer: You forgot to update r after each x[j] update. Following is the fixed function which behaves like OLS_coord_descent:
function OLS_coord_descent_fast(A,b)
N,P = size(A)
x = zeros(P)
for cycle in 1:1000
r = b - A*x
for j = 1:P
x[j] += dot(A[:,j],r)
r -= A[:,j]*dot(A[:,j],r) # Add this line
end
end
return(x)
end