Fortran: efficient matrix-vector multiplication - optimization

I have a piece of code which is a significant bottleneck:
do s = 1,ns
msum = 0.d0
do k = 1,ns
msum = msum + tm(k,s)*f(:,:,k)
end do
m(:,:,s) = msum
end do
This is a simple matrix-vector product m=tm*f (where f is length k) for every x,y.
I thought about using a BLAS routine but i am not sure if any allows multiplying along a specific dimension (k). Do any of you have any good advice?

Unfortunately you do not mention the actual shape of f, i.e. the number of x and y. Since you mention this piece of code to be a bottleneck, you can and should replace msum and use the memory m(:,:,s) and spare the first step in you loop, e.g.
do s = 1,ns
m = tm(k,1)*f(:,:,k)
do k = 2, ns
m(:,:,s) = m(:,:,s) + tm(k,s)*f(:,:,k)
end do
end do
Secondly, a more general appraoch
There are ns summations of nK 2D matrices f(:,:,1:nK) by means of scalar factors that are stored in tm(:,1:ns). The goal is to store these sums in m(:,:,1:ns). Why not sum up element-wise wrt x and y to exploit contiguuos memory sections by means of the result? You already mentioned that you can redesign such that k is the first dimension in f, i.e. f(k,:,:).
Considering only the desired outcome, you ought to have ns 2D matrices m(:,:,1:ns) that are independent of each other (outer loop remains at it is). Lets drop this dimension for a moment. The problem then becomes:
m(:,:) = \sum_{k=1}^{ns} tm_k * f_k(:,:)
We should thus sum over k, e.g. have f(k,:,:) to determine m(:,:) as follows (note that I am adding the outer loop for s again):
nK = size(f, 1) ! the "k"s
nX = size(f, 2) ! the "x"s
nY = size(f, 3) ! the "y"s
m = 0.d0
do s = 1, ns
do ii = 1, nY
call DGEMV('N', nK, nY, &
1.d0, f(:,:,nY), 1, tm(:,s), 1, &
1.d0, m(:,nY,s), 1)
end do !ii
end do !s
See the documentation of DGEMV for more details on its usage.
Of course, the above advice of excluding the first step of the loop to spare the initialization by means of zeros may be applied at well.

Related

How to iterate through all non zero values of a sparse matrix and normal matrix

I am using Julia and I want to iterate over the values of a matrix. This matrix can either be a normal matrix or a sparse matrix but I do not have the prior knowledge of that. I would like to create a code that would work in both cases and being optimised for both cases.
For simplicity, I did a example that computes the sum of the vector multiplied by a random value. What I want to do is actually similar to this but instead of being multiplied by a random number is actually an function that takes long time to compute.
myiterator(m::SparseVector) = m.nzval
myiterator(m::AbstractVector) = m
function sumtimesrand(m)
a = 0.
for i in myiterator(m)
a += i * rand()
end
return a
end
I = [1, 4, 3, 5]; V = [1, 2, -5, 3];
Msparse = sparsevec(I,V)
M = rand(5)
sumtimesrand(Msparse)
sumtimesrand(M)
I want my code to work this way. I.e. most of the code is the same and by using the right iterator the code is optimised for both cases (sparse and normal vector).
My question is: is there any iterator that does what I am trying to achieve? In this case, the iterator returns the values but an iterator over the indices would work to.
Cheers,
Dylan
I think you almost had what you are asking for? I.e., change your AbstractVector and SparseVector into AbstractArray and AbstractSparseArray. But maybe I am missing something? See MWE below:
using SparseArrays
using BenchmarkTools # to compare performance
# note the changes here to "Array":
myiterator(m::AbstractSparseArray) = m.nzval
myiterator(m::AbstractArray) = m
function sumtimesrand(m)
a = 0.
for i in myiterator(m)
a += i * rand()
end
return a
end
N = 1000
spV = sprand(N, 0.01); V = Vector(spV)
spM = sprand(N, N, 0.01); M = Matrix(spM)
#btime sumtimesrand($spV); # 0.044936 μs
#btime sumtimesrand($V); # 3.919 μs
#btime sumtimesrand($spM); # 0.041678 ms
#btime sumtimesrand($M); # 4.095 ms

Coordinate Descent Algorithm in Julia for Least Squares not converging

As a warm-up to writing my own elastic net solver, I'm trying to get a fast enough version of ordinary least squares implemented using coordinate descent.
I believe I've implemented the coordinate descent algorithm correctly, but when I use the "fast" version (see below), the algorithm is insanely unstable, outputting regression coefficients that routinely overflow a 64-bit float when the number of features is of moderate size compared to the number of samples.
Linear Regression and OLS
If b = A*x, where A is a matrix, x a vector of the unknown regression coefficients, and y is the output, I want to find x that minimizes
||b - Ax||^2
If A[j] is the jth column of A and A[-j] is A without column j, and the columns of A are normalized so that ||A[j]||^2 = 1 for all j, the coordinate-wise update is then
Coordinate Descent:
x[j] <-- A[j]^T * (b - A[-j] * x[-j])
I'm following along with these notes (page 9-10) but the derivation is simple calculus.
It's pointed out that instead of recomputing A[j]^T(b - A[-j] * x[-j]) all the time, a faster way to do it is with
Fast Coordinate Descent:
x[j] <-- A[j]^T*r + x[j]
where the total residual r = b - Ax is computed outside the loop over coordinates. The equivalence of these update rules follows from noting that Ax = A[j]*x[j] + A[-j]*x[-j] and rearranging terms.
My problem is that while the second method is indeed faster, it's wildly numerically unstable for me whenever the number of features isn't small compared to the number of samples. I was wondering if anyone might have some insight as to why that's the case. I should note that the first method, which is more stable, still starts disagreeing with more standard methods as the number of features approaches the number of samples.
Julia code
Below is some Julia code for the two update rules:
function OLS_builtin(A,b)
x = A\b
return(x)
end
function OLS_coord_descent(A,b)
N,P = size(A)
x = zeros(P)
for cycle in 1:1000
for j = 1:P
x[j] = dot(A[:,j], b - A[:,1:P .!= j]*x[1:P .!= j])
end
end
return(x)
end
function OLS_coord_descent_fast(A,b)
N,P = size(A)
x = zeros(P)
for cycle in 1:1000
r = b - A*x
for j = 1:P
x[j] += dot(A[:,j],r)
end
end
return(x)
end
Example of the problem
I generate data with the following:
n = 100
p = 50
σ = 0.1
β_nz = float([i*(-1)^i for i in 1:10])
β = append!(β_nz,zeros(Float64,p-length(β_nz)))
X = randn(n,p); X .-= mean(X,1); X ./= sqrt(sum(abs2(X),1))
y = X*β + σ*randn(n); y .-= mean(y);
Here I use p=50, and I get good agreement between OLS_coord_descent(X,y) and OLS_builtin(X,y), whereas OLS_coord_descent_fast(X,y)returns exponentially large values for the regression coefficients.
When p is less than about 20, OLS_coord_descent_fast(X,y) agrees with the other two.
Conjecture
Since things agrees for the regime of p << n, I think the algorithm is formally correct, but numerically unstable. Does anyone have any thoughts on whether this guess is correct, and if so how to correct for the instability while retaining (most) of the performance gains of the fast version of the algorithm?
The quick answer: You forgot to update r after each x[j] update. Following is the fixed function which behaves like OLS_coord_descent:
function OLS_coord_descent_fast(A,b)
N,P = size(A)
x = zeros(P)
for cycle in 1:1000
r = b - A*x
for j = 1:P
x[j] += dot(A[:,j],r)
r -= A[:,j]*dot(A[:,j],r) # Add this line
end
end
return(x)
end

Optimising Sparse Array Math

I have a sparse array: term_doc
its size is 622256x715 of Float64. It is very sparse:
Of its ~444,913,040 cells, only about 22,215 of them normally are nonempty.
Of the 622256 rows only 4,699 are occupied
though of the 715 columns all are occupied.
The operator I would like to perform can be described as returning the row normalized and column normalized versions this matrix.
The Naive nonsparse version, I wrote is:
function doUnsparseWay()
gc() #Force Garbage collect before I start (and periodically during). This uses alot of memory
term_doc
N = term_doc./sum(term_doc,1)
println("N done")
gc()
P = term_doc./sum(term_doc,2)
println("P done")
gc()
N[isnan(N)] = 0.0
P[isnan(P)] = 0.0
N,P,term_doc
end
Running this:
> #time N,P,term_doc= doUnsparseWay()
outputs:
N done
P done
elapsed time: 30.97332475 seconds (14466 MB allocated, 5.15% gc time in 13 pauses with 3 full sweep)
It is fairly simple.
It chews memory, and will crash if the garbage collection does not occur at the right times (Thus I call it manually).
But it is fairly fast
I wanted to get it to work on the sparse matrix.
So as not to chew my memory out,
and because logically it is a faster operation -- less cells need operating on.
I followed suggestions from this post and from the performance page of the docs.
function doSparseWay()
term_doc::SparseMatrixCSC{Float64,Int64}
N= spzeros(size(term_doc)...)
N::SparseMatrixCSC{Float64,Int64}
for (doc,total_terms::Float64) in enumerate(sum(term_doc,1))
if total_terms == 0
continue
end
#fastmath #inbounds N[:,doc] = term_doc[:,doc]./total_terms
end
println("N done")
P = spzeros(size(term_doc)...)'
P::SparseMatrixCSC{Float64,Int64}
gfs = sum(term_doc,2)[:]
gfs::Array{Float64,1}
nterms = size(term_doc,1)
nterms::Int64
term_doc = term_doc'
#inbounds #simd for term in 1:nterms
#fastmath #inbounds P[:,term] = term_doc[:,term]/gfs[term]
end
println("P done")
P=P'
N[isnan(N)] = 0.0
P[isnan(P)] = 0.0
N,P,term_doc
end
It never completes.
It gets up to outputting "N Done",
but never outputs "P Done".
I have left it running for several hours.
How can I optimize it so it can complete in reasonable time?
Or if this is not possible, explain why.
First, you're making term_doc a global variable, which is a big problem for performance. Pass it as an argument, doSparseWay(term_doc::SparseMatrixCSC). (The type annotation at the beginning of your function does not do anything useful.)
You want to use an approach similar to the answer by walnuss:
function doSparseWay(term_doc::SparseMatrixCSC)
I, J, V = findnz(term_doc)
normI = sum(term_doc, 1)
normJ = sum(term_doc, 2)
NV = similar(V)
PV = similar(V)
for idx = 1:length(V)
NV[idx] = V[idx]/normI[J[idx]]
PV[idx] = V[idx]/normJ[I[idx]]
end
m, n = size(term_doc)
sparse(I, J, NV, m, n), sparse(I, J, PV, m, n), term_doc
end
This is a general pattern: when you want to optimize something for sparse matrices, extract the I, J, V and perform all your computations on V.

Time complexity in backtracking algorithm

I what to calculate the worst case, time complexity for this recursive function.
list is a list of m*n pieces.
matrix is a matrix of mxn to fill with this peaces.
Backtrack(list, matrix):
if(matrix is complete) //O(1)
return true
from the list of m*n pieces, make a list of candidatePieces to put in the matrix. // O(m*n)
for every candidatePiece // Worst case of n*m calls
Put that piece in the matrix // O(1)
if(Backtrack(list, matrix) is true)
return true
What I guess is that the formula is something like:
T(n*m) = T(n*m - 1) + O(n*m) + O(1) = T(n*m - 1) + O(n*m)
Is this correct?
I can't use the master theorem, witch other method can I use to get a closed formula?
If you unwind your formula, you'll get
T(n*m) = T(n*m-1)+O(n*m) = T(n*m-2)+O(n*m-1) + O(n*m) = ...
= O(n*m) + O(n*m-1) + O(n*m-2) +... + O(1) => ~ O(n^2*m^2)
But I wonder is that algorithm complete? It does not seem to return false at all.
So we have:
Backtrack(list, matrix):
if(matrix is complete) //O(1)
return true
from the list of m*n pieces, make a list of candidatePieces to put in the matrix. // O(m*n)
for every candidatePiece // Worst case of n*m calls
Put that piece in the matrix // O(1)
if(Backtrack(list, matrix) is true)
return true
We assume worst-case behaviour, some matrix is complete is false for as long as possible. I'm going to assume this is up until O(n·m) insertions.
So consider for every candidatePiece. In order to get to the second item, you'll need Backtrack(list, matrix) is true to be false at least once. This requires termination other than by return true. As the only way to do that requires exhausting the loop, and that requires exhausting the loop (and so on), this can only happen if candidatePiece is empty.
I assume pieces can only be used once. In this case, there are O(n·m) things in the matrix. This means matrix is complete has returned true, so this actually doesn't let the loop run a second time.
If pieces don't run out, this is also obviously true.
We can then simplify to
Backtrack(list, matrix):
if(matrix is complete)
return true
if(candidatePiece is not empty)
Put the first piece in the matrix
if(Backtrack(list, matrix) is true)
return true
Then we have
T(n·m) = T(n·m - 1) + O(n·m)
as you said, which leads to Ashalynd's answer.
Assume instead that matrix is complete can return false even when the matrix is full. Assume also that pieces run out, so the matrix can never be over-full. I also assume that pieces are removed as appropriate.
The outmost loop (loop 0) costs O(n·m) and runs the inner loop (loop 1) n·m times.
The inner loop (loop 1) costs O(n·m - 1) and runs the inner inner loop (loop 2) n·m -1 times.
...
Loop n·m - 1 costs 1 and runs loop n·m once.
Loop n·m costs 0 (call overhead moves into loop n·m - 1) and terminates.
So, let T(n·m) be the cost for loop 0, and T(0) be the cost for loop n·m.
T(x) = O(x) + x · T(x - 1)
and WolframAlpha solves this to be O(Γ(x+1) + x·Γ(x, 1)).
By some research, this is equal to
O(x! + x· [ (x-1)! · e⁻¹ · eₓ₋₁(1) ])
= O(x!) + O(x· eₓ₋₁(1))
= O(x!) + O(x· ∑ 1/k! from k=0 to x-1)
= O(x!)
So for x = n·m we are talking O((n·m)!) time. That's pretty poor :/.

Is it possible to optimize this Matlab code for doing vector quantization with centroids from k-means?

I've created a codebook using k-means of size 4000x300 (4000 centroids, each with 300 features). Using the codebook, I then want to label an input vector (for purposes of binning later on). The input vector is of size Nx300, where N is the total number of input instances I receive.
To compute the labels, I calculate the closest centroid for each of the input vectors. To do so, I compare each input vector against all centroids and pick the centroid with the minimum distance. The label is then just the index of that centroid.
My current Matlab code looks like:
function labels = assign_labels(centroids, X)
labels = zeros(size(X, 1), 1);
% for each X, calculate the distance from each centroid
for i = 1:size(X, 1)
% distance of X_i from all j centroids is: sum((X_i - centroid_j)^2)
% note: we leave off the sqrt as an optimization
distances = sum(bsxfun(#minus, centroids, X(i, :)) .^ 2, 2);
[value, label] = min(distances);
labels(i) = label;
end
However, this code is still fairly slow (for my purposes), and I was hoping there might be a way to optimize the code further.
One obvious issue is that there is a for-loop, which is the bane of good performance on Matlab. I've been trying to come up with a way to get rid of it, but with no luck (I looked into using arrayfun in conjunction with bsxfun, but haven't gotten that to work). Alternatively, if someone know of any other way to speed this up, I would be greatly appreciate it.
Update
After doing some searching, I couldn't find a great solution using Matlab, so I decided to look at what is used in Python's scikits.learn package for 'euclidean_distance' (shortened):
XX = sum(X * X, axis=1)[:, newaxis]
YY = Y.copy()
YY **= 2
YY = sum(YY, axis=1)[newaxis, :]
distances = XX + YY
distances -= 2 * dot(X, Y.T)
distances = maximum(distances, 0)
which uses the binomial form of the euclidean distance ((x-y)^2 -> x^2 + y^2 - 2xy), which from what I've read usually runs faster. My completely untested Matlab translation is:
XX = sum(data .* data, 2);
YY = sum(center .^ 2, 2);
[val, ~] = max(XX + YY - 2*data*center');
Use the following function to calculate your distances. You should see an order of magnitude speed up
The two matrices A and B have the columns as the dimenions and the rows as each point.
A is your matrix of centroids. B is your matrix of datapoints.
function D=getSim(A,B)
Qa=repmat(dot(A,A,2),1,size(B,1));
Qb=repmat(dot(B,B,2),1,size(A,1));
D=Qa+Qb'-2*A*B';
You can vectorize it by converting to cells and using cellfun:
[nRows,nCols]=size(X);
XCell=num2cell(X,2);
dist=reshape(cell2mat(cellfun(#(x)(sum(bsxfun(#minus,centroids,x).^2,2)),XCell,'UniformOutput',false)),nRows,nRows);
[~,labels]=min(dist);
Explanation:
We assign each row of X to its own cell in the second line
This piece #(x)(sum(bsxfun(#minus,centroids,x).^2,2)) is an anonymous function which is the same as your distances=... line, and using cell2mat, we apply it to each row of X.
The labels are then the indices of the minimum row along each column.
For a true matrix implementation, you may consider trying something along the lines of:
P2 = kron(centroids, ones(size(X,1),1));
Q2 = kron(ones(size(centroids,1),1), X);
distances = reshape(sum((Q2-P2).^2,2), size(X,1), size(centroids,1));
Note
This assumes the data is organized as [x1 y1 ...; x2 y2 ...;...]
You can use a more efficient algorithm for nearest neighbor search than brute force.
The most popular approach are Kd-Tree. O(log(n)) average query time instead of the O(n) brute force complexity.
Regarding a Maltab implementation of Kd-Trees, you can have a look here