Optimising Sparse Array Math - optimization

I have a sparse array: term_doc
its size is 622256x715 of Float64. It is very sparse:
Of its ~444,913,040 cells, only about 22,215 of them normally are nonempty.
Of the 622256 rows only 4,699 are occupied
though of the 715 columns all are occupied.
The operator I would like to perform can be described as returning the row normalized and column normalized versions this matrix.
The Naive nonsparse version, I wrote is:
function doUnsparseWay()
gc() #Force Garbage collect before I start (and periodically during). This uses alot of memory
term_doc
N = term_doc./sum(term_doc,1)
println("N done")
gc()
P = term_doc./sum(term_doc,2)
println("P done")
gc()
N[isnan(N)] = 0.0
P[isnan(P)] = 0.0
N,P,term_doc
end
Running this:
> #time N,P,term_doc= doUnsparseWay()
outputs:
N done
P done
elapsed time: 30.97332475 seconds (14466 MB allocated, 5.15% gc time in 13 pauses with 3 full sweep)
It is fairly simple.
It chews memory, and will crash if the garbage collection does not occur at the right times (Thus I call it manually).
But it is fairly fast
I wanted to get it to work on the sparse matrix.
So as not to chew my memory out,
and because logically it is a faster operation -- less cells need operating on.
I followed suggestions from this post and from the performance page of the docs.
function doSparseWay()
term_doc::SparseMatrixCSC{Float64,Int64}
N= spzeros(size(term_doc)...)
N::SparseMatrixCSC{Float64,Int64}
for (doc,total_terms::Float64) in enumerate(sum(term_doc,1))
if total_terms == 0
continue
end
#fastmath #inbounds N[:,doc] = term_doc[:,doc]./total_terms
end
println("N done")
P = spzeros(size(term_doc)...)'
P::SparseMatrixCSC{Float64,Int64}
gfs = sum(term_doc,2)[:]
gfs::Array{Float64,1}
nterms = size(term_doc,1)
nterms::Int64
term_doc = term_doc'
#inbounds #simd for term in 1:nterms
#fastmath #inbounds P[:,term] = term_doc[:,term]/gfs[term]
end
println("P done")
P=P'
N[isnan(N)] = 0.0
P[isnan(P)] = 0.0
N,P,term_doc
end
It never completes.
It gets up to outputting "N Done",
but never outputs "P Done".
I have left it running for several hours.
How can I optimize it so it can complete in reasonable time?
Or if this is not possible, explain why.

First, you're making term_doc a global variable, which is a big problem for performance. Pass it as an argument, doSparseWay(term_doc::SparseMatrixCSC). (The type annotation at the beginning of your function does not do anything useful.)
You want to use an approach similar to the answer by walnuss:
function doSparseWay(term_doc::SparseMatrixCSC)
I, J, V = findnz(term_doc)
normI = sum(term_doc, 1)
normJ = sum(term_doc, 2)
NV = similar(V)
PV = similar(V)
for idx = 1:length(V)
NV[idx] = V[idx]/normI[J[idx]]
PV[idx] = V[idx]/normJ[I[idx]]
end
m, n = size(term_doc)
sparse(I, J, NV, m, n), sparse(I, J, PV, m, n), term_doc
end
This is a general pattern: when you want to optimize something for sparse matrices, extract the I, J, V and perform all your computations on V.

Related

Is it possible to optimize these fortran loops?

Here is my problem: I have a fortran code with a certain amount of nested loops and first I wanted to know if it's possible to optimize (rearranging) them in order to get a time gain? Second I wonder if I could use OpenMP to optimize them?
I have seen a lot of posts about nested do loops in fortran and how to optimize them but I didn't find one example that is suited to mine. I have also searched about OpenMP for nested do loops in fortran but I'm level 0 in OpenMP and it's difficult for me to know how to use it in my case.
Here are two very similar examples of loops that I have, first:
do p=1,N
do q=1,N
do ab=1,nVV
cd = 0
do c=nO+1,N
do d=c+1,N
cd = cd + 1
A(p,q,ab) = A(p,q,ab) + (B(p,q,c,d) - B(p,q,d,c))*C(cd,ab)
end do
end do
kl = 0
do k=1,nO
do l=k+1,nO
kl = kl + 1
A(p,q,ab) = A(p,q,ab) + (B(p,q,k,l) - B(p,q,l,k))*D(kl,ab)
end do
end do
end do
do ij=1,nOO
cd = 0
do c=nO+1,N
do d=c+1,N
cd = cd + 1
E(p,q,ij) = E(p,q,ij) + (B(p,q,c,d) - B(p,q,d,c))*F(cd,ij)
end do
end do
kl = 0
do k=1,nO
do l=k+1,nO
kl = kl + 1
E(p,q,ij) = E(p,q,ij) + (B(p,q,k,l) - B(p,q,l,k))*G(kl,ij)
end do
end do
end do
end do
end do
and the other one is:
do p=1,N
do q=1,N
do ab=1,nVV
cd = 0
do c=nO+1,N
do d=nO+1,N
cd = cd + 1
A(p,q,ab) = A(p,q,ab) + B(p,q,c,d)*C(cd,ab)
end do
end do
kl = 0
do k=1,nO
do l=1,nO
kl = kl + 1
A(p,q,ab) = A(p,q,ab) + B(p,q,k,l)*D(kl,ab)
end do
end do
end do
do ij=1,nOO
cd = 0
do c=nO+1,N
do d=nO+1,N
cd = cd + 1
E(p,q,ij) = E(p,q,ij) + B(p,q,c,d)*F(cd,ij)
end do
end do
kl = 0
do k=1,nO
do l=1,nO
kl = kl + 1
E(p,q,ij) = E(p,q,ij) + B(p,q,k,l)*G(kl,ij)
end do
end do
end do
end do
end do
The very small difference between the two examples is mainly in the indices of the loops. I don't know if you need more info about the different integers in the loops but you have in general: nO < nOO < N < nVV. So I don't know if it's possible to optimize these loops and/or possibly put them in a way that will facilitate the use of OpenMP (I don't know yet if I will use OpenMP, it will depend on how much I can gain by optimizing the loops without it).
I already tried to rearrange the loops in different ways without any success (no time gain) and I also tried a little bit of OpenMP but I don't know much about it, so again no success.
From the initial comments it may appear that at least in some cases you may be using more memory than the available RAM, which means you may be using the swap file, with all the bad consequences on the performances. To fix this, you have to either install more RAM if possible, or deeply reorganize your code to not store the full B array (by far the largest one) at once (again, if possible).
Now, let's assume that you have enough RAM. As I wrote in the comments, the access pattern on the B array is far from optimal, as the inner loops correspond to the last indeces of B, which can result in many cache misses (all the more given the the size of B). Changing the loop order if possible is a way to go.
Just looking at your first example, I am focusing on the computation of the array A (the computation of the array E looks completely independent of A, so it can be processed separately):
!! test it at first without OpenMP
!!$OMP PARALLEL DO PRIVATE(cd,c,d,kl,k,l)
do ab=1,nVV
cd = 0
do c=nO+1,N
do d=c+1,N
cd = cd + 1
A(:,:,ab) = A(:,:,ab) + (B(:,:,c,d) - B(:,:,d,c))*C(cd,ab)
end do
end do
kl = 0
do k=1,nO
do l=k+1,nO
kl = kl + 1
A(:,:,ab) = A(:,:,ab) + (B(:,:,k,l) - B(:,:,l,k))*D(kl,ab)
end do
end do
end do
!!$OMP END PARALLEL DO
What I did:
moved the loops on p and q from outer to inner positions (it's not always as easy than it is here)
replaced them with array syntax (no performance gain to expect, just a code easier to read)
Now the inner loops (abstracted by the array syntax) tackle contiguous elements in memory, which is much better for the performances. The code is even ready for OpenMP multithreading on the (now) outer loop.
EDIT/Hint
Fortran stores the arrays in "column-major order", that is when incrementing the first index one accesses contiguous elements in memory. In C the arrays are stored in "row-major order", that is when incrementing the last index one accesses contiguous elements in memory. So a general rule is to have the inner loops on the first indeces (and the opposite in C).
It would be helpful if you could describe the operations you'd like to perform using tensor notation and the Einstein summation rule. I have the feeling the code could be written much more succinctly using something like np.einsum in NumPy.
For the second block of loop nests (the ones where you iterate across a square sub-section of B as opposed to a triangle) you could try to introduce some sub-programs or primitives from which the full solution is built.
Working from the bottom up, you start with a simple sum of two matrices.
!
! a_ij := a_ij + beta * b_ij
!
pure subroutine apb(A,B,beta)
real(dp), intent(inout) :: A(:,:)
real(dp), intent(in) :: B(:,:)
real(dp), intent(in) :: beta
A = A + beta*B
end subroutine
(for first code block in the original post, you would substitute this primitive with one that only updates the upper/lower triangle of the matrix)
One step higher is a tensor contraction
!
! a_ij := a_ij + b_ijkl c_kl
!
pure subroutine reduce_b(A,B,C)
real(dp), intent(inout) :: A(:,:)
real(dp), intent(in) :: B(:,:,:,:)
real(dp), intent(in) :: C(:,:)
integer :: k, l
do l = 1, size(B,4)
do k = 1, size(B,3)
call apb( A, B(:,:,k,l), C(k,l) )
end do
end do
end subroutine
Note the dimensions of C must match the last two dimensions of B. (In the original loop nest above, the storage order of C is swapped (i.e. c_lk instead of c_kl.)
Working our way upward, we have the contractions with two different sub-blocks of B, moreover A, C, and D have an additional outer dimension:
!
! A_n := A_n + B1_cd C_cdn + B2_kl D_kln
!
! The elements of A_n are a_ijn
! The elements of B1_cd are B1_ijcd
! The elements of B2_kl are B2_ijkl
!
subroutine abcd(A,B1,C,B2,D)
real(dp), intent(inout), contiguous :: A(:,:,:)
real(dp), intent(in) :: B1(:,:,:,:)
real(dp), intent(in) :: B2(:,:,:,:)
real(dp), intent(in), contiguous, target :: C(:,:), D(:,:)
real(dp), pointer :: p_C(:,:,:) => null()
real(dp), pointer :: p_D(:,:,:) => null()
integer :: k
integer :: nc, nd
nc = size(B1,3)*size(B1,4)
nd = size(B2,3)*size(B2,4)
if (nc /= size(C,1)) then
error stop "FATAL ERROR: Dimension mismatch between B1 and C"
end if
if (nd /= size(D,1)) then
error stop "FATAL ERROR: Dimension mismatch between B2 and D"
end if
! Pointer remapping of arrays C and D to rank-3
p_C(1:size(B1,3),1:size(B1,4),1:size(C,2)) => C
p_D(1:size(B2,3),1:size(B2,4),1:size(D,2)) => D
!$omp parallel do default(private) shared(A,B1,p_C,B2,p_D)
do k = 1, size(A,3)
call reduce_b( A(:,:,k), B1, p_C(:,:,k))
call reduce_b( A(:,:,k), B2, p_D(:,:,k))
end do
!$omp end parallel do
end subroutine
Finally, we reach the main level where we select the subblocks of B
program doit
use transform, only: abcd, dp
implicit none
! n0 [2,10]
!
integer, parameter :: n0 = 6
integer, parameter :: n00 = n0*n0
integer, parameter :: N, nVV
real(dp), allocatable :: A(:,:,:), B(:,:,:,:), C(:,:), D(:,:)
! N [100,200]
!
read(*,*) N
nVV = (N - n0)**2
allocate(A(N,N,nVV))
allocate(B(N,N,N,N))
allocate(C(nVV,nVV))
allocate(D(n00,nVV))
print *, "Memory occupied (MB): ", &
real(sizeof(A) + sizeof(B) + sizeof(C) + sizeof(D),dp) / 1024._dp**2
A = 0
call random_number(B)
call random_number(C)
call random_number(D)
call abcd(A=A, &
B1=B(:,:,n0+1:N,n0+1:N), &
B2=B(:,:,1:n0,1:n0), &
C=C, &
D=D)
deallocate(A,B,C,D)
end program
Similar to the answer by PierU, parallelization is on the outermost loop. On my PC, for N = 50, this re-engineered routine is about 8 times faster when executed serially. With OpenMP on 4 threads the factor is 20. For N = 100 and I got tired of waiting for the original code; the re-engineered version on 4 threads took about 3 minutes.
The full code I used for testing, configurable via environment variables (ORIG=<0|1> N=100 ./abcd), is available here: https://gist.github.com/ivan-pi/385b3ae241e517381eb5cf84f119423d
With more fine-tuning it should be possible to bring the numbers down even further. Even better performance could be sought with a specialized library like cuTENSOR (also used under the hood of Fortran intrinsics as explained in Bringing Tensor Cores to Standard Fortran or a tool like the Tensor Contraction Engine.
One last thing I found odd was that large parts of B are un-unused. The sub sections B(:,:,1:n0,n0+1:N) and B(:,:,n0+1:N,1:n0) appear to be wasted space.

numpy matmul is very very slow

I'm trying to implement the Gradient descent method for solving $Ax = b$ for a positive definite symmetric matrix $A$ of size about $9600 \times 9600$. I thought my code was relatively simple
#Solves the problem Ax = b for x within epsilon tolerance or until MAX_ITERATION is reached
def GradientDescent(Amat,target,epsilon = .01,MAX_ITERATION = 100,x=np.zeros(9604):
CurrentRes = target-np.matmul(Amat,x)
count = 0
while(np.linalg.norm(CurrentRes)> epsilon and count < MAX_ITERATION):
Ar = np.matmul(Amat,CurrentRes)
alpha = CurrentRes.T.dot(CurrentRes)/CurrentRes.T.dot(Ar)
x = x+alpha*CurrentRes
Ax = np.matmul(Amat,x)
CurrentRes = target-Ax
count = count+1
return(x,count,norm(CurrentRes))
#A is square matrix about 9600x9600 and b is about 9600x1
GDSum = GradientDescent(A,b)
but the above takes almost 3 minutes to run a single iteration of the main while loop.
I didn't think that $9600 \times 9600$ was too big for NumPy to handle effectively, but even the step of computing alpha which is just the quotient of two dot products is taking over 30 seconds.
I tried error-testing the code by timing each action in the while loop, and they are all running much slower than expected. A single matrix multiplication is taking almost a minute. The steps involving vector addition or subtraction at least seem to be running quickly.
#A is square matrix about 9600x9600 and b is about 9600x1
GDSum = GradientDescent(A,b)
Perhaps the most relevant bit of information is missing.
Your function is fast when A and b are Numpy arrays, but it's terribly slow when they are lists.
Is that your case?

Coordinate Descent Algorithm in Julia for Least Squares not converging

As a warm-up to writing my own elastic net solver, I'm trying to get a fast enough version of ordinary least squares implemented using coordinate descent.
I believe I've implemented the coordinate descent algorithm correctly, but when I use the "fast" version (see below), the algorithm is insanely unstable, outputting regression coefficients that routinely overflow a 64-bit float when the number of features is of moderate size compared to the number of samples.
Linear Regression and OLS
If b = A*x, where A is a matrix, x a vector of the unknown regression coefficients, and y is the output, I want to find x that minimizes
||b - Ax||^2
If A[j] is the jth column of A and A[-j] is A without column j, and the columns of A are normalized so that ||A[j]||^2 = 1 for all j, the coordinate-wise update is then
Coordinate Descent:
x[j] <-- A[j]^T * (b - A[-j] * x[-j])
I'm following along with these notes (page 9-10) but the derivation is simple calculus.
It's pointed out that instead of recomputing A[j]^T(b - A[-j] * x[-j]) all the time, a faster way to do it is with
Fast Coordinate Descent:
x[j] <-- A[j]^T*r + x[j]
where the total residual r = b - Ax is computed outside the loop over coordinates. The equivalence of these update rules follows from noting that Ax = A[j]*x[j] + A[-j]*x[-j] and rearranging terms.
My problem is that while the second method is indeed faster, it's wildly numerically unstable for me whenever the number of features isn't small compared to the number of samples. I was wondering if anyone might have some insight as to why that's the case. I should note that the first method, which is more stable, still starts disagreeing with more standard methods as the number of features approaches the number of samples.
Julia code
Below is some Julia code for the two update rules:
function OLS_builtin(A,b)
x = A\b
return(x)
end
function OLS_coord_descent(A,b)
N,P = size(A)
x = zeros(P)
for cycle in 1:1000
for j = 1:P
x[j] = dot(A[:,j], b - A[:,1:P .!= j]*x[1:P .!= j])
end
end
return(x)
end
function OLS_coord_descent_fast(A,b)
N,P = size(A)
x = zeros(P)
for cycle in 1:1000
r = b - A*x
for j = 1:P
x[j] += dot(A[:,j],r)
end
end
return(x)
end
Example of the problem
I generate data with the following:
n = 100
p = 50
σ = 0.1
β_nz = float([i*(-1)^i for i in 1:10])
β = append!(β_nz,zeros(Float64,p-length(β_nz)))
X = randn(n,p); X .-= mean(X,1); X ./= sqrt(sum(abs2(X),1))
y = X*β + σ*randn(n); y .-= mean(y);
Here I use p=50, and I get good agreement between OLS_coord_descent(X,y) and OLS_builtin(X,y), whereas OLS_coord_descent_fast(X,y)returns exponentially large values for the regression coefficients.
When p is less than about 20, OLS_coord_descent_fast(X,y) agrees with the other two.
Conjecture
Since things agrees for the regime of p << n, I think the algorithm is formally correct, but numerically unstable. Does anyone have any thoughts on whether this guess is correct, and if so how to correct for the instability while retaining (most) of the performance gains of the fast version of the algorithm?
The quick answer: You forgot to update r after each x[j] update. Following is the fixed function which behaves like OLS_coord_descent:
function OLS_coord_descent_fast(A,b)
N,P = size(A)
x = zeros(P)
for cycle in 1:1000
r = b - A*x
for j = 1:P
x[j] += dot(A[:,j],r)
r -= A[:,j]*dot(A[:,j],r) # Add this line
end
end
return(x)
end

Fortran: efficient matrix-vector multiplication

I have a piece of code which is a significant bottleneck:
do s = 1,ns
msum = 0.d0
do k = 1,ns
msum = msum + tm(k,s)*f(:,:,k)
end do
m(:,:,s) = msum
end do
This is a simple matrix-vector product m=tm*f (where f is length k) for every x,y.
I thought about using a BLAS routine but i am not sure if any allows multiplying along a specific dimension (k). Do any of you have any good advice?
Unfortunately you do not mention the actual shape of f, i.e. the number of x and y. Since you mention this piece of code to be a bottleneck, you can and should replace msum and use the memory m(:,:,s) and spare the first step in you loop, e.g.
do s = 1,ns
m = tm(k,1)*f(:,:,k)
do k = 2, ns
m(:,:,s) = m(:,:,s) + tm(k,s)*f(:,:,k)
end do
end do
Secondly, a more general appraoch
There are ns summations of nK 2D matrices f(:,:,1:nK) by means of scalar factors that are stored in tm(:,1:ns). The goal is to store these sums in m(:,:,1:ns). Why not sum up element-wise wrt x and y to exploit contiguuos memory sections by means of the result? You already mentioned that you can redesign such that k is the first dimension in f, i.e. f(k,:,:).
Considering only the desired outcome, you ought to have ns 2D matrices m(:,:,1:ns) that are independent of each other (outer loop remains at it is). Lets drop this dimension for a moment. The problem then becomes:
m(:,:) = \sum_{k=1}^{ns} tm_k * f_k(:,:)
We should thus sum over k, e.g. have f(k,:,:) to determine m(:,:) as follows (note that I am adding the outer loop for s again):
nK = size(f, 1) ! the "k"s
nX = size(f, 2) ! the "x"s
nY = size(f, 3) ! the "y"s
m = 0.d0
do s = 1, ns
do ii = 1, nY
call DGEMV('N', nK, nY, &
1.d0, f(:,:,nY), 1, tm(:,s), 1, &
1.d0, m(:,nY,s), 1)
end do !ii
end do !s
See the documentation of DGEMV for more details on its usage.
Of course, the above advice of excluding the first step of the loop to spare the initialization by means of zeros may be applied at well.

Time complexity in backtracking algorithm

I what to calculate the worst case, time complexity for this recursive function.
list is a list of m*n pieces.
matrix is a matrix of mxn to fill with this peaces.
Backtrack(list, matrix):
if(matrix is complete) //O(1)
return true
from the list of m*n pieces, make a list of candidatePieces to put in the matrix. // O(m*n)
for every candidatePiece // Worst case of n*m calls
Put that piece in the matrix // O(1)
if(Backtrack(list, matrix) is true)
return true
What I guess is that the formula is something like:
T(n*m) = T(n*m - 1) + O(n*m) + O(1) = T(n*m - 1) + O(n*m)
Is this correct?
I can't use the master theorem, witch other method can I use to get a closed formula?
If you unwind your formula, you'll get
T(n*m) = T(n*m-1)+O(n*m) = T(n*m-2)+O(n*m-1) + O(n*m) = ...
= O(n*m) + O(n*m-1) + O(n*m-2) +... + O(1) => ~ O(n^2*m^2)
But I wonder is that algorithm complete? It does not seem to return false at all.
So we have:
Backtrack(list, matrix):
if(matrix is complete) //O(1)
return true
from the list of m*n pieces, make a list of candidatePieces to put in the matrix. // O(m*n)
for every candidatePiece // Worst case of n*m calls
Put that piece in the matrix // O(1)
if(Backtrack(list, matrix) is true)
return true
We assume worst-case behaviour, some matrix is complete is false for as long as possible. I'm going to assume this is up until O(n·m) insertions.
So consider for every candidatePiece. In order to get to the second item, you'll need Backtrack(list, matrix) is true to be false at least once. This requires termination other than by return true. As the only way to do that requires exhausting the loop, and that requires exhausting the loop (and so on), this can only happen if candidatePiece is empty.
I assume pieces can only be used once. In this case, there are O(n·m) things in the matrix. This means matrix is complete has returned true, so this actually doesn't let the loop run a second time.
If pieces don't run out, this is also obviously true.
We can then simplify to
Backtrack(list, matrix):
if(matrix is complete)
return true
if(candidatePiece is not empty)
Put the first piece in the matrix
if(Backtrack(list, matrix) is true)
return true
Then we have
T(n·m) = T(n·m - 1) + O(n·m)
as you said, which leads to Ashalynd's answer.
Assume instead that matrix is complete can return false even when the matrix is full. Assume also that pieces run out, so the matrix can never be over-full. I also assume that pieces are removed as appropriate.
The outmost loop (loop 0) costs O(n·m) and runs the inner loop (loop 1) n·m times.
The inner loop (loop 1) costs O(n·m - 1) and runs the inner inner loop (loop 2) n·m -1 times.
...
Loop n·m - 1 costs 1 and runs loop n·m once.
Loop n·m costs 0 (call overhead moves into loop n·m - 1) and terminates.
So, let T(n·m) be the cost for loop 0, and T(0) be the cost for loop n·m.
T(x) = O(x) + x · T(x - 1)
and WolframAlpha solves this to be O(Γ(x+1) + x·Γ(x, 1)).
By some research, this is equal to
O(x! + x· [ (x-1)! · e⁻¹ · eₓ₋₁(1) ])
= O(x!) + O(x· eₓ₋₁(1))
= O(x!) + O(x· ∑ 1/k! from k=0 to x-1)
= O(x!)
So for x = n·m we are talking O((n·m)!) time. That's pretty poor :/.