Faster than numpy.matmul for multiplying matrix by its transpose - numpy

I have a large 2D NumPy array M in python, and I want to compute numpy.matmul(M, M.T), or equivalently, numpy.dot(M, M.T).
However, numpy.matmul and numpy.dot won't exploit the symmetry involved in multiplication with the transpose, so I believe I am doing twice the work that I really need to do.
Is there an easy way to make this faster by exploiting the symmetry and only doing half the work? Perhaps there is a NumPy/SciPy function or some other python library I'm not aware of that accomplishes this?

Someone informed me that numpy actually already accounts for the symmetry:
https://github.com/numpy/numpy/blob/9a1229f86ca4d4041c9aa48027a21c7ad97da748/numpy/core/src/umath/matmul.c.src#L157

Related

Copying a PyTorch Variable to a Numpy array

Suppose I have a PyTorch Variable in GPU:
var = Variable(torch.rand((100,100,100))).cuda()
What's the best way to copy (not bridge) this variable to a NumPy array?
var.clone().data.cpu().numpy()
or
var.data.cpu().numpy().copy()
By running a quick benchmark, .clone() was slightly faster than .copy(). However, .clone() + .numpy() will create a PyTorch Variable plus a NumPy bridge, while .copy() will create a NumPy bridge + a NumPy array.
This is a very interesting question. According to me, the question is little bit opinion-based and I would like to share my opinion on this.
From the above two approaches, I would prefer the first one (use clone()). Since your goal is to copy information, essentially you need to invest extra memory. clone() and copy() should take a similar amount of storage since creating numpy bridge doesn't cause extra memory. Also, I didn't understand what you meant by, copy() will create two numPy arrays. And as you mentioned, clone() is faster than copy(), I don't see any other problem with using clone().
I would love to give a second thought on this if anyone can provide some counter arguments.
Because clone() is recorded by AD second options is less intense. There are few options you may also consider.

How to use the backslash operator in Julia?

I am currently trying to invert huge matrices of order 1 million by 1 million and I figured that the Backslash operator will be helpful in doing this. Any idea as to how it's implemented?. I did not find any concrete examples so any help is much appreciated.
Any idea as to how it's implemented?
It's a multialgorithm. This shows how to use it:
julia> A = rand(10,10)
10×10 Array{Float64,2}:
0.330453 0.294142 0.682869 0.991427 … 0.533443 0.876566 0.157157
0.666233 0.47974 0.172657 0.427015 0.501511 0.0978822 0.634164
0.829653 0.380123 0.589555 0.480963 0.606704 0.642441 0.159564
0.709197 0.570496 0.484826 0.17325 0.699379 0.0281233 0.66744
0.478663 0.87298 0.488389 0.188844 0.38193 0.641309 0.448757
0.471705 0.804767 0.420039 0.0528729 … 0.658368 0.911007 0.705696
0.679734 0.542958 0.22658 0.977581 0.197043 0.717683 0.21933
0.771544 0.326557 0.863982 0.641557 0.969889 0.382148 0.508773
0.932684 0.531116 0.838293 0.031451 0.242338 0.663352 0.784813
0.283031 0.754613 0.938358 0.0408097 0.609105 0.325545 0.671151
julia> b = rand(10)
10-element Array{Float64,1}:
0.0795157
0.219318
0.965155
0.896807
0.701626
0.741823
0.954437
0.573683
0.493615
0.0821557
julia> A\b
10-element Array{Float64,1}:
1.47909
2.39816
-0.15789
0.144003
-1.10083
-0.273698
-0.775122
0.590762
-0.0266894
-2.36216
You can use #which to see how it's defined:
julia> #which A\b
\(A::AbstractArray{T,2} where T, B::Union{AbstractArray{T,1}, AbstractArray{T,2}} where T) in Base.LinAlg at linalg\generic.jl:805
Which leads us here: https://github.com/JuliaLang/julia/blob/master/base/linalg/generic.jl#L827 (line numbers change slightly because of version differences). As you can see, it does a few quick function calls to determine what type of matrix it is. istril finds out of its lower triangular: https://github.com/JuliaLang/julia/blob/master/base/linalg/generic.jl#L987 , etc. Once it determines the matrix type, it specializes the matrix as much as possible so it can be efficient, and then calls \. These specialized matrix types either perform a factorization which then \ does the backsubstitution (which is a nice way to use \ on your own BTW to re-use the factorization), or it "directly knows" the answer, like for triangular or diagonal matrices.
Can't get more concrete than the source.
Note that \ is slightly different than just inverting. You usually do not want to invert a matrix, let alone a large matrix. These factorizations are much more numerically stable. However, inv will do an inversion, which is a lot like an LU-factorization (which in Julia is lufact). You may also want to look into pinv for the psudo-inverse in some cases where the matrix is singular or close to singular, but you should really avoid this an instead factorize + solve the system instead of using the inverse.
For very large sparse matrices, you'll want to use iterative solvers. You'll find a lot of implementations in IterativeSolvers.jl

Update submatrix in Tensorflow

Quite simply, what I want to do is the following
A = np.ones((3,3)) #arbitrary matrix
B = np.ones((2,2)) #arbitrary matrix
A[1:,1:] = A[1:,1:] + B
except in Tensorflow (where the matrices can be arbitrarily complicated tensor expressions). Neither A nor B is a Tensorflow Variable, but just a run-of-the-mill tensor.
What I have gathered so far: tensors are immutable, so I cannot assign to a submatrix. tf.scatter_nd is the current option for sub-assignment, but does not appear to support sub-matrices, only slices.
Methods that should work, but are perhaps not ideal:
I could pad B with zeros, but I'm sure this leads to instantiation of
an unnecessarily large B - can it be made sparse, maybe?
I could use the padding idea, but write it as a low-rank decomposition, e.g. in Numpy: A+U.dot(B).U.T where U is a stacked zero and identity matrix. I'm not sure this is actually advantageous.
I could split A into submatrices, and stack them back together. Might be the most efficient, but sounds like the code would be convoluted.
Ideally, I want to do this operation N times for progressively smaller matrices, resulting in one large final result, but this is tangential.
I'll use one of the hacks for now, but I'm hoping someone can tell me what the idiomatic version is!

Numpy/Scipy pinv and pinv2 behave differently

I am working with bidimensional arrays on Numpy for Extreme Learning Machines. One of my arrays, H, is random, and I want to compute its pseudoinverse.
If I use scipy.linalg.pinv2 everything runs smoothly. However, if I use scipy.linalg.pinv, sometimes (30-40% of the times) problems arise.
The reason why I am using pinv2 is because I read (here: http://vene.ro/blog/inverses-pseudoinverses-numerical-issues-speed-symmetry.html ) that pinv2 performs better on "tall" and on "wide" arrays.
The problem is that, if H has a column j of all 1, pinv(H) has huge coefficients at row j.
This is in turn a problem because, in such cases, np.dot(pinv(H), Y) contains some nan values (Y is an array of small integers).
Now, I am not into linear algebra and numeric computation enough to understand if this is a bug or some precision related property of the two functions. I would like you to answer this question so that, if it's the case, I can file a bug report (honestly, at the moment I would not even know what to write).
I saved the arrays with np.savetxt(fn, a, '%.2e', ';'): please, see https://dl.dropboxusercontent.com/u/48242012/example.tar.gz to find them.
Any help is appreciated. In the provided file, you can see in pinv(H).csv that rows 14, 33, 55, 56 and 99 have huge values, while in pinv2(H) the same rows have more decent values.
Your help is appreciated.
In short, the two functions implement two different ways to calculate the pseudoinverse matrix:
scipy.linalg.pinv uses least squares, which may be quite compute intensive and take up a lot of memory.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.pinv.html#scipy.linalg.pinv
scipy.linalg.pinv2 uses SVD (singular value decomposition), which should run with a smaller memory footprint in most cases.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.pinv2.html#scipy.linalg.pinv2
numpy.linalg.pinv also implements this method.
As these are two different evaluation methods, the resulting matrices will not be the same. Each method has its own advantages and disadvantages, and it is not always easy to determine which one should be used without deeply understanding the data and what the pseudoinverse will be used for. I'd simply suggest some trial-and-error and use the one which gives you the best results for your classifier.
Note that in some cases these functions cannot converge to a solution, and will then raise a scipy.stats.LinAlgError. In that case you may try to use the second pinv implementation, which will greatly reduce the amount of errors you receive.
Starting from scipy 1.7.0 , pinv2 is deprecated and also replaced by a SVD solution.
DeprecationWarning: scipy.linalg.pinv2 is deprecated since SciPy 1.7.0, use scipy.linalg.pinv instead
That means, numpy.pinv, scipy.pinv and scipy.pinv2 now compute all equivalent solutions. They are also equally fast in their computation, with scipy being slightly faster.
import numpy as np
import scipy
arr = np.random.rand(1000, 2000)
res1 = np.linalg.pinv(arr)
res2 = scipy.linalg.pinv(arr)
res3 = scipy.linalg.pinv2(arr)
np.testing.assert_array_almost_equal(res1, res2, decimal=10)
np.testing.assert_array_almost_equal(res1, res3, decimal=10)

Optimize Blas-like operation - A`*B*A

Given two matrices, A and B, where B is symetric (and positive semi-definite), What is the best (fastest) way to calculate A`*B*A?
Currently, using BLAS, I first compute C=B*A using dsymm (introducing a temporary matrix C) and then A`*C using dgemm.
Is there a better (faster, no temporaries) way to do this using BLAS and mkl?
Thanks.
I'll offer somekind of answer: Compared to the general case A*B*C you know that the end result is symmetric matrix. After computing C=B*A with BLAS subroutine dsymm, you want to compute A'C, but you only need to compute the upper diagonal part of the matrix and the copy the strictly upper diagonal part to the lower diagonal part.
Unfortunately there doesn't seem to be a BLAS routine where you can claim beforehand that given two general matrices, the output matrix will be symmetric. I'm not sure if it would be beneficial to write you own function for this. This probably depends on the size of your matrices and the implementation.
EDIT:
This idea seems to be addressed recently here: A Matrix Multiplication Routine that Updates Only the Upper or Lower Triangular Part of the Result Matrix