How to speed up matrix functions such as expm function in scipy/numpy? - numpy

I'm using scipy and numpy to calculate exponentiation of a 6*6 matrix for many times.
Compared to Matlab, it's about 10 times slower.
The function I'm using is scipy.linalg.expm, I have also tried deprecated methods scipy.linalg.expm2 and scipy.linalg.expm3, and those are only two times faster than expm. My question is:
What's wrong with expm2 and expm3 as they are faster than expm?
I'm using wheel package from http://www.lfd.uci.edu/~gohlke/pythonlibs/, and I found https://software.intel.com/en-us/articles/building-numpyscipy-with-intel-mkl-and-intel-fortran-on-windows. Is the wheel package compiled with MKL. If not, I think I can optimize and numpy, scipy by compile it by myself with MKL?
Any other ways to optimize the performance?

Well I think I have found answer for question 1 and 2 by myself
1. It seems expm2 and expm3 returns array rather than matrix. But they are about 2 times faster than expm
Well, after a whole day trying to compile scipy by MKL, I succeed. It's really hard to build the scipy, especially when I'm using windows, x64 and python3. It turned out to be a waste of time. It's not even a bit faster than the whl package from http://www.lfd.uci.edu/~gohlke/pythonlibs/ .
Hoping someone give answer to question 3.

Your matrix is relatively small, so maybe the numerical part is not the bottleneck. You should use a profiler to make sure that the limitation is in the exponentiation.
You can also take a look at the source code of these implementations and write an equivalent function with less conditionals and checking.

Related

With numpy, what is the fastest way to compute one solution to an underdetermined linear system?

With numpy, what is the fastest way to compute one solution to an underdetermined linear system? I don't care which solution the method would return, I'd be happy with any solution.
In particular, I'm dealing with a 7x7 rank-6 matrix which describes the dynamics of a physical system. I'm noticing numpy.linalg.lstsq, numpy.linalg.qr, scipy.linalg.null_space, and scipy.linalg.lu run on the full matrix are all slower on my machine than numpy.linalg.solve run on a correctly-trimmed 6x6 full-rank matrix; solve is twice as fast as lstsq (14.8 µs vs 29.1 µs).
Is there any way to speed up the computation without some horrible C LAPACK-level hacking?
Numpy is not designed to be efficient on very small matrices. Its overheads (due to type checks, value checks, iterators, allocations, etc.) can be quite big on such matrices. In fact, dozens of microseconds is reasonable for such Numpy function call. Numba can reduce the overheads thanks to a fully compiled native code. That being said, Numba can still have a small overhead (due to the call from CPython, few type checks and allocations), but there are generally reasonable unless you work on extremely small inputs. In that case, it is better to use Numba in the caller function since the problem is actually the slow CPython interpreter. The lazy compilation of the Numba function make the first execution significantly slower. You can provide the signature to Numba to make it faster (eager compilation).
import numba as nb
#nb.njit('(float64[:,::1], float64[::1])')
def solve_nb(a, b):
return np.linalg.solve(a, b)
On my machine. It is about 16% faster on a 7x7 matrix. It requires the matrices to be contiguous (working on non-contiguous is fundamentally inefficient, especially here). If this is not fast enough, then you can call dgesv directly for double-precision matrices (or sgesv for simple-precision).
Actually, solve does use dgesv internally. lstsq appears to use a singular value decomposition (SVD). SVD are significantly slower than a QR decomposition which is generally a bit slower than a LU decomposition.
I am not an expert of the numerical/mathematical part, but AFAIK, solving this with a LU decomposition is less numerically stable than using a QR which is also less numerically stable than a SVD. Also, I think a SVD/QR method should be used instead of a simple LU decomposition for matrices that are not full-rank one.
The implementation of dgesv of the standard Netlib LAPACK uses a LU factorization followed by a call to dgetrs (see here). This later call should be fast compared to the LU factorization. The code of LAPACK implementations are generally pretty generic so they may have significant overhead on 7x7 matrices (AFAIK, the Intel implementation is one of the fastest for that).
An alternative solution is to write your own specialized LU decomposition and your own system solving using Numba or Cython. This solution is tedious, but it should be significantly faster since the compiler can unroll the loop if it know the bounds reducing the overheads. You can also perform 1 allocation instead of multiple ones.

Need help understanding Kernel Transport speed on GPU (numba, cupy, cuda)

While GPUs speed math calculations there is a fixed overhead for moving a kernel out to the GPU for execution that is high.
I'm using cupy and numba. THe first time I execute a function call that is using cupy's GPU version of numpy it is quite slow. But the second time it is fast.
I've realized I don't understand how the kernel, or GPU code, gets out to the GPU to run. Operationally I want to understand this better so that I can know when the things I do will accidentally create a slow step due to some kernel transfer. So I need some sorts of rules or rules of thumb understand the concept.
For example, if I multiply two cupy arrays that are stashed on the GPU already I might write C= A*B
At some point the cupy overload on * multiplication has to be coded out on the GPU, and it automagically needs will also get wrapped by loops that break it down into blocks and threads. So presumably this code is some kernel that gets transported out to the GPU. I'm guessing that the next time I call C*D that the GPU no longer needs to be taught what * means and so it will be fast.
But at some point I would imagine the GPU needs to clear out old code so * or other operations not being used at that moment might get flushed from memory, and so later on when the call for A*B happens again there's going to be a penalty in time to recompile it out on the GPU.
Or so I imagine. If I'm right how do I know when these kernels are going to stick around or disappear?
If I'm wrong and this isn't how it works or there is some other slow step (I'm assuming the data is already transported to arrays on the GPU) then what is this slow step and how does organize things so one pay it as little as possible?
I'm trying to avoid writing explicit numba thread managment kernels as one does in cuda++ but just use the standard numba #njit, #vectorize, #stencil decorators. Likewise in Cupy I want to just work at the level of the numpy syntax not dive into thread management.
I've read a lot of docs on this but they just refer to overheads of kernels, not when these get paid and how one controls that so I'm confused.
I don't have a full answer to this yet. But so far the biggest clue I've gotten has come from reading up on the currently undocumented function #cupy.fuse() which makes it more clear than the #numba.jit documents where the kernel launch costs are paid. I have not found the connection to Contexts yet as recommended by #talonmies.
see https://gist.github.com/unnonouno/877f314870d1e3a2f3f45d84de78d56c
The key example is this
c = cupy.arange(4)
##cupy.fuse()
def foo(x):
return x+x+x+x+x
foo(.) will be three times slower with #cupy.fuse() commented out because each "+" involves a kernel load and a kernel free. Fusion merges all the adds into a single kernel so those the launch and free are paid onces. FOr matricies less than 1 million in size on a typical 2018 GPU, the add() is so fast that the launch and free are the dominate times.
I wish I could find some documentation on #fuse. FOr example, does it unroll internal functions the way #jit does. Could I achieve that by stacking #jit and #fuse?
I'm still however largely in the dark about when the costs are getting paid in numba.

What direction should I go to go faster than np.fft [duplicate]

This question already has answers here:
Improving FFT performance in Python
(6 answers)
Closed 9 years ago.
I have some code that is heavily using np.fft.rfft and np.fft.irfft, such that this is the bottleneck for optimisation.
Is there any chance of going faster than this, and if so what are my best options. Thoughts that occur to me would be:
Cython - heard this is very fast; but would it help here though?
digging into numpy - rfft calls _raw_fft which does lots of checking then calls fftpack.cfftf. Profiler is telling me only 80% of the time is in fftpack.cfftf, stripping down the wrapping to the only bits I need could save a little time.
find a faster DFT algorithm somewhere?
buy more computers
So really the question boils down to:
Does anyone with Cython experience know if it would be worth trying here - or can it not make numpy any faster.
Are there any faster packages out there? How much faster is possible?
I've found this question/answer which actually answer part of this:
https://stackoverflow.com/a/8481916/1900520
Shows that there is another FFT implementation in scipy that is quite a bit faster, but also that there is a package called FFTW that goes faster still (up to about 3x looking at these benchmarks).
So that just leaves the question of whether Cython would make this go any faster.

What to beware of reading old Numarray tutorials and examples?

Python currently uses Numpy for heavy duty math and image processing.
The earlier Numeric and Numarray are obsolete, but still today there are many tutorials, notes, sample code and other documentation using them. Some of these cover special topics of interest, some are well written but haven't been updated or replaced, or are otherwise of use. Quite a bit is the same between Numeric, Numarray and Numpy, so I usually get good mileage out these older docs. Ocassionaly, though, I run into a line of code that results in error. Not often enough to remember how to get around it, but usually I figure it out at the cost of some time.
What are the main things to watch out for when relying on such older documentation for current Numpy use? Is there a list of how to translate the differences that exist?
Two good resources:
Numarray to numpy guide
Differences between Numeric and numpy

How to optimize MATLAB loops?

I have been working lately on a number of iterative algorithms in MATLAB, and been getting hit hard by MATLAB's performance (or lack thereof) when it comes to loops. I'm aware of the benefit of vectorizing code when possible, but are there any tools for optimization when you need the loop for your algorithm?
I am aware of the MEX-file option to write small subroutines in C/C++, although given my algorithms, this can be a very painful option given the data structures required. I mainly use MATLAB for the simplicity and speed of prototyping, so a syntactically complex, statically typed language is not ideal for my situation.
Are there any other suggestions? Even other languages (python?) which have relatively painless matrix tools are an option.
It was once true that vectorization would improve the speed of your MATLAB code. However, that is largely no longer true with the JIT-accelerator
This video demonstrating the MATLAB profiler might help.
PROFILER is very useful tool to find bottlenecks in Matlab code. it does not change your code of course, but helps to find which functions/lines to optimize with vectorization or mex.
http://www.mathworks.com/access/helpdesk/help/techdoc/ref/profile.html
http://www.mathworks.com/access/helpdesk/help/techdoc/matlab_env/f9-17018.html
If you have a choice, be sure to set up your loops so you scan the data column-wise which is how the data in MATLAB are arranged. In addition, be sure to preallocate any output arrays before the loop and index into them instead of growing the array inside the for-loop.
If you can cast your code so your operations are called on the whole matrix then you will see great improvement in the speed of your code. Many functions are much quicker when operating on the whole matrix rather than in an element-wise fashion with loops.
You might want to investigate MATLAB's Parallel Computing Toolbox which can make a big difference if you have the right hardware. I re-wrote about 12 lines of code and got 4 - 6 times speedup for one of our loop-intensive programs on and eight core PC.